You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Hao Sun <ha...@zendesk.com> on 2017/10/06 18:16:29 UTC

TM get killed/disconnected after a while

Hi, I am running Flink 1.3.2 on kubernetes, I am not sure why sometime one
of my TM is killed, is there a way to debug this? Thanks

===== Logs ====

*2017-10-05 22:36:42,631 INFO
org.apache.flink.runtime.instance.InstanceManager             - Registered
TaskManager at fps-flink-taskmanager-2384273947-9n4kc
(akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274/user/taskmanager)
as 330ff7eeaabfe2b7289fee4a0e36c4b2. Current number of registered hosts is
2. Current number of alive task slots is 2.*
2017-10-05 22:37:04,974 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Deploying
Source: KafkaSource(maxwell.users) -> MaxwellFilter->Maxwell(maxwell.users)
-> FixedDelayWatermark(maxwell.users) ->
MaxwellFPSEvent->InfluxDBData(maxwell.users) -> (Sink:
influxdbSink(maxwell.users), Sink: PrintSink(maxwell.users)) (1/1) (attempt
#0) to fps-flink-taskmanager-2384273947-9n4kc
*2017-10-06 06:08:55,657 WARN  akka.remote.ReliableDeliverySupervisor
                  - Association with remote system
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed,
address is now gated for [5000] ms. Reason: [Disassociated]*
2017-10-06 06:08:55,832 WARN  Remoting
                - Tried to associate with unreachable remote address
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]. Address is
now gated for 5000 ms, all messages to this address will be delivered to
dead letters. Reason: [The remote system has quarantined this system. No
further associations to the remote system are possible until this system is
restarted.]
2017-10-06 06:09:01,232 WARN  akka.remote.ReliableDeliverySupervisor
                - Association with remote system
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed,
address is now gated for [5000] ms. Reason: [Association failed with
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] Caused by:
[fps-flink-taskmanager-2384273947-9n4kc: Name does not resolve]
2017-10-06 06:09:03,416 WARN  akka.remote.ReliableDeliverySupervisor
                - Association with remote system
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed,
address is now gated for [5000] ms. Reason: [Association failed with
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] Caused by:
[fps-flink-taskmanager-2384273947-9n4kc]
2017-10-06 06:09:11,174 WARN  akka.remote.ReliableDeliverySupervisor
                - Association with remote system
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed,
address is now gated for [5000] ms. Reason: [Association failed with
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] Caused by:
[fps-flink-taskmanager-2384273947-9n4kc]
2017-10-06 06:09:11,440 WARN  Remoting
                - Tried to associate with unreachable remote address
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]. Address is
now gated for 5000 ms, all messages to this address will be delivered to
dead letters. Reason: [The remote system has quarantined this system. No
further associations to the remote system are possible until this system is
restarted.]
2017-10-06 06:09:21,232 WARN  akka.remote.ReliableDeliverySupervisor
                - Association with remote system
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed,
address is now gated for [5000] ms. Reason: [Association failed with
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] Caused by:
[fps-flink-taskmanager-2384273947-9n4kc: Name does not resolve]
2017-10-06 06:09:27,460 WARN  Remoting
                - Tried to associate with unreachable remote address
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]. Address is
now gated for 5000 ms, all messages to this address will be delivered to
dead letters. Reason: [The remote system has quarantined this system. No
further associations to the remote system are possible until this system is
restarted.]
2017-10-06 06:09:31,173 WARN  akka.remote.ReliableDeliverySupervisor
                - Association with remote system
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed,
address is now gated for [5000] ms. Reason: [Association failed with
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] Caused by:
[fps-flink-taskmanager-2384273947-9n4kc]
2017-10-06 06:09:41,179 WARN  akka.remote.ReliableDeliverySupervisor
                - Association with remote system
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed,
address is now gated for [5000] ms. Reason: [Association failed with
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] Caused by:
[fps-flink-taskmanager-2384273947-9n4kc: Name does not resolve]
2017-10-06 06:09:51,174 WARN  akka.remote.ReliableDeliverySupervisor
                - Association with remote system
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed,
address is now gated for [5000] ms. Reason: [Association failed with
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] Caused by:
[fps-flink-taskmanager-2384273947-9n4kc]
2017-10-06 06:09:57,475 WARN  Remoting
                - Tried to associate with unreachable remote address
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]. Address is
now gated for 5000 ms, all messages to this address will be delivered to
dead letters. Reason: [The remote system has quarantined this system. No
further associations to the remote system are possible until this system is
restarted.]
2017-10-06 06:10:01,179 WARN  akka.remote.ReliableDeliverySupervisor
                - Association with remote system
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed,
address is now gated for [5000] ms. Reason: [Association failed with
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] Caused by:
[fps-flink-taskmanager-2384273947-9n4kc: Name does not resolve]
2017-10-06 06:10:06,173 WARN  akka.remote.RemoteWatcher
                 - Detected unreachable:
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]
2017-10-06 06:10:06,177 INFO
org.apache.flink.runtime.jobmanager.JobManager                - Task
manager akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274/user/taskmanager
terminated.
java.lang.Exception: TaskManager was lost/killed:
55d3143ccecec7878f7df169208795d0 @ fps-flink-taskmanager-2384273947-9n4kc
(dataPort=37448)
java.lang.Exception: TaskManager was lost/killed:
55d3143ccecec7878f7df169208795d0 @ fps-flink-taskmanager-2384273947-9n4kc
(dataPort=37448)
2017-10-06 06:10:06,188 WARN  akka.remote.ReliableDeliverySupervisor
                - Association with remote system
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed,
address is now gated for [5000] ms. Reason: [Association failed with
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] Caused by:
[fps-flink-taskmanager-2384273947-9n4kc]
2017-10-06 06:10:06,240 INFO
org.apache.flink.runtime.instance.InstanceManager             -
Unregistered task manager fps-flink-taskmanager-2384273947-9n4kc/
10.225.132.78. Number of registered task managers 3. Number of available
slots 3.
2017-10-06 06:10:16,247 WARN  akka.remote.ReliableDeliverySupervisor
                - Association with remote system
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed,
address is now gated for [5000] ms. Reason: [Association failed with
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] Caused by:
[fps-flink-taskmanager-2384273947-9n4kc: Name does not resolve]
2017-10-06 06:10:26,284 WARN  akka.remote.ReliableDeliverySupervisor
                - Association with remote system
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed,
address is now gated for [5000] ms. Reason: [Association failed with
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] Caused by:
[fps-flink-taskmanager-2384273947-9n4kc: Name does not resolve]
2017-10-06 06:10:27,495 WARN  Remoting
                - Tried to associate with unreachable remote address
[akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]. Address is
now gated for 5000 ms, all messages to this address will be delivered to
dead letters. Reason: [The remote system has quarantined this system. No
further associations to the remote system are possible until this system is
restarted.]

Re: TM get killed/disconnected after a while

Posted by Patrick Lucas <pa...@data-artisans.com>.
Hi,

Can you provide a bit more info about your setup, such as what Kubernetes
resources you are using? (Deployments, Service)

Is the pod running the taskmanager killed by Kubernetes or does it fail?
Can you provide the output of kubectl describe pod <pod> and kubectl logs
<pod> of the taskmanager pod that exited?

--
Patrick Lucas

On Fri, Oct 6, 2017 at 8:16 PM, Hao Sun <ha...@zendesk.com> wrote:

> Hi, I am running Flink 1.3.2 on kubernetes, I am not sure why sometime one
> of my TM is killed, is there a way to debug this? Thanks
>
> ===== Logs ====
>
> *2017-10-05 22:36:42,631 INFO
> org.apache.flink.runtime.instance.InstanceManager             - Registered
> TaskManager at fps-flink-taskmanager-2384273947-9n4kc
> (akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274/user/taskmanager)
> as 330ff7eeaabfe2b7289fee4a0e36c4b2. Current number of registered hosts is
> 2. Current number of alive task slots is 2.*
> 2017-10-05 22:37:04,974 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
>       - Deploying Source: KafkaSource(maxwell.users) ->
> MaxwellFilter->Maxwell(maxwell.users) -> FixedDelayWatermark(maxwell.users)
> -> MaxwellFPSEvent->InfluxDBData(maxwell.users) -> (Sink:
> influxdbSink(maxwell.users), Sink: PrintSink(maxwell.users)) (1/1) (attempt
> #0) to fps-flink-taskmanager-2384273947-9n4kc
> *2017-10-06 06:08:55,657 WARN  akka.remote.ReliableDeliverySupervisor
>                   - Association with remote system
> [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed,
> address is now gated for [5000] ms. Reason: [Disassociated]*
> 2017-10-06 06:08:55,832 WARN  Remoting
>                   - Tried to associate with unreachable remote address
> [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]. Address
> is now gated for 5000 ms, all messages to this address will be delivered to
> dead letters. Reason: [The remote system has quarantined this system. No
> further associations to the remote system are possible until this system is
> restarted.]
> 2017-10-06 06:09:01,232 WARN  akka.remote.ReliableDeliverySupervisor
>                   - Association with remote system
> [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has
> failed, address is now gated for [5000] ms. Reason: [Association failed
> with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]]
> Caused by: [fps-flink-taskmanager-2384273947-9n4kc: Name does not resolve]
> 2017-10-06 06:09:03,416 WARN  akka.remote.ReliableDeliverySupervisor
>                   - Association with remote system
> [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has
> failed, address is now gated for [5000] ms. Reason: [Association failed
> with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]]
> Caused by: [fps-flink-taskmanager-2384273947-9n4kc]
> 2017-10-06 06:09:11,174 WARN  akka.remote.ReliableDeliverySupervisor
>                   - Association with remote system
> [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has
> failed, address is now gated for [5000] ms. Reason: [Association failed
> with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]]
> Caused by: [fps-flink-taskmanager-2384273947-9n4kc]
> 2017-10-06 06:09:11,440 WARN  Remoting
>                   - Tried to associate with unreachable remote address
> [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]. Address
> is now gated for 5000 ms, all messages to this address will be delivered to
> dead letters. Reason: [The remote system has quarantined this system. No
> further associations to the remote system are possible until this system is
> restarted.]
> 2017-10-06 06:09:21,232 WARN  akka.remote.ReliableDeliverySupervisor
>                   - Association with remote system
> [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has
> failed, address is now gated for [5000] ms. Reason: [Association failed
> with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]]
> Caused by: [fps-flink-taskmanager-2384273947-9n4kc: Name does not resolve]
> 2017-10-06 06:09:27,460 WARN  Remoting
>                   - Tried to associate with unreachable remote address
> [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]. Address
> is now gated for 5000 ms, all messages to this address will be delivered to
> dead letters. Reason: [The remote system has quarantined this system. No
> further associations to the remote system are possible until this system is
> restarted.]
> 2017-10-06 06:09:31,173 WARN  akka.remote.ReliableDeliverySupervisor
>                   - Association with remote system
> [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has
> failed, address is now gated for [5000] ms. Reason: [Association failed
> with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]]
> Caused by: [fps-flink-taskmanager-2384273947-9n4kc]
> 2017-10-06 06:09:41,179 WARN  akka.remote.ReliableDeliverySupervisor
>                   - Association with remote system
> [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has
> failed, address is now gated for [5000] ms. Reason: [Association failed
> with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]]
> Caused by: [fps-flink-taskmanager-2384273947-9n4kc: Name does not resolve]
> 2017-10-06 06:09:51,174 WARN  akka.remote.ReliableDeliverySupervisor
>                   - Association with remote system
> [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has
> failed, address is now gated for [5000] ms. Reason: [Association failed
> with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]]
> Caused by: [fps-flink-taskmanager-2384273947-9n4kc]
> 2017-10-06 06:09:57,475 WARN  Remoting
>                   - Tried to associate with unreachable remote address
> [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]. Address
> is now gated for 5000 ms, all messages to this address will be delivered to
> dead letters. Reason: [The remote system has quarantined this system. No
> further associations to the remote system are possible until this system is
> restarted.]
> 2017-10-06 06:10:01,179 WARN  akka.remote.ReliableDeliverySupervisor
>                   - Association with remote system
> [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has
> failed, address is now gated for [5000] ms. Reason: [Association failed
> with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]]
> Caused by: [fps-flink-taskmanager-2384273947-9n4kc: Name does not resolve]
> 2017-10-06 06:10:06,173 WARN  akka.remote.RemoteWatcher
>                  - Detected unreachable: [akka.tcp://flink@fps-flink-
> taskmanager-2384273947-9n4kc:40274]
> 2017-10-06 06:10:06,177 INFO  org.apache.flink.runtime.jobmanager.JobManager
>               - Task manager akka.tcp://flink@fps-flink-
> taskmanager-2384273947-9n4kc:40274/user/taskmanager terminated.
> java.lang.Exception: TaskManager was lost/killed:
> 55d3143ccecec7878f7df169208795d0 @ fps-flink-taskmanager-2384273947-9n4kc
> (dataPort=37448)
> java.lang.Exception: TaskManager was lost/killed:
> 55d3143ccecec7878f7df169208795d0 @ fps-flink-taskmanager-2384273947-9n4kc
> (dataPort=37448)
> 2017-10-06 06:10:06,188 WARN  akka.remote.ReliableDeliverySupervisor
>                   - Association with remote system
> [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has
> failed, address is now gated for [5000] ms. Reason: [Association failed
> with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]]
> Caused by: [fps-flink-taskmanager-2384273947-9n4kc]
> 2017-10-06 06:10:06,240 INFO  org.apache.flink.runtime.instance.InstanceManager
>            - Unregistered task manager fps-flink-taskmanager-
> 2384273947-9n4kc/10.225.132.78. Number of registered task managers 3.
> Number of available slots 3.
> 2017-10-06 06:10:16,247 WARN  akka.remote.ReliableDeliverySupervisor
>                   - Association with remote system
> [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has
> failed, address is now gated for [5000] ms. Reason: [Association failed
> with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]]
> Caused by: [fps-flink-taskmanager-2384273947-9n4kc: Name does not resolve]
> 2017-10-06 06:10:26,284 WARN  akka.remote.ReliableDeliverySupervisor
>                   - Association with remote system
> [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has
> failed, address is now gated for [5000] ms. Reason: [Association failed
> with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]]
> Caused by: [fps-flink-taskmanager-2384273947-9n4kc: Name does not resolve]
> 2017-10-06 06:10:27,495 WARN  Remoting
>                   - Tried to associate with unreachable remote address
> [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]. Address
> is now gated for 5000 ms, all messages to this address will be delivered to
> dead letters. Reason: [The remote system has quarantined this system. No
> further associations to the remote system are possible until this system is
> restarted.]
>