You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Seonhee Yun <se...@gmail.com> on 2017/09/03 17:06:23 UTC

Fwd: HA : My job didn't restart even if task manager restarted.

I have 4 nodes for flink cluster using flink 1.3.2 and hadoop 2.6.5. All
configurations are same in 4 nodes.

Running yarn session and flink job, I killed node#2(flink-02)’s
TM(YarnTaskManager) process.

After resoucemanager make new TM container, job still failed status.

But when I killed JM(YarnApplicationMasterRunner) process, all behavior
seems ok. My job restarted well.

I’m using DFS for storageDir.

Flink configuration file and log files are attached. If any other
information needed, please let me know.



Here is use case, ------------------------------
-------------------------------------------------------------------------

1. start yarn session

flink-01 ~]$ flink/bin/yarn-session.sh -n 4 -s 8 -nm flink -d

2. submit job

flink-01 ~]$ flink/bin/flink run -d -p 3 -c a.b.c.Application abc.jar

3. find TM pid and kill it

flink-02 ~]$ jps | grep YarnTaskManager

4022 YarnTaskManager

flink-02 ~]$ kill 4022



And JM logs ------------------------------------------------------------
-------------------------------------------

2017-09-04 00:50:57,830 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Disassociated]

2017-09-04 00:50:58,013 INFO  org.apache.flink.yarn.
YarnFlinkResourceManager                - Container
container_1504449399430_0005_01_000004 failed. Exit status: 143

2017-09-04 00:50:58,014 INFO  org.apache.flink.yarn.
YarnFlinkResourceManager                - Diagnostics for container
container_1504449399430_0005_01_000004 in state COMPLETE : exitStatus=143
diagnostics=Container killed on request. Exit code is 143

Container exited with a non-zero exit code 143

Killed by external signal



2017-09-04 00:50:58,014 INFO  org.apache.flink.yarn.
YarnFlinkResourceManager                - Total number of failed containers
so far: 1

2017-09-04 00:50:58,014 INFO  org.apache.flink.yarn.
YarnFlinkResourceManager                - Received new container:
container_1504449399430_0005_01_000005 - Remaining pending container
requests: 0

2017-09-04 00:50:58,014 INFO  org.apache.flink.yarn.
YarnFlinkResourceManager                - Launching TaskManager in
container ContainerInLaunch @ 1504453858014: Container: [ContainerId:
container_1504449399430_0005_01_000005, NodeId: flink-02:34821,
NodeHttpAddress: flink-02:8042, Resource: <memory:8192, vCores:1>,
Priority: 0, Token: Token { kind: ContainerToken, service: 10.1.0.5:34821
}, ] on host flink-02

2017-09-04 00:50:58,015 INFO  org.apache.hadoop.yarn.client.api.impl.
ContainerManagementProtocolProxy  - Opening proxy : flink-02:34821

2017-09-04 00:50:58,015 INFO  org.apache.flink.yarn.
YarnJobManager                          - Task manager
akka.tcp://flink@flink-02:45104/user/taskmanager terminated.

2017-09-04 00:50:58,016 INFO  org.apache.flink.runtime.
executiongraph.ExecutionGraph        - Source: src -> Service: vps -> Sink:
snk (3/3) (33b33a9c15eb0107b320e375374ac07f) switched from RUNNING to
FAILED.

java.lang.Exception: TaskManager was lost/killed:
container_1504449399430_0005_01_000004 @ flink-02 (dataPort=42381)

           at org.apache.flink.runtime.instance.SimpleSlot.
releaseSlot(SimpleSlot.java:217)

           at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.
releaseSharedSlot(SlotSharingGroupAssignment.java:533)

           at org.apache.flink.runtime.instance.SharedSlot.
releaseSlot(SharedSlot.java:192)

           at org.apache.flink.runtime.instance.Instance.markDead(
Instance.java:167)

           at org.apache.flink.runtime.instance.InstanceManager.
unregisterTaskManager(InstanceManager.java:212)

           at org.apache.flink.runtime.jobmanager.JobManager.org$
apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(
JobManager.scala:1228)

           at org.apache.flink.runtime.jobmanager.JobManager$$
anonfun$handleMessage$1.applyOrElse(JobManager.scala:474)

           at scala.runtime.AbstractPartialFunction.apply(
AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.clusterframework.
ContaineredJobManager$$anonfun$handleContainerMessage$1.applyOrElse(
ContaineredJobManager.scala:100)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.yarn.YarnJobManager$$anonfun$
handleYarnShutdown$1.applyOrElse(YarnJobManager.scala:103)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.runtime.LeaderSessionMessageFilter$$
anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:38)

           at scala.runtime.AbstractPartialFunction.apply(
AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(
LogMessages.scala:33)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(
LogMessages.scala:28)

           at scala.PartialFunction$class.applyOrElse(PartialFunction.
scala:123)

           at org.apache.flink.runtime.LogMessages$$anon$1.
applyOrElse(LogMessages.scala:28)

           at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

           at org.apache.flink.runtime.jobmanager.JobManager.
aroundReceive(JobManager.scala:125)

           at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

           at akka.actor.ActorCell.invoke(ActorCell.scala:487)

           at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

           at akka.dispatch.Mailbox.run(Mailbox.scala:220)

           at akka.dispatch.ForkJoinExecutorConfigurator$
AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

           at scala.concurrent.forkjoin.ForkJoinTask.doExec(
ForkJoinTask.java:260)

           at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
runTask(ForkJoinPool.java:1339)

           at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
ForkJoinPool.java:1979)

           at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
ForkJoinWorkerThread.java:107)

2017-09-04 00:50:58,018 INFO  org.apache.flink.runtime.
executiongraph.ExecutionGraph        - Job vp (
6e73f6babf36c9321efc0524a82dede1) switched from state RUNNING to FAILING.

java.lang.Exception: TaskManager was lost/killed:
container_1504449399430_0005_01_000004 @ flink-02 (dataPort=42381)

           at org.apache.flink.runtime.instance.SimpleSlot.
releaseSlot(SimpleSlot.java:217)

           at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.
releaseSharedSlot(SlotSharingGroupAssignment.java:533)

           at org.apache.flink.runtime.instance.SharedSlot.
releaseSlot(SharedSlot.java:192)

           at org.apache.flink.runtime.instance.Instance.markDead(
Instance.java:167)

           at org.apache.flink.runtime.instance.InstanceManager.
unregisterTaskManager(InstanceManager.java:212)

           at org.apache.flink.runtime.jobmanager.JobManager.org$
apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(
JobManager.scala:1228)

           at org.apache.flink.runtime.jobmanager.JobManager$$
anonfun$handleMessage$1.applyOrElse(JobManager.scala:474)

           at scala.runtime.AbstractPartialFunction.apply(
AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.clusterframework.
ContaineredJobManager$$anonfun$handleContainerMessage$1.applyOrElse(
ContaineredJobManager.scala:100)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.yarn.YarnJobManager$$anonfun$
handleYarnShutdown$1.applyOrElse(YarnJobManager.scala:103)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.runtime.LeaderSessionMessageFilter$$
anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:38)

           at scala.runtime.AbstractPartialFunction.apply(
AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(
LogMessages.scala:33)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(
LogMessages.scala:28)

           at scala.PartialFunction$class.applyOrElse(PartialFunction.
scala:123)

           at org.apache.flink.runtime.LogMessages$$anon$1.
applyOrElse(LogMessages.scala:28)

           at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

           at org.apache.flink.runtime.jobmanager.JobManager.
aroundReceive(JobManager.scala:125)

           at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

           at akka.actor.ActorCell.invoke(ActorCell.scala:487)

           at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

           at akka.dispatch.Mailbox.run(Mailbox.scala:220)

           at akka.dispatch.ForkJoinExecutorConfigurator$
AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

           at scala.concurrent.forkjoin.ForkJoinTask.doExec(
ForkJoinTask.java:260)

           at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
runTask(ForkJoinPool.java:1339)

           at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
ForkJoinPool.java:1979)

           at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
ForkJoinWorkerThread.java:107)

2017-09-04 00:50:58,027 INFO  org.apache.flink.runtime.
executiongraph.ExecutionGraph        - Source: src -> Service: vps -> Sink:
snk (1/3) (02232baaa7b3f88982cbf40cb5ebe489) switched from RUNNING to
CANCELING.

2017-09-04 00:50:58,029 INFO  org.apache.flink.runtime.
executiongraph.ExecutionGraph        - Source: src -> Service: vps -> Sink:
snk (2/3) (952c34a277df34aa7e3216a8b96ae2d5) switched from RUNNING to
CANCELING.

2017-09-04 00:50:58,029 INFO  org.apache.flink.yarn.
YarnFlinkResourceManager                - Requesting new TaskManager
container with 8192 megabytes memory. Pending requests: 1

2017-09-04 00:50:58,030 INFO  org.apache.flink.runtime.
instance.InstanceManager             - Unregistered task manager flink-02/
10.1.0.5. Number of registered task managers 2. Number of available slots
16.

2017-09-04 00:50:58,065 INFO  org.apache.flink.runtime.
executiongraph.ExecutionGraph        - Source: src -> Service: vps -> Sink:
snk (1/3) (02232baaa7b3f88982cbf40cb5ebe489) switched from CANCELING to
CANCELED.

2017-09-04 00:50:58,071 INFO  org.apache.flink.runtime.
executiongraph.ExecutionGraph        - Source: src -> Service: vps -> Sink:
snk (2/3) (952c34a277df34aa7e3216a8b96ae2d5) switched from CANCELING to
CANCELED.

2017-09-04 00:50:58,072 INFO  org.apache.flink.runtime.
executiongraph.ExecutionGraph        - Try to restart or fail the job vp (
6e73f6babf36c9321efc0524a82dede1) if no longer possible.

2017-09-04 00:50:58,072 INFO  org.apache.flink.runtime.
executiongraph.ExecutionGraph        - Job vp (
6e73f6babf36c9321efc0524a82dede1) switched from state FAILING to FAILED.

java.lang.Exception: TaskManager was lost/killed:
container_1504449399430_0005_01_000004 @ flink-02 (dataPort=42381)

           at org.apache.flink.runtime.instance.SimpleSlot.
releaseSlot(SimpleSlot.java:217)

           at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.
releaseSharedSlot(SlotSharingGroupAssignment.java:533)

           at org.apache.flink.runtime.instance.SharedSlot.
releaseSlot(SharedSlot.java:192)

           at org.apache.flink.runtime.instance.Instance.markDead(
Instance.java:167)

           at org.apache.flink.runtime.instance.InstanceManager.
unregisterTaskManager(InstanceManager.java:212)

           at org.apache.flink.runtime.jobmanager.JobManager.org$
apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(
JobManager.scala:1228)

           at org.apache.flink.runtime.jobmanager.JobManager$$
anonfun$handleMessage$1.applyOrElse(JobManager.scala:474)

           at scala.runtime.AbstractPartialFunction.apply(
AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.clusterframework.
ContaineredJobManager$$anonfun$handleContainerMessage$1.applyOrElse(
ContaineredJobManager.scala:100)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.yarn.YarnJobManager$$anonfun$
handleYarnShutdown$1.applyOrElse(YarnJobManager.scala:103)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.runtime.LeaderSessionMessageFilter$$
anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:38)

           at scala.runtime.AbstractPartialFunction.apply(
AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(
LogMessages.scala:33)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(
LogMessages.scala:28)

           at scala.PartialFunction$class.applyOrElse(PartialFunction.
scala:123)

           at org.apache.flink.runtime.LogMessages$$anon$1.
applyOrElse(LogMessages.scala:28)

           at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

           at org.apache.flink.runtime.jobmanager.JobManager.
aroundReceive(JobManager.scala:125)

           at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

           at akka.actor.ActorCell.invoke(ActorCell.scala:487)

           at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

           at akka.dispatch.Mailbox.run(Mailbox.scala:220)

           at akka.dispatch.ForkJoinExecutorConfigurator$
AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

           at scala.concurrent.forkjoin.ForkJoinTask.doExec(
ForkJoinTask.java:260)

           at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
runTask(ForkJoinPool.java:1339)

           at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
ForkJoinPool.java:1979)

           at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
ForkJoinWorkerThread.java:107)

2017-09-04 00:50:58,072 INFO  org.apache.flink.runtime.
executiongraph.ExecutionGraph        - Could not restart the job vp (
6e73f6babf36c9321efc0524a82dede1) because the restart strategy prevented it.

java.lang.Exception: TaskManager was lost/killed:
container_1504449399430_0005_01_000004 @ flink-02 (dataPort=42381)

           at org.apache.flink.runtime.instance.SimpleSlot.
releaseSlot(SimpleSlot.java:217)

           at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.
releaseSharedSlot(SlotSharingGroupAssignment.java:533)

           at org.apache.flink.runtime.instance.SharedSlot.
releaseSlot(SharedSlot.java:192)

           at org.apache.flink.runtime.instance.Instance.markDead(
Instance.java:167)

           at org.apache.flink.runtime.instance.InstanceManager.
unregisterTaskManager(InstanceManager.java:212)

           at org.apache.flink.runtime.jobmanager.JobManager.org$
apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(
JobManager.scala:1228)

           at org.apache.flink.runtime.jobmanager.JobManager$$
anonfun$handleMessage$1.applyOrElse(JobManager.scala:474)

           at scala.runtime.AbstractPartialFunction.apply(
AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.clusterframework.
ContaineredJobManager$$anonfun$handleContainerMessage$1.applyOrElse(
ContaineredJobManager.scala:100)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.yarn.YarnJobManager$$anonfun$
handleYarnShutdown$1.applyOrElse(YarnJobManager.scala:103)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.runtime.LeaderSessionMessageFilter$$
anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:38)

           at scala.runtime.AbstractPartialFunction.apply(
AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(
LogMessages.scala:33)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(
LogMessages.scala:28)

           at scala.PartialFunction$class.applyOrElse(PartialFunction.
scala:123)

           at org.apache.flink.runtime.LogMessages$$anon$1.
applyOrElse(LogMessages.scala:28)

           at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

           at org.apache.flink.runtime.jobmanager.JobManager.
aroundReceive(JobManager.scala:125)

           at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

           at akka.actor.ActorCell.invoke(ActorCell.scala:487)

           at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

           at akka.dispatch.Mailbox.run(Mailbox.scala:220)

           at akka.dispatch.ForkJoinExecutorConfigurator$
AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

           at scala.concurrent.forkjoin.ForkJoinTask.doExec(
ForkJoinTask.java:260)

           at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
runTask(ForkJoinPool.java:1339)

           at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
ForkJoinPool.java:1979)

           at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
ForkJoinWorkerThread.java:107)

2017-09-04 00:50:58,073 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator
- Stopping checkpoint coordinator for job 6e73f6babf36c9321efc0524a82dede1

2017-09-04 00:50:58,073 INFO  org.apache.flink.runtime.checkpoint.
ZooKeeperCompletedCheckpointStore  - Shutting down

2017-09-04 00:50:58,073 INFO  org.apache.flink.runtime.checkpoint.
ZooKeeperCompletedCheckpointStore  - Removing /flink/cluster_one/
checkpoints/6e73f6babf36c9321efc0524a82dede1 from ZooKeeper

2017-09-04 00:50:58,102 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
- Shutting down.

2017-09-04 00:50:58,102 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
- Removing /checkpoint-counter/6e73f6babf36c9321efc0524a82dede1 from
ZooKeeper

2017-09-04 00:50:58,118 INFO  org.apache.flink.runtime.jobmanager.
ZooKeeperSubmittedJobGraphStore  - Removed job graph
6e73f6babf36c9321efc0524a82dede1 from ZooKeeper.

2017-09-04 00:51:01,000 INFO  org.apache.flink.runtime.
instance.InstanceManager             - Registered TaskManager at flink-02
(akka.tcp://flink@flink-02:37281/user/taskmanager) as
6f8a6941cddb361de1643fa47b71a1db. Current number of registered hosts is 3.
Current number of alive task slots is 24.

2017-09-04 00:51:01,000 INFO  org.apache.flink.yarn.
YarnFlinkResourceManager                - TaskManager
container_1504449399430_0005_01_000005 has started.

2017-09-04 00:51:02,868 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:07,883 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:12,903 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:17,924 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:22,942 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:27,963 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:32,983 WARN  akka.remote.ReliableDeliverySupervisor
                    - Association with remote system
[akka.tcp://flink@flink-02:45104] has failed, address is now gated for
[5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]]
Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:51:38,004 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:43,024 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:48,043 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:53,063 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:58,082 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:03,105 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:08,123 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:13,143 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:18,164 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:23,184 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:28,202 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:33,225 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:38,244 WARN  akka.remote.ReliableDeliverySupervisor
        - Association with remote system [akka.tcp://flink@flink-02:45104]
has failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:43,263 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:48,282 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:53,302 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:58,324 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:03,344 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:08,362 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:13,384 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:18,403 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:23,422 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:28,443 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:33,462 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:38,482 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:43,502 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:48,522 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:53,549 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:58,573 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:54:03,592 ERROR Remoting
                    - Association to [akka.tcp://flink@flink-02:45104] with
UID [1818461746] irrecoverably failed. Quarantining address.

java.util.concurrent.TimeoutException: Delivery of system messages timed
out and they were dropped.

           at akka.remote.ReliableDeliverySupervisor$$
anonfun$gated$1.applyOrElse(Endpoint.scala:336)

           at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

           at akka.remote.ReliableDeliverySupervisor.
aroundReceive(Endpoint.scala:189)

           at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

           at akka.actor.ActorCell.invoke(ActorCell.scala:487)

           at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

           at akka.dispatch.Mailbox.run(Mailbox.scala:220)

           at akka.dispatch.ForkJoinExecutorConfigurator$
AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

           at scala.concurrent.forkjoin.ForkJoinTask.doExec(
ForkJoinTask.java:260)

           at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
runTask(ForkJoinPool.java:1339)

           at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
ForkJoinPool.java:1979)

           at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
ForkJoinWorkerThread.java:107)
ᐧ

ᐧ

Re: Fwd: HA : My job didn't restart even if task manager restarted.

Posted by Fabian Hueske <fh...@gmail.com>.
Hi,

sorry for the late response!
I'm not familiar with the details of the failure recovery but Till (in CC)
knows the code in depth.
Maybe he can figure out what's going on.

Best, Fabian

2017-09-06 5:35 GMT+02:00 sunny yun <se...@gmail.com>:

> I am still struggling to solve this problem.
> I have no doubt that the JOB should automatically restart after restarting
> the TASK MANAGER in YARN MODE. Is it a misunderstood?
>
> Problem seems that *JOB MANAGER still try to connect to old TASK MANAGER
> even after new TASK MANAGER container be created.*
> When I killed TM on node#2 then new TM container is created on node#3, but
> JM still tries to connect to TM on node#2 according to the log file. (It
> was
> not a log I posted before, when I found it while continuing the test.
> Normally the TM be created on the same node after killed.)
> So new TM don't know JOB info and JM show us JOB with fail status.
>
> If anyone has succeeded in the same situation(YARN + TM FAILURE), please
> just tell me.
> That will be big help to me.
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/
>

Re: Fwd: HA : My job didn't restart even if task manager restarted.

Posted by sunny yun <se...@gmail.com>.
I am still struggling to solve this problem.
I have no doubt that the JOB should automatically restart after restarting
the TASK MANAGER in YARN MODE. Is it a misunderstood?

Problem seems that *JOB MANAGER still try to connect to old TASK MANAGER
even after new TASK MANAGER container be created.*
When I killed TM on node#2 then new TM container is created on node#3, but
JM still tries to connect to TM on node#2 according to the log file. (It was
not a log I posted before, when I found it while continuing the test.
Normally the TM be created on the same node after killed.)
So new TM don't know JOB info and JM show us JOB with fail status.

If anyone has succeeded in the same situation(YARN + TM FAILURE), please
just tell me.
That will be big help to me.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/