You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by "Chan, Regina" <Re...@gs.com> on 2017/08/28 17:12:50 UTC

Flink Yarn Session failures

Hi,

Was trying to understand why it takes about 9 minutes between the last try to start a container and when it finally gets the sigterm to kill the YarnApplicationMasterRunner.

Client:



Calc Engine: 2017-08-28 12:39:23,596 INFO  org.apache.flink.yarn.YarnClusterClient                       - Waiting until all TaskManagers have connected

Calc Engine: Waiting until all TaskManagers have connected

Calc Engine: 2017-08-28 12:39:23,600 INFO  org.apache.flink.yarn.YarnClusterClient                       - Starting client actor system.

Calc Engine: 2017-08-28 12:39:24,077 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started

Calc Engine: 2017-08-28 12:39:24,366 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@dlp-qa-176378-023.dc.gs.com:39353]

Calc Engine: 2017-08-28 12:39:24,609 INFO  org.apache.flink.yarn.YarnClusterClient                       - TaskManager status (0/4)

Calc Engine: TaskManager status (0/4)

Calc Engine: 2017-08-28 12:39:29,864 INFO  org.apache.flink.yarn.YarnClusterClient                       - TaskManager status (1/4)

Calc Engine: TaskManager status (1/4)

Calc Engine: 2017-08-28 12:39:30,389 INFO  org.apache.flink.yarn.YarnClusterClient                       - TaskManager status (2/4)

Calc Engine: TaskManager status (2/4)

Calc Engine: 2017-08-28 12:41:04,920 INFO  org.apache.flink.yarn.YarnClusterClient                       - TaskManager status (1/4)

Calc Engine: TaskManager status (1/4)

Calc Engine: 2017-08-28 12:41:13,775 INFO  org.apache.flink.yarn.YarnClusterClient                       - TaskManager status (0/4)

Calc Engine: TaskManager status (0/4)

Calc Engine: 2017-08-28 12:50:43,133 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@d191303-019.dc.gs.com:58084] has failed, address is now gated for [5000] ms. Reason: [Disassociated]



Logs:


Container id: container_e71_1503688027943_30786_01_000013

Exit code: 134

Stack trace: ExitCodeException exitCode=134:

        at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)

        at org.apache.hadoop.util.Shell.run(Shell.java:455)

        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)

        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:293)

        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)

        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)

        at java.util.concurrent.FutureTask.run(FutureTask.java:262)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

        at java.lang.Thread.run(Thread.java:745)



Shell output: main : command provided 1

main : user is delp

main : requested yarn user is delp



Container exited with a non-zero exit code 134



17/08/28 12:39:51 INFO yarn.YarnFlinkResourceManager: Total number of failed containers so far: 5

17/08/28 12:39:51 ERROR yarn.YarnFlinkResourceManager: Stopping YARN session because the number of failed containers (5) exceeded the maximum failed containers (4). This number is controlled by the 'yarn.maximum-failed-containers' configuration setting. By default its the number of requested containers.

17/08/28 12:39:51 INFO yarn.YarnFlinkResourceManager: Shutting down cluster with status FAILED : Stopping YARN session because the number of failed containers (5) exceeded the maximum failed containers (4). This number is controlled by the 'yarn.maximum-failed-containers' configuration setting. By default its the number of requested containers.

17/08/28 12:39:51 INFO yarn.YarnFlinkResourceManager: Unregistering application from the YARN Resource Manager

17/08/28 12:39:51 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.

17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-010.dc.gs.com:45454

17/08/28 12:39:51 INFO impl.AMRMClientAsyncImpl: Interrupted while waiting for queue

java.lang.InterruptedException

        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)

        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)

        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)

        at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:274)

17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-010.dc.gs.com:45454

17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-019.dc.gs.com:45454

17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-010.dc.gs.com:45454

17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-016.dc.gs.com:45454

17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-013.dc.gs.com:45454

17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-019.dc.gs.com:45454

17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-019.dc.gs.com:45454

17/08/28 12:39:52 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://flink@d191303-010.dc.gs.com:48786] has failed, address is now gated for [5000] ms. Reason: [Disassociated]

17/08/28 12:39:52 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://flink@d191303-016.dc.gs.com:58367] has failed, address is now gated for [5000] ms. Reason: [Disassociated]

17/08/28 12:40:01 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://flink@d191303-016.dc.gs.com:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]

17/08/28 12:40:01 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://flink@d191303-010.dc.gs.com:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]

17/08/28 12:40:11 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://flink@d191303-010.dc.gs.com:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]

17/08/28 12:40:11 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://flink@d191303-016.dc.gs.com:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]

17/08/28 12:40:21 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://flink@d191303-010.dc.gs.com:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]

17/08/28 12:40:21 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://flink@d191303-016.dc.gs.com:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]

17/08/28 12:40:31 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://flink@d191303-016.dc.gs.com:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]

17/08/28 12:40:31 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://flink@d191303-010.dc.gs.com:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]

17/08/28 12:40:41 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://flink@d191303-016.dc.gs.com:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]

17/08/28 12:40:41 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://flink@d191303-010.dc.gs.com:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]

17/08/28 12:40:51 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://flink@d191303-016.dc.gs.com:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]

17/08/28 12:40:51 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://flink@d191303-010.dc.gs.com:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]

17/08/28 12:41:01 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://flink@d191303-010.dc.gs.com:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]

17/08/28 12:41:01 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://flink@d191303-016.dc.gs.com:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]

17/08/28 12:41:04 WARN remote.RemoteWatcher: Detected unreachable: [akka.tcp://flink@d191303-010.dc.gs.com:48786]

17/08/28 12:41:04 INFO yarn.YarnJobManager: Task manager akka.tcp://flink@d191303-010.dc.gs.com:48786/user/taskmanager terminated.

17/08/28 12:41:04 INFO instance.InstanceManager: Unregistered task manager d191303-010.dc.gs.com/10.79.252.104. Number of registered task managers 1. Number of available slots 2.

17/08/28 12:41:11 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://flink@d191303-016.dc.gs.com:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]

17/08/28 12:41:13 WARN remote.RemoteWatcher: Detected unreachable: [akka.tcp://flink@d191303-016.dc.gs.com:58367]

17/08/28 12:41:13 INFO yarn.YarnJobManager: Task manager akka.tcp://flink@d191303-016.dc.gs.com:58367/user/taskmanager terminated.

17/08/28 12:41:13 INFO instance.InstanceManager: Unregistered task manager d191303-016.dc.gs.com/10.79.162.181. Number of registered task managers 0. Number of available slots 0.

17/08/28 12:50:42 INFO yarn.YarnApplicationMasterRunner: RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.

17/08/28 12:50:42 INFO webmonitor.WebRuntimeMonitor: Removing web dashboard root cache directory /tmp/flink-web-d1eebf19-098f-419e-859e-101cfd6c0749

17/08/28 12:50:42 INFO webmonitor.WebRuntimeMonitor: Removing web dashboard jar upload directory /tmp/flink-web-4d9bcf76-ddcb-4dbe-b91d-4a8d8da3d716

17/08/28 12:50:42 INFO blob.BlobServer: Stopped BLOB server at 0.0.0.0:35815




Regina Chan
Goldman Sachs - Enterprise Platforms, Data Architecture
30 Hudson Street, 37th floor | Jersey City, NY 07302 *  (212) 902-5697