You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Jagadish (JIRA)" <ji...@apache.org> on 2017/07/14 17:40:00 UTC
[jira] [Updated] (SAMZA-1359) Handle phantom container notifications cleanly during RM fail-over

     [ https://issues.apache.org/jira/browse/SAMZA-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jagadish updated SAMZA-1359:
----------------------------
    Description: 
At LinkedIn, we have observed that there are phantom notifications corresponding to containers not being actively managed by Samza. We should improve our AppMaster's logic to be resilient to these phantom notifications. Currently, we take actions based on these notifications - for example, we increment the number of failed containers, mark the job health to be false etc. These actions cause false alerts to be fired and automated remediations based on these alerts. At LinkedIn, this has affected our ability to do upgrades on the Samza cluster because several jobs have to be restarted owing to these phantom notifications. 

Here's the investigation during an upgrade and a rolling bounce of the Samza ResourceManagers.
1. Samza starts container 13 on a certain host for this job.
{code}
2017-06-07 19:46:20.226 [Container Allocator Thread] YarnContainerRunner [INFO] Claimed container ID 13 for container container_1490916233553_1471_01_000016 on node ltx1-app19956.prod.linkedin.com (http://ltx1-app19956.prod.linkedin.com:8042/node/containerlogs/container_1490916233553_1471_01_000016).
{code}

2. Yarn reports that container 13 is lost due to a failure of its NodeManager on the host at which the container was running.

2017-06-09 06:14:40.398 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: [ContainerId: container_1490916233553_1471_01_000016, State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: -100, ]
2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Completed container had ID: 13
2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Removing container ID 13 from completed containers
2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Matching container ID found 13 org.apache.samza.clustermanager.SamzaResource@e4e2d63a
2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Got an exit code of -100. This means that container container_1490916233553_1471_01_000016 was killed by YARN, either due to being released by the application master or being 'lost' due to node failures etc. or due to preemption by the RM
2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Released container container_1490916233553_1471_01_000016 was assigned task group ID 13. Requesting a refactor container for the task group.
2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] SamzaResourceRequest [INFO] Resource Request created for 13 on ANY_HOST at 1496988880401

3. Samza makes a request to Yarn for restarting container 13 on a different host.

2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Requesting resources on  ANY_HOST for container 13
2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Making a request for ANY_HOST ANY_HOST
2017-06-09 06:14:42.400 [AMRM Heartbeater thread] AMRMClientImpl [INFO] Received new token for : ltx1-app19939.prod.linkedin.com:1158
2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Container allocated from RM on ltx1-app19939.prod.linkedin.com
2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Container allocated from RM on ltx1-app19939.prod.linkedin.com
2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] ResourceRequestState [INFO] Host affinity not enabled. Saving the samzaResource container_1490916233553_1471_01_000027 in the buffer for ANY_HOST
2017-06-09 06:14:44.310 [Container Allocator Thread] YarnClusterResourceManager [INFO] Cancelling request SamzaResourceRequest{numCores=1, memoryMB=4000, preferredHost='ANY_HOST', requestID='f74a8fcf-be17-42eb-8824-1c32b4504933', containerID=13, requestTimestampMs=1496988880401} 
2017-06-09 06:14:44.310 [Container Allocator Thread] AbstractContainerAllocator [INFO] Found available resources on ANY_HOST. Assigning request for container_id 13 with timestamp 1496988880401 to resource container_1490916233553_1471_01_000027
2017-06-09 06:14:44.310 [Container Allocator Thread] YarnClusterResourceManager [INFO] Received launch request for 13 on hostname ltx1-app19939.prod.linkedin.com
2017-06-09 06:14:44.310 [Container Allocator Thread] YarnContainerRunner [INFO] Got available container ID (13) for container: Container: [ContainerId: container_1490916233553_1471_01_000027, NodeId: ltx1-app19939.prod.linkedin.com:1158, NodeHttpAddress: ltx1-app19939.prod.linkedin.com:8042, Resource: <memory:4096, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.149.196.44:1158 }, ]


4. Container 13 is re-started and runs on a different host
2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] In runContainer in util: fwkPath= ;cmdPath=./__package/;jobLib=
2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] Container ID 13 using command ./__package//bin/run-container.sh
2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] Container ID 13 using environment variables: 

ltx1-app1348.prod.linkedin.com
ltx1-app1355.prod.linkedin.com
ltx1-app20440.prod.linkedin.com


5. A cluster wide upgrade happens and the machines in the cluster have to be bounced. We typically do a rolling bounce on the RM to give an opportunity for the passive RM to take ownership. RM failover happens and state is re-synced from the passive to the active RM. Samza's AppMaster (which had a previous connection to the active RM) resyncs it's state from the other RM.

2017-06-14 23:18:20.281 [AMRM Heartbeater thread] RetryInvocationHandler [INFO] Exception while invoking allocate of class ApplicationMasterProtocolPBClientImpl over yarn-prod-ltx1-samza-rm-2. Trying to fail over immediately.
java.net.ConnectException: Call From ltx1-app19958.prod.linkedin.com/10.149.196.108 to ltx1-app1348.prod.linkedin.com:8030 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
2017-06-14 23:19:39.203 [AMRM Heartbeater thread] AMRMClientImpl [WARN] ApplicationMaster is out of sync with ResourceManager, hence resyncing.

6. A phantom notification (duplicate) is received for the same container (Note the same container ID: container_1490916233553_1471_01_000016). This notification declares that the container exited with code 154.

Get a Container killed by RM error with non zero exit code 154

2017-06-14 23:19:39.211 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: [ContainerId: container_1490916233553_1471_01_000016, State: COMPLETE, Diagnostics: Container Killed by ResourceManager
Container exited with a non-zero exit code 154
, ExitStatus: 154, ]
2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Completed container had ID: -1
2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] No matching container id found for SamzaResourceStatus{resourceID='container_1490916233553_1471_01_000016', diagnostics='Container Killed by ResourceManager
Container exited with a non-zero exit code 154
', exitCode=154}
2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Container failed for some reason. Let's start it again
2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Container container_1490916233553_1471_01_000016 failed with exit code . 154 - Container Killed by ResourceManager
Container exited with a non-zero exit code 154
 containerID is -1

7. We take an action based on this phantom notification (and reset job health, other metrics). This causes automated false-alerts, and remediation to kick in. All jobs in the cluster have to be restarted now.

Node Manager scenarios that triggered the phantom notification:

We have observed that if a node manager is re-started after a failure, NM recovery kicks in and tries to take ownership of containers it was managing in its previous run. In some cases, the container could have been killed / exited before the node manager comes back up. This would result in a timeout in the NM recovery process.

Here are the relevant logs:
a. NM is killed due to some errors.

2017-06-09 06:01:47 last NM log
06:17:29,470 [INFO] [main] util.LogAdapter.info(LogAdapter.java:45) - STARTUP_MSG:

6:01-6:17 NM was down.
/************************************************************
STARTUP_MSG: Starting NodeManager
STARTUP_MSG:   host = ltx1-app19956.prod.linkedin.com/10.149.196.106
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 2.7.3


b. NM recovery kicks in trying to own previous containers
2017-06-09 06:17:32,899 [INFO] [main] containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:307) - Recovering container_1490916233553_1471_01_000016 in state LAUNCHED with exit code -1000
2017-06-09 06:17:33,115 [INFO] [main] containermanager.ContainerManagerImpl.waitForRecoveredContainers
2017-06-09 06:17:33,274 [INFO] [AsyncDispatcher event handler] application.ApplicationImpl$InitContainerTransition.transition(ApplicationImpl.java:304) - Adding container_1490916233553_1471_01_000016 to application application_1490916233553_1471
2017-06-09 06:17:33,387 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1490916233553_1471_01_000016 transitioned from LOCALIZING to LOCALIZED
2017-06-09 06:17:33,427 [INFO] [ContainersLauncher #39] nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:181) - Reacquiring container_1490916233553_1471_01_000016 with pid 23848
reqcquireing pid
2017-06-09 06:17:33,449 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1490916233553_1471_01_000016 transitioned from LOCALIZED to RUNNING
2017-06-09 06:17:35,094 Sending diagnostics
[container_1490916233553_1471_01_000016, CreateTime: 1496864777558, State: RUNNING, Capability: <memory:4096, vCores:1>, Diagnostics: , ExitStatus: -1000, Priority: 0]

c. Recovery times out.
2017-06-09 06:17:35,592 [WARN] [ContainersLauncher #39] util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:324) - Failed to read cgroup tasks file.
java.io.FileNotFoundException: /cgroup/cpu/restrain.slice/container_1490916233553_1471_01_000016/tasks (No such file or directory)
        at java.io.FileInputStream.open0(Native Method)
        at java.io.FileInputStream.open(FileInputStream.java:195)
        at java.io.FileInputStream.<init>(FileInputStream.java:138)
        at java.io.FileInputStream.<init>(FileInputStream.java:93)
        at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:308)
        at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.deleteCgroup(CgroupsLCEResourcesHandler.java:339)
        at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.clearLimits(CgroupsLCEResourcesHandler.java:388)
        at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.postExecute(CgroupsLCEResourcesHandler.java:402)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:354)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
2017-06-09 06:17:36,608 [ERROR] [ContainersLauncher #39] launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:88) - Unable to recover container container_1490916233553_1471_01_000016
java.io.IOException: Timeout while waiting for exit code from container_1490916233553_1471_01_000016
        at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:202)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:352)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
2017-06-09 06:17:37,223 [INFO] [AsyncDispatcher event handler] launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:370) - Cleaning up container container_1490916233553_1471_01_000016
2017-06-09 06:17:37,465 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1490916233553_1471_01_000016 transitioned from KILLING to EXITED_WITH_FAILURE

2017-06-09 06:17:37,493 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1477351569246_0429_02_000129 transitioned from EXITED_WITH_FAILURE to DONE
2017-06-09 06:17:37,493 [WARN] [AsyncDispatcher event handler] nodemanager.NMAuditLogger.logFailure(NMAuditLogger.java:150) - USER=samza-trust-repartition      OPERATION=Container Finished - Failed   TARGET=ContainerImpl    RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE    APPID=application_1490916233553_1471    CONTAINERID=container_1490916233553_1471_01_000016

d. When an RM fail over happens, and the state is re-synced from the passive to the active RM triggering a phantom notification



> Handle phantom container notifications cleanly during RM fail-over
> ------------------------------------------------------------------
>
>                 Key: SAMZA-1359
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1359
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Jagadish
>
> At LinkedIn, we have observed that there are phantom notifications corresponding to containers not being actively managed by Samza. We should improve our AppMaster's logic to be resilient to these phantom notifications. Currently, we take actions based on these notifications - for example, we increment the number of failed containers, mark the job health to be false etc. These actions cause false alerts to be fired and automated remediations based on these alerts. At LinkedIn, this has affected our ability to do upgrades on the Samza cluster because several jobs have to be restarted owing to these phantom notifications. 
> Here's the investigation during an upgrade and a rolling bounce of the Samza ResourceManagers.
> 1. Samza starts container 13 on a certain host for this job.
> {code}
> 2017-06-07 19:46:20.226 [Container Allocator Thread] YarnContainerRunner [INFO] Claimed container ID 13 for container container_1490916233553_1471_01_000016 on node ltx1-app19956.prod.linkedin.com (http://ltx1-app19956.prod.linkedin.com:8042/node/containerlogs/container_1490916233553_1471_01_000016).
> {code}
> 2. Yarn reports that container 13 is lost due to a failure of its NodeManager on the host at which the container was running.
> 2017-06-09 06:14:40.398 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: [ContainerId: container_1490916233553_1471_01_000016, State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: -100, ]
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Completed container had ID: 13
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Removing container ID 13 from completed containers
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Matching container ID found 13 org.apache.samza.clustermanager.SamzaResource@e4e2d63a
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Got an exit code of -100. This means that container container_1490916233553_1471_01_000016 was killed by YARN, either due to being released by the application master or being 'lost' due to node failures etc. or due to preemption by the RM
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Released container container_1490916233553_1471_01_000016 was assigned task group ID 13. Requesting a refactor container for the task group.
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] SamzaResourceRequest [INFO] Resource Request created for 13 on ANY_HOST at 1496988880401
> 3. Samza makes a request to Yarn for restarting container 13 on a different host.
> 2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Requesting resources on  ANY_HOST for container 13
> 2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Making a request for ANY_HOST ANY_HOST
> 2017-06-09 06:14:42.400 [AMRM Heartbeater thread] AMRMClientImpl [INFO] Received new token for : ltx1-app19939.prod.linkedin.com:1158
> 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Container allocated from RM on ltx1-app19939.prod.linkedin.com
> 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Container allocated from RM on ltx1-app19939.prod.linkedin.com
> 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] ResourceRequestState [INFO] Host affinity not enabled. Saving the samzaResource container_1490916233553_1471_01_000027 in the buffer for ANY_HOST
> 2017-06-09 06:14:44.310 [Container Allocator Thread] YarnClusterResourceManager [INFO] Cancelling request SamzaResourceRequest{numCores=1, memoryMB=4000, preferredHost='ANY_HOST', requestID='f74a8fcf-be17-42eb-8824-1c32b4504933', containerID=13, requestTimestampMs=1496988880401} 
> 2017-06-09 06:14:44.310 [Container Allocator Thread] AbstractContainerAllocator [INFO] Found available resources on ANY_HOST. Assigning request for container_id 13 with timestamp 1496988880401 to resource container_1490916233553_1471_01_000027
> 2017-06-09 06:14:44.310 [Container Allocator Thread] YarnClusterResourceManager [INFO] Received launch request for 13 on hostname ltx1-app19939.prod.linkedin.com
> 2017-06-09 06:14:44.310 [Container Allocator Thread] YarnContainerRunner [INFO] Got available container ID (13) for container: Container: [ContainerId: container_1490916233553_1471_01_000027, NodeId: ltx1-app19939.prod.linkedin.com:1158, NodeHttpAddress: ltx1-app19939.prod.linkedin.com:8042, Resource: <memory:4096, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.149.196.44:1158 }, ]
> 4. Container 13 is re-started and runs on a different host
> 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] In runContainer in util: fwkPath= ;cmdPath=./__package/;jobLib=
> 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] Container ID 13 using command ./__package//bin/run-container.sh
> 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] Container ID 13 using environment variables: 
> ltx1-app1348.prod.linkedin.com
> ltx1-app1355.prod.linkedin.com
> ltx1-app20440.prod.linkedin.com
> 5. A cluster wide upgrade happens and the machines in the cluster have to be bounced. We typically do a rolling bounce on the RM to give an opportunity for the passive RM to take ownership. RM failover happens and state is re-synced from the passive to the active RM. Samza's AppMaster (which had a previous connection to the active RM) resyncs it's state from the other RM.
> 2017-06-14 23:18:20.281 [AMRM Heartbeater thread] RetryInvocationHandler [INFO] Exception while invoking allocate of class ApplicationMasterProtocolPBClientImpl over yarn-prod-ltx1-samza-rm-2. Trying to fail over immediately.
> java.net.ConnectException: Call From ltx1-app19958.prod.linkedin.com/10.149.196.108 to ltx1-app1348.prod.linkedin.com:8030 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> 2017-06-14 23:19:39.203 [AMRM Heartbeater thread] AMRMClientImpl [WARN] ApplicationMaster is out of sync with ResourceManager, hence resyncing.
> 6. A phantom notification (duplicate) is received for the same container (Note the same container ID: container_1490916233553_1471_01_000016). This notification declares that the container exited with code 154.
> Get a Container killed by RM error with non zero exit code 154
> 2017-06-14 23:19:39.211 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: [ContainerId: container_1490916233553_1471_01_000016, State: COMPLETE, Diagnostics: Container Killed by ResourceManager
> Container exited with a non-zero exit code 154
> , ExitStatus: 154, ]
> 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Completed container had ID: -1
> 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] No matching container id found for SamzaResourceStatus{resourceID='container_1490916233553_1471_01_000016', diagnostics='Container Killed by ResourceManager
> Container exited with a non-zero exit code 154
> ', exitCode=154}
> 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Container failed for some reason. Let's start it again
> 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Container container_1490916233553_1471_01_000016 failed with exit code . 154 - Container Killed by ResourceManager
> Container exited with a non-zero exit code 154
>  containerID is -1
> 7. We take an action based on this phantom notification (and reset job health, other metrics). This causes automated false-alerts, and remediation to kick in. All jobs in the cluster have to be restarted now.
> Node Manager scenarios that triggered the phantom notification:
> We have observed that if a node manager is re-started after a failure, NM recovery kicks in and tries to take ownership of containers it was managing in its previous run. In some cases, the container could have been killed / exited before the node manager comes back up. This would result in a timeout in the NM recovery process.
> Here are the relevant logs:
> a. NM is killed due to some errors.
> 2017-06-09 06:01:47 last NM log
> 06:17:29,470 [INFO] [main] util.LogAdapter.info(LogAdapter.java:45) - STARTUP_MSG:
> 6:01-6:17 NM was down.
> /************************************************************
> STARTUP_MSG: Starting NodeManager
> STARTUP_MSG:   host = ltx1-app19956.prod.linkedin.com/10.149.196.106
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 2.7.3
> b. NM recovery kicks in trying to own previous containers
> 2017-06-09 06:17:32,899 [INFO] [main] containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:307) - Recovering container_1490916233553_1471_01_000016 in state LAUNCHED with exit code -1000
> 2017-06-09 06:17:33,115 [INFO] [main] containermanager.ContainerManagerImpl.waitForRecoveredContainers
> 2017-06-09 06:17:33,274 [INFO] [AsyncDispatcher event handler] application.ApplicationImpl$InitContainerTransition.transition(ApplicationImpl.java:304) - Adding container_1490916233553_1471_01_000016 to application application_1490916233553_1471
> 2017-06-09 06:17:33,387 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1490916233553_1471_01_000016 transitioned from LOCALIZING to LOCALIZED
> 2017-06-09 06:17:33,427 [INFO] [ContainersLauncher #39] nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:181) - Reacquiring container_1490916233553_1471_01_000016 with pid 23848
> reqcquireing pid
> 2017-06-09 06:17:33,449 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1490916233553_1471_01_000016 transitioned from LOCALIZED to RUNNING
> 2017-06-09 06:17:35,094 Sending diagnostics
> [container_1490916233553_1471_01_000016, CreateTime: 1496864777558, State: RUNNING, Capability: <memory:4096, vCores:1>, Diagnostics: , ExitStatus: -1000, Priority: 0]
> c. Recovery times out.
> 2017-06-09 06:17:35,592 [WARN] [ContainersLauncher #39] util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:324) - Failed to read cgroup tasks file.
> java.io.FileNotFoundException: /cgroup/cpu/restrain.slice/container_1490916233553_1471_01_000016/tasks (No such file or directory)
>         at java.io.FileInputStream.open0(Native Method)
>         at java.io.FileInputStream.open(FileInputStream.java:195)
>         at java.io.FileInputStream.<init>(FileInputStream.java:138)
>         at java.io.FileInputStream.<init>(FileInputStream.java:93)
>         at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:308)
>         at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.deleteCgroup(CgroupsLCEResourcesHandler.java:339)
>         at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.clearLimits(CgroupsLCEResourcesHandler.java:388)
>         at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.postExecute(CgroupsLCEResourcesHandler.java:402)
>         at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:354)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-06-09 06:17:36,608 [ERROR] [ContainersLauncher #39] launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:88) - Unable to recover container container_1490916233553_1471_01_000016
> java.io.IOException: Timeout while waiting for exit code from container_1490916233553_1471_01_000016
>         at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:202)
>         at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:352)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-06-09 06:17:37,223 [INFO] [AsyncDispatcher event handler] launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:370) - Cleaning up container container_1490916233553_1471_01_000016
> 2017-06-09 06:17:37,465 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1490916233553_1471_01_000016 transitioned from KILLING to EXITED_WITH_FAILURE
> 2017-06-09 06:17:37,493 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1477351569246_0429_02_000129 transitioned from EXITED_WITH_FAILURE to DONE
> 2017-06-09 06:17:37,493 [WARN] [AsyncDispatcher event handler] nodemanager.NMAuditLogger.logFailure(NMAuditLogger.java:150) - USER=samza-trust-repartition      OPERATION=Container Finished - Failed   TARGET=ContainerImpl    RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE    APPID=application_1490916233553_1471    CONTAINERID=container_1490916233553_1471_01_000016
> d. When an RM fail over happens, and the state is re-synced from the passive to the active RM triggering a phantom notification



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)