You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Jagadish (JIRA)" <ji...@apache.org> on 2017/07/25 18:23:00 UTC

[jira] [Resolved] (SAMZA-1359) Handle phantom container notifications cleanly during RM fail-over

     [ https://issues.apache.org/jira/browse/SAMZA-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jagadish resolved SAMZA-1359.
-----------------------------
    Resolution: Fixed

> Handle phantom container notifications cleanly during RM fail-over
> ------------------------------------------------------------------
>
>                 Key: SAMZA-1359
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1359
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Jagadish
>            Assignee: Jagadish
>
> At LinkedIn, we have observed that there are phantom notifications corresponding to containers not being actively managed by Samza. We should improve our AppMaster's logic to be resilient to these phantom notifications. Currently, we take actions based on these notifications - for example, we increment the number of failed containers, mark the job health to be false etc. These actions cause false alerts to be fired and automated remediations based on these alerts. At LinkedIn, this has affected our ability to do upgrades on the Samza cluster because several jobs have to be restarted owing to these phantom notifications. 
> Here's the investigation during an upgrade and a rolling bounce of the Samza ResourceManagers.
> 1. Samza starts container 13 on a certain host for this job.
> {code}
> 2017-06-07 19:46:20.226 [Container Allocator Thread] YarnContainerRunner [INFO] Claimed container ID 13 for container container_1490916233553_1471_01_000016 on node hostname_masked19956masked_host (http://hostname_masked19956masked_host:8042/node/containerlogs/container_1490916233553_1471_01_000016).
> {code}
> 2. Yarn reports that container 13 is lost due to a failure of its NodeManager on the host at which the container was running.
> {code}
> 2017-06-09 06:14:40.398 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: [ContainerId: container_1490916233553_1471_01_000016, State: COMPLETE, Diagnostics: Container released on a *lost* node, ExitStatus: -100, ]
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Completed container had ID: 13
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Removing container ID 13 from completed containers
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Matching container ID found 13 org.apache.samza.clustermanager.SamzaResource@e4e2d63a
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Got an exit code of -100. This means that container container_1490916233553_1471_01_000016 was killed by YARN, either due to being released by the application master or being 'lost' due to node failures etc. or due to preemption by the RM
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Released container container_1490916233553_1471_01_000016 was assigned task group ID 13. Requesting a refactor container for the task group.
> 2017-06-09 06:14:40.401 [AMRM Callback Handler Thread] SamzaResourceRequest [INFO] Resource Request created for 13 on ANY_HOST at 1496988880401
> {code}
> 3. Samza makes a request to Yarn for restarting container 13 on a different host.
> {code}
> 2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Requesting resources on  ANY_HOST for container 13
> 2017-06-09 06:14:40.402 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Making a request for ANY_HOST ANY_HOST
> 2017-06-09 06:14:42.400 [AMRM Heartbeater thread] AMRMClientImpl [INFO] Received new token for : hostname_masked19939masked_host
> 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Container allocated from RM on hostname_masked19939masked_host
> 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Container allocated from RM on hostname_masked19939masked_host
> 2017-06-09 06:14:42.401 [AMRM Callback Handler Thread] ResourceRequestState [INFO] Host affinity not enabled. Saving the samzaResource container_1490916233553_1471_01_000027 in the buffer for ANY_HOST
> 2017-06-09 06:14:44.310 [Container Allocator Thread] YarnClusterResourceManager [INFO] Cancelling request SamzaResourceRequest{numCores=1, memoryMB=4000, preferredHost='ANY_HOST', requestID='f74a8fcf-be17-42eb-8824-1c32b4504933', containerID=13, requestTimestampMs=1496988880401} 
> 2017-06-09 06:14:44.310 [Container Allocator Thread] AbstractContainerAllocator [INFO] Found available resources on ANY_HOST. Assigning request for container_id 13 with timestamp 1496988880401 to resource container_1490916233553_1471_01_000027
> 2017-06-09 06:14:44.310 [Container Allocator Thread] YarnClusterResourceManager [INFO] Received launch request for 13 on hostname hostname_masked19939masked_host
> 2017-06-09 06:14:44.310 [Container Allocator Thread] YarnContainerRunner [INFO] Got available container ID (13) for container: Container: [ContainerId: container_1490916233553_1471_01_000027, NodeId: hostname_masked19939masked_host:1158, NodeHttpAddress: hostname_masked19939masked_host:8042, Resource: <memory:4096, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service:  }, ]
> {code}
> 4. Container 13 is re-started and runs on a different host
> {code}
> 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] In runContainer in util: fwkPath= ;cmdPath=./__package/;jobLib=
> 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] Container ID 13 using command ./__package//bin/run-container.sh
> 2017-06-09 06:14:44.311 [Container Allocator Thread] YarnContainerRunner [INFO] Container ID 13 using environment variables: 
> hostname_masked1348masked_host
> hostname_masked1355masked_host
> hostname_masked20440masked_host
> {code}
> 5. A cluster wide upgrade happens and the machines in the cluster have to be bounced. We typically do a rolling bounce on the RM to give an opportunity for the passive RM to take ownership. RM failover happens and state is re-synced from the passive to the active RM. Samza's AppMaster (which had a previous connection to the active RM) resyncs it's state from the other RM.
> {code}
> 2017-06-14 23:18:20.281 [AMRM Heartbeater thread] RetryInvocationHandler [INFO] Exception while invoking allocate of class ApplicationMasterProtocolPBClientImpl over yarn-prod-ltx1-samza-rm-2. Trying to fail over immediately.
> java.net.ConnectException: Call From hostname_masked19958masked_host/ to hostname_masked1348masked_host:8030 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> 2017-06-14 23:19:39.203 [AMRM Heartbeater thread] AMRMClientImpl [WARN] ApplicationMaster is out of sync with ResourceManager, hence resyncing.
> {code}
> 6. A phantom notification (duplicate) is received for the same container (Note the same container ID: container_1490916233553_1471_01_000016). This notification declares that the container exited with code 154.
> {code}
> Get a Container killed by RM error with non zero exit code 154
> 2017-06-14 23:19:39.211 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Container completed from RM ContainerStatus: [ContainerId: container_1490916233553_1471_01_000016, State: COMPLETE, Diagnostics: Container Killed by ResourceManager
> Container exited with a non-zero exit code 154
> , ExitStatus: 154, ]
> 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] YarnClusterResourceManager [INFO] Completed container had ID: -1
> 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] No matching container id found for SamzaResourceStatus{resourceID='container_1490916233553_1471_01_000016', diagnostics='Container Killed by ResourceManager
> Container exited with a non-zero exit code 154
> ', exitCode=154}
> 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Container failed for some reason. Let's start it again
> 2017-06-14 23:19:39.212 [AMRM Callback Handler Thread] ContainerProcessManager [INFO] Container container_1490916233553_1471_01_000016 failed with exit code . 154 - Container Killed by ResourceManager
> Container exited with a non-zero exit code 154
>  containerID is -1
> {code}
> 7. We take an action based on this phantom notification (and reset job health, other metrics). This causes automated false-alerts, and remediation to kick in. All jobs in the cluster have to be restarted now.
> *What scenarios on the Node Manager side trigger the phantom notification?*
> We have observed that if a node manager is re-started after a failure, NM recovery kicks in and tries to take ownership of containers it was managing in its previous run. In some cases, the container could have been killed / exited before the node manager comes back up. This would result in a timeout in the NM recovery process.
> Here are the relevant logs :
> a. NM is killed due to some errors and auto re-started. It was down from 06:01 to 06:17.
> {code}
> 06:17:29,470 [INFO] [main] util.LogAdapter.info(LogAdapter.java:45) - STARTUP_MSG:
> /************************************************************
> STARTUP_MSG: Starting NodeManager
> STARTUP_MSG:   host = hostname_masked19956masked_host
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 2.7.3
> {code}
> b. NM recovery kicks in trying to own previous containers
> {code}
> 2017-06-09 06:17:32,899 [INFO] [main] containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:307) - Recovering container_1490916233553_1471_01_000016 in state LAUNCHED with exit code -1000
> 2017-06-09 06:17:33,115 [INFO] [main] containermanager.ContainerManagerImpl.waitForRecoveredContainers
> 2017-06-09 06:17:33,274 [INFO] [AsyncDispatcher event handler] application.ApplicationImpl$InitContainerTransition.transition(ApplicationImpl.java:304) - Adding container_1490916233553_1471_01_000016 to application application_1490916233553_1471
> 2017-06-09 06:17:33,387 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1490916233553_1471_01_000016 transitioned from LOCALIZING to LOCALIZED
> 2017-06-09 06:17:33,427 [INFO] [ContainersLauncher #39] nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:181) - Reacquiring container_1490916233553_1471_01_000016 with pid 23848
> reqcquireing pid
> 2017-06-09 06:17:33,449 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1490916233553_1471_01_000016 transitioned from LOCALIZED to RUNNING
> 2017-06-09 06:17:35,094 Sending diagnostics
> [container_1490916233553_1471_01_000016, CreateTime: 1496864777558, State: RUNNING, Capability: <memory:4096, vCores:1>, Diagnostics: , ExitStatus: -1000, Priority: 0]
> {code}
> c. Recovery times out because the container already exited (due to an unrelated error)
> {code}
> 2017-06-09 06:17:35,592 [WARN] [ContainersLauncher #39] util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:324) - Failed to read cgroup tasks file.
> java.io.FileNotFoundException: /cgroup/cpu/restrain.slice/container_1490916233553_1471_01_000016/tasks (No such file or directory)
>         at java.io.FileInputStream.open0(Native Method)
>         at java.io.FileInputStream.open(FileInputStream.java:195)
>         at java.io.FileInputStream.<init>(FileInputStream.java:138)
>         at java.io.FileInputStream.<init>(FileInputStream.java:93)
>         at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.checkAndDeleteCgroup(CgroupsLCEResourcesHandler.java:308)
>         at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.deleteCgroup(CgroupsLCEResourcesHandler.java:339)
>         at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.clearLimits(CgroupsLCEResourcesHandler.java:388)
>         at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.postExecute(CgroupsLCEResourcesHandler.java:402)
>         at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:354)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-06-09 06:17:36,608 [ERROR] [ContainersLauncher #39] launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:88) - Unable to recover container container_1490916233553_1471_01_000016
> java.io.IOException: Timeout while waiting for exit code from container_1490916233553_1471_01_000016
>         at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:202)
>         at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:352)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:83)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-06-09 06:17:37,223 [INFO] [AsyncDispatcher event handler] launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:370) - Cleaning up container container_1490916233553_1471_01_000016
> 2017-06-09 06:17:37,465 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1490916233553_1471_01_000016 transitioned from KILLING to EXITED_WITH_FAILURE
> 2017-06-09 06:17:37,493 [INFO] [AsyncDispatcher event handler] container.ContainerImpl.handle(ContainerImpl.java:1135) - Container container_1477351569246_0429_02_000129 transitioned from EXITED_WITH_FAILURE to DONE
> 2017-06-09 06:17:37,493 [WARN] [AsyncDispatcher event handler] nodemanager.NMAuditLogger.logFailure(NMAuditLogger.java:150) - USER=samza-trust-repartition      OPERATION=Container Finished - Failed   TARGET=ContainerImpl    RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE    APPID=application_1490916233553_1471    CONTAINERID=container_1490916233553_1471_01_000016
> {code}
> d. When an RM fail over happens, and the state is re-synced from the passive to the active RM triggering a phantom notification.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)