You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by "Anil Sadineni (Jira)" <ji...@apache.org> on 2020/03/23 01:15:00 UTC

[jira] [Resolved] (YARN-10205) NodeManager stateful restart feature did not work as expected - information only (Resolved)

     [ https://issues.apache.org/jira/browse/YARN-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anil Sadineni resolved YARN-10205.
----------------------------------
    Resolution: Not A Problem

> NodeManager stateful restart feature did not work as expected - information only (Resolved)
> -------------------------------------------------------------------------------------------
>
>                 Key: YARN-10205
>                 URL: https://issues.apache.org/jira/browse/YARN-10205
>             Project: Hadoop YARN
>          Issue Type: Test
>          Components: graceful, nodemanager, rolling upgrade, yarn
>            Reporter: Anil Sadineni
>            Priority: Major
>
> *TL;DR* This is information only Jira on stateful restart of node manager feature. Unexpected behavior of this feature was due to systemd process configurations in this case. Please read below for more details - 
> Stateful restart of Node Manager(YARN-1336) i introduced in Hadoop 2.6. This feature worked as expected in Hadoop 2.6 for us. Recently we upgraded our clusters from 2.6 to 2.9.2 along with some OS upgrades. This feature was broken after the upgrade. one of the initial suspicion was LinuxContainerExecutor as we started using it in this upgrade. 
> yarn-site.xml has all required configurations to enable this feature - 
> {{yarn.nodemanager.recovery.enabled: 'true'}}
> {{yarn.nodemanager.recovery.dir:'<nm_recovery_dir>'}}
> {{yarn.nodemanager.recovery.supervised: 'true'}}
> {{yarn.nodemanager.address: '0.0.0.0:8041'}}
> While containers running and NM restarted, below is the exception constantly observed in Node Manager logs - 
> {quote}
> java.io.IOException: *Timeout while waiting for exit code from container_e37_1583181000856_0008_01_000043*2020-03-05 17:45:18,241 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: Unable to recover container container_e37_1583181000856_0008_01_000043
> {quote}
> {quote}        at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:274)
>         at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:631)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2020-03-05 17:45:18,241 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: Unable to recover container container_e37_1583181000856_0008_01_000018
> java.io.IOException: Timeout while waiting for exit code from container_e37_1583181000856_0008_01_000018
>         at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:274)
>         at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:631)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2020-03-05 17:45:18,242 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: Recovered container exited with a non-zero exit code 154
> {quote}
> {quote}2020-03-05 17:45:18,243 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: Recovered container exited with a non-zero exit code 154
> {quote}
> After some digging on what was causing exitfile missing, at OS level identified that running container processes are going down as soon as NM is going down. Process tree looks perfectly fine as the container-executor takes care of forking child process as expected.  Dig deeper into various parts of code to see if anything caused the failure. 
> One question was did we break anything in our internal repo after we forked 2.9.2 from open source. Started looking into code at different areas like NM shutdown hook and clean up process, NM State store on container launch, NM aux services, container-executor, Shell launch and clean up related hooks, etc. Things were looking fine as expected. 
> It was identified that hadoop-nodemanager systemd process configured to use default KillMode which is control-group. [https://www.freedesktop.org/software/systemd/man/systemd.kill.html#KillMode=]
> This is causing systemd to send a terminate signal to all child processes as soon as NM daemon is down either through stop command or via kill -9 command. 
> With this, NM stateful restart is working as expected. As part of migration, we moved all daemons from monit to systemd and this bug seems introduced around that time. 
> I am sharing this information here so that it will be helpful if anyone goes through a similar problem.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org