You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2015/11/04 20:11:27 UTC
[jira] [Updated] (YARN-4331) Restarting NodeManager leaves orphaned containers

     [ https://issues.apache.org/jira/browse/YARN-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated YARN-4331:
-----------------------------
    Summary: Restarting NodeManager leaves orphaned containers  (was: Killing NodeManager leaves orphaned containers)

Note that the killing of the nodemanager itself with SIGKILL should not cause the containers to be killed in itself.  Instead the problem seems to be that when the nodemanager restarts it is either failing to reacquire the containers that were running or it reacquires them and the RM fails to tell the NM to kill them when it re-registers.  Updating the summary accordingly.  Also by "the AM and its container" I assume you mean the application master and some other container that the AM launched.  Please correct me if I'm wrong.  

Is work-preserving nodemanager restart enabled on this cluster?  Without it nodemanagers cannot track containers that were previously running, so it will not be able to reacquire them and kill them.  If they don't exit on their own then they will "leak" and continue running outside of YARN's knowledge.  If that feature is not enabled on the nodemanager then this behavior is expected, since killing it with SIGKILL gave the nodemanager no chance to perform any container cleanup on its own.

If restart is enabled on the nodemanager then this behavior could be correct if the application running told the RM that containers should not be killed when AM attempts fail.  In that case the container should be left running and its up to the AM to reacquire it via some means.  (I believe the RM does provide a bit of help there in the AM-RM protocol.)

If the containers were supposed to be killed when the AM attempt failed then we need to figure out which of the two possibilities above is the problem.  Could you look in the NM logs and see if it said it was able to reacquire the previously running containers before it was killed?  If it didn't then we need to figure out why, and log snippets around the restart/recovery would be a big help.  If it did reacquire the containers and register to the RM with those containers then apparently the RM didn't tell the NM to kill the undesired containers.  In that case the log from the RM side around the time the NM re-registered would be helpful.

> Restarting NodeManager leaves orphaned containers
> -------------------------------------------------
>
>                 Key: YARN-4331
>                 URL: https://issues.apache.org/jira/browse/YARN-4331
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager, yarn
>    Affects Versions: 2.7.1
>            Reporter: Joseph
>            Priority: Critical
>
> We are seeing a lot of orphaned containers running in our production clusters.
> I tried to simulate this locally on my machine and can replicate the issue by killing nodemanager.
> I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza jobs.
> Steps:
> {quote}1. Deploy a job 
> 2. Issue a kill -9 signal to nodemanager 
> 3. We should see the AM and its container running without nodemanager
> 4. AM should die but the container still keeps running
> 5. Restarting nodemanager brings up new AM and container but leaves the orphaned container running in the background
> {quote}
> This is effectively causing double processing of data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)