You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-issues@hadoop.apache.org by "Joseph (JIRA)" <ji...@apache.org> on 2015/11/05 12:57:27 UTC

[jira] [Commented] (YARN-4331) Restarting NodeManager leaves orphaned containers

    [ https://issues.apache.org/jira/browse/YARN-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991575#comment-14991575 ] 

Joseph commented on YARN-4331:
------------------------------

[~jlowe] Thanks for your comments, very helpful.
yarn.resourcemanager.work-preserving-recovery.enabled is indeed set to false. The reason we have set it to false is because we run samza jobs on the yarn cluster and they don't work well with this feature turned on (https://issues.apache.org/jira/browse/SAMZA-750).

Apologies for my ignorance in this area, but if the application master (AM) is dead, shouldn't it be responsibility of the container to kill itself? I'd imagine every container should be required to heartbeat to its application master and killing itself if it misses a few?


> Restarting NodeManager leaves orphaned containers
> -------------------------------------------------
>
>                 Key: YARN-4331
>                 URL: https://issues.apache.org/jira/browse/YARN-4331
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager, yarn
>    Affects Versions: 2.7.1
>            Reporter: Joseph
>            Priority: Critical
>
> We are seeing a lot of orphaned containers running in our production clusters.
> I tried to simulate this locally on my machine and can replicate the issue by killing nodemanager.
> I'm running Yarn 2.7.1 with RM state stored in zookeeper and deploying samza jobs.
> Steps:
> {quote}1. Deploy a job 
> 2. Issue a kill -9 signal to nodemanager 
> 3. We should see the AM and its container running without nodemanager
> 4. AM should die but the container still keeps running
> 5. Restarting nodemanager brings up new AM and container but leaves the orphaned container running in the background
> {quote}
> This is effectively causing double processing of data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)