You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2016/06/02 22:03:59 UTC

[jira] [Updated] (YARN-5197) RM leaks containers if running container disappears from node update

     [ https://issues.apache.org/jira/browse/YARN-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated YARN-5197:
-----------------------------
    Attachment: YARN-5197.001.patch

RMNodeImpl checks the list of running containers on the node against launchedContainers but not vice-versa, so containers that disappear on the node are not detected.  Here's a patch that detects when the RM thinks there are more containers running on the node than were reported and finds the containers that are lost.  Each lost container generates a corresponding aborted completion event for the scheduler.  The search for lost containers is only performed when one should be found, so it's low cost for the normal case.

I updated MockNM as part of this patch since lots of tests were getting away with lazy mocking of a real NM.  They were only specifying container state deltas in the heartbeat and sending empty heartbeats in-between those state changes.  With this patch, the RM interprets those empty heartbeats as a loss of all actively running containers and broke those tests.  The patch therefore also updates MockNM to track containers and continue reporting them until they have been marked completed just like a real node should.  That was simpler to do than update all the users of MockNM to maintain their list of active container statuses explicitly.

> RM leaks containers if running container disappears from node update
> --------------------------------------------------------------------
>
>                 Key: YARN-5197
>                 URL: https://issues.apache.org/jira/browse/YARN-5197
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.2, 2.6.4
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-5197.001.patch
>
>
> Once a node reports a container running in a status update, the corresponding RMNodeImpl will track the container in its launchedContainers map.  If the node somehow misses sending the completed container status to the RM and the container simply disappears from subsequent heartbeats, the container will leak in launchedContainers forever and the container completion event will not be sent to the scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org