You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-issues@hadoop.apache.org by "Robert Kanter (JIRA)" <ji...@apache.org> on 2016/09/08 22:30:20 UTC

[jira] [Reopened] (YARN-5566) Client-side NM graceful decom is not triggered when jobs finish

     [ https://issues.apache.org/jira/browse/YARN-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Kanter reopened YARN-5566:
---------------------------------

I've discovered that the tests added to {{TestResourceTrackerService}} in the branch-2.8 version of the patch have a race condition.  If a DECOMMISSIONING node receives the heartbeat to become DECOMMISSIONED, the node might do this quickly enough that by the time the test code goes to check the node's status, it's already gone from the list of nodes, and the test fails because the node is null.  This can easily be reproduced by adding a sleep between sending the heartbeat and waiting for the DECOMMISSIONED state.  

I missed a small change to the {{waitForState}} method when I borrowed the tests from YARN-4676.  This allows the test to also grab nodes from the inactive list of nodes, which is where DECOMMISSIONED nodes would be found.

> Client-side NM graceful decom is not triggered when jobs finish
> ---------------------------------------------------------------
>
>                 Key: YARN-5566
>                 URL: https://issues.apache.org/jira/browse/YARN-5566
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.8.0
>            Reporter: Robert Kanter
>            Assignee: Robert Kanter
>             Fix For: 2.8.0, 3.0.0-alpha2
>
>         Attachments: YARN-5566-branch-2.8-004.patch, YARN-5566.001.patch, YARN-5566.002.patch, YARN-5566.003.patch, YARN-5566.004.branch-2.8.patch, YARN-5566.004.patch
>
>
> I was testing the client-side NM graceful decommission and noticed that it was always waiting for the timeout, even if all jobs running on that node (or even the cluster) had already finished.
> For example:
> # JobA is running with at least one container on NodeA
> # User runs client-side decom on NodeA at 5:00am with a timeout of 3 hours --> NodeA enters DECOMMISSIONING state
> # JobA finishes at 6:00am and there are no other jobs running on NodeA
> # User's client reaches the timeout at 8:00am, and forcibly decommissions NodeA
> NodeA should have decommissioned at 6:00am.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org