You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Junping Du (JIRA)" <ji...@apache.org> on 2016/09/01 16:59:20 UTC

[jira] [Commented] (YARN-5566) client-side NM graceful decom doesn't trigger when jobs finish

    [ https://issues.apache.org/jira/browse/YARN-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15455996#comment-15455996 ] 

Junping Du commented on YARN-5566:
----------------------------------

From above description, it seems the root cause is RM receive container status after RMApp do App Finish Transition (will app from runningApplications), then it add back the application to RMNode's runningApplications but never remove it again. I am not 100% sure as RM log is not included. 
[~rkanter], if you can check the timestamp for calling "runningApplications.add(containerAppId);" (in RMNodeImpl) and AppFinishedTransition (in RMAppImpl) for the same app when this issue happens, you should get the same answer. Current fix is a right one as we should always check application's status in context before we adding it to RMNode's runningApplication.
+1. 004 patch LGTM. [~kasha], please feel free to commit it today or I will commit it tomorrow.
BTW, patch for branch-2.8 should be slightly different. Robert, can you deliver one for 2.8 also? Thx!

> client-side NM graceful decom doesn't trigger when jobs finish
> --------------------------------------------------------------
>
>                 Key: YARN-5566
>                 URL: https://issues.apache.org/jira/browse/YARN-5566
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.8.0
>            Reporter: Robert Kanter
>            Assignee: Robert Kanter
>         Attachments: YARN-5566.001.patch, YARN-5566.002.patch, YARN-5566.003.patch, YARN-5566.004.patch
>
>
> I was testing the client-side NM graceful decommission and noticed that it was always waiting for the timeout, even if all jobs running on that node (or even the cluster) had already finished.
> For example:
> # JobA is running with at least one container on NodeA
> # User runs client-side decom on NodeA at 5:00am with a timeout of 3 hours --> NodeA enters DECOMMISSIONING state
> # JobA finishes at 6:00am and there are no other jobs running on NodeA
> # User's client reaches the timeout at 8:00am, and forcibly decommissions NodeA
> NodeA should have decommissioned at 6:00am.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org