You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "D M Murali Krishna Reddy (Jira)" <ji...@apache.org> on 2021/06/22 10:48:00 UTC

[jira] [Commented] (YARN-10825) Yarn Service containers not getting killed after NM shutdown

    [ https://issues.apache.org/jira/browse/YARN-10825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367223#comment-17367223 ] 

D M Murali Krishna Reddy commented on YARN-10825:
-------------------------------------------------

As per my analysis when yarn.nodemanager.recovery.supervised is enabled, I could see that for Mapreduce jobs, once the NM is shutdown, after some time RM assumes the Node lost and then with UPDATED_NODES_TRANSITION, AM removes all the taskAttempts of the containers launched on Lost node and launches the next taskattempt. Once the old containers sends *status update*, the AM assumes it as illegal task and returns feedback with taskFound as false in TaskAttemptListenerImpl. In Task.java container gets killed by itself.

 

But in yarn services I couldnt find any communication directly from container to AM like *status update* in MR jobs.  So, I think the AM is not able to communicate to container directly to get the container killed. I think the only communication is from AM to RM and then from RM to NM to container, which is not possible as the NM itself is down.

 

[~billie], [~eyang], [~prabhujoseph]  Can you have look over this issue.

 

Thanks!

> Yarn Service containers not getting killed after NM shutdown
> ------------------------------------------------------------
>
>                 Key: YARN-10825
>                 URL: https://issues.apache.org/jira/browse/YARN-10825
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>    Affects Versions: 3.1.1
>            Reporter: Sushanta Sen
>            Assignee: D M Murali Krishna Reddy
>            Priority: Major
>
> When yarn.nodemanager.recovery.supervised is enabled and NM is shutdown, the new containers are getting launched after the RM sends the node lost event to AM, but the existing containers on the lost node are not getting killed. The issue has occurred only for yarn service. For Normal jobs the behavior is working fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org