You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-issues@hadoop.apache.org by "Junping Du (JIRA)" <ji...@apache.org> on 2015/04/10 17:18:12 UTC

[jira] [Commented] (YARN-3474) Add a way to let NM wait RM to come back, not kill running containers

    [ https://issues.apache.org/jira/browse/YARN-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14489761#comment-14489761 ] 

Junping Du commented on YARN-3474:
----------------------------------

Can we just set "yarn.resourcemanager.connect.max-wait.ms" to some larger value than 900 seconds? YARN admin could make mistake that forget to set flag back, in that case applications and containers could pending forever. So whatever ways, we need a timeout here (to get rid of fault operation). 
For your concrete scenario, one interesting topic is we may allow admin to extend the timeout when cluster is on the fly. Probably, through ZKNode because RM is unavailable but that could bring extra configuration complexity.

> Add a way to let NM wait RM to come back, not kill running containers
> ---------------------------------------------------------------------
>
>                 Key: YARN-3474
>                 URL: https://issues.apache.org/jira/browse/YARN-3474
>             Project: Hadoop YARN
>          Issue Type: New Feature
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>
> When RM HA is enabled and active RM shuts down, standby RM will become active, recover apps and attempts. Apps will not be affected. 
> If there are some cases or bugs that cause both RM could not start normally(e.g. [YARN-2340|https://issues.apache.org/jira/browse/YARN-2340]; RM could not connect with ZK well). NM will kill containers running on it when  it could not heartbeat with RM for some time(max retry time is 15 mins by default). Then all apps will be killed. 
> In production cluster, we might come across above cases and fixing these bugs might need time more than 15 mins. In order to let apps not be affected and killed by NM, YARN admin could set a flag(the flag is a znode '/wait-rm-to-come-back/cluster-id' in our solution) to tell NM wait for RM to come back and not kill running containers. After fixing bugs and RM start normally, clear the flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)