You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-issues@hadoop.apache.org by "Ming Ma (JIRA)" <ji...@apache.org> on 2014/01/10 20:22:15 UTC

[jira] [Commented] (YARN-1336) Work-preserving nodemanager restart

    [ https://issues.apache.org/jira/browse/YARN-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13868167#comment-13868167 ] 

Ming Ma commented on YARN-1336:
-------------------------------

Jason, nice work and thanks for driving this. Couple comments:

1. One of the scenarios for NM restart is NM config update. In that scenario, it might be worth calling out having NM to support dynamic config reload could be one design option; not necessaily something we should do.

2. It seems your design is based on quick NM restart and there is no need to kill the existing containers during NM restart. That will make the design simple. There is one scenario where we want to decomm the node and would like to preserve the state of long running tasks. For that somehow RM and AM will need to know about it so that it can checkpoint and resume the tasks on other nodes. Lots of work has been done in preemption space for that. Is that something covered here?

3. ShuffleHandler support. ShuffleHandler is a component above YARN. There might be some scenarios where we just need to update NM without update of ShuffleHandler or the other way. I don't know your approach. Will making ShuffleHandler be an out-of-proc help? During NM restart ShuffleHandler process just keeps running. NM will create the proxy an object to reconnect to the ShuffleHandler process. If we end up having several AuxiliaryServices for different type of applications, out-of-proc approach also makes it easier to manage from resource utilization and reduce the impact of one type of AuxiliaryService on the other.

> Work-preserving nodemanager restart
> -----------------------------------
>
>                 Key: YARN-1336
>                 URL: https://issues.apache.org/jira/browse/YARN-1336
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager
>    Affects Versions: 2.4.0
>            Reporter: Jason Lowe
>
> This serves as an umbrella ticket for tasks related to work-preserving nodemanager restart.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)