You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-issues@hadoop.apache.org by "Bikas Saha (JIRA)" <ji...@apache.org> on 2014/01/04 18:25:55 UTC

[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart

    [ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862359#comment-13862359 ] 

Bikas Saha commented on YARN-1489:
----------------------------------

Here is an idea:
The RM allows the app to send it some data during registration. This data could include the AM port information etc. The RM could then sync this data with the NM during NM heartbeat. The NM anyways maintain per app attempt info and this data would be added to that. The containers running on an AM could query for this attempt data and get the information about the new app attempt. This would be a scalable and efficient solution.
The data per NM will be small since the data would be size checked and proportional to the app attempts. The NM could give access to an attempts data only to the containers that belong to that attempt. Only local containers should be able to communicate with their NM for such information. This could be done via a local access token that is supplied by the NM whenever it launches a container.

> [Umbrella] Work-preserving ApplicationMaster restart
> ----------------------------------------------------
>
>                 Key: YARN-1489
>                 URL: https://issues.apache.org/jira/browse/YARN-1489
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: Work preserving AM restart.pdf
>
>
> Today if AMs go down,
>  - RM kills all the containers of that ApplicationAttempt
>  - New ApplicationAttempt doesn't know where the previous containers are running
>  - Old running containers don't know where the new AM is running.
> We need to fix this to enable work-preserving AM restart. The later two potentially can be done at the app level, but it is good to have a common solution for all apps where-ever possible.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)