You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Bikas Saha (JIRA)" <ji...@apache.org> on 2013/04/04 21:20:16 UTC

[jira] [Commented] (YARN-540) RM state store not cleaned if job succeeds but RM shutdown and restart-dispatcher stopped before it can process REMOVE_APP event

    [ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13622681#comment-13622681 ] 

Bikas Saha commented on YARN-540:
---------------------------------

This is a known issue. The problem here is that the rm state store is essentially a write ahead log. But in the application unregister/finish case, the application has already finished before the rm stores that fact in its state. So the RM by itself cannot avoid this problem. Since its a race condition we may choose not not fix it unless we see this happen often in practice.
The solutions that come to mind are
1) finishApplicationMaster() blocks until the finish is stored in the store. This has issues of getting blocked on a slow/unavailable store. Also, the RM does a bunch of other things before and application finishes. The RM may not be able to remove the application from the store until all those steps are complete.
2) finishApplicationMaster() becomes a 2-step process in which, in the second step the app waits for the RM to change the app's state to "FINISHED" before exiting.
                
> RM state store not cleaned if job succeeds but RM shutdown and restart-dispatcher stopped before it can process REMOVE_APP event
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-540
>                 URL: https://issues.apache.org/jira/browse/YARN-540
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>
> When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload the existing state files even though the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira