You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Karthik Kambatla (JIRA)" <ji...@apache.org> on 2014/01/13 18:54:52 UTC

[jira] [Commented] (MAPREDUCE-5718) MR AM should tolerate RM failover during commit

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869750#comment-13869750 ] 

Karthik Kambatla commented on MAPREDUCE-5718:
---------------------------------------------

The failure comes from the following snippet. If the previous AM has started the commit, but has not succeeded or failed, we assume it is an error state. 
{code}
        if (commitSuccess) {
          shutDownMessage = "We crashed after successfully committing. Recovering.";
          forcedState = JobStateInternal.SUCCEEDED;
        } else if (commitFailure) {
          shutDownMessage = "We crashed after a commit failure.";
          forcedState = JobStateInternal.FAILED;
        } else {
          //The commit is still pending, commit error
          shutDownMessage = "We crashed durring a commit";
          forcedState = JobStateInternal.ERROR;
        }
{code}

To fix this, we can do either of
# Treat the lack of success/failure file as an artifact of the previous commit failing due to RM restart and re-attempt the commit. The only downside to this seems to be when the commit itself is buggy - we ll end up trying to commit upto the number of attempts allowed. 
# Make sure the AM deletes the commit file before failing. Given the RM/ NM kill the containers, making sure we delete the commit file before dying can be a little more involved.

[~revans2] - do you think it is reasonable to go with the first option? 

> MR AM should tolerate RM failover during commit
> -----------------------------------------------
>
>                 Key: MAPREDUCE-5718
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5718
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Blocker
>              Labels: ha
>
> While testing RM HA, we ran into this issue where if the RM fails over while an MR AM is in the middle of a commit, the subsequent AM gets spawned but dies with a diagnostic message - "We crashed durring a commit". 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)