You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2014/01/15 16:37:21 UTC

[jira] [Commented] (MAPREDUCE-5718) MR AM should tolerate RM restart/failover during commit

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13872189#comment-13872189 ] 

Jason Lowe commented on MAPREDUCE-5718:
---------------------------------------

This is closely related to MAPREDUCE-5485.  The problem here is that the output committer is user-pluggable code, and we can't assume what it does or if it can be safely restarted after crashing mid-way through the commit.  This is one of the reasons job commits are not retried by the AM, and by extension we can't assume it's safe to retry in another AM attempt.  That's why the AM goes out of its way to indicate via a file that it's starting to do the job commit and avoids repeating it on an AM restart if that file is still present.  Whether the retry is because the AM crash or the AM was restarted due to RM restart, the end effect is the same -- it's not safe to retry a job commit in the general case.

If we had an API by which the output committer could tell the AM if it's safe to retry a job commit that would help.

> MR AM should tolerate RM restart/failover during commit
> -------------------------------------------------------
>
>                 Key: MAPREDUCE-5718
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5718
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>              Labels: ha
>         Attachments: mr-5718-0.patch
>
>
> While testing RM HA, we ran into this issue where if the RM fails over while an MR AM is in the middle of a commit, the subsequent AM gets spawned but dies with a diagnostic message - "We crashed durring a commit". 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)