You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2012/11/26 22:02:59 UTC

[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-4813:
----------------------------------

    Attachment: MAPREDUCE-4813.patch

Posting a rough patch for comment.  It adds a new, interal COMMITTING state to JobImpl.  It's missing the state transition tests for the new state and also breaks a fair number of tests that are confused by the new state.  I wanted to get this out there for initial comment in case this isn't the direction people think this should go.
                
> AM timing out during job commit
> -------------------------------
>
>                 Key: MAPREDUCE-4813
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during JobImpl state transitions, which means the JobImpl write lock is held the entire time the job is being committed.  Holding the write lock prevents the RM allocator thread from heartbeating to the RM.  Therefore if committing the job takes too long (e.g.: the job has tons of files to commit and/or the namenode is bogged down) then the AM appears to be unresponsive to the RM and the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira