You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2012/11/21 16:32:05 UTC

[jira] [Created] (MAPREDUCE-4813) AM timing out during job commit

Jason Lowe created MAPREDUCE-4813:
-------------------------------------

             Summary: AM timing out during job commit
                 Key: MAPREDUCE-4813
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: applicationmaster
    Affects Versions: 2.0.1-alpha, 0.23.3
            Reporter: Jason Lowe
            Priority: Critical


The AM calls the output committer's {{commitJob}} method synchronously during JobImpl state transitions, which means the JobImpl write lock is held the entire time the job is being committed.  Holding the write lock prevents the RM allocator thread from heartbeating to the RM.  Therefore if committing the job takes too long (e.g.: the job has tons of files to commit and/or the namenode is bogged down) then the AM appears to be unresponsive to the RM and the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (MAPREDUCE-4813) AM timing out during job commit

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli reassigned MAPREDUCE-4813:
--------------------------------------------------

    Assignee: Jason Lowe
    
> AM timing out during job commit
> -------------------------------
>
>                 Key: MAPREDUCE-4813
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during JobImpl state transitions, which means the JobImpl write lock is held the entire time the job is being committed.  Holding the write lock prevents the RM allocator thread from heartbeating to the RM.  Therefore if committing the job takes too long (e.g.: the job has tons of files to commit and/or the namenode is bogged down) then the AM appears to be unresponsive to the RM and the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4813) AM timing out during job commit

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506299#comment-13506299 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-4813:
----------------------------------------------------

bq. So I still think we need this
Agreed, also we need a fix before MAPREDUCE-4815 is resolved. So let's get this in. Looking at the patch now.
                
> AM timing out during job commit
> -------------------------------
>
>                 Key: MAPREDUCE-4813
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during JobImpl state transitions, which means the JobImpl write lock is held the entire time the job is being committed.  Holding the write lock prevents the RM allocator thread from heartbeating to the RM.  Therefore if committing the job takes too long (e.g.: the job has tons of files to commit and/or the namenode is bogged down) then the AM appears to be unresponsive to the RM and the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4813) AM timing out during job commit

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13507040#comment-13507040 ] 

Hadoop QA commented on MAPREDUCE-4813:
--------------------------------------

{color:green}+1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12555453/MAPREDUCE-4813.patch
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 1 new or modified test files.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of javac compiler warnings.

    {color:green}+1 javadoc{color}.  The javadoc tool did not generate any warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with eclipse:eclipse.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number of release audit warnings.

    {color:green}+1 core tests{color}.  The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

    {color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3085//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3085//console

This message is automatically generated.
                
> AM timing out during job commit
> -------------------------------
>
>                 Key: MAPREDUCE-4813
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch, MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during JobImpl state transitions, which means the JobImpl write lock is held the entire time the job is being committed.  Holding the write lock prevents the RM allocator thread from heartbeating to the RM.  Therefore if committing the job takes too long (e.g.: the job has tons of files to commit and/or the namenode is bogged down) then the AM appears to be unresponsive to the RM and the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-4813:
----------------------------------

    Attachment: MAPREDUCE-4813.patch

Patch that fixes the unit test failures and adds some testing of the new COMMITTING state.  As a bonus, most of the tests in TestJobImpl actually test a JobImpl object rather than a mock of it.

                
> AM timing out during job commit
> -------------------------------
>
>                 Key: MAPREDUCE-4813
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during JobImpl state transitions, which means the JobImpl write lock is held the entire time the job is being committed.  Holding the write lock prevents the RM allocator thread from heartbeating to the RM.  Therefore if committing the job takes too long (e.g.: the job has tons of files to commit and/or the namenode is bogged down) then the AM appears to be unresponsive to the RM and the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4813) AM timing out during job commit

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504304#comment-13504304 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-4813:
----------------------------------------------------

Started looking at the patch but realized a thing. When we fix MAPREDUCE-4815, commitJob won't be expensive anymore? We still need to make sure that a hung DFS move doesn't make the AM timeout, but I believe that is automatically handled via RPC timeouts for e.g.
                
> AM timing out during job commit
> -------------------------------
>
>                 Key: MAPREDUCE-4813
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during JobImpl state transitions, which means the JobImpl write lock is held the entire time the job is being committed.  Holding the write lock prevents the RM allocator thread from heartbeating to the RM.  Therefore if committing the job takes too long (e.g.: the job has tons of files to commit and/or the namenode is bogged down) then the AM appears to be unresponsive to the RM and the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-4813:
----------------------------------

    Target Version/s: 2.0.3-alpha, 0.23.6
              Status: Patch Available  (was: Open)
    
> AM timing out during job commit
> -------------------------------
>
>                 Key: MAPREDUCE-4813
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 2.0.1-alpha, 0.23.3
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during JobImpl state transitions, which means the JobImpl write lock is held the entire time the job is being committed.  Holding the write lock prevents the RM allocator thread from heartbeating to the RM.  Therefore if committing the job takes too long (e.g.: the job has tons of files to commit and/or the namenode is bogged down) then the AM appears to be unresponsive to the RM and the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-4813:
----------------------------------

    Attachment: MAPREDUCE-4813.patch

Posting a rough patch for comment.  It adds a new, interal COMMITTING state to JobImpl.  It's missing the state transition tests for the new state and also breaks a fair number of tests that are confused by the new state.  I wanted to get this out there for initial comment in case this isn't the direction people think this should go.
                
> AM timing out during job commit
> -------------------------------
>
>                 Key: MAPREDUCE-4813
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during JobImpl state transitions, which means the JobImpl write lock is held the entire time the job is being committed.  Holding the write lock prevents the RM allocator thread from heartbeating to the RM.  Therefore if committing the job takes too long (e.g.: the job has tons of files to commit and/or the namenode is bogged down) then the AM appears to be unresponsive to the RM and the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-4813:
----------------------------------

    Status: Patch Available  (was: Open)
    
> AM timing out during job commit
> -------------------------------
>
>                 Key: MAPREDUCE-4813
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 2.0.1-alpha, 0.23.3
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch, MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during JobImpl state transitions, which means the JobImpl write lock is held the entire time the job is being committed.  Holding the write lock prevents the RM allocator thread from heartbeating to the RM.  Therefore if committing the job takes too long (e.g.: the job has tons of files to commit and/or the namenode is bogged down) then the AM appears to be unresponsive to the RM and the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-4813:
----------------------------------

    Attachment: MAPREDUCE-4813.patch

Thanks for the review, Vinod!  I've attached a patch that hopefully addresses most of your comments.

I agree that abortJob, setupJob, etc. need to be handled as well, as those could take an arbitrary amount of time as well.  Adding a new top-level service, associated events for that service, and new state machine wait states will be a bit involved, and I'm keen on getting a fix for the now common case of long job commits.  If it's OK with you, I'd like to tackle that review comment in a separate JIRA.
                
> AM timing out during job commit
> -------------------------------
>
>                 Key: MAPREDUCE-4813
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch, MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during JobImpl state transitions, which means the JobImpl write lock is held the entire time the job is being committed.  Holding the write lock prevents the RM allocator thread from heartbeating to the RM.  Therefore if committing the job takes too long (e.g.: the job has tons of files to commit and/or the namenode is bogged down) then the AM appears to be unresponsive to the RM and the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4813) AM timing out during job commit

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-4813:
-----------------------------------------------

    Status: Open  (was: Patch Available)
    
> AM timing out during job commit
> -------------------------------
>
>                 Key: MAPREDUCE-4813
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 2.0.1-alpha, 0.23.3
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during JobImpl state transitions, which means the JobImpl write lock is held the entire time the job is being committed.  Holding the write lock prevents the RM allocator thread from heartbeating to the RM.  Therefore if committing the job takes too long (e.g.: the job has tons of files to commit and/or the namenode is bogged down) then the AM appears to be unresponsive to the RM and the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4813) AM timing out during job commit

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506328#comment-13506328 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-4813:
----------------------------------------------------

Some comments on the patch:
 - Similar to JobCommitFailedEvent, add an event class for JOB_COMMIT_COMPLETED.
 - JobImpl.checkJobCompleteSuccess() and corresponding return variables should be renamed to mean checkIfJobReadyForCommit(). Similary, checkJobForCompletion(job).
 - For now, we may be just be addressing MAPREDUCE-4815, but the same argument of committer being arbitrary user code is valid for other calls like abortJob, setupJob too. We will need states capturing those calls and put them on separate threads so that dispatches isn't blocked. We can do that later, but to be future-proof, let's move the committer-thread to a top-level service ala TaskCleaner. We may even re-purpose TaskCleanerImpl for this. Scope the effort and split it as you see fit.
 - Commit-thread interrupting and joining is only meaning-ful in the case of kill-during-commit. So let's move that code there. Also, earlier, we never supported kill-during-commit, but now we do and the patch is putting a 60second upper bound on commitJob() before abortJob(). Comparing this with 1.*, we do allow kill-during-commit as commit happens in a separate JVM. So interrupt and join seems fine, let's just put in a config so that we can tweak if ever there is a need.
 - The test looks good. Can you extend it to include kill-during-commit too. That will also validate that the dispatcher isn't blocked anymore because of long commit.
                
> AM timing out during job commit
> -------------------------------
>
>                 Key: MAPREDUCE-4813
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4813.patch, MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during JobImpl state transitions, which means the JobImpl write lock is held the entire time the job is being committed.  Holding the write lock prevents the RM allocator thread from heartbeating to the RM.  Therefore if committing the job takes too long (e.g.: the job has tons of files to commit and/or the namenode is bogged down) then the AM appears to be unresponsive to the RM and the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4813) AM timing out during job commit

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504635#comment-13504635 ] 

Jason Lowe commented on MAPREDUCE-4813:
---------------------------------------

MAPREDUCE-4815 only addresses FileOutputCommitter and friends, but the committer is arbitrary user code.  It could be doing all sorts of things including connecting to databases, etc.  So I still think we need this, although the priority of it is reduced given how many things are built from FileOutputCommitter.
                
> AM timing out during job commit
> -------------------------------
>
>                 Key: MAPREDUCE-4813
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during JobImpl state transitions, which means the JobImpl write lock is held the entire time the job is being committed.  Holding the write lock prevents the RM allocator thread from heartbeating to the RM.  Therefore if committing the job takes too long (e.g.: the job has tons of files to commit and/or the namenode is bogged down) then the AM appears to be unresponsive to the RM and the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira