You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Tom White (JIRA)" <ji...@apache.org> on 2012/05/12 06:40:28 UTC

[jira] [Created] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Tom White created MAPREDUCE-4252:
------------------------------------

             Summary: MR2 job never completes with 1 pending task
                 Key: MAPREDUCE-4252
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mrv2
    Affects Versions: 0.23.1
            Reporter: Tom White
            Assignee: Tom White


This was found by ATM:

bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13411477#comment-13411477 ] 

Hudson commented on MAPREDUCE-4252:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Build #310 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/310/])
    svn merge -c 1359747. FIXES: MAPREDUCE-4252. MR2 job never completes with 1 pending task (Tom White via bobby) (Revision 1359750)

     Result = UNSTABLE
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1359750
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java

                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.23.3, 2.0.1-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410452#comment-13410452 ] 

Hadoop QA commented on MAPREDUCE-4252:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12535854/MAPREDUCE-4252.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 1 new or modified test files.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2565//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2565//console

This message is automatically generated.
                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274804#comment-13274804 ] 

Bikas Saha commented on MAPREDUCE-4252:
---------------------------------------

I hesitate at the idea of introducing more events, unless absolutely necessary. FAILED seems to be a good enough event to inform the task that an attempt has failed. FAIL_FETCH_FAILURE now means for any other kind of failure in the future I would have to introduce more events and introduce more arcs in the state machine. Large state machines are hard to understand. 

While the change itself seems correct, I am not able to convince myself that it is the best fix.

What surprises me is that the success of an attempt did not end up terminating the concurrent attempts. I would expect the speculative attempt to be Killed during the call to AttemptTransitionSucceeded(). Did that not work?

Or was there a close race condition that the speculative task failed at the same time as the succeeded attempt completed? So the KILLED event from the SUCCEEDED task raced against the FAILED event in the TaskAttemptImpl state machine, with the FAILED event winning?

What do you think about the following approach? Allow MapRetroActiveFailureTransition to return SUCCEEDED as a possible state. In that transition, if the failed attempt is not the same as the attempt that SUCCEEDED, then that failure would not change the state of the Task. Task would remain in succeeded state. We can rename MapRetroActiveFailureTransition to something more appropriate.

                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410600#comment-13410600 ] 

Hudson commented on MAPREDUCE-4252:
-----------------------------------

Integrated in Hadoop-Common-trunk-Commit #2438 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2438/])
    MAPREDUCE-4252. MR2 job never completes with 1 pending task (Tom White via bobby) (Revision 1359747)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1359747
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java

                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.23.3, 2.0.1-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409781#comment-13409781 ] 

Hadoop QA commented on MAPREDUCE-4252:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12535714/MAPREDUCE-4252.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 1 new or modified test files.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2560//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2560//console

This message is automatically generated.
                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410437#comment-13410437 ] 

Robert Joseph Evans commented on MAPREDUCE-4252:
------------------------------------------------

Assuming Jenkins comes back OK I am a +1 on this patch (I did my thorough review).
                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13444383#comment-13444383 ] 

Bikas Saha commented on MAPREDUCE-4252:
---------------------------------------

We could move the check for TaskType after the check for successful task attempt has been made. This is similar to the pattern for MapRetroActiveFailureTransition(). So the fix is a simple downward movement of the check for TaskType.

The check for whether the task is MAP should still be there (and is also there for MapRetroActiveFailureTransition()) because after a REDUCE has completed with SUCCEEDED there is no reason to fail it because its outputs are already preserved in HDFS. So any attempt to fail it would be a bug in the code.

We could also potentially improve the check of a successful attempt in MapRetroActiveKilledTransition() by using the condition in MapRetroActiveFailureTransition().

I opened YARN-61 to track this. Currently assigned to me because I wrote MapRetroactiveKilledTransition. [~chaku88] Feel free to assign this to yourself.

                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.23.3, 2.1.0-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410454#comment-13410454 ] 

Hadoop QA commented on MAPREDUCE-4252:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12535854/MAPREDUCE-4252.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 1 new or modified test files.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2566//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2566//console

This message is automatically generated.
                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Siddharth Seth (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13275980#comment-13275980 ] 

Siddharth Seth commented on MAPREDUCE-4252:
-------------------------------------------

bq. Do you mean decrement numberUncompletedAttempts when a RUNNING attempt fails/killed after the task is in SUCCEEDED state? Normally they would trigger their own Failed/Killed transition that takes care of the decrement.
After a task goes to SUCCEEDED, FAILED/KILLED attempts are ignored.
1. attemp1 starts
2. speculative attempt starts
3. attempt 1 completes - Task moves to SUCCEEDED state
4. speculative attempt is KILLED
5. T_ATTEMPT_KILLED is ignored.
6. attemp1 1 fails with TOO_MANY_FETCH_FAILURES
The job will effectively hang, since a new task attempt isn't started.

@Tom, this is another case which leads to the same situation of a stuck job. I can open a separate jira if you plan to commit this patch as is.
                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated MAPREDUCE-4252:
---------------------------------

    Attachment: MAPREDUCE-4252.patch

Well spotted, Bobby. I updated the patch to fail if there is an internal error, and I've also fixed the bug so the error doesn't occur any more.
                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated MAPREDUCE-4252:
---------------------------------

    Status: Patch Available  (was: Open)
    
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Tom White (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410508#comment-13410508 ] 

Tom White commented on MAPREDUCE-4252:
--------------------------------------

Thanks Bobby.
                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.23.3, 2.0.1-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated MAPREDUCE-4252:
---------------------------------

    Attachment: MapReduce.png
                MAPREDUCE-4252.patch

The problem is that a speculative task attempt can cause a previously SUCCEEDED task to transition to SCHEDULED (or FAILED).

1. A task attempt is started.
2. A speculative task attempt for the same task is started.
3. The speculative task attempt completes and causes the task to transition to SUCCEEDED.
4. The initial task attempt fails and causes the task to transition to SCHEDULED.

No more task attempts are scheduled so the job never completes. I have written a unit test to demonstrate this, which fails.

The situation occurs because TaskEventType.T_ATTEMPT_FAILED can cause a task to transition from the SUCCEEDED state (see attached state diagram). This type is caused by one of:

* FailedTransition - when a task attempt fails 
* DeallocateContainerTransition - when a task attempt goes from assigned/unassigned to killed/failed
* TooManyFetchFailureTransition - when a reducer fails to get the map output

Only the last one seems like a good reason to transition a task from a previously SUCCEEDED state. The first two should leave a task in the SUCCEEDED state, and the third should be handled with a new type (T_ATTEMPT_FAILED_RETROSPECTIVELY or T_ATTEMPT_TOO_MANY_FETCH_FAILURE) which transitions to SCHEDULED (or FAILED).



                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated MAPREDUCE-4252:
-------------------------------------------

          Resolution: Fixed
       Fix Version/s: 3.0.0
                      2.0.1-alpha
                      0.23.3
    Target Version/s: 2.0.0-alpha, 0.23.3  (was: 0.23.3, 2.0.0-alpha)
              Status: Resolved  (was: Patch Available)

Thanks for the patch Tom.

I put this into trunk, branch-2 and branch-0.23.
                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.23.3, 2.0.1-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "chackaravarthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443920#comment-13443920 ] 

chackaravarthy commented on MAPREDUCE-4252:
-------------------------------------------

Hi Tom,

This problem has been handled when speculative task launched for map task and other attempt got failed (not killed)
Can the similar kind of scenario can happen in case of reduce task?

Consider the following scenario for reduce task in case of speculation (one attempt got killed):
{quote}
1. A task attempt is started.
2. A speculative task attempt for the same task is started.
3. The first task attempt completes and causes the task to transition to SUCCEEDED.
4. Then speculative task attempt will be killed because of the completion of first attempt.
{quote}

As a result, internal error will be thrown from this attempt (*TaskImpl.MapRetroactiveKilledTransition*) and hence task attempt failure leads to job failure.

{code:title=TaskImpl.MapRetroactiveKilledTransition|borderStyle=solid}
if (!TaskType.MAP.equals(task.getType())) {
        LOG.error("Unexpected event for REDUCE task " + event.getType());
        task.internalError(event.getType());
      }
{code}

So, do we need to have following code in MapRetroactiveKilledTransition also just like in MapRetroactiveFailureTransition.

{noformat}
if (event instanceof TaskTAttemptEvent) {
        TaskTAttemptEvent castEvent = (TaskTAttemptEvent) event;
        if (task.getState() == TaskState.SUCCEEDED &&
            !castEvent.getTaskAttemptID().equals(task.successfulAttempt)) {
          // don't allow a different task attempt to override a previous
          // succeeded state
          return TaskState.SUCCEEDED;
        }
      }
{noformat}

please check whether this is a valid case and give your suggestion.

Thanks
                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.23.3, 2.1.0-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410173#comment-13410173 ] 

Hadoop QA commented on MAPREDUCE-4252:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12535714/MAPREDUCE-4252.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 1 new or modified test files.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2564//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2564//console

This message is automatically generated.
                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated MAPREDUCE-4252:
---------------------------------

    Attachment: MAPREDUCE-4252.patch

Here's a new patch which implements Bikas' suggestion of not allowing task attempt failures override a previously successful task attempt. This passes the same unit test as before.

Siddharth - please open another JIRA for the case you described above.
                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated MAPREDUCE-4252:
---------------------------------

    Attachment: MAPREDUCE-4252.patch

Here's a patch that implements the suggested change in my previous comment. It passes the test I wrote that was otherwise failing; however, I'd like to hear if folks think this is the right way to fix the problem.
                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410793#comment-13410793 ] 

Hudson commented on MAPREDUCE-4252:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk-Commit #2508 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2508/])
    MAPREDUCE-4252. MR2 job never completes with 1 pending task (Tom White via bobby) (Revision 1359747)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1359747
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java

                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.23.3, 2.0.1-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy closed MAPREDUCE-4252.
------------------------------------

    
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.23.3, 2.0.2-alpha
>
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410669#comment-13410669 ] 

Hudson commented on MAPREDUCE-4252:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk-Commit #2456 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2456/])
    MAPREDUCE-4252. MR2 job never completes with 1 pending task (Tom White via bobby) (Revision 1359747)

     Result = FAILURE
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1359747
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java

                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.23.3, 2.0.1-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410835#comment-13410835 ] 

Hudson commented on MAPREDUCE-4252:
-----------------------------------

Integrated in Hadoop-Common-trunk-Commit #2441 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2441/])
    MAPREDUCE-4252. MR2 job never completes with 1 pending task (Tom White via bobby) (Revision 1359747)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1359747
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java

                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.23.3, 2.0.1-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13275995#comment-13275995 ] 

Bikas Saha commented on MAPREDUCE-4252:
---------------------------------------

@Sid ,thanks for clarifying.

@Tom, Let me clarify a bit. When the speculative attempt succeeded, then AttemptSucceededTransition would have KILLED the other running attempts. In this case, the other attempt was not KILLED but it FAILED on its own.
                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274552#comment-13274552 ] 

Robert Joseph Evans commented on MAPREDUCE-4252:
------------------------------------------------

The patch looks good to me.  I have read through the code and looked at the state transition graphs and I don't see any issues with it.  +1

On a side note I would like to get this into branch-0.23 as well, patch applies with no problems there.
                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Siddharth Seth (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13444326#comment-13444326 ] 

Siddharth Seth commented on MAPREDUCE-4252:
-------------------------------------------

Chackaravarthy, that does seem like a valid case - either in case of reduce speculation or a race between a reduce completing and a node being marked as unhealthy. The 'successful attempt' check already exists in MapRetroactiveKilledTransition. However, the REDUCE check will fail before this. Could you please open a separate jira. 
>From a quick look, ignoring the event for Reduce tasks should be sufficient.
                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.23.3, 2.1.0-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409839#comment-13409839 ] 

Robert Joseph Evans commented on MAPREDUCE-4252:
------------------------------------------------

In the transitions for MapRetroactiveFailureTransition the postStates are listed as only TaskState.SCHEDULED and TaskState.FAILED.  TaskState.SUCCEEDED needs to also be in there. 

This is causing an exception which results in internal error event to be sent, that we are not detecting in the tests.  It would be good to have the tests look for these internal error events, for all of the TaskImpl Tests (but that might be a separate JIRA).

{noformat}
2012-07-09 15:00:22,423 ERROR [main] impl.TaskImpl (TaskImpl.java:handle(606)) - Can't handle this event at current state for task_1341864022346_0001_m_000001
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: T_ATTEMPT_FAILED at SUCCEEDED
        at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:383)
        at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
        at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
        at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
        at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl.handle(TaskImpl.java:604)
        at org.apache.hadoop.mapreduce.v2.app.job.impl.TestTaskImpl.testSpeculativeTaskAttemptSucceedsEvenIfFirstFails(TestTaskImpl.java:485)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
        at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
        at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
        at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
        at org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
        at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
        at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
        at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:53)
        at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:123)
        at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:104)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:164)
        at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:110)
        at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:175)
        at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcessWhenForked(SurefireStarter.java:81)
        at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:68)
{noformat}

                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated MAPREDUCE-4252:
-------------------------------------------

    Target Version/s: 2.0.0-alpha, 0.23.3  (was: 2.0.0-alpha)

The patch applies with no changes to branch-0.23 as well.  It would be great if you could pull it in there too.  For what it is worth a quick look at the patch looks good, but I don't want to +1 it without a more in-depth look.
                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410604#comment-13410604 ] 

Hudson commented on MAPREDUCE-4252:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk-Commit #2506 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2506/])
    MAPREDUCE-4252. MR2 job never completes with 1 pending task (Tom White via bobby) (Revision 1359747)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1359747
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java

                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.23.3, 2.0.1-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13411523#comment-13411523 ] 

Hudson commented on MAPREDUCE-4252:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk #1133 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1133/])
    MAPREDUCE-4252. MR2 job never completes with 1 pending task (Tom White via bobby) (Revision 1359747)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1359747
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java

                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.23.3, 2.0.1-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409748#comment-13409748 ] 

Bikas Saha commented on MAPREDUCE-4252:
---------------------------------------

Thanks! I like this because it makes sure that failures of other attempts do not end up erroneously failing the succeeded attempt. +1.
                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13411419#comment-13411419 ] 

Hudson commented on MAPREDUCE-4252:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk #1100 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1100/])
    MAPREDUCE-4252. MR2 job never completes with 1 pending task (Tom White via bobby) (Revision 1359747)

     Result = FAILURE
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1359747
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskImpl.java

                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.23.3, 2.0.1-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4252) MR2 job never completes with 1 pending task

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274173#comment-13274173 ] 

Hadoop QA commented on MAPREDUCE-4252:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12526660/MAPREDUCE-4252.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 1 new or modified test files.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2385//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2385//console

This message is automatically generated.
                
> MR2 job never completes with 1 pending task
> -------------------------------------------
>
>                 Key: MAPREDUCE-4252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: MAPREDUCE-4252.patch, MAPREDUCE-4252.patch, MapReduce.png
>
>
> This was found by ATM:
> bq. I ran a teragen with 1000 map tasks. Many task attempts failed, but after 999 of the tasks had completed, the job is now sitting forever with 1 task "pending".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira