You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Vinod Kumar Vavilapalli (Created) (JIRA)" <ji...@apache.org> on 2011/10/20 14:13:10 UTC

[jira] [Created] (MAPREDUCE-3228) MR AM hangs when one node goes bad

MR AM hangs when one node goes bad
----------------------------------

                 Key: MAPREDUCE-3228
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: applicationmaster, mrv2
    Affects Versions: 0.23.0
            Reporter: Vinod Kumar Vavilapalli
            Priority: Blocker
             Fix For: 0.23.0


Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.

Thanks to [~Karams] for helping with this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3228) MR AM hangs when one node goes bad

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3228:
-----------------------------------------------

    Attachment: MAPREDUCE-3228-20111020.txt

Adding timers for both {{startContainer()}} and {{stopContainer()}} so that MR AM doesn't get stuck on faulty nodes.

Need to do some testing.
                
> MR AM hangs when one node goes bad
> ----------------------------------
>
>                 Key: MAPREDUCE-3228
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3228-20111020.txt
>
>
> Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
> Thanks to [~Karams] for helping with this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3228) MR AM hangs when one node goes bad

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137833#comment-13137833 ] 

Hudson commented on MAPREDUCE-3228:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk #874 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/874/])
    MAPREDUCE-3228. Fixed MR AM to timeout RPCs to bad NodeManagers. Contributed by Vinod K V.

acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1189879
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestContainerLauncher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestFail.java

                
> MR AM hangs when one node goes bad
> ----------------------------------
>
>                 Key: MAPREDUCE-3228
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3228-20111020.txt, MAPREDUCE-3228-20111027.txt
>
>
> Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
> Thanks to [~Karams] for helping with this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3228) MR AM hangs when one node goes bad

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3228:
-----------------------------------------------

    Status: Patch Available  (was: Open)
    
> MR AM hangs when one node goes bad
> ----------------------------------
>
>                 Key: MAPREDUCE-3228
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3228-20111020.txt, MAPREDUCE-3228-20111027.txt
>
>
> Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
> Thanks to [~Karams] for helping with this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3228) MR AM hangs when one node goes bad

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137865#comment-13137865 ] 

Hudson commented on MAPREDUCE-3228:
-----------------------------------

Integrated in Hadoop-Mapreduce-0.23-Build #65 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/65/])
    Merge -c 1189879 from trunk to branch-0.23 to fix MAPREDUCE-3228.

acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1189880
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestContainerLauncher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestFail.java

                
> MR AM hangs when one node goes bad
> ----------------------------------
>
>                 Key: MAPREDUCE-3228
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3228-20111020.txt, MAPREDUCE-3228-20111027.txt
>
>
> Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
> Thanks to [~Karams] for helping with this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3228) MR AM hangs when one node goes bad

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137333#comment-13137333 ] 

Hudson commented on MAPREDUCE-3228:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk-Commit #1187 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1187/])
    MAPREDUCE-3228. Fixed MR AM to timeout RPCs to bad NodeManagers. Contributed by Vinod K V.

acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1189879
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestContainerLauncher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestFail.java

                
> MR AM hangs when one node goes bad
> ----------------------------------
>
>                 Key: MAPREDUCE-3228
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3228-20111020.txt, MAPREDUCE-3228-20111027.txt
>
>
> Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
> Thanks to [~Karams] for helping with this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (MAPREDUCE-3228) MR AM hangs when one node goes bad

Posted by "Vinod Kumar Vavilapalli (Assigned) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli reassigned MAPREDUCE-3228:
--------------------------------------------------

    Assignee: Vinod Kumar Vavilapalli
    
> MR AM hangs when one node goes bad
> ----------------------------------
>
>                 Key: MAPREDUCE-3228
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.0
>
>
> Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
> Thanks to [~Karams] for helping with this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3228) MR AM hangs when one node goes bad

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137328#comment-13137328 ] 

Hudson commented on MAPREDUCE-3228:
-----------------------------------

Integrated in Hadoop-Common-trunk-Commit #1171 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1171/])
    MAPREDUCE-3228. Fixed MR AM to timeout RPCs to bad NodeManagers. Contributed by Vinod K V.

acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1189879
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestContainerLauncher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestFail.java

                
> MR AM hangs when one node goes bad
> ----------------------------------
>
>                 Key: MAPREDUCE-3228
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3228-20111020.txt, MAPREDUCE-3228-20111027.txt
>
>
> Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
> Thanks to [~Karams] for helping with this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3228) MR AM hangs when one node goes bad

Posted by "Arun C Murthy (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated MAPREDUCE-3228:
-------------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this. Thanks Vinod!
                
> MR AM hangs when one node goes bad
> ----------------------------------
>
>                 Key: MAPREDUCE-3228
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3228-20111020.txt, MAPREDUCE-3228-20111027.txt
>
>
> Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
> Thanks to [~Karams] for helping with this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3228) MR AM hangs when one node goes bad

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137327#comment-13137327 ] 

Hudson commented on MAPREDUCE-3228:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk-Commit #1248 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1248/])
    MAPREDUCE-3228. Fixed MR AM to timeout RPCs to bad NodeManagers. Contributed by Vinod K V.

acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1189879
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestContainerLauncher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestFail.java

                
> MR AM hangs when one node goes bad
> ----------------------------------
>
>                 Key: MAPREDUCE-3228
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3228-20111020.txt, MAPREDUCE-3228-20111027.txt
>
>
> Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
> Thanks to [~Karams] for helping with this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3228) MR AM hangs when one node goes bad

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137123#comment-13137123 ] 

Hadoop QA commented on MAPREDUCE-3228:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12501089/MAPREDUCE-3228-20111027.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 5 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 170 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed the unit tests build

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1173//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1173//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-common.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1173//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-app.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1173//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1173//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1173//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-web-proxy.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1173//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1173//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1173//console

This message is automatically generated.
                
> MR AM hangs when one node goes bad
> ----------------------------------
>
>                 Key: MAPREDUCE-3228
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3228-20111020.txt, MAPREDUCE-3228-20111027.txt
>
>
> Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
> Thanks to [~Karams] for helping with this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3228) MR AM hangs when one node goes bad

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137342#comment-13137342 ] 

Hudson commented on MAPREDUCE-3228:
-----------------------------------

Integrated in Hadoop-Mapreduce-0.23-Commit #74 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/74/])
    Merge -c 1189879 from trunk to branch-0.23 to fix MAPREDUCE-3228.

acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1189880
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestContainerLauncher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestFail.java

                
> MR AM hangs when one node goes bad
> ----------------------------------
>
>                 Key: MAPREDUCE-3228
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3228-20111020.txt, MAPREDUCE-3228-20111027.txt
>
>
> Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
> Thanks to [~Karams] for helping with this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3228) MR AM hangs when one node goes bad

Posted by "Vinod Kumar Vavilapalli (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137120#comment-13137120 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-3228:
----------------------------------------------------

Verified this on a single node secure setup. No side effects because of the patch.
                
> MR AM hangs when one node goes bad
> ----------------------------------
>
>                 Key: MAPREDUCE-3228
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3228-20111020.txt, MAPREDUCE-3228-20111027.txt
>
>
> Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
> Thanks to [~Karams] for helping with this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3228) MR AM hangs when one node goes bad

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3228:
-----------------------------------------------

    Attachment: MAPREDUCE-3228-20111027.txt

Updated patch with tests.
                
> MR AM hangs when one node goes bad
> ----------------------------------
>
>                 Key: MAPREDUCE-3228
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3228-20111020.txt, MAPREDUCE-3228-20111027.txt
>
>
> Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
> Thanks to [~Karams] for helping with this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3228) MR AM hangs when one node goes bad

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13138287#comment-13138287 ] 

Hudson commented on MAPREDUCE-3228:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Build #53 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/53/])
    Merge -c 1189879 from trunk to branch-0.23 to fix MAPREDUCE-3228.

acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1189880
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestContainerLauncher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestFail.java

                
> MR AM hangs when one node goes bad
> ----------------------------------
>
>                 Key: MAPREDUCE-3228
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3228-20111020.txt, MAPREDUCE-3228-20111027.txt
>
>
> Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
> Thanks to [~Karams] for helping with this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3228) MR AM hangs when one node goes bad

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137330#comment-13137330 ] 

Hudson commented on MAPREDUCE-3228:
-----------------------------------

Integrated in Hadoop-Common-0.23-Commit #78 (See [https://builds.apache.org/job/Hadoop-Common-0.23-Commit/78/])
    Merge -c 1189879 from trunk to branch-0.23 to fix MAPREDUCE-3228.

acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1189880
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestContainerLauncher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestFail.java

                
> MR AM hangs when one node goes bad
> ----------------------------------
>
>                 Key: MAPREDUCE-3228
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3228-20111020.txt, MAPREDUCE-3228-20111027.txt
>
>
> Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
> Thanks to [~Karams] for helping with this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3228) MR AM hangs when one node goes bad

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137325#comment-13137325 ] 

Hudson commented on MAPREDUCE-3228:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Commit #78 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/78/])
    Merge -c 1189879 from trunk to branch-0.23 to fix MAPREDUCE-3228.

acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1189880
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestContainerLauncher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestFail.java

                
> MR AM hangs when one node goes bad
> ----------------------------------
>
>                 Key: MAPREDUCE-3228
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3228-20111020.txt, MAPREDUCE-3228-20111027.txt
>
>
> Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
> Thanks to [~Karams] for helping with this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3228) MR AM hangs when one node goes bad

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13138368#comment-13138368 ] 

Hudson commented on MAPREDUCE-3228:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk #846 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/846/])
    MAPREDUCE-3228. Fixed MR AM to timeout RPCs to bad NodeManagers. Contributed by Vinod K V.

acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1189879
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestContainerLauncher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestFail.java

                
> MR AM hangs when one node goes bad
> ----------------------------------
>
>                 Key: MAPREDUCE-3228
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3228-20111020.txt, MAPREDUCE-3228-20111027.txt
>
>
> Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
> Thanks to [~Karams] for helping with this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira