You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Ravi Prakash (Created) (JIRA)" <ji...@apache.org> on 2011/12/23 00:53:33 UTC

[jira] [Created] (MAPREDUCE-3596) Job got hang after completion of 99% map phase with hadoop-0.23.1.1112091615 RE build

Job got hang after completion of 99% map phase with hadoop-0.23.1.1112091615 RE build
-------------------------------------------------------------------------------------

                 Key: MAPREDUCE-3596
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: applicationmaster, mrv2
    Affects Versions: 0.23.0
            Reporter: Ravi Prakash
            Priority: Critical


Courtesy [~vinaythota]
{quote}
Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
Cluster size is 350 nodes.

Build Details:
==============
Version:        0.23.1.1112091615, 1212592
Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 

ResourceManager version:        0.23.1.1112091615 from 1212681 by someone source checksum
6e54430abdc912c91c05b9208a3361de on Fri Dec 9 16:52:07 PST 2011
Hadoop version:         0.23.1.1112091615 from 1212592 by someone source checksum 999b78e0eadace831529ee78ed29c8e1 on
Fri Dec 9 16:25:27 PST 2011
{quote}




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3596:
-----------------------------------------------

    Attachment: MAPREDUCE-3596-20120111.txt

It turns out MAPREDUCE-3530 didn't work. I could reproduce the bug in a test-case.

Attached patch should fix this. Essentially I am cleaning up the data structures a little late so that any outstanding updates from NM about new containers can be filtered before sending them off to schedulers. Also added one more check for containers belonging to finished applications from reaching the scheduler.

I also added the test-case which fails without the patch and passes with.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3596:
-----------------------------------------------

    Attachment: MAPREDUCE-3596-20120112.1.txt

Updated patch. Making scheduler inform the Node about containers that it doesn't know about.

Fortunately I could change my unit tests a bit to reproduce the corner case also. The test hangs without the fix and passes with.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.1.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186192#comment-13186192 ] 

Hudson commented on MAPREDUCE-3596:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Build #138 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/138/])
    merge MAPREDUCE-3596 from trunk. Fix scheduler to handle cleaned up containers, which NMs may subsequently report as running. (Contributed by Vinod Kumar Vavilapalli)

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1231303
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/BuilderUtils.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNode.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApp.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNodes.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java

                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.1.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185870#comment-13185870 ] 

Hudson commented on MAPREDUCE-3596:
-----------------------------------

Integrated in Hadoop-Common-trunk-Commit #1539 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1539/])
    MAPREDUCE-3596. Fix scheduler to handle cleaned up containers, which NMs may subsequently report as running. (Contributed by Vinod Kumar Vavilapalli)

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1231297
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/BuilderUtils.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNode.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApp.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNodes.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java

                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.1.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3596:
-----------------------------------------------

    Attachment: MAPREDUCE-3596-20120111.1.txt

Fixing test failure.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3596:
-----------------------------------------------

    Hadoop Flags: Reviewed
          Status: Patch Available  (was: Open)
    
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Vinod Kumar Vavilapalli (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180877#comment-13180877 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-3596:
----------------------------------------------------

I see an exception in RM from the attached logs (rm1) which points to the fixed issue MAPREDUCE-3530. Will need newer set of logs if this is still an issue.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>         Attachments: logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Vinod Kumar Vavilapalli (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181505#comment-13181505 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-3596:
----------------------------------------------------

Looks like MAPREDUCE-3530 didn't work, seeing the same exception again.

Orthogonal to this, we should make sure RM crashes when the dispatcher gets an exception. Created MAPREDUCE-3634 for this.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>         Attachments: logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186198#comment-13186198 ] 

Hudson commented on MAPREDUCE-3596:
-----------------------------------

Integrated in Hadoop-Mapreduce-0.23-Build #160 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/160/])
    merge MAPREDUCE-3596 from trunk. Fix scheduler to handle cleaned up containers, which NMs may subsequently report as running. (Contributed by Vinod Kumar Vavilapalli)

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1231303
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/BuilderUtils.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNode.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApp.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNodes.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java

                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.1.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3596) Job got hang after completion of 99% map phase with hadoop-0.23.1.1112091615 RE build

Posted by "Ravi Prakash (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ravi Prakash updated MAPREDUCE-3596:
------------------------------------

    Description: 
Courtesy [~vinaythota]
{quote}
Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
Cluster size is 350 nodes.

Build Details:
==============

Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
{quote}




  was:
Courtesy [~vinaythota]
{quote}
Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
Cluster size is 350 nodes.

Build Details:
==============

Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
ResourceManager version:        revision 1212681 by someone source checksum
6e54430abdc912c91c05b9208a3361de on Fri Dec 9 16:52:07 PST 2011
Hadoop version:         revision 1212592 by someone source checksum 999b78e0eadace831529ee78ed29c8e1 on
Fri Dec 9 16:25:27 PST 2011
{quote}




    
> Job got hang after completion of 99% map phase with hadoop-0.23.1.1112091615 RE build
> -------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Priority: Critical
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3596:
--------------------------------------

    Priority: Blocker  (was: Critical)

Marking as blocker since this has been seen more than once.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Priority: Blocker
>         Attachments: logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3596:
-----------------------------------------------

    Status: Open  (was: Patch Available)

Thanks for the review Robert.

Makes sense, updating the patch.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3596) Job got hang after completion of 99% map phase with hadoop-0.23.1.1112091615 RE build

Posted by "Ravi Prakash (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ravi Prakash updated MAPREDUCE-3596:
------------------------------------

    Description: 
Courtesy [~vinaythota]
{quote}
Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
Cluster size is 350 nodes.

Build Details:
==============

Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
ResourceManager version:        revision 1212681 by someone source checksum
6e54430abdc912c91c05b9208a3361de on Fri Dec 9 16:52:07 PST 2011
Hadoop version:         revision 1212592 by someone source checksum 999b78e0eadace831529ee78ed29c8e1 on
Fri Dec 9 16:25:27 PST 2011
{quote}




  was:
Courtesy [~vinaythota]
{quote}
Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
Cluster size is 350 nodes.

Build Details:
==============
Version:        0.23.1.1112091615, 1212592
Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 

ResourceManager version:        0.23.1.1112091615 from 1212681 by someone source checksum
6e54430abdc912c91c05b9208a3361de on Fri Dec 9 16:52:07 PST 2011
Hadoop version:         0.23.1.1112091615 from 1212592 by someone source checksum 999b78e0eadace831529ee78ed29c8e1 on
Fri Dec 9 16:25:27 PST 2011
{quote}




    
> Job got hang after completion of 99% map phase with hadoop-0.23.1.1112091615 RE build
> -------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Priority: Critical
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum
> 6e54430abdc912c91c05b9208a3361de on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone source checksum 999b78e0eadace831529ee78ed29c8e1 on
> Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185869#comment-13185869 ] 

Hudson commented on MAPREDUCE-3596:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk-Commit #1612 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1612/])
    MAPREDUCE-3596. Fix scheduler to handle cleaned up containers, which NMs may subsequently report as running. (Contributed by Vinod Kumar Vavilapalli)

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1231297
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/BuilderUtils.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNode.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApp.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNodes.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java

                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.1.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3596:
--------------------------------------

    Attachment: logs.tar.bz2

Should've seen the exception....
anyway, sections of another run with the same exception. rm + am + nm which caused the error.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>         Attachments: logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185901#comment-13185901 ] 

Hudson commented on MAPREDUCE-3596:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk-Commit #1557 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1557/])
    MAPREDUCE-3596. Fix scheduler to handle cleaned up containers, which NMs may subsequently report as running. (Contributed by Vinod Kumar Vavilapalli)

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1231297
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/BuilderUtils.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNode.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApp.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNodes.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java

                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.1.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185394#comment-13185394 ] 

Hadoop QA commented on MAPREDUCE-3596:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510446/MAPREDUCE-3596-20120112.1.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 12 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1609//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1609//console

This message is automatically generated.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.1.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Arun C Murthy (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy reassigned MAPREDUCE-3596:
----------------------------------------

    Assignee: Vinod Kumar Vavilapalli
    
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>         Attachments: logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3596:
-----------------------------------------------

    Hadoop Flags:   (was: Reviewed)
          Status: Patch Available  (was: Open)
    
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.1.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Job got hang after completion of 99% map phase with hadoop-0.23.1.1112091615 RE build

Posted by "Ravi Prakash (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175188#comment-13175188 ] 

Ravi Prakash commented on MAPREDUCE-3596:
-----------------------------------------

Ok. Here's how far I've got

{noformat}
$ grep attempt_1324018664143_0002_m -r container_1324018664143_0002_01_000001/ | grep "Created attempt" | awk '{print $10}' | sort | uniq  | grep "_1$"
attempt_1324018664143_0002_m_009775_1
attempt_1324018664143_0002_m_012988_1
attempt_1324018664143_0002_m_013199_1
{noformat}

i.e. There are three maps which had to be retried. The first succeeded on being retried
{noformat}
2011-12-16 07:09:11,013 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attempt attempt_1324018664143_0002_m_009775_1
{noformat}

The other two failed. They failed for different reasons which doesn't seem to me to be related to this investigation. In any case. After failure,
{noformat}
2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Processing attempt_1324018664143_0002_m_012988_0 of type TA_CONTAINER_LAUNCH_FAILED
2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1324018664143_0002_m_012988_0 TaskAttempt Transitioned from ASSIGNED to FAILED
2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Processing the event EventType: CONTAINER_DEALLOCATE
2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Processing task_1324018664143_0002_m_012988 of type T_ATTEMPT_FAILED
2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Created attempt attempt_1324018664143_0002_m_012988_1
2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Processing the event EventType: CONTAINER_FAILED
2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 1 failures on node someNode
2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Processing attempt_1324018664143_0002_m_012988_1 of type TA_RESCHEDULE
2011-12-16 07:09:15,870 INFO [Thread-31] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In HistoryEventHandler TASK_FINISHED
2011-12-16 07:09:15,871 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1324018664143_0002_m_012988_1 TaskAttempt Transitioned from NEW to UNASSIGNED
2011-12-16 07:09:15,871 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Processing the event EventType: CONTAINER_REQ
2011-12-16 07:09:15,871 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Added attempt_1324018664143_0002_m_012988_1 to list of failed maps
2011-12-16 07:09:15,871 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Added priority=priority: 5, 
2011-12-16 07:09:15,871 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: addResourceRequest: applicationId=2 priority=5 resourceName=* numContainers=1 #asks=1
{noformat}
And then that attempt is never heard from again in the AM logs. Similarly for the other attempt

I could not find the resource request in the RM logs.

                
> Job got hang after completion of 99% map phase with hadoop-0.23.1.1112091615 RE build
> -------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Priority: Critical
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Siddharth Seth (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185850#comment-13185850 ] 

Siddharth Seth commented on MAPREDUCE-3596:
-------------------------------------------

+1. Patch looks good. Also ran a couple of runs of sort with this patch and MAPREDUCE-3656 - completed without running into either issue.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.1.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3596:
-----------------------------------------------

    Fix Version/s: 0.23.1
           Status: Patch Available  (was: Open)
    
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Siddharth Seth (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185313#comment-13185313 ] 

Siddharth Seth commented on MAPREDUCE-3596:
-------------------------------------------

The NPE could still happen if the startContainer request to the NM is delayed. The patch fixes the case where a newly launched container is reported in a NM heartbeat and the RM is still aware that this container needs to be cleaned up.
If the newly launched container (because of a delayed startContainer) is sent to the RM in a subsequent heartbeat (after the RM has told it to clean up the container and cleaned up it's own list of containersToCleanup) - will end up in the same NPE, and the container running to completion.

One possible option would be to have the NM keep track of containers it needs to clean up - if it isn't aware of the container yet.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3596:
-----------------------------------------------

    Status: Patch Available  (was: Open)
    
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Vinod Kumar Vavilapalli (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185356#comment-13185356 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-3596:
----------------------------------------------------

Thanks Robert.

And a very good catch Sid. Doing it on NM is more complicated with the heartbeat thread different from the AMNM RPC. I am tending to do it on the RM itself inside the scheduler.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3596:
--------------------------------------

    Attachment: logs.tar.bz2

Attached some parts of the AM and RM logs.
am1/rm1 - first 2 map failures
am2/rm2 - 3rd map failure
am3/rm3 - last bit before the job was killed.

The first failed map was retried successfully. The remaining 2 never got containers allocated.

Looks like this may be an issue on the RM (RM logs aren't very useful though - since DEBUG logging wasn't enabled). The AM side table looks ok. After the second failed map - 1 container requested with priority=5 (never allocated)
{noformat}
2011-12-16 07:09:15,871 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: addResourceRequest: applicationId=2 priority=5 resourceName=* numContainers=1 #asks=1
{noformat}

After the third failed map - 2 container requests with priority=5 (never allocated)
{noformat}
2011-12-16 07:26:07,641 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: addResourceRequest: applicationId=2 priority=5 resourceName=* numContainers=2 #asks=1
{noformat}

Towards the end, all reduce tasks are around 0.3328 complete, pendingMaps stays at 2.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Priority: Critical
>         Attachments: logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3596:
-----------------------------------------------

    Attachment: MAPREDUCE-3596-20120112.txt

Addressing review comment.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185872#comment-13185872 ] 

Hudson commented on MAPREDUCE-3596:
-----------------------------------

Integrated in Hadoop-Common-0.23-Commit #373 (See [https://builds.apache.org/job/Hadoop-Common-0.23-Commit/373/])
    merge MAPREDUCE-3596 from trunk. Fix scheduler to handle cleaned up containers, which NMs may subsequently report as running. (Contributed by Vinod Kumar Vavilapalli)

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1231303
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/BuilderUtils.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNode.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApp.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNodes.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java

                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.1.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184725#comment-13184725 ] 

Hadoop QA commented on MAPREDUCE-3596:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510310/MAPREDUCE-3596-20120111.1.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 12 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1603//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1603//console

This message is automatically generated.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185881#comment-13185881 ] 

Hudson commented on MAPREDUCE-3596:
-----------------------------------

Integrated in Hadoop-Mapreduce-0.23-Commit #385 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/385/])
    merge MAPREDUCE-3596 from trunk. Fix scheduler to handle cleaned up containers, which NMs may subsequently report as running. (Contributed by Vinod Kumar Vavilapalli)

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1231303
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/BuilderUtils.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNode.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApp.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNodes.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java

                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.1.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185057#comment-13185057 ] 

Robert Joseph Evans commented on MAPREDUCE-3596:
------------------------------------------------

+1 (non-binding) I just have one comment and it is very minor, so feel free to ignore it, my +1 stands with or without it.

In RMNodeImpl getAppsToCleanup and getContainersToCleanUp.  It would probably be slightly more efficient to do something like {code}
this.readLock.lock();
try {
  return new ArrayList<ContainerId>(this.containersToClean);
} finally {
  this.readLock.unlock();
}{code} It will ensure that for even large arrays internally in ArrayList only one array will need to be allocated.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186188#comment-13186188 ] 

Hudson commented on MAPREDUCE-3596:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk #925 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/925/])
    MAPREDUCE-3596. Fix scheduler to handle cleaned up containers, which NMs may subsequently report as running. (Contributed by Vinod Kumar Vavilapalli)

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1231297
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/BuilderUtils.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNode.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApp.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNodes.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java

                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.1.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3596:
--------------------------------------

      Resolution: Fixed
    Hadoop Flags: Reviewed
          Status: Resolved  (was: Patch Available)

Committed to trunk and branch-0.23. Thanks Vinod
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.1.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Siddharth Seth (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182780#comment-13182780 ] 

Siddharth Seth commented on MAPREDUCE-3596:
-------------------------------------------

>From another set of logs, sequence of events.
1. AM calls a startContainer
2. NM receives this, starts processing but takes about 1minutes20 seconds to finish processing it.
3. Meanwhile, the AM times out the call after 1 minute - and sends a release container to the RM
4. RM ends up removing references to the container
5. The NM sends a containerStarted event to the RM - which ends up causing the NPE.

>From a quick look at the code - if the AM release event had gone out after the NM containerStarted, things would've been handled.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>         Attachments: logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Amol Kekre (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184439#comment-13184439 ] 

Amol Kekre commented on MAPREDUCE-3596:
---------------------------------------

Sid, the above optimizations are good. But at a top level RM should gracefully handle containerStarted from NM for a release container. Maybe RM should just log it saying "Error! got a containerStarted event for a release container" and ignore the event.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>         Attachments: logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185868#comment-13185868 ] 

Hudson commented on MAPREDUCE-3596:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Commit #363 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/363/])
    merge MAPREDUCE-3596 from trunk. Fix scheduler to handle cleaned up containers, which NMs may subsequently report as running. (Contributed by Vinod Kumar Vavilapalli)

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1231303
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/BuilderUtils.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNode.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApp.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNodes.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java

                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.1.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186205#comment-13186205 ] 

Hudson commented on MAPREDUCE-3596:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk #958 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/958/])
    MAPREDUCE-3596. Fix scheduler to handle cleaned up containers, which NMs may subsequently report as running. (Contributed by Vinod Kumar Vavilapalli)

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1231297
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/BuilderUtils.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNode.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApp.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNodes.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java

                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.1.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3596:
--------------------------------------

    Status: Open  (was: Patch Available)
    
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185194#comment-13185194 ] 

Hadoop QA commented on MAPREDUCE-3596:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510415/MAPREDUCE-3596-20120112.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 12 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1606//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1606//console

This message is automatically generated.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Siddharth Seth (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182904#comment-13182904 ] 

Siddharth Seth commented on MAPREDUCE-3596:
-------------------------------------------

For the NM taking a long time to process a startContainer call - would be interesting to see if changing RPC thread priorities will make a difference, or if RPC traffic can be prioritized over shuffle. Also, trying to reduce what the startContainer call does in the NM.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>         Attachments: logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3596:
-----------------------------------------------

    Summary: Sort benchmark got hang after completion of 99% map phase  (was: Job got hang after completion of 99% map phase with hadoop-0.23.1.1112091615 RE build)

Please avoid using internal numbering. They won't make sense to outsiders(like me) anyways :)

Regardless, can you please provide more information? AM logs is a good start. Thanks!
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Priority: Critical
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185196#comment-13185196 ] 

Robert Joseph Evans commented on MAPREDUCE-3596:
------------------------------------------------

I looked at the new patch and I am still +1 (non-binding), just to be official, because it is a new patch.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.1.txt, MAPREDUCE-3596-20120111.txt, MAPREDUCE-3596-20120112.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184702#comment-13184702 ] 

Hadoop QA commented on MAPREDUCE-3596:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510305/MAPREDUCE-3596-20120111.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 12 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed these unit tests:
                  org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1602//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1602//console

This message is automatically generated.
                
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3596) Sort benchmark got hang after completion of 99% map phase

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3596:
-----------------------------------------------

    Status: Open  (was: Patch Available)
    
> Sort benchmark got hang after completion of 99% map phase
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3596-20120111.txt, logs.tar.bz2, logs.tar.bz2
>
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after completion 99% map phase. There are some map tasks failed. Also it's not scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira