You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Ramgopal N (Created) (JIRA)" <ji...@apache.org> on 2011/11/03 11:03:33 UTC

[jira] [Created] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

                 Key: MAPREDUCE-3339
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mrv2
    Affects Versions: 0.23.0
            Reporter: Ramgopal N


I have only one NM running.
I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.

In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Vinod Kumar Vavilapalli (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171317#comment-13171317 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-3339:
----------------------------------------------------

Looking at the patch now for review/commit.
                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173695#comment-13173695 ] 

Hudson commented on MAPREDUCE-3339:
-----------------------------------

Integrated in Hadoop-Mapreduce-0.23-Commit #323 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/323/])
    MAPREDUCE-3339. Fixed MR AM to stop considering node blacklisting after the number of nodes blacklisted crosses a threshold. Contributed by Siddharth Seth.
svn merge -c 1221523 --ignore-ancestry ../../trunk/

vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1221524
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/AllocateResponse.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/AllocateResponsePBImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java

                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3339-20111220.txt, MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174052#comment-13174052 ] 

Hudson commented on MAPREDUCE-3339:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk #901 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/901/])
    MAPREDUCE-3339. Fixed MR AM to stop considering node blacklisting after the number of nodes blacklisted crosses a threshold. Contributed by Siddharth Seth.

vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1221523
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/AllocateResponse.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/AllocateResponsePBImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java

                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3339-20111220.txt, MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173620#comment-13173620 ] 

Hudson commented on MAPREDUCE-3339:
-----------------------------------

Integrated in Hadoop-Common-trunk-Commit #1459 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1459/])
    MAPREDUCE-3339. Fixed MR AM to stop considering node blacklisting after the number of nodes blacklisted crosses a threshold. Contributed by Siddharth Seth.

vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1221523
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/AllocateResponse.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/AllocateResponsePBImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java

                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3339-20111220.txt, MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Siddharth Seth (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159830#comment-13159830 ] 

Siddharth Seth commented on MAPREDUCE-3339:
-------------------------------------------

Could you please provide more details on how to reproduce this - if you're still seing it.
                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Priority: Blocker
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Siddharth Seth (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173629#comment-13173629 ] 

Siddharth Seth commented on MAPREDUCE-3339:
-------------------------------------------

Thanks for the update patch and commit Vinod.
                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3339-20111220.txt, MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Siddharth Seth (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161950#comment-13161950 ] 

Siddharth Seth commented on MAPREDUCE-3339:
-------------------------------------------

Ran into this as well. This is the same as https://issues.apache.org/jira/browse/MAPREDUCE-3314?focusedCommentId=13139942&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13139942

The AM keeps getting allocated containers on a single node - since the node has already been blacklisted for previous failures, it releases these containers and just keeps running.

The warning 'Event EventType: KILL_CONTAINER sent to absent container' is expected - the RM tries to tell the NM to kill the container, since the AM could have LAUNCHEd it and then RELEASEd it.

Leaving this Jira open to add a configuration option to the AM, so that it ignores blacklisting in such situations.
                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173622#comment-13173622 ] 

Hudson commented on MAPREDUCE-3339:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Commit #301 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/301/])
    MAPREDUCE-3339. Fixed MR AM to stop considering node blacklisting after the number of nodes blacklisted crosses a threshold. Contributed by Siddharth Seth.
svn merge -c 1221523 --ignore-ancestry ../../trunk/

vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1221524
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/AllocateResponse.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/AllocateResponsePBImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java

                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3339-20111220.txt, MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164156#comment-13164156 ] 

Hadoop QA commented on MAPREDUCE-3339:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12506389/MR3339_v1.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1404//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1404//console

This message is automatically generated.
                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3339_v1.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Mahadev konar (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159864#comment-13159864 ] 

Mahadev konar commented on MAPREDUCE-3339:
------------------------------------------

Ramgopal,
 Is this the default - Fifo scheduler?


                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3339:
--------------------------------------

    Status: Patch Available  (was: Open)
    
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3339_v1.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3339:
--------------------------------------

    Attachment: MR3339_v2.txt

Modified the way the AM finds out knownNodes based on feedback from Hitesh and Vinod.

knownNodes are now reported by the RM on each allocate call.

One known issue with this approach. AM blacklisting is host based, the known node count is NodeManager based - so if there's multiple NMs on a node, disabling blacklisting may not work. AM blacklisting needs to move over to being NM based instead of node based.
                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3339:
-----------------------------------------------

    Hadoop Flags: Reviewed
          Status: Patch Available  (was: Open)
    
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3339-20111220.txt, MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173619#comment-13173619 ] 

Hudson commented on MAPREDUCE-3339:
-----------------------------------

Integrated in Hadoop-Common-0.23-Commit #312 (See [https://builds.apache.org/job/Hadoop-Common-0.23-Commit/312/])
    MAPREDUCE-3339. Fixed MR AM to stop considering node blacklisting after the number of nodes blacklisted crosses a threshold. Contributed by Siddharth Seth.
svn merge -c 1221523 --ignore-ancestry ../../trunk/

vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1221524
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/AllocateResponse.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/AllocateResponsePBImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java

                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3339-20111220.txt, MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3339:
--------------------------------------

    Status: Open  (was: Patch Available)
    
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3339_v1.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3339:
-----------------------------------------------

    Attachment: MAPREDUCE-3339-20111220.txt

Attaching patch with trivial update to the configuration names, changed to {{job.node-blacklisting.enable}} and {{job.node-blacklisting.ignore-threshold-node-percent}}.
                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3339-20111220.txt, MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13170621#comment-13170621 ] 

Hadoop QA commented on MAPREDUCE-3339:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12507620/MR3339_v2.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed these unit tests:
                  org.apache.hadoop.mapreduce.v2.app.TestStagingCleanup
                  org.apache.hadoop.mapreduce.v2.app.TestJobEndNotifier

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1466//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1466//console

This message is automatically generated.
                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173672#comment-13173672 ] 

Hudson commented on MAPREDUCE-3339:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk-Commit #1482 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1482/])
    MAPREDUCE-3339. Fixed MR AM to stop considering node blacklisting after the number of nodes blacklisted crosses a threshold. Contributed by Siddharth Seth.

vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1221523
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/AllocateResponse.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/AllocateResponsePBImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java

                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3339-20111220.txt, MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174098#comment-13174098 ] 

Hudson commented on MAPREDUCE-3339:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk #934 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/934/])
    MAPREDUCE-3339. Fixed MR AM to stop considering node blacklisting after the number of nodes blacklisted crosses a threshold. Contributed by Siddharth Seth.

vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1221523
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/AllocateResponse.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/AllocateResponsePBImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java

                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3339-20111220.txt, MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Siddharth Seth (Assigned) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth reassigned MAPREDUCE-3339:
-----------------------------------------

    Assignee: Siddharth Seth
    
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3339:
-----------------------------------------------

    Fix Version/s: 0.23.1
           Status: Open  (was: Patch Available)

Patch looks good overall.

Let's correct the job-blacklisting related config items to all follow the same naming pattern.

The test is good, BTW!
                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173626#comment-13173626 ] 

Hudson commented on MAPREDUCE-3339:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk-Commit #1532 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1532/])
    MAPREDUCE-3339. Fixed MR AM to stop considering node blacklisting after the number of nodes blacklisted crosses a threshold. Contributed by Siddharth Seth.

vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1221523
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/AllocateResponse.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/AllocateResponsePBImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java

                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3339-20111220.txt, MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3339:
--------------------------------------

    Attachment: MR3339_v1.txt

Changes in the patch 
- Node blacklisting is disabled if a certain configured percentage of nodes known to the AM are blacklisted.
- This is computed based on nodes on which containers have been allocated by the RM - not requested hosts.
- Blacklisting can be re-enabled if the AM becomes aware of additional hosts.

To reproduce the issue, run a job for which maps always fail. FailedAttemptsToBlacklist < numTaskAttempts
                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3339_v1.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3339:
-----------------------------------------------

      Resolution: Fixed
    Release Note: Fixed MR AM to stop considering node blacklisting after the number of nodes blacklisted crosses a threshold.
          Status: Resolved  (was: Patch Available)

Just committed this to trunk and branch-0.23. Thanks Sid!
                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3339-20111220.txt, MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3339:
--------------------------------------

    Status: Patch Available  (was: Open)
    
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>         Attachments: MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174063#comment-13174063 ] 

Hudson commented on MAPREDUCE-3339:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Build #114 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/114/])
    MAPREDUCE-3339. Fixed MR AM to stop considering node blacklisting after the number of nodes blacklisted crosses a threshold. Contributed by Siddharth Seth.
svn merge -c 1221523 --ignore-ancestry ../../trunk/

vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1221524
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/AllocateResponse.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/AllocateResponsePBImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java

                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3339-20111220.txt, MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Ramgopal N (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159842#comment-13159842 ] 

Ramgopal N commented on MAPREDUCE-3339:
---------------------------------------

Have only one NM. Submit a big job and continuosly kill the child processes on that NM. After some time NM will stop respawning the child proceeses and the job is also hanged.And nowhere in the logs it is given as blacklisted.
 
                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173604#comment-13173604 ] 

Hadoop QA commented on MAPREDUCE-3339:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12508138/MAPREDUCE-3339-20111220.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1483//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1483//console

This message is automatically generated.
                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3339-20111220.txt, MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3339) Job is getting hanged indefinitely,if the child processes are killed on the NM. KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173667#comment-13173667 ] 

Hudson commented on MAPREDUCE-3339:
-----------------------------------

Integrated in Hadoop-Mapreduce-0.23-Build #134 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/134/])
    MAPREDUCE-3339. Fixed MR AM to stop considering node blacklisting after the number of nodes blacklisted crosses a threshold. Contributed by Siddharth Seth.
svn merge -c 1221523 --ignore-ancestry ../../trunk/

vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1221524
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMCommunicator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/AllocateResponse.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/AllocateResponsePBImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java

                
> Job is getting hanged indefinitely,if the child processes are killed on the NM.  KILL_CONTAINER eventtype is continuosly sent to the containers that are not existing
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3339
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3339
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ramgopal N
>            Assignee: Siddharth Seth
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MAPREDUCE-3339-20111220.txt, MR3339_v1.txt, MR3339_v2.txt
>
>
> I have only one NM running.
> I have submitted a job and all the child processes on the NM got killed continuosly.This made the Job to hang indefinitely.
> In the NM logs it is logging WARN message :org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1320301910500_0004_01_001359 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira