You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Amol Kekre (JIRA)" <ji...@apache.org> on 2011/07/16 00:34:02 UTC

[jira] [Created] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

NPE in AM causes it to lose containers which are never returned back to RM
--------------------------------------------------------------------------

                 Key: MAPREDUCE-2693
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mrv2
            Reporter: Amol Kekre
            Priority: Critical
             Fix For: 0.23.0


The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
because of these lost containers.

It happens when there are blacklisted nodes at the app level in AM. A bug in AM
(RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
request-table. We should make sure RM also knows about this update.

========================================================================
11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
resourceName=... numContainers=4978 #asks=5
11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
resourceName=... numContainers=4977 #asks=5
11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
resourceName=... numContainers=1540 #asks=5
11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
resourceName=... numContainers=1539 #asks=6
11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
java.lang.NullPointerException
        at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
        at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
        at
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
        at
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
        at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
        at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
        at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066608#comment-13066608 ] 

Sharad Agarwal commented on MAPREDUCE-2693:
-------------------------------------------

This is due a bug in Job level node blacklisting. Indirectly it is related to how AM and RM keeps the request table. I will provide the fix shortly.

> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Amol Kekre
>            Assignee: Sharad Agarwal
>            Priority: Critical
>             Fix For: 0.23.0
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hitesh Shah (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hitesh Shah updated MAPREDUCE-2693:
-----------------------------------

    Status: Patch Available  (was: Open)
    
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch, MR-2693.2.patch, MR-2693.3.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Arun C Murthy (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13115198#comment-13115198 ] 

Arun C Murthy commented on MAPREDUCE-2693:
------------------------------------------

Sharad - is this still valid? Thanks.
                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Amol Kekre
>            Assignee: Sharad Agarwal
>            Priority: Critical
>             Fix For: 0.23.0
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hitesh Shah (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hitesh Shah reassigned MAPREDUCE-2693:
--------------------------------------

    Assignee: Hitesh Shah  (was: Sharad Agarwal)
    
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hitesh Shah (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hitesh Shah updated MAPREDUCE-2693:
-----------------------------------

    Attachment: MR-2693.2.patch

Updated previous patch with minor optimizations.
                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch, MR-2693.2.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Vinod Kumar Vavilapalli (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13117224#comment-13117224 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-2693:
----------------------------------------------------

I know the details, I was the one who originally ran into this while running YARN on a cluster :)

The description of the ticket is comprehensive enough to fix the bug.
                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Amol Kekre
>            Assignee: Sharad Agarwal
>            Priority: Critical
>             Fix For: 0.23.0
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131614#comment-13131614 ] 

Hudson commented on MAPREDUCE-2693:
-----------------------------------

Integrated in Hadoop-Mapreduce-0.23-Build #57 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/57/])
    Merge -c 1186529 from trunk to branch-0.23 to complete fix for MAPREDUCE-2693.

acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1186530
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java

                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch, MR-2693.2.patch, MR-2693.3.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Sharad Agarwal (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13117006#comment-13117006 ] 

Sharad Agarwal commented on MAPREDUCE-2693:
-------------------------------------------

Yes this bug is valid but only appears if job level node blacklisting is enabled.

sigh! I may not have the bandwidth to work on this in short term. feel free if someone else wants to take this up. thanks!
                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Amol Kekre
>            Assignee: Sharad Agarwal
>            Priority: Critical
>             Fix For: 0.23.0
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131068#comment-13131068 ] 

Hudson commented on MAPREDUCE-2693:
-----------------------------------

Integrated in Hadoop-Common-trunk-Commit #1116 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1116/])
    MAPREDUCE-2693. Fix NPE in job-blacklisting. Contributed by Hitesh Shah.

acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1186529
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java

                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch, MR-2693.2.patch, MR-2693.3.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131090#comment-13131090 ] 

Hudson commented on MAPREDUCE-2693:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk-Commit #1133 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1133/])
    MAPREDUCE-2693. Fix NPE in job-blacklisting. Contributed by Hitesh Shah.

acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1186529
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java

                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch, MR-2693.2.patch, MR-2693.3.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131576#comment-13131576 ] 

Hudson commented on MAPREDUCE-2693:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Build #45 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/45/])
    Merge -c 1186529 from trunk to branch-0.23 to complete fix for MAPREDUCE-2693.

acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1186530
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java

                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch, MR-2693.2.patch, MR-2693.3.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13130291#comment-13130291 ] 

Hadoop QA commented on MAPREDUCE-2693:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12499617/MR-2693.2.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 160 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed these unit tests:
                  org.apache.hadoop.mapreduce.TestJobMonitorAndPrint

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1060//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1060//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-common.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1060//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-app.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1060//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1060//console

This message is automatically generated.
                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch, MR-2693.2.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hitesh Shah (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hitesh Shah updated MAPREDUCE-2693:
-----------------------------------

    Attachment: MR-2693.3.patch

Address code review comments. 
                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch, MR-2693.2.patch, MR-2693.3.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-2693:
-----------------------------------------------

    Status: Open  (was: Patch Available)

Sorry, took time, it's an involved change. Mostly looks good. Few comments:

RMContainerRequestor:
 - Make the constructor with event-argument invoke the other constructor.
 - {{containerFailedOnHost()}}:
   -- Do we need to remove the rack entries from ask and remoteRequestTable also? (The TODO at the end)
   -- Use {{BuilderUtils.newResourceRequest()}} for constructing zeroedRequest.
 - {{getFilteredContainerRequest()}}: Why look for both IP addresses and host-names to check if they are/aren't blacklisted?

RMContainerAllocator:
 - Checks for illegal resource size (allocated.getResource().getMemory() < mapResourceReqt || maps.isEmpty()) can be moved one level up from so that we don't need to do multiple times in both _assign()_ and _getContainerReqToReplace()_?
 - Log message: "Could not find a valid request to which this allocated container maps to". Also add that this container is going to be released?

Test: It is not clear to me why we need five iterations in that loop, is it possible to make it deterministic or more explicit?

What about current running tasks, do we want to kill them too if we mark the node for blacklisting?

General: Wrap lines longer than 80 chars, only those which the patch touches of course :)
                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch, MR-2693.2.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Arun C Murthy (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13117018#comment-13117018 ] 

Arun C Murthy commented on MAPREDUCE-2693:
------------------------------------------

No worries, can you pls provide more info and I can get someone else to take this up? Thanks.
                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Amol Kekre
>            Assignee: Sharad Agarwal
>            Priority: Critical
>             Fix For: 0.23.0
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hitesh Shah (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hitesh Shah updated MAPREDUCE-2693:
-----------------------------------

    Status: Patch Available  (was: Open)
    
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch, MR-2693.2.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hitesh Shah (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hitesh Shah updated MAPREDUCE-2693:
-----------------------------------

    Status: Open  (was: Patch Available)
    
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hitesh Shah (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13130909#comment-13130909 ] 

Hitesh Shah commented on MAPREDUCE-2693:
----------------------------------------

bq. Do we need to remove the rack entries from ask and remoteRequestTable also? (The TODO at the end) 

I don't believe we should be blacklisting a rack based on a single node's failure. This probably needs a bit more thought in terms of how we decide to blacklist racks. Node failures could be co-related to rack/switch failures. I updated the comment with some more information on what we need to account for when blacklisting a rack and I will probably open a jira which we can use a discussion board on what approach should we apply when trying to blacklist a rack.

bq. getFilteredContainerRequest(): Why look for both IP addresses and host-names to check if they are/aren't blacklisted? 

Had added that as there was some confusion in the code in terms of handling hostnames and ips. Given that now containers are also using hostnames, all code in the allocator/requestor has now been changed to use hostnames only. 

bq. Test: It is not clear to me why we need five iterations in that loop, is it possible to make it deterministic or more explicit?

Was required as nodes blacklisted by AM could still be assigned back to it by the RM. Changed the code around a bit to mark the blacklisted nodes as not healthy and make the test more cleaner and deterministic. 

                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch, MR-2693.2.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127994#comment-13127994 ] 

Hadoop QA commented on MAPREDUCE-2693:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12499113/MR-2693.1.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1030//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1030//console

This message is automatically generated.
                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy reassigned MAPREDUCE-2693:
----------------------------------------

    Assignee: Sharad Agarwal

Sharad, I think you/Vinod were looking at this... can you please check? Thanks.

> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Amol Kekre
>            Assignee: Sharad Agarwal
>            Priority: Critical
>             Fix For: 0.23.0
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131070#comment-13131070 ] 

Hudson commented on MAPREDUCE-2693:
-----------------------------------

Integrated in Hadoop-Common-0.23-Commit #26 (See [https://builds.apache.org/job/Hadoop-Common-0.23-Commit/26/])
    Merge -c 1186529 from trunk to branch-0.23 to complete fix for MAPREDUCE-2693.

acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1186530
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java

                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch, MR-2693.2.patch, MR-2693.3.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131069#comment-13131069 ] 

Hudson commented on MAPREDUCE-2693:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Commit #26 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/26/])
    Merge -c 1186529 from trunk to branch-0.23 to complete fix for MAPREDUCE-2693.

acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1186530
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java

                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch, MR-2693.2.patch, MR-2693.3.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Arun C Murthy (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated MAPREDUCE-2693:
-------------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this after testing on a secure cluster. Thanks Hitesh!
                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch, MR-2693.2.patch, MR-2693.3.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13130947#comment-13130947 ] 

Hadoop QA commented on MAPREDUCE-2693:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12499746/MR-2693.3.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 160 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1073//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1073//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-common.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1073//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1073//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-app.html
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1073//console

This message is automatically generated.
                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch, MR-2693.2.patch, MR-2693.3.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hitesh Shah (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hitesh Shah updated MAPREDUCE-2693:
-----------------------------------

    Status: Patch Available  (was: Open)
    
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hitesh Shah (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hitesh Shah updated MAPREDUCE-2693:
-----------------------------------

    Affects Version/s: 0.23.0
    
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hitesh Shah (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hitesh Shah updated MAPREDUCE-2693:
-----------------------------------

    Attachment: MR-2693.1.patch
    
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131084#comment-13131084 ] 

Hudson commented on MAPREDUCE-2693:
-----------------------------------

Integrated in Hadoop-Mapreduce-0.23-Commit #27 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/27/])
    Merge -c 1186529 from trunk to branch-0.23 to complete fix for MAPREDUCE-2693.

acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1186530
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java

                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch, MR-2693.2.patch, MR-2693.3.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131066#comment-13131066 ] 

Hudson commented on MAPREDUCE-2693:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk-Commit #1195 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1195/])
    MAPREDUCE-2693. Fix NPE in job-blacklisting. Contributed by Hitesh Shah.

acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1186529
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java

                
> NPE in AM causes it to lose containers which are never returned back to RM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2693
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Amol Kekre
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.0
>
>         Attachments: MR-2693.1.patch, MR-2693.2.patch, MR-2693.3.patch
>
>
> The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
> containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
> because of these lost containers.
> It happens when there are blacklisted nodes at the app level in AM. A bug in AM
> (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
> request-table. We should make sure RM also knows about this update.
> ========================================================================
> 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4978 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=4977 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1540 #asks=5
> 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
> resourceName=... numContainers=1539 #asks=6
> 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
> java.lang.NullPointerException
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
>         at
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
>         at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
>         at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira