You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Siddharth Seth (Created) (JIRA)" <ji...@apache.org> on 2011/11/23 05:18:40 UTC

[jira] [Created] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

MR AM can hang if containers are allocated on a node blacklisted by the AM
--------------------------------------------------------------------------

                 Key: MAPREDUCE-3460
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mr-am, mrv2
    Affects Versions: 0.23.0
            Reporter: Siddharth Seth
            Priority: Blocker


When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
find a corresponding container request.
This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161880#comment-13161880 ] 

Hudson commented on MAPREDUCE-3460:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk-Commit #1429 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1429/])
    MAPREDUCE-3460. MR AM can hang if containers are allocated on a node blacklisted by the AM. (Contributed by Hitesh Shah and Robert Joseph Evans)

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209737
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java

                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt, MR3460_v4.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162111#comment-13162111 ] 

Hudson commented on MAPREDUCE-3460:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Commit #241 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/241/])
    mrege MAPREDUCE-3460 from trunk

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209740
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java

                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt, MR3460_v4.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Hitesh Shah (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158714#comment-13158714 ] 

Hitesh Shah commented on MAPREDUCE-3460:
----------------------------------------

Based on Sid's theory, the problem would be in RmContainerAllocator#getContainerReqToReplace. 

{code}
-      if (PRIORITY_FAST_FAIL_MAP.equals(priority) 
-          || PRIORITY_MAP.equals(priority)) {
+      if (PRIORITY_FAST_FAIL_MAP.equals(priority)) {
+        while (toBeReplaced == null && earlierFailedMaps.size() > 0) {
+          TaskAttemptId tId = earlierFailedMaps.removeFirst();
+          if (maps.containsKey(tId)) {
+            toBeReplaced = maps.remove(tId);
+          }
+        }
+        return toBeReplaced;
+      }
+      else if (PRIORITY_MAP.equals(priority)) {
{code}
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Siddharth Seth
>            Priority: Blocker
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161907#comment-13161907 ] 

Hudson commented on MAPREDUCE-3460:
-----------------------------------

Integrated in Hadoop-Mapreduce-0.23-Commit #257 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/257/])
    mrege MAPREDUCE-3460 from trunk

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209740
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java

                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt, MR3460_v4.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Robert Joseph Evans (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated MAPREDUCE-3460:
-------------------------------------------

    Target Version/s: 0.23.1, 0.24.0  (was: 0.24.0, 0.23.1)
              Status: Open  (was: Patch Available)
    
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161899#comment-13161899 ] 

Hudson commented on MAPREDUCE-3460:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk-Commit #1380 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1380/])
    MAPREDUCE-3460. MR AM can hang if containers are allocated on a node blacklisted by the AM. (Contributed by Hitesh Shah and Robert Joseph Evans)

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209737
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java

                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt, MR3460_v4.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Mahadev konar (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158865#comment-13158865 ] 

Mahadev konar commented on MAPREDUCE-3460:
------------------------------------------

Great.

Thanks Hitesh.

Bobby, can you try it out and see if you can add a test case.

As for the long term goal of cleaning up the if then else, we'll have to give it some thought before we go there. Hopefully 0.23 will be stable soon.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Siddharth Seth
>            Priority: Blocker
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160955#comment-13160955 ] 

Robert Joseph Evans commented on MAPREDUCE-3460:
------------------------------------------------

OK I understand why they all keep going to h1, because there is no way to request anything but h1 so it requests with a *.  When h1 heart beats back in and it has free space on it then it still gets a container assigned to it.  I don't see any evidence of requests being lost, without the patch even in this situation.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161709#comment-13161709 ] 

Hadoop QA commented on MAPREDUCE-3460:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12505898/MR3460_v4.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1384//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1384//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-examples.html
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1384//console

This message is automatically generated.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt, MR3460_v4.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162124#comment-13162124 ] 

Hudson commented on MAPREDUCE-3460:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Build #96 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/96/])
    mrege MAPREDUCE-3460 from trunk

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209740
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java

                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt, MR3460_v4.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Siddharth Seth (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160387#comment-13160387 ] 

Siddharth Seth commented on MAPREDUCE-3460:
-------------------------------------------

Bobby, the test is still failing with and without the change. I think the failed container needs to be sent after the first container is allocated - and the second container request + failed map request after this.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161316#comment-13161316 ] 

Hadoop QA commented on MAPREDUCE-3460:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12505827/MR3460_v3.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1383//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1383//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-examples.html
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1383//console

This message is automatically generated.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3460:
--------------------------------------

    Target Version/s: 0.23.1, 0.24.0  (was: 0.24.0, 0.23.1)
              Status: Patch Available  (was: Open)
    
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161680#comment-13161680 ] 

Robert Joseph Evans commented on MAPREDUCE-3460:
------------------------------------------------

I don't know for sure if the test simulates the situation or not yet, but yesterday before I left one of the tests we were running got into this situation and I was able to poke around a little bit.  I have the complete set of logs for the AM and RM during that time, and I am walking through the logs now to try and understand exactly what happened, and try to reproduce it.

>From what I have seen so far the following is the set of events.
{noformat}
2011-12-01 19:05:48,480 ASSIGNED CONTAINER container_1322524316055_0237_01_000002 TO HOST H2
2011-12-01 19:05:48,483 ASSIGNED CONTAINER container_1322524316055_0237_01_000003 TO HOST H2
2011-12-01 19:05:50,469 ASSIGNED CONTAINER container_1322524316055_0237_01_000002 TO ATTEMPT attempt_1322524316055_0237_m_000000_0
2011-12-01 19:05:50,476 ASSIGNED CONTAINER container_1322524316055_0237_01_000003 TO ATTEMPT attempt_1322524316055_0237_m_000001_0
2011-12-01 19:06:11,541 ASSIGNED CONTAINER container_1322524316055_0237_01_000004 TO HOST H2
2011-12-01 19:06:11,542 ASSIGNED CONTAINER container_1322524316055_0237_01_000005 TO HOST H2
2011-12-01 19:06:12,539 ATTEMPT attempt_1322524316055_0237_m_000000_0 FAILED
2011-12-01 19:06:12,540 ATTEMPT attempt_1322524316055_0237_m_000001_0 FAILED
2011-12-01 19:06:12,545 ASSIGNED CONTAINER container_1322524316055_0237_01_000004 TO ATTEMPT attempt_1322524316055_0237_m_000002_0
2011-12-01 19:06:12,555 ASSIGNED CONTAINER container_1322524316055_0237_01_000005 TO ATTEMPT attempt_1322524316055_0237_m_000003_0
2011-12-01 19:06:12,573 1 FAILURES ON H2
2011-12-01 19:06:12,574 2 FAILURES ON H2
2011-12-01 19:06:20,573 ASSIGNED CONTAINER container_1322524316055_0237_01_000006 TO HOST H2
2011-12-01 19:06:20,574 ASSIGNED CONTAINER container_1322524316055_0237_01_000007 TO HOST H2
2011-12-01 19:06:20,585 ATTEMPT attempt_1322524316055_0237_m_000002_0 FAILED
2011-12-01 19:06:20,586 ATTEMPT attempt_1322524316055_0237_m_000003_0 FAILED
2011-12-01 19:06:20,589 ASSIGNED CONTAINER container_1322524316055_0237_01_000006 TO ATTEMPT attempt_1322524316055_0237_m_000001_1
2011-12-01 19:06:20,592 ASSIGNED CONTAINER container_1322524316055_0237_01_000007 TO ATTEMPT attempt_1322524316055_0237_m_000000_1
2011-12-01 19:06:20,605 3 FAILURES ON H2
2011-12-01 19:06:20,607 4 FAILURES ON H2
2011-12-01 19:06:20,608 BLACKLISTED H2
2011-12-01 19:06:23,998 ASSIGNED CONTAINER container_1322524316055_0237_01_000008 TO HOST H2
2011-12-01 19:06:23,999 ASSIGNED CONTAINER container_1322524316055_0237_01_000009 TO HOST H2
2011-12-01 19:06:26,647 ASSIGNED CONTAINER container_1322524316055_0237_01_000010 TO HOST H1
2011-12-01 19:06:26,649 ASSIGNED CONTAINER container_1322524316055_0237_01_000011 TO HOST H1
2011-12-01 19:06:28,635 ASSIGNED CONTAINER container_1322524316055_0237_01_000010 TO ATTEMPT attempt_1322524316055_0237_m_000004_0
2011-12-01 19:06:28,640 ASSIGNED CONTAINER container_1322524316055_0237_01_000011 TO ATTEMPT attempt_1322524316055_0237_m_000005_0
2011-12-01 19:06:40,839 ASSIGNED CONTAINER container_1322524316055_0237_01_000012 TO HOST H1
2011-12-01 19:06:40,840 ASSIGNED CONTAINER container_1322524316055_0237_01_000013 TO HOST H1
2011-12-01 19:06:42,675 ASSIGNED CONTAINER container_1322524316055_0237_01_000012 TO ATTEMPT attempt_1322524316055_0237_m_000006_0
2011-12-01 19:06:42,682 ASSIGNED CONTAINER container_1322524316055_0237_01_000013 TO ATTEMPT attempt_1322524316055_0237_m_000007_0
2011-12-01 19:06:45,698 ASSIGNED CONTAINER container_1322524316055_0237_01_000014 TO HOST H1
2011-12-01 19:06:45,699 ASSIGNED CONTAINER container_1322524316055_0237_01_000015 TO HOST H1
2011-12-01 19:06:46,698 ASSIGNED CONTAINER container_1322524316055_0237_01_000014 TO ATTEMPT attempt_1322524316055_0237_m_000008_0
2011-12-01 19:06:46,703 ASSIGNED CONTAINER container_1322524316055_0237_01_000015 TO ATTEMPT attempt_1322524316055_0237_m_000009_0
{noformat}

After that it looks like the scheduler has several requested container to assign, but it never assigns any of them, and the AM never asks for anything new.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Robert Joseph Evans (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated MAPREDUCE-3460:
-------------------------------------------

    Attachment: MR-3460.txt

Sid,  You were correct.  It was not accessing the expected code.  I was confused because the FAST_FAIL_MAP container was still being assigned.  It was just not sent to the scheduler before the node was blacklisted.

I have updated the test, and also the code itself.  The original patch was updating the list of failed maps and also the list of pending maps, but this caused the actual allocation of the container to fail later on.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162134#comment-13162134 ] 

Hudson commented on MAPREDUCE-3460:
-----------------------------------

Integrated in Hadoop-Mapreduce-0.23-Build #114 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/114/])
    mrege MAPREDUCE-3460 from trunk

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209740
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java

                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt, MR3460_v4.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162129#comment-13162129 ] 

Hudson commented on MAPREDUCE-3460:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk #883 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/883/])
    MAPREDUCE-3460. MR AM can hang if containers are allocated on a node blacklisted by the AM. (Contributed by Hitesh Shah and Robert Joseph Evans)

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209737
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java

                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt, MR3460_v4.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Robert Joseph Evans (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated MAPREDUCE-3460:
-------------------------------------------

    Target Version/s: 0.23.1, 0.24.0  (was: 0.24.0, 0.23.1)
              Status: Patch Available  (was: Open)
    
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Siddharth Seth (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160448#comment-13160448 ] 

Siddharth Seth commented on MAPREDUCE-3460:
-------------------------------------------

New container requests ignore node blacklisting - and make an entry into {{mapsHostMapping}}. That would be one way to recreate this issue (or alternately fix it).

Something like
1. request _1 on h1
2. am heartbeat
3. h1 heartbeat 
4. am heartbeat - container assigned
5. fail _1 on h1
6. request fast_fail replacement for _1
7. am heartbeat - to update request
8. request _3 on h3 / h1,h3
9. h1 heartbeat - to schedule (RM only aware fast_fail _1 at this point)
10. am heartbeat - to get a fast_fail allocated on a blacklisted node.
11. h1 heartbeat
12. h3 heartbeat

                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Robert Joseph Evans (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated MAPREDUCE-3460:
-------------------------------------------

    Attachment: MR3460_v4.txt

Yes Sid it did reproduce the issue.  Thanks for doing that.  I am just uploading a new patch that fixes some spelling mistakes I introduced.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt, MR3460_v4.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162143#comment-13162143 ] 

Hudson commented on MAPREDUCE-3460:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk #916 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/916/])
    MAPREDUCE-3460. MR AM can hang if containers are allocated on a node blacklisted by the AM. (Contributed by Hitesh Shah and Robert Joseph Evans)

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209737
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java

                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt, MR3460_v4.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158848#comment-13158848 ] 

Robert Joseph Evans commented on MAPREDUCE-3460:
------------------------------------------------

+1 to Hitesh's patch at least as a quick fix.  I can try and reproduce the issue here and verify that the patch does indeed fix the issue.  I can also add in a few unit tests for it and turn it into a real patch if you like.

I would also like some feedback on a potential (long term) refactor of the code which would be done on a separate JIRA after 0.23 stabilizes.  It seems to me that the root cause of this issue is because a special condition for a FAST_FAIL_MAP was missed.  The code right now is written with lots of if else statements separating out map tasks from reduce tasks and also from failed map tasks, etc.  I think it would be cleaner to replace the if statements with classes that use polymorphism to change the methods called.  This would allow the different handling of a failed map from a normal map or from a reduce to be more evident.  It would also force the internal data structures that keep track of the different types of tasks to be combined together.  This is just something that popped into my head while trying to evaluate Hitesh's fix.  I have not really evaluated what it would take to make it work or anything, I would just like some feedback about the idea before filing a JIRA a
 bout it.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Siddharth Seth
>            Priority: Blocker
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161068#comment-13161068 ] 

Robert Joseph Evans commented on MAPREDUCE-3460:
------------------------------------------------

Sid I think I may have found a bug in the scheduler/MR-AM, but I am not really sure about it or not, and I would like your feedback on it.

When I run the unit test above I see the hosts(NM) are registered with the RM using "host:port", but when we request a container in the tests it only has "host" in it.  The scheduler seems to indicate that when it assigns a container to a host it is because it is rack local not data local.  As part of this the host specific request does not seem to be cleared out from the scheduler even though it is not part of the new ask.  If I switch it over to requesting a container on a particular "host:port" then the scheduler will clear find the container to be data local, and clear out the host, rack, and * requests.  This seems to work OK, but I thought when we requested a container due to data locality we used just the host name, because that is what HDFS returns to us.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Mahadev konar (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated MAPREDUCE-3460:
-------------------------------------

    Target Version/s: 0.23.1, 0.24.0  (was: 0.24.0, 0.23.1)
              Status: Open  (was: Patch Available)

Cancelling to address the issues, Sid pointed out.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Robert Joseph Evans (Assigned) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans reassigned MAPREDUCE-3460:
----------------------------------------------

    Assignee: Robert Joseph Evans
    
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160439#comment-13160439 ] 

Robert Joseph Evans commented on MAPREDUCE-3460:
------------------------------------------------

I think I must be doing it wrong some how or I don't understand the order of things you are requesting.  I am doing the following at it passes on both

# request _1 on h1
# am heartbeat()
# h1 heartbeat()
# am heartbeat() //Get _1 container back
# fail _1 so h1 is blacklisted
# request _3 on h3
# request fast fail map _2 on h1
... (More heartbeats to schedule things)

This does not work to reproduce the issue because any requests for h1 added after h1 is blacklisted will have h1 removed.

If I move the fast fail map request above h1 being blacklisted then when the container request comes back for h1 it sees that it is blacklisted.  It will not find the request in the mapsHostMapping and will result to pulling a request out of maps, which still works.  The only way we are going to get this deadlock is if some how maps is empty.  I don't really see how the patch changes that.  I really don't understand all of what the code is doing so I could just be completely wrong about it. 

                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3460:
--------------------------------------

    Attachment: MR3460_v3.txt

Bobby, could you please see if this test simulates the situation. Meanwhile, looking at your last comment about locality.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161888#comment-13161888 ] 

Hudson commented on MAPREDUCE-3460:
-----------------------------------

Integrated in Hadoop-Common-0.23-Commit #246 (See [https://builds.apache.org/job/Hadoop-Common-0.23-Commit/246/])
    mrege MAPREDUCE-3460 from trunk

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209740
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java

                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt, MR3460_v4.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Siddharth Seth (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159583#comment-13159583 ] 

Siddharth Seth commented on MAPREDUCE-3460:
-------------------------------------------

Thanks for adding the unit test Bobby. The test passes with and without the change to the RMContainerAllocator. The MockRM needs to allocate a prio=5 container on h1 to reproduce the issue (and the MRAM needs to send back a release for this container). The AM was losing track of allocated containers with priority=5, hosts=empty, Host blacklisted by AM.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Robert Joseph Evans (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated MAPREDUCE-3460:
-------------------------------------------

    Attachment: MR-3460.txt

Adding a patch with The fix by Hitesh and a unit test to verify that it works.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Robert Joseph Evans (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated MAPREDUCE-3460:
-------------------------------------------

     Target Version/s: 0.23.1, 0.24.0
    Affects Version/s: 0.24.0
               Status: Patch Available  (was: Open)
    
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Robert Joseph Evans (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated MAPREDUCE-3460:
-------------------------------------------

    Target Version/s: 0.23.1, 0.24.0  (was: 0.24.0, 0.23.1)
              Status: Patch Available  (was: Open)

Oh and I will be filing a JIRA for the fifo scheduler issue.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt, MR3460_v4.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162119#comment-13162119 ] 

Hudson commented on MAPREDUCE-3460:
-----------------------------------

Integrated in Hadoop-Hdfs-HAbranch-build #4 (See [https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/4/])
    mrege MAPREDUCE-3460 from trunk

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209740
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java

                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt, MR3460_v4.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160945#comment-13160945 ] 

Robert Joseph Evans commented on MAPREDUCE-3460:
------------------------------------------------

Sid that didn't do it.  It fails both with and without the patch. For some reason it looks like after steps 11 and 12 all am heartbeats still have the containers scheduled on h1 (even though it is blacklisted).  I am investigating it.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Robert Joseph Evans (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated MAPREDUCE-3460:
-------------------------------------------

    Target Version/s: 0.23.1, 0.24.0  (was: 0.24.0, 0.23.1)
              Status: Open  (was: Patch Available)

Addressing Sids comments.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Siddharth Seth (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161295#comment-13161295 ] 

Siddharth Seth commented on MAPREDUCE-3460:
-------------------------------------------

bq. When I run the unit test above I see the hosts(NM) are registered with the RM using "host:port", but when we request a container in the tests it only has "host" in it. The scheduler seems to indicate that when it assigns a container to a host it is because it is rack local not data local. As part of this the host specific request does not seem to be cleared out from the scheduler even though it is not part of the new ask. If I switch it over to requesting a container on a particular "host:port" then the scheduler will clear find the container to be data local, and clear out the host, rack, and * requests. This seems to work OK, but I thought when we requested a container due to data locality we used just the host name, because that is what HDFS returns to us.

Good catch! Like you said, the request shouldn't care about the port for data locality. The FifoScheduler seems to be using the entire nodeAddress for allocating containers - which is incorrect. The capacity scheduler appears to be working as it should though - using only the hostname to allocate containers.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3460:
--------------------------------------

          Resolution: Fixed
       Fix Version/s: 0.23.1
    Target Version/s: 0.23.1, 0.24.0  (was: 0.24.0, 0.23.1)
              Status: Resolved  (was: Patch Available)

Committed to trunk and branch-0.23. Thanks Hitesh and Bobby.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt, MR3460_v4.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160148#comment-13160148 ] 

Hadoop QA commented on MAPREDUCE-3460:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12505632/MR-3460.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1369//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1369//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-examples.html
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1369//console

This message is automatically generated.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt, MR-3460.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161881#comment-13161881 ] 

Hudson commented on MAPREDUCE-3460:
-----------------------------------

Integrated in Hadoop-Common-trunk-Commit #1355 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1355/])
    MAPREDUCE-3460. MR AM can hang if containers are allocated on a node blacklisted by the AM. (Contributed by Hitesh Shah and Robert Joseph Evans)

sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209737
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerRequestor.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java

                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt, MR3460_v4.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159447#comment-13159447 ] 

Hadoop QA commented on MAPREDUCE-3460:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12505513/MR-3460.txt
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1362//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1362//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-examples.html
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1362//console

This message is automatically generated.
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>         Attachments: MR-3460.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3460) MR AM can hang if containers are allocated on a node blacklisted by the AM

Posted by "Brian Cho (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265692#comment-13265692 ] 

Brian Cho commented on MAPREDUCE-3460:
--------------------------------------

Was a JIRA ever filed for using hostname:port instead of only hostname in FifoScheduler?
                
> MR AM can hang if containers are allocated on a node blacklisted by the AM
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3460
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3460
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR-3460.txt, MR-3460.txt, MR3460_v3.txt, MR3460_v4.txt
>
>
> When an AM is assigned a FAILED_MAP (priority = 5) container on a nodemanager which it has blacklisted - it tries to
> find a corresponding container request.
> This uses the hostname to find the matching container request - and can end up returning any of the ContainerRequests which may have requested a container on this node. This container request is cleaned to remove the bad node - and then added back to the RM 'ask' list.
> The AM cleans the 'ask' list after each heartbeat - The RM Allocator is still aware of the priority=5 container (in 'remoteRequestsTable') - but this never gets added back to the 'ask' set - which is what is sent to the RM.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira