You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Vinod Kumar Vavilapalli (Created) (JIRA)" <ji...@apache.org> on 2011/11/04 14:14:59 UTC

[jira] [Created] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
---------------------------------------------------------------------

                 Key: MAPREDUCE-3353
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: applicationmaster, mrv2, resourcemanager
    Affects Versions: 0.23.0
            Reporter: Vinod Kumar Vavilapalli
             Fix For: 0.23.1


When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222848#comment-13222848 ] 

Hadoop QA commented on MAPREDUCE-3353:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12517166/MAPREDUCE-3353-branch-0.23.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 29 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The applied patch generated 503 javac compiler warnings (more than the trunk's current 501 warnings).

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed these unit tests:
                  org.apache.hadoop.mapreduce.v2.app.TestRMContainerAllocator

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2008//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2008//console

This message is automatically generated.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3353:
----------------------------------

    Status: Patch Available  (was: Open)
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Hitesh Shah (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hitesh Shah reassigned MAPREDUCE-3353:
--------------------------------------

    Assignee:     (was: Hitesh Shah)
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Priority: Critical
>             Fix For: 0.23.1
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3353:
----------------------------------

    Attachment: MAPREDUCE-3353-branch-0.23.patch

Attaching patch. This passes all tests locally and has a functional test for main changes.
1) providing all unusable nodes on app attempt registration
2) providing delta updates of nodes thereafter to a running app
I will scan through the changes for final clean ups and look at adding some more tests if necessary.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha reassigned MAPREDUCE-3353:
-------------------------------------

    Assignee: Bikas Saha  (was: Vinod Kumar Vavilapalli)
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238393#comment-13238393 ] 

Robert Joseph Evans commented on MAPREDUCE-3353:
------------------------------------------------

Arun,

It looks like you put in MAPREDUCE-3533 instead of MAPREDUCE-3353 in CHANGES.txt, could you please fix it.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.3
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3353:
-----------------------------------------------

    Priority: Critical  (was: Major)
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.1
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3353:
----------------------------------

    Attachment: MAPREDUCE-3353-branch-0.23.patch

New patch after latest git pull. Removed unreferenced import. Reused newly added BuilderUtil function.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Arun C Murthy (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy reassigned MAPREDUCE-3353:
----------------------------------------

    Assignee: Vinod Kumar Vavilapalli
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Critical
>             Fix For: 0.23.1
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238617#comment-13238617 ] 

Hudson commented on MAPREDUCE-3353:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk-Commit #2005 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2005/])
    MAPREDUCE-3353. Fixed commit msg to point to right jira. (Revision 1305457)

     Result = SUCCESS
acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1305457
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt

                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.3
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Arun C Murthy (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated MAPREDUCE-3353:
-------------------------------------

    Fix Version/s: trunk
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 2.0.0, trunk
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3353:
----------------------------------

    Status: Patch Available  (was: Open)
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239475#comment-13239475 ] 

Hudson commented on MAPREDUCE-3353:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk #1032 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1032/])
    MAPREDUCE-3353. Fixed commit msg to point to right jira. (Revision 1305457)

     Result = SUCCESS
acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1305457
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt

                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.3
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3353:
----------------------------------

    Attachment: MAPREDUCE-3353-branch-0.23.patch

New patch. I have not change ConcurrentMap implementations. They use ConcurrentHashmap that have safe iterators, although java docs claim that the iterator itself can be accessed on 1 thread. 
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Arun C Murthy (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated MAPREDUCE-3353:
-------------------------------------

    Status: Open  (was: Patch Available)

Canceling patch while comments are addressed.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222947#comment-13222947 ] 

Bikas Saha commented on MAPREDUCE-3353:
---------------------------------------

The test failure is unrelated to the patch and happens on trunk. MAPREDUCE-3976.

                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233157#comment-13233157 ] 

Hadoop QA commented on MAPREDUCE-3353:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12519006/MAPREDUCE-3353-branch-0.23.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 20 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The applied patch generated 507 javac compiler warnings (more than the trunk's current 505 warnings).

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2071//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2071//console

This message is automatically generated.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3353:
----------------------------------

    Attachment: MAPREDUCE-3353-branch-0.23.patch

Adding new patch with latest changes pulled and merged.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219400#comment-13219400 ] 

Bikas Saha commented on MAPREDUCE-3353:
---------------------------------------

The changes turned out to be more than initially expected. I have the code done and will start on the tests.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3353:
----------------------------------

    Attachment: MAPREDUCE-3353-branch-0.23.patch

1) Fixed the javadoc warning.

2) The javac warnings are because of events handlers being called in NodesListManager.java and are similar to pre-existing warnings.
===========
[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java:[142,19] [unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler
[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java:[155,21] [unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler
[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java:[244,35] [unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler
===========
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233248#comment-13233248 ] 

Hadoop QA commented on MAPREDUCE-3353:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12519012/MAPREDUCE-3353-branch-0.23.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 17 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The applied patch generated 507 javac compiler warnings (more than the trunk's current 505 warnings).

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2074//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2074//console

This message is automatically generated.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Amol Kekre (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219337#comment-13219337 ] 

Amol Kekre commented on MAPREDUCE-3353:
---------------------------------------

any updates?
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215901#comment-13215901 ] 

Bikas Saha commented on MAPREDUCE-3353:
---------------------------------------

1) Add a ClusterManager to RM that currently only provides bad node information. All node management will incrementally be moved to it. Eg. RmContext.RMnodes and RMContext.invalidNodes could move into ClusterManager. This could be a new class or be an enhancement of NodesListManager.
2) Node is bad when it gets a heartbeat with health=unhealthy or when the livenessmonitor reports lost node. In both cases a SchedulerNodeRemove event is issued. Similar for nodes becoming healthy. Additionally, these will now emit events to update the ClusterManager with corresponding events.
3) In AppMasterService, after calling Scheduler.allocate(), it will call ClusterManager.getUnusableNodes(). These will be passed to  RMAttempt object which will calculate the delta with previous known bad machines. The delta will be sent with the allocate response and the current list will be saved to calculate the next delta.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Arun C Murthy (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228739#comment-13228739 ] 

Arun C Murthy commented on MAPREDUCE-3353:
------------------------------------------

Looks good, some minor nits:

# RegisterApplicationMasterResponse.(get,set)UnusableNodes is not used right now, so let's add it later when we need it.
# AMResponse.getUpdatedNodes could use javadocs.
# BuilderUtils.createNodeReport
# RMContextImpl.applications shud be changed to ConcurrentSkipListMap to be safe - we should open a separate jira to fix the signature of RMContext.get* which return ConcurrentMap
# RMAppImpl.pullRMNodeUpdates needs a writeLock since it's clearing
# RMAppImpl shud ignore NodeUpdate in COMPLETED state (thus we can remove the 'if' condition in RMAppNodeUpdateTransition).

                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239428#comment-13239428 ] 

Hudson commented on MAPREDUCE-3353:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk #997 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/997/])
    MAPREDUCE-3353. Fixed commit msg to point to right jira. (Revision 1305457)

     Result = FAILURE
acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1305457
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt

                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.3
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238603#comment-13238603 ] 

Hudson commented on MAPREDUCE-3353:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Commit #719 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/719/])
    MAPREDUCE-3353. Fixed commit msg to point to right jira. (Revision 1305458)

     Result = SUCCESS
acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1305458
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt

                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.3
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215306#comment-13215306 ] 

Bikas Saha commented on MAPREDUCE-3353:
---------------------------------------

A potential solution would be the following
1) have the scheduler interface return the set of bad nodes on which it has stopped scheduling. This keeps the decision of which node is bad in the scheduler. The scheduler is the ultimate authority on what runs on a node and should tell its clients whether about the nodes that it is not considering for scheduling.
2) 1) above could be done as another interface API or piggybacked on the scheduler.allocate() API.
3) The response could contain all the known bad nodes or deltas to the previous response. Deltas are cheaper to send but are susceptible to message loss and retransmission. Also, deltas would have to be divided into new bad nodes and new good nodes.
4) The AM might want to know the type of bad node. Say lost or unhealthy etc. The bad nodes information could be enhanced via querying the RMNode object for the actual reason/health.

As an enhancement, we could add a new RMNodeMananger entity that manages all the RMNodes. The above functionality could move from the scheduler into RMNodeManager (though it would need to be in sync with the scheduler). After that, getting detailed information may not need direct access to RMNode object. Potentially, other interactions with RMNode could be forwarded through the RMNodeManager. But this would be a fairly significant refactoring thats best left to a separate future work item.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233568#comment-13233568 ] 

Bikas Saha commented on MAPREDUCE-3353:
---------------------------------------

They have already been clarified in the first patch submission. Pasting from an earlier comment.
>>>
The javac warnings are because of events handlers being called in NodesListManager.java and are similar to pre-existing warnings.
===========
[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java:[142,19] [unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler
[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java:[155,21] [unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler
[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java:[244,35] [unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler
===========
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3353:
-----------------------------------------------

    Target Version/s: 0.23.2  (was: 0.23.1)

bq. In addition, we need AMs to act on the nodes information. Filing a separate ticket.
Filed MAPREDUCE-3921.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238703#comment-13238703 ] 

Hudson commented on MAPREDUCE-3353:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk-Commit #1941 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1941/])
    MAPREDUCE-3353. Fixed commit msg to point to right jira. (Revision 1305457)

     Result = ABORTED
acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1305457
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt

                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.3
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3353:
----------------------------------

    Attachment: MAPREDUCE-3353-branch-0.23.patch

Done with planned changes. Patch for review.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3353:
----------------------------------

    Status: Patch Available  (was: Open)
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238601#comment-13238601 ] 

Hudson commented on MAPREDUCE-3353:
-----------------------------------

Integrated in Hadoop-Common-trunk-Commit #1930 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1930/])
    MAPREDUCE-3353. Fixed commit msg to point to right jira. (Revision 1305457)

     Result = SUCCESS
acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1305457
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt

                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.3
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3353:
-----------------------------------------------

    Fix Version/s:     (was: 0.23.1)
                   0.23.2

Targeting 0.23.2
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Critical
>             Fix For: 0.23.2
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3353:
----------------------------------

    Status: Open  (was: Patch Available)
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3353:
----------------------------------

    Status: Open  (was: Patch Available)
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238683#comment-13238683 ] 

Hudson commented on MAPREDUCE-3353:
-----------------------------------

Integrated in Hadoop-Mapreduce-0.23-Commit #738 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/738/])
    MAPREDUCE-3353. Fixed commit msg to point to right jira. (Revision 1305458)

     Result = ABORTED
acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1305458
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt

                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.3
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3353:
----------------------------------

    Status: Patch Available  (was: Open)
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3353:
----------------------------------

    Status: Open  (was: Patch Available)
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3353:
----------------------------------

    Status: Open  (was: Patch Available)
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215328#comment-13215328 ] 

Bikas Saha commented on MAPREDUCE-3353:
---------------------------------------

Not doing deltas on the RM-AM channel does not seem viable because of high frequency message traffic. Sending information about 100 bad nodes at 100 bytes per node for 1000AM's every second is about 10MB/s of traffic.
Sending deltas means tracking last and current states on the RM on a per AM attempt basis. That would not be good to do in the scheduler because its not the responsibility of the scheduler. So this needs to be done on each RMAttempt object. The RMAttempt object gets the current list of bad nodes and compares it with its last known list of bad nodes. Additions and deletions are sent to the AM as new bad and good nodes.
Alternatively, each RMNode could send an event to each RMAppAttempt for healthy->unhealthy and vice versa transitions. These events could be accumulated and copied to the AM via the allocate response.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233178#comment-13233178 ] 

Hadoop QA commented on MAPREDUCE-3353:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12519012/MAPREDUCE-3353-branch-0.23.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 17 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The applied patch generated 507 javac compiler warnings (more than the trunk's current 505 warnings).

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2072//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2072//console

This message is automatically generated.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Hitesh Shah (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hitesh Shah reassigned MAPREDUCE-3353:
--------------------------------------

    Assignee: Hitesh Shah
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Hitesh Shah
>            Priority: Critical
>             Fix For: 0.23.1
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Arun C Murthy (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233361#comment-13233361 ] 

Arun C Murthy commented on MAPREDUCE-3353:
------------------------------------------

Bikas, can you pls look at the javac warnings? Thanks.

                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238624#comment-13238624 ] 

Hudson commented on MAPREDUCE-3353:
-----------------------------------

Integrated in Hadoop-Common-0.23-Commit #729 (See [https://builds.apache.org/job/Hadoop-Common-0.23-Commit/729/])
    MAPREDUCE-3353. Fixed commit msg to point to right jira. (Revision 1305458)

     Result = SUCCESS
acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1305458
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt

                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.3
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Amol Kekre (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212963#comment-13212963 ] 

Amol Kekre commented on MAPREDUCE-3353:
---------------------------------------

Vinod,
Should this be in a .23.1 RC or can we move it to .23.2?
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Critical
>             Fix For: 0.23.1
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3353:
----------------------------------

    Status: Patch Available  (was: Open)
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3353:
----------------------------------

    Status: Open  (was: Patch Available)
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Arun C Murthy (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated MAPREDUCE-3353:
-------------------------------------

          Resolution: Fixed
       Fix Version/s:     (was: 0.23.2)
                      0.23.3
    Target Version/s: 0.23.3  (was: 0.23.2)
              Status: Resolved  (was: Patch Available)

I just committed this. Thanks Bikas!
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.3
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Amol Kekre (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222111#comment-13222111 ] 

Amol Kekre commented on MAPREDUCE-3353:
---------------------------------------

Not sure why this jira is marked critical. This only impacts if a node goes bad during AM life span right? If so given 3 attempts by MR, how important this jira (Major?).
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220493#comment-13220493 ] 

Hadoop QA commented on MAPREDUCE-3353:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12516746/MAPREDUCE-3353-branch-0.23.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 29 new or modified tests.

    -1 javadoc.  The javadoc tool appears to have generated 1 warning messages.

    -1 javac.  The applied patch generated 504 javac compiler warnings (more than the trunk's current 502 warnings).

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1977//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1977//console

This message is automatically generated.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Hitesh Shah (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hitesh Shah updated MAPREDUCE-3353:
-----------------------------------

    Priority: Critical  (was: Major)
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Priority: Critical
>             Fix For: 0.23.1
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Vinod Kumar Vavilapalli (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215952#comment-13215952 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-3353:
----------------------------------------------------

+1 for the latest proposal, we can use NodeListManager itself instead of a new ClusterManager.

In addition, we need AMs to act on the nodes information. Filing a separate ticket.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3353:
----------------------------------

    Status: Patch Available  (was: Open)
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3353:
----------------------------------

    Status: Patch Available  (was: Open)
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220602#comment-13220602 ] 

Hadoop QA commented on MAPREDUCE-3353:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12516770/MAPREDUCE-3353-branch-0.23.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 29 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The applied patch generated 503 javac compiler warnings (more than the trunk's current 501 warnings).

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1980//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1980//console

This message is automatically generated.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3353:
----------------------------------

    Attachment: MAPREDUCE-3353-branch-0.23.patch

uploading patch again to trigger hudson
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes

Posted by "Arun C Murthy (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated MAPREDUCE-3353:
-------------------------------------

    Priority: Major  (was: Critical)
    
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch, MAPREDUCE-3353-branch-0.23.patch
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can take some action like for e.g. re-executing map tasks whose intermediate output live on that faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira