You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Vinod Kumar Vavilapalli (Created) (JIRA)" <ji...@apache.org> on 2012/02/24 22:51:49 UTC

[jira] [Created] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
------------------------------------------------------------------------------------

                 Key: MAPREDUCE-3921
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mr-am, mrv2
    Affects Versions: 0.23.0
            Reporter: Vinod Kumar Vavilapalli
             Fix For: 0.23.2




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293281#comment-13293281 ] 

Hudson commented on MAPREDUCE-3921:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk-Commit #2415 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2415/])
    MAPREDUCE-3921. MR AM should act on node health status changes. Contributed by Bikas Saha. (Revision 1349065)

     Result = SUCCESS
sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349065
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobEventType.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobUpdatedNodesEvent.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/TaskAttemptKillEvent.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeState.java

                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 2.0.1-alpha
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-10.patch, MAPREDUCE-3921-11.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Arun C Murthy (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy reassigned MAPREDUCE-3921:
----------------------------------------

    Assignee: Bikas Saha
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Open  (was: Patch Available)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13261127#comment-13261127 ] 

Bikas Saha commented on MAPREDUCE-3921:
---------------------------------------

I see what you are saying about fetch failures and bad nodes. I am open to both approaches. The way its currently done is based off discussions I had with Vinod long ago.

Changed JobImpl to ignore JOB_UPDATED_NODE events in NEW and INITED states.

Changed to TaskId.getTaskType()

Yes, the removed entry is marked OBSOLETE. But I still dont understand why that would be done if the current entry is not successful. Why lose the previously successful entry when the current one is not successful itself? This should be done only when the current entry is successful.

I like the change to store NodeId's instead of ContainerId's in the AssignedRequests map. I would like to make it a separate change and not merge it with this one. There might be other gotchas to doing that.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247943#comment-13247943 ] 

Hadoop QA commented on MAPREDUCE-3921:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12521616/MAPREDUCE-3921.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 4 new or modified test files.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The applied patch generated 509 javac compiler warnings (more than the trunk's current 507 warnings).

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed these unit tests:
                  org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService
                  org.apache.hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry
                  org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization
                  org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2165//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2165//console

This message is automatically generated.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13261167#comment-13261167 ] 

Bikas Saha commented on MAPREDUCE-3921:
---------------------------------------

I will change the map to store Container directly. Return ContainerId and NodeId in the getters from the container object.

About the OBSOLETE part. I get how it is used. What I dont get is why we are marking a previously successful task as obsolete and invalid upon the completion of a new task without first checking if the new task was itself successful or not.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Siddharth Seth (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260172#comment-13260172 ] 

Siddharth Seth commented on MAPREDUCE-3921:
-------------------------------------------

Couple of questions and suggestions.
- Does 'node unhealthy' need to be treated differently from 'TooManyFetchFailures' ? Killed versus Failed. This ends up with NodeFailures not counting towards the limit on task attempts.
- JOB_UPDATED_NODES needs to be handled in the JOB_INIT state. Very small chance of hitting this.
- Minor: JobImpl.actOnUsableNode can get the task type from the id itself. It doesn't need to fetch the actual task.
- Minor: "if this attempt is not successful" this comment in JobImpl can be removed. It's removing an entry from a successfulAttempt index.
- In KilledAfterSuccessTransition - createJobCounterUpdateEventTAFailed should be createJobCounterUpdateEventTAKilled
- TaskImpl.handleAttemptCompletion - finishedAttempts - this will end up double counting the same task attempt. It's used in some other transition.
- Does the JobHistoryParser need some more changes - to unset fields which may have been set previously by Map/ReduceAttemptSuccessfulEvents and TaskFinishedEvent
- For running tasks - shouldn't running Reduce attempts also be killed ?
- RMContainerAllocator.handleUpdatedNodes - instead of fetching the nodeId via appContext, job etc - the nodeId can be stored with the AssignedRequest. 1) getTask, getAttempt require readLocks - can avoid these calls every second. 2) There's an unlikely race where the nodeId may not be assigned in the TaskAttempt (if the dispatcher thread is backlogged).
- TaskAttemptId.getNodeId() can be avoided. getContainerManagerAddress can be used instead.


Not related to this patch.
Does JOB_TASK_COMPLETED need to be handled (ignored) in additional states?
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238877#comment-13238877 ] 

Hadoop QA commented on MAPREDUCE-3921:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12519999/MAPREDUCE-3921-branch-0.23.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 12 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The applied patch generated 509 javac compiler warnings (more than the trunk's current 507 warnings).

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2105//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2105//console

This message is automatically generated.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Patch Available  (was: Open)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Attachment: MAPREDUCE-3921-1.patch
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Siddharth Seth (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13261094#comment-13261094 ] 

Siddharth Seth commented on MAPREDUCE-3921:
-------------------------------------------

bq. Yes, I think. Based on our experience, here we are pre-emptively taking action on a task that might actually be ok. And it should be an infrequent action.
bq. My understanding of existing behavior in mrv1 was that only maps are pre-emptively terminated for performance reasons.
I think 'fetch failure' / 'node unhealthy' should be considered in the same way - at least for the purpose of counting towards the allowed_task_failure limit. Ideally for the tasks state as well. There's currently no way to distinguish between a task causing a node to go unhealthy versus other problems. My guess is 'fetch failures' are more often than not caused by a bad tracker, as against a bad task.
WRT killing reduce tasks on unhealthy node - I'm not sure what was done in 20 (From a quick look, couldn't find the code which kills map tasks either). It'd be best if Vinod or others with more knowledge and history about how and why 20 deals with this pitch in.

bq. My understanding was that scheduling happens when the job moves from INIT to RUNNING state via the StartTransition(). Unless allocate is called on RM it will not return any unhealthy machines. So I thought that JOB_UPDATED_EVENT can never come until the job moves into the RUNNING state. Can you please point out the scenario you are thinking about?
Calls to allocate() start once the RMCommunicator service is started - which happens before a JOB_START event is sent. Very unlikely - but there's an extremely remote possibility of an allocate call completing before a job moves into the START state. 

bq. Unless you really want this, I would prefer it the way its currently written. I prefer not to depend on string name encodings.
It's safe to use TaskId.getTaskType() - don't need to explicitly depend on string name encoding. Avoids the extra task lookups.

bq. That was a question I had and put it in the comments. It seems that for a TaskAttemptCompletedEventTransition the code removes the previous successful entry from successAttemptCompletionEventNoMap. It then checks if the current attempt is successful, and in that case adds it to the successAttemptCompletionEventNoMap. But what if the current attempt is not successful. We have now removed the previous successful attempt too. Is that the desired behavior. This question is independent of this jira.
It also marks the removed entry as OBSOLETE - so the taskAttemptCompletionEvents list doesn't have any SUCCESSFUL attempts for the specific taskId.

bq. I have moved the finishedTask increment out of that function and made it explicit in every transition that requires it to be that way.
In the same context I have a question in comments in MapRetroactiveFailureTransition. Why is this not calling handleAttemptCompletion. My understanding is that handleAttemptCompletion is used to notify reducers about changes in map outputs. So if a map was failed after success then reducers should know about it so that they can abandon its outputs before getting too many fetch failures. Is that not so?
It is calling it via AttemptFailedTransition.transition(). That's the bit which also counts the failure towards the allowed_failure_limit.

bq. Sorry I did not find getContainerManagerAddress(). The map in AssignedRequests stores ContainerId and its not possible to get nodeId from it. What are you proposing?
Correction - called getAssignedContainerMgrAddress. IAC, was proposing storing the containers NodeId with the AssignedRequest - that completely removes the need to fetch the actual task.

bq. It does not look like it but there may be race conditions I have not thought of. But looking further, it seems that the action on this event checks for job completion in TaskCompletedTransition. TaskCompletedTransition increments job.completedTaskCount irrespective of whether the task has succeeded/killed or failed. Now, TaskCompletedTransition.checkJobCompleteSuccess() checks job.completedTaskCount == job.tasks.size() for completion. How is this working? Wont enough killed tasks/failed + completed tasks trigger job completion? Or is that expected behavior?
It checks for failure before attempting the SUCCESS check - so that should work. Unless I'm missing something - Tasks could complete after a Job moves to state FAILED - which would end up generating this event.

                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Siddharth Seth (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293172#comment-13293172 ] 

Siddharth Seth commented on MAPREDUCE-3921:
-------------------------------------------

+1 lgtm. Thanks Bikas.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: trunk
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-10.patch, MAPREDUCE-3921-11.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249128#comment-13249128 ] 

Hadoop QA commented on MAPREDUCE-3921:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12521796/MAPREDUCE-3921-1.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 4 new or modified test files.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The applied patch generated 509 javac compiler warnings (more than the trunk's current 507 warnings).

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2171//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2171//console

This message is automatically generated.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Open  (was: Patch Available)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Attachment: MAPREDUCE-3921-6.patch

Review comments fixed.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Patch Available  (was: Open)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Open  (was: Patch Available)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Patch Available  (was: Open)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: trunk
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-10.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13261843#comment-13261843 ] 

Bikas Saha commented on MAPREDUCE-3921:
---------------------------------------

Here is how I read TaskAttemptCompletedEventTransition in JobImpl
Task T has a new attempt completion event T2. 
At this point T1 is a successful attempt that has been recorded in successAttemptCompletionEventNoMap. So it is still a valid successful event.
T is removed from successAttemptCompletionEventNoMap and its T1 taskAttemptCompletionEvents entry is marked obsolete.
Now T2 status is checked and if successful, T is added  to successAttemptCompletionEventNoMap.

This means while retry T2 was running, T was considered successful because it was in successAttemptCompletionEventNoMap. 
1) So if we want to *not* leave a task as successful when being retried, then T should already have been removed from successAttemptCompletionEventNoMap and T1 marked obsolete. Removing T after completion of T2 is not correct.
2) However, if we want T to remain successful until we have another successful attempt, then it should be removed from successAttemptCompletionEventNoMap only when T2 is successful. But currently we remove T from successAttemptCompletionEventNoMap regardless of T2's status.


                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated MAPREDUCE-3921:
-------------------------------------

    Status: Open  (was: Patch Available)

Sorry to come in late.
Some clarifications:
# MR1 JT kills all running tasks on a TT when it's deemed 'lost'.
# It also kills all completed maps on that TT for 'active' jobs.
# The tasks are marked KILLED rather than FAILED and thus don't count towards the job, which is correct since it wasn't the job's fault.

Hope this helps.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Open  (was: Patch Available)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy closed MAPREDUCE-3921.
------------------------------------

    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 2.0.2-alpha
>
>         Attachments: MAPREDUCE-3921-10.patch, MAPREDUCE-3921-11.patch, MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249164#comment-13249164 ] 

Bikas Saha commented on MAPREDUCE-3921:
---------------------------------------

Attached new patch.
1) Cleaned up asserts, logs and minor comments. javac warnings are same as pre-existing warnings around the use of raw types for events.
2) Removed newly added TaskEventType.T_ATTEMPT_KILLED_AFTER_SUCCESS with existing TaskEventType.T_ATTEMPT_KILLED. The successful attempt was being killed and it makes sense to reuse existing code flow. There was some reason (which is lost in my notes) for which I had added a new event type but after looking at the code I dont see any reason to do so now.
3) All map task completion events (succeeded, killed etc) are being synced with the reducers. When a map task is killed because of a bad node, that event will be sent to the reducer. Then when it completes, the reducer will know about it. Just like any other case of change in map outputs. All of this is pre-existing functionality based on my understanding of the code and talking offline with Vinod. So your concerns about informing the reducers about the newly killed map task are already
addressed by the pre-existing code flow.
4) AM recovery. I was having trouble trying to manually create failures of a real cluster. So I went ahead and enhanced the newly added TestMRApp.testUpdatedNodes() with AM recovery. The test now checks for successful tasks being killed and rerun on node failure. Then the AM is restarted and the test verifies that those completed tasks are recovered. While that worked and this patch passed the tests, a variant of the test exposed a different problem.

In recovery mode, the recovery service assigns a success status to any task that has a FINISHED event reported. The only way that status could be changed is if there is a FAILED event for that task, in which case a failed status is assigned to that task. So once a task is marked with a success status, it remains so even when subsequent events kill the successful task attempt and mark it invalid. 
Next the recovery service adds all success status tasks into a completedTasks collection. Then it proceeds to enumerate the events and process them. When it hits a TaskEventType.*_KILLED/FAILED/SUCCEEDED then it removes those attempts from the completedTasks. Recovery does not complete until all attempts of all completedTasks are removed. Now the following sequence of events can happen for Tasks A and B. A1 represents task attempt 1 of A.
CompletedTasks contains A and B. A1 and A2 are succeeded. A2 was a rerun of A1. B1 is succeeded and B2 was running when AM crashed.
A1- container request is processed. It uses the nodeid info from A1 to work.
B1- container request is processed. It uses the nodeid info from B1 to work.
A1- Succeeded removes A1
B1- Succeeded removed B1
A2- container request is processed. It uses the nodeid info from A2 to work
B2- container request is processed. It uses the nodeid info from B2 to work. But there is no such info as it is populated on task completion. AM crashed here while trying to resolve the nodeid.
If AM had not crashed the following would have happened
A2- Succeeded removes A2
There is no FAILED/KILLED/SUCCEEDED event for B2 since it was running when the AM crashed. So it seems the AM would never move out of recovery.

If the above is correct, there seems to be 2 problems
1) While recovery is in process, event handling for task attempts that are not in a completed state. I am not sure if the recovery design allows this and the current crash is simply a case of missing info. 
2) Expecting every task attempt of a completedTask to have a KILLED/FAILED/SUCCEEDED entry. This seems to be clearly wrong in the current scenario.

                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Attachment: MAPREDUCE-3921-branch-0.23.patch
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-branch-0.23.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248512#comment-13248512 ] 

Bikas Saha commented on MAPREDUCE-3921:
---------------------------------------

1) By AM Recovery do you mean recovery after restart? The test added in this patch check that the AM restarts a previously successful task when the node (on which it ran) goes bad. See TestMRApp.java.
2) My understanding was that the new version of the map is a pre-emptively created copy. The running reduces would use their existing inputs. Is that not the case? Are reducers informed about new locations for map outputs on the fly?
4)6) The comments are for reviewers to clarify those points. eg. Some of the code was taken from similar actions elsewhere. They set the finish time and I was not sure if that was the correct thing to do. I dont think the assert is necessary given the current code but do you usually put in asserts?
5) The log means that this task was not started and hence further history events are not being added. This is similar to other places in the code

                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Attachment: MAPREDUCE-3921-branch-0.23.patch

Patch with above implemented.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13251014#comment-13251014 ] 

Robert Joseph Evans commented on MAPREDUCE-3921:
------------------------------------------------

I have been looking at the patch and I think it looks good to me, but it is rather large, and there are some unanswered questions in the code that I cannot answer so I would feel more comfortable if Vinod or Sid gave it a quick once over before I checked it in.

Also it looks like some of the includes in TestMRApp.java have changed and a quick upmerge would be good for it to apply cleanly.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292189#comment-13292189 ] 

Hadoop QA commented on MAPREDUCE-3921:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12531492/MAPREDUCE-3921-10.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 4 new or modified test files.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2448//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2448//console

This message is automatically generated.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: trunk
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-10.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293035#comment-13293035 ] 

Hadoop QA commented on MAPREDUCE-3921:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12531700/MAPREDUCE-3921-11.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 4 new or modified test files.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2450//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2450//console

This message is automatically generated.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: trunk
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-10.patch, MAPREDUCE-3921-11.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Attachment: MAPREDUCE-3921-10.patch

Fixed the counters. Also fixed it for FAILED transition that had the same issue.
Suppressed the unchecked warning.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-10.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Fix Version/s: trunk
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: trunk
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-10.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293228#comment-13293228 ] 

Hudson commented on MAPREDUCE-3921:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk-Commit #2364 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2364/])
    MAPREDUCE-3921. MR AM should act on node health status changes. Contributed by Bikas Saha. (Revision 1349065)

     Result = SUCCESS
sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349065
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobEventType.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobUpdatedNodesEvent.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/TaskAttemptKillEvent.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeState.java

                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 2.0.1-alpha
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-10.patch, MAPREDUCE-3921-11.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293276#comment-13293276 ] 

Hudson commented on MAPREDUCE-3921:
-----------------------------------

Integrated in Hadoop-Common-trunk-Commit #2342 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2342/])
    MAPREDUCE-3921. MR AM should act on node health status changes. Contributed by Bikas Saha. (Revision 1349065)

     Result = SUCCESS
sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349065
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobEventType.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobUpdatedNodesEvent.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/TaskAttemptKillEvent.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeState.java

                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 2.0.1-alpha
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-10.patch, MAPREDUCE-3921-11.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Attachment: MAPREDUCE-3921-branch-0.23.patch

The javadoc warning and 1 javac warning was because of a spurious import of sun libraries that Eclipse had inserted.
The remaining 2 javac warnings are similar to existing warnings.
======
[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java:[626,25] [unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler
[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java:[647,29] [unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler
======
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13261214#comment-13261214 ] 

Hadoop QA commented on MAPREDUCE-3921:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12524088/MAPREDUCE-3921-7.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 4 new or modified test files.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The applied patch generated 494 javac compiler warnings (more than the trunk's current 492 warnings).

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed these unit tests:
                  org.apache.hadoop.yarn.server.TestContainerManagerSecurity
                  org.apache.hadoop.yarn.server.resourcemanager.security.TestApplicationTokens
                  org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService
                  org.apache.hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry
                  org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization
                  org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs
                  org.apache.hadoop.mapred.TestClientRedirect
                  org.apache.hadoop.mapreduce.TestYarnClientProtocolProvider
                  org.apache.hadoop.mapreduce.security.TestJHSSecurity

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2305//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2305//console

This message is automatically generated.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13261001#comment-13261001 ] 

Hadoop QA commented on MAPREDUCE-3921:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12524035/MAPREDUCE-3921-5.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 4 new or modified test files.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The applied patch generated 494 javac compiler warnings (more than the trunk's current 492 warnings).

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed these unit tests:
                  org.apache.hadoop.yarn.server.TestContainerManagerSecurity
                  org.apache.hadoop.yarn.server.resourcemanager.security.TestApplicationTokens
                  org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService
                  org.apache.hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry
                  org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization
                  org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs
                  org.apache.hadoop.mapred.TestClientRedirect
                  org.apache.hadoop.mapreduce.security.TestJHSSecurity

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2302//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2302//console

This message is automatically generated.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Open  (was: Patch Available)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Open  (was: Patch Available)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Patch Available  (was: Open)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260834#comment-13260834 ] 

Bikas Saha commented on MAPREDUCE-3921:
---------------------------------------

bq.Does 'node unhealthy' need to be treated differently from 'TooManyFetchFailures' ? Killed versus Failed. This ends up with NodeFailures not counting towards the limit on task attempts.
Yes, I think. Based on our experience, here we are pre-emptively taking action on a task that might actually be ok. And it should be an infrequent action.

bq. JOB_UPDATED_NODES needs to be handled in the JOB_INIT state. Very small chance of hitting this.
My understanding was that scheduling happens when the job moves from INIT to RUNNING state via the StartTransition(). Unless allocate is called on RM it will not return any unhealthy machines. So I thought that JOB_UPDATED_EVENT can never come until the job moves into the RUNNING state. Can you please point out the scenario you are thinking about?
I can make the change for safety reasons, just in case.

bq. Minor: JobImpl.actOnUsableNode can get the task type from the id itself. It doesn't need to fetch the actual task.
Unless you really want this, I would prefer it the way its currently written. I prefer not to depend on string name encodings.

bq. Minor: "if this attempt is not successful" this comment in JobImpl can be removed. It's removing an entry from a successfulAttempt index.
That was a question I had and put it in the comments. It seems that for a TaskAttemptCompletedEventTransition the code removes the previous successful entry from successAttemptCompletionEventNoMap. It then checks if the current attempt is successful, and in that case adds it to the successAttemptCompletionEventNoMap. But what if the current attempt is not successful. We have now removed the previous successful attempt too. Is that the desired behavior. This question is independent of this jira.

bq. In KilledAfterSuccessTransition - createJobCounterUpdateEventTAFailed should be createJobCounterUpdateEventTAKilled
Done.

bq. TaskImpl.handleAttemptCompletion - finishedAttempts - this will end up double counting the same task attempt. It's used in some other transition.
I have moved the finishedTask increment out of that function and made it explicit in every transition that requires it to be that way.
In the same context I have a question in comments in MapRetroactiveFailureTransition. Why is this not calling handleAttemptCompletion. My understanding is that handleAttemptCompletion is used to notify reducers about changes in map outputs. So if a map was failed after success then reducers should know about it so that they can abandon its outputs before getting too many fetch failures. Is that not so?

bq. Does the JobHistoryParser need some more changes - to unset fields which may have been set previously by Map/ReduceAttemptSuccessfulEvents and TaskFinishedEvent
Done. Reset all fields set in handleTaskFinishedEvent. Others are already handled in the existing code.

bq. For running tasks - shouldn't running Reduce attempts also be killed ?
My understanding of existing behavior in mrv1 was that only maps are pre-emptively terminated for performance reasons.

bq. RMContainerAllocator.handleUpdatedNodes - instead of fetching the nodeId via appContext, job etc - the nodeId can be stored with the AssignedRequest. 1) getTask, getAttempt require readLocks - can avoid these calls every second. 2) There's an unlikely race where the nodeId may not be assigned in the TaskAttempt (if the dispatcher thread is backlogged). TaskAttemptId.getNodeId() can be avoided. getContainerManagerAddress can be used instead.
Sorry I did not find getContainerManagerAddress(). The map in AssignedRequests stores ContainerId and its not possible to get nodeId from it. What are you proposing?

bq. Not related to this patch. Does JOB_TASK_COMPLETED need to be handled (ignored) in additional states?
It does not look like it but there may be race conditions I have not thought of. But looking further, it seems that the action on this event checks for job completion in TaskCompletedTransition. TaskCompletedTransition increments job.completedTaskCount irrespective of whether the task has succeeded/killed or failed. Now, TaskCompletedTransition.checkJobCompleteSuccess() checks job.completedTaskCount == job.tasks.size() for completion. How is this working? Wont enough killed tasks/failed + completed tasks trigger job completion? Or is that expected behavior?
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Attachment: MAPREDUCE-3921-3.patch

Attaching patch after pulling latest changes and improved test for AM recovery.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293609#comment-13293609 ] 

Hudson commented on MAPREDUCE-3921:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk #1107 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1107/])
    MAPREDUCE-3921. MR AM should act on node health status changes. Contributed by Bikas Saha. (Revision 1349065)

     Result = FAILURE
sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349065
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobEventType.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobUpdatedNodesEvent.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/TaskAttemptKillEvent.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeState.java

                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 2.0.1-alpha
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-10.patch, MAPREDUCE-3921-11.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Attachment: MAPREDUCE-3921-4.patch

New patch with latest synced changes from trunk.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Attachment: MAPREDUCE-3921.patch

Some cleanup and creating new diff wrt trunk since the previous one was for 0.23
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Patch Available  (was: Open)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260860#comment-13260860 ] 

Bikas Saha commented on MAPREDUCE-3921:
---------------------------------------

added new patch for comments above.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Patch Available  (was: Open)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Siddharth Seth (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13261135#comment-13261135 ] 

Siddharth Seth commented on MAPREDUCE-3921:
-------------------------------------------

bq. Yes, the removed entry is marked OBSOLETE. But I still dont understand why that would be done if the current entry is not successful. Why lose the previously successful entry when the current one is not successful itself? This should be done only when the current entry is successful.
This list is sent over to reduce tasks - which do consider the OBSOLETE state when deciding on which map outputs need to be fetched.

bq. I like the change to store NodeId's instead of ContainerId's in the AssignedRequests map. I would like to make it a separate change and not merge it with this one. There might be other gotchas to doing that.
Both NodeId and ContainerId can be stored (I believe containerId is required). That should be a reasonably simple change - and will allow TA_KILLs to be sent directly.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Open  (was: Patch Available)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-10.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Open  (was: Patch Available)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248585#comment-13248585 ] 

Robert Joseph Evans commented on MAPREDUCE-3921:
------------------------------------------------

For the testing I just want to be sure that nothing catastrophically bad happens in these cases.  If a failed task is not detected until the reducer fails to fetch data from it, that is fine with me, but if the AM dies or hangs, or if there is some how data corruption I really would like to avoid those.
 
By AM Recovery I mean that when the AM dies, i.e. it was on a bad node, the RM will restart it.  The AM then looks through the JobHistory logs to find out which tasks finished successfully before it died, and which ones need to be restarted.  I just want to be sure that if a map task is restarted because a node is unhealthy and the AM also is restarted that the recovery code will handle that case correctly.

bq. Are reducers informed about new locations for map outputs on the fly?
That is my understanding otherwise no reducer could be launched until all mappers had finished, and all reducers would have to be relaunched if a map task disappeared on a bad node.

bq. I dont think the assert is necessary given the current code but do you usually put in asserts?
I don't usually put in asserts.  But I don't really like dangling TODO's lying around.  If it is something that needs to be done I feel we should either do it or file a JIRA to track it so it gets done.  If it is not something that needs to be done then we don't need a TODO for it.  If this is a copy and paste TODO I am OK with leaving it.  That is the reason I did not comment on the other TODOs added into the code, I could see where they were copied from.

bq. The log means that this task was not started and hence further history events are not being added. This is similar to other places in the code
Yes I can see the place where it was copied from.  What I am referring to is that the KilledTransition, where this looks like it came from, handles the kill event coming in from many different states.  In some of these states it is reasonable to have a launch time of 0.  In KilledAfterSuccessTransition, as the name implies, it seems very difficult to have a taskattempt in the "SUCCESS" state that had no launch time.  A task that finished successfully but was never run seems odd to me, if you want to leave it for defensive programming I am happy to, but I would prefer the log message to not be debug so someone looking can see that something odd happened here.

bq. The comments are for reviewers to clarify those points. eg. Some of the code was taken from similar actions elsewhere. They set the finish time and I was not sure if that was the correct thing to do.
It seems logical that if you are killing a task that we want to be sure the finish time is set, so just set it, but that should already have been set for the SUCCESS case, so I would just leave it off, but I really don't know for sure.  
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249867#comment-13249867 ] 

Robert Joseph Evans commented on MAPREDUCE-3921:
------------------------------------------------

Someone pointed out to me that my comment is a bit confusing.  When I said two nodes going down very close to one another I meant that for this to happen we would need one node to go down in succession that had the correct processes running on them.   But now that I think about it more, I am not even sure if it will expose the issue.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Open  (was: Patch Available)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: trunk
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-10.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Attachment: MAPREDUCE-3921-5.patch
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Patch Available  (was: Open)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13251055#comment-13251055 ] 

Bikas Saha commented on MAPREDUCE-3921:
---------------------------------------

Thanks! Robert, could you look at MAPREDUCE-4128 please? That pretty small and would make this less risky.

I have a new patch for this based on the patch for MAPREDUCE-4128. That will clean up the patch based off latest changes to trunk.

                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226794#comment-13226794 ] 

Bikas Saha commented on MAPREDUCE-3921:
---------------------------------------

Attaching a patch that builds on MAPREDUCE-3353
1) RMContainerAllocator receives node updates along with allocated containers
2) It sends KILL event to map task attempts running on unusable nodes
3) It sends a JobUpdatedNode event to JobImpl
4) JobImpl maintains a mapping of nodes to successful task attempts that have run on them
5) On receiving updated nodes JobImpl sends KILL event to map task attempts from 4)
6) Successful task completions retro actively get to the KILLED state if their successful task attempt is the same as the task attempt in 5). They reschedule another attempt.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-branch-0.23.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253678#comment-13253678 ] 

Hadoop QA commented on MAPREDUCE-3921:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12522596/MAPREDUCE-3921-3.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 4 new or modified test files.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The applied patch generated 508 javac compiler warnings (more than the trunk's current 506 warnings).

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed these unit tests:
                  org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell
                  org.apache.hadoop.yarn.server.TestDiskFailures
                  org.apache.hadoop.yarn.server.TestContainerManagerSecurity
                  org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService
                  org.apache.hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry
                  org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization
                  org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs
                  org.apache.hadoop.mapred.TestMiniMRClasspath
                  org.apache.hadoop.mapreduce.v2.TestMRJobs
                  org.apache.hadoop.mapred.TestMiniMRWithDFSWithDistinctUsers
                  org.apache.hadoop.mapred.TestMiniMRBringup
                  org.apache.hadoop.mapred.TestMiniMRChildTask
                  org.apache.hadoop.mapred.TestReduceFetch
                  org.apache.hadoop.mapred.TestClusterMRNotification
                  org.apache.hadoop.mapred.TestReduceFetchFromPartialMem
                  org.apache.hadoop.mapred.TestJobCounters
                  org.apache.hadoop.mapreduce.TestChild
                  org.apache.hadoop.mapred.TestMiniMRClientCluster
                  org.apache.hadoop.ipc.TestSocketFactory
                  org.apache.hadoop.mapreduce.v2.TestMRJobsWithHistoryService
                  org.apache.hadoop.mapreduce.v2.TestMROldApiJobs
                  org.apache.hadoop.mapreduce.v2.TestSpeculativeExecution
                  org.apache.hadoop.mapreduce.lib.output.TestJobOutputCommitter
                  org.apache.hadoop.mapred.TestClientRedirect
                  org.apache.hadoop.mapred.TestLazyOutput
                  org.apache.hadoop.mapred.TestJobCleanup
                  org.apache.hadoop.mapreduce.TestMapReduceLazyOutput
                  org.apache.hadoop.mapred.TestSpecialCharactersInOutputPath
                  org.apache.hadoop.mapreduce.v2.TestMRAppWithCombiner
                  org.apache.hadoop.conf.TestNoDefaultsJobConf
                  org.apache.hadoop.mapreduce.v2.TestRMNMInfo
                  org.apache.hadoop.mapred.TestClusterMapReduceTestCase
                  org.apache.hadoop.mapreduce.v2.TestNonExistentJob
                  org.apache.hadoop.mapred.TestJobSysDirWithDFS
                  org.apache.hadoop.mapreduce.v2.TestUberAM
                  org.apache.hadoop.mapreduce.v2.TestMiniMRProxyUser
                  org.apache.hadoop.mapred.TestJobName
                  org.apache.hadoop.mapreduce.security.TestJHSSecurity

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2222//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2222//console

This message is automatically generated.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Siddharth Seth (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3921:
--------------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: trunk)
                   2.0.1-alpha
     Hadoop Flags: Reviewed
           Status: Resolved  (was: Patch Available)

Committed to trunk and branch-2.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 2.0.1-alpha
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-10.patch, MAPREDUCE-3921-11.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Siddharth Seth (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3921:
--------------------------------------

       Fix Version/s:     (was: 0.23.2)
    Target Version/s: 0.23.3, 2.0.1-alpha
              Status: Patch Available  (was: Open)

Submitting to jenkins.

Minor stuff
In TaskAttemptImpl, createJobCounterUpdateEventTAKilled - SLOTS_MILLIS_MAPS shouldn't be updated if a task_attempt is transitioning from SUCCEEDED to FAILED.
Some minor formatting changes required in RMContainerAllocator (spacing in the for loop). Also the warnings in the same class.
Otherwise, lgtm.

Bobby, the patch doesn't apply to the 23 branch. Has a dependency on MAPREDUCE-3958. Do you want to pull that in to 23 as well ?

                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Patch Available  (was: Open)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Issue Type: Improvement  (was: Bug)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: trunk
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-10.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293536#comment-13293536 ] 

Hudson commented on MAPREDUCE-3921:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk #1074 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1074/])
    MAPREDUCE-3921. MR AM should act on node health status changes. Contributed by Bikas Saha. (Revision 1349065)

     Result = SUCCESS
sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349065
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobEventType.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/JobUpdatedNodesEvent.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/event/TaskAttemptKillEvent.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestMRApp.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRMContainerAllocator.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeState.java

                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 2.0.1-alpha
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-10.patch, MAPREDUCE-3921-11.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238670#comment-13238670 ] 

Hadoop QA commented on MAPREDUCE-3921:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12519982/MAPREDUCE-3921-branch-0.23.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 12 new or modified tests.

    -1 javadoc.  The javadoc tool appears to have generated 1 warning messages.

    -1 javac.  The applied patch generated 510 javac compiler warnings (more than the trunk's current 507 warnings).

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2102//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2102//console

This message is automatically generated.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Siddharth Seth (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13261811#comment-13261811 ] 

Siddharth Seth commented on MAPREDUCE-3921:
-------------------------------------------

Thanks for the updated patch Bikas. Will take a look. Still waiting for input from the MR veterans on some of the previous comments - how things were handled in 20 - specifically for killing map/reduce tasks on unhealthy nodes, and treating 'node unhealthy' similar to 'fetch failure' (State Killed / Failed as well as counting towards max_attempts). 

bq. About the OBSOLETE part. I get how it is used. What I dont get is why we are marking a previously successful task as obsolete and invalid upon the completion of a new task without first checking if the new task was itself successful or not.
Are you considering leaving the task in SUCCESSFUL state, even if it's being retried, so that the Reduce *may* be able to pull data - before there's a new SUCCESSFUL attempt ?
Otherwise, marking the attempt as OBSOLETE and removing the task from successAttemptCompletionEventNoMap (tracks only SUCCESSUL attempts) seems like the correct thing to do.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Patch Available  (was: Open)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249844#comment-13249844 ] 

Robert Joseph Evans commented on MAPREDUCE-3921:
------------------------------------------------

I did a quick look at the code and it looks good to me.  As for the recovery error you discovered could you please file a follow up JIRA for it, as it is a preexisting issue that can be caused by AM recovery with speculative execution.  This patch may expose the issue more frequently, but not enough to really worry me that much.  You need two nodes going down very close to one another which is possible, but not that often.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Patch Available  (was: Open)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: trunk
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-10.patch, MAPREDUCE-3921-11.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Robert Joseph Evans (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated MAPREDUCE-3921:
-------------------------------------------

    Status: Patch Available  (was: Open)

Kicking Jenkins.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Attachment: MAPREDUCE-3921-7.patch

Changed assignedRequests maps to use Container as the value instead of ContainerId
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Status: Open  (was: Patch Available)
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Robert Joseph Evans (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248411#comment-13248411 ] 

Robert Joseph Evans commented on MAPREDUCE-3921:
------------------------------------------------

A few minor comments about the patch, and some questions on the manual testing that was done on it.  Overall the patch looks very good and once the javac warnings are addressed and I know manual testing was performed I am a +1 on it

  # Have you tested this with AM Recovery?  Specifically I would like to see the AM recover when a map task finished successfully and then was killed because the node went bad.
  # Have you tested this with reduces?  The code will reschedule the map task, but I don't really see where/if it informs the reducer that it is rescheduling the map task until that new task finishes successfully.  I believe that the reducer would just ignore an update for a task it has already fetched successfully, but I just want to be sure it was tested.
  # NodeState.isUnhealthy()  (Very minor) I think it would be cleaner, to have it be {code}
return this == UNHEALTHY ||
       this == DECOMMISSIONED ||
       this == LOST;
{code}
  # KilledAfterSuccessTransition.transition() There is some commented out code {code}
// why set a wrong finish time ???
//set the finish time
//taskAttempt.setFinishTime();
{code} Is this needed? If not please remove it.
  # KilledAfterSuccessTransition.transition() I am a bit confused by the log statement {code}
if (taskAttempt.getLaunchTime() != 0) {
  ...
}else {
  LOG.debug("Not generating HistoryFinish event since start event not generated for taskAttempt: "
      + taskAttempt.getID());
}
{code} Is this really needed (looks like it was a copy and paste from the KilledTransition)?  When would we even get a successful job that did not have a launch time?  I would rather have it be an ERROR or WARN rather then a debug if we did see this in this transition.
  # TaskAttemptCompletedEventTransition.transition() {code}
// TODO assert nodeId is not null
{code} please either add in the assert or remove the TODO.

                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Attachment: MAPREDUCE-3921-11.patch
    
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: trunk
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-10.patch, MAPREDUCE-3921-11.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Posted by "Bikas Saha (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated MAPREDUCE-3921:
----------------------------------

    Attachment: MAPREDUCE-3921-9.patch

Adding patch in which AM sends kill event to map and reduce tasks to keep behavior similar to JT in mrv1.
Thing to note, is that the RM also terminates such containers. However, in AM such RM terminations mark the task attempt as FAILED but in this case we need to mark them as KILLED. So it is necessary to send the kill event in the AM so that it pre-empts the state transition to FAILED in the normal handling of such cases.
                
> MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-1.patch, MAPREDUCE-3921-3.patch, MAPREDUCE-3921-4.patch, MAPREDUCE-3921-5.patch, MAPREDUCE-3921-6.patch, MAPREDUCE-3921-7.patch, MAPREDUCE-3921-9.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira