You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jian He (JIRA)" <ji...@apache.org> on 2016/05/11 23:04:13 UTC
[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280963#comment-15280963 ] 

Jian He commented on MAPREDUCE-6513:
------------------------------------

looks like TaskAttemptKillEvent will be sent twice for each mapper 
First at below code in RMContainerAllocator#handleUpdatedNodes,  JobImpl will in turn send the  TaskAttemptKillEvent event for each mapper on the unusable node.
{code}
      // send event to the job to act upon completed tasks
      eventHandler.handle(new JobUpdatedNodesEvent(getJob().getID(),
          updatedNodes));
{code}
Second time at this code in the same method  
{code}
            // If map, reschedule next task attempt.
            boolean rescheduleNextAttempt = (i == 0) ? true : false;
            eventHandler.handle(new TaskAttemptKillEvent(tid,
                "TaskAttempt killed because it ran on unusable node"
                    + taskAttemptNodeId, rescheduleNextAttempt));
          }
{code}

This is how it was long time ago, Not sure why that is.  With the new change, will this cause more container requests get scheduled ?

> MR job got hanged forever when one NM unstable for some time
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-6513
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Bob.zhao
>            Assignee: Varun Saxena
>            Priority: Critical
>         Attachments: MAPREDUCE-6513.01.patch, MAPREDUCE-6513.02.patch, MAPREDUCE-6513.03.patch, MAPREDUCE-6513.3.branch-2.8.patch, MAPREDUCE-6513.3_1.branch-2.7.patch, MAPREDUCE-6513.3_1.branch-2.8.patch
>
>
> when job is in-progress which is having more tasks,one node became unstable due to some OS issue.After the node became unstable, the map on this node status changed to KILLED state. 
> Currently maps which were running on unstable node are rescheduled, and all are in scheduled state and wait for RM assign container.Seen ask requests for map till Node is good (all those failed), there are no ask request after this. But AM keeps on preempting the reducers (it's recycling).
> Finally reducers are waiting for complete mappers and mappers did n't get container..
> My Question Is:
> ============
> why map requests did not sent AM ,once after node recovery.?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org