You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "shilongfei (Jira)" <ji...@apache.org> on 2020/01/13 03:34:00 UTC
[jira] [Commented] (MAPREDUCE-7253) MR job got hanged forever when map task timeout

    [ https://issues.apache.org/jira/browse/MAPREDUCE-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013996#comment-17013996 ] 

shilongfei commented on MAPREDUCE-7253:
---------------------------------------

The log of this failed map is as follows. The first attempt is scheduled at 9:46, the resource is allocated at 9:50, the second attempt is scheduled at 9:52, and no subsequent resources are allocated.

 
{code:java}
2020-01-02 09:46:14,181 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1577272892024_639901_m_007585_0 TaskAttempt Transitioned from NEW to UNASSIGNED
2020-01-02 09:50:11,102 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1577272892024_639901_m_007585_0 TaskAttempt Transitioned from UNASSIGNED to ASSIGNED
2020-01-02 09:50:11,146 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1577272892024_639901_m_007585_0 TaskAttempt Transitioned from ASSIGNED to RUNNING
2020-01-02 09:52:11,944 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1577272892024_639901_m_007585_1 TaskAttempt Transitioned from NEW to UNASSIGNED
{code}
 

From the AM allocation container log, it can be seen that the first attempt to the current map is assigned to the container at 9:50, and the second attempt has not been
{code:java}
2020-01-02 09:46:16,377 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_e922_1577272892024_639901_01_000002 to attempt_1577272892024_639901_m_000250_0
2020-01-02 09:46:16,379 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_e922_1577272892024_639901_01_000003 to attempt_1577272892024_639901_m_000839_0
......
2020-01-02 09:50:11,095 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_e922_1577272892024_639901_01_016146 to attempt_1577272892024_639901_m_007585_0
......
{code}
Looking at the container application log, it can be found that after 9:52 AM has applied for resources, but there is no second attempt allocated to the map.
{code:java}
2020-01-02 09:52:30,084 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1
2020-01-02 09:52:31,093 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 3
2020-01-02 09:52:33,110 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1
2020-01-02 09:52:34,119 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1
2020-01-02 09:55:11,234 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1
{code}
The requested resources are allocated to reduce
{code:java}
2020-01-02 09:52:31,093 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce
2020-01-02 09:52:33,110 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce
2020-01-02 09:52:34,119 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce
2020-01-02 09:55:11,234 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce{code}
 

Because the timeout event does not carry the rescheduler flag
{code:java}
if(taskTimedOut) {
  // task is lost, remove from the list and raise lost event
  iterator.remove();
  eventHandler.handle(new TaskAttemptDiagnosticsUpdateEvent(entry
      .getKey(), "AttemptID:" + entry.getKey().toString()
      + " Timed out after " + taskTimeOut / 1000 + " secs"));
  eventHandler.handle(new TaskAttemptEvent(entry.getKey(),
      TaskAttemptEventType.TA_TIMED_OUT));
}{code}

> MR job got hanged forever when map task timeout
> -----------------------------------------------
>
>                 Key: MAPREDUCE-7253
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7253
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: shilongfei
>            Priority: Major
>         Attachments: 2020-01-02 11-15-42屏幕截图.png, 2020-01-02 11-15-58屏幕截图.png, 2020-01-02 11-16-14屏幕截图.png
>
>
> When the MR map job fails and the reduce job has started to be scheduled, the failed map will not be scheduled, it will always be in the NEW state, and the reduce job will also be blocked.
> So why was the map job not scheduled in time?  !2020-01-02 11-15-58屏幕截图.png!
> !2020-01-02 11-15-42屏幕截图.png!
> !2020-01-02 11-16-14屏幕截图.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org