You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2015/04/22 15:32:00 UTC
[jira] [Commented] (MAPREDUCE-6329) Failure of start map task on NM cause job hang

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507067#comment-14507067 ] 

Jason Lowe commented on MAPREDUCE-6329:
---------------------------------------

Scanning the AM logs, it looks like this may be a situation where the AM is waiting for the RM to allocate a new container but the RM thinks all asks are fulfilled.  We would need to look into the RM logs to try to verify.

I noticed this odd sequence in the AM log:
{noformat}
2015-04-20 21:36:37,225 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 2
[...]
2015-04-20 21:36:37,236 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1428390739155_23973_01_000002 to attempt_1428390739155_23973_m_000000_0
[...]
2015-04-20 21:36:37,246 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1428390739155_23973_01_000003 to attempt_1428390739155_23973_m_000001_0
[... container 3 proceeds to fail to launch ...]
2015-04-20 21:36:38,259 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1428390739155_23973_01_000003
[...]
2015-04-20 21:36:39,276 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1428390739155_23973_01_000004
2015-04-20 21:36:39,276 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_1428390739155_23973_01_000004
{noformat}

I see the AM received two containers from the "Got allocated 2 containers" log message, presumably for containers 000002 and 000003.  Then suddenly the AM is notified of a released container 000004 that apparently was never allocated?  I do not see a corresponding "Got allocated" message that would indicate the AM ever saw container 000004.  That may explain why the AM is stuck.  If the RM thought it allocated a container to the AM and it was released then it will think all asks are satisfied.  However the AM would need to re-ask for the final map container or the job will not progress.  We need to look into the RM log and find the RM's perspective of what happened to container_1428390739155_23973_01_000004.

> Failure of start map task on NM cause job hang
> ----------------------------------------------
>
>                 Key: MAPREDUCE-6329
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6329
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Peng Zhang
>         Attachments: syslog.tgz
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)