You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2015/04/22 15:32:00 UTC
[jira] [Commented] (MAPREDUCE-6329) Failure of start map task on NM
cause job hang
[ https://issues.apache.org/jira/browse/MAPREDUCE-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507067#comment-14507067 ]
Jason Lowe commented on MAPREDUCE-6329:
---------------------------------------
Scanning the AM logs, it looks like this may be a situation where the AM is waiting for the RM to allocate a new container but the RM thinks all asks are fulfilled. We would need to look into the RM logs to try to verify.
I noticed this odd sequence in the AM log:
{noformat}
2015-04-20 21:36:37,225 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 2
[...]
2015-04-20 21:36:37,236 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1428390739155_23973_01_000002 to attempt_1428390739155_23973_m_000000_0
[...]
2015-04-20 21:36:37,246 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1428390739155_23973_01_000003 to attempt_1428390739155_23973_m_000001_0
[... container 3 proceeds to fail to launch ...]
2015-04-20 21:36:38,259 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1428390739155_23973_01_000003
[...]
2015-04-20 21:36:39,276 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1428390739155_23973_01_000004
2015-04-20 21:36:39,276 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_1428390739155_23973_01_000004
{noformat}
I see the AM received two containers from the "Got allocated 2 containers" log message, presumably for containers 000002 and 000003. Then suddenly the AM is notified of a released container 000004 that apparently was never allocated? I do not see a corresponding "Got allocated" message that would indicate the AM ever saw container 000004. That may explain why the AM is stuck. If the RM thought it allocated a container to the AM and it was released then it will think all asks are satisfied. However the AM would need to re-ask for the final map container or the job will not progress. We need to look into the RM log and find the RM's perspective of what happened to container_1428390739155_23973_01_000004.
> Failure of start map task on NM cause job hang
> ----------------------------------------------
>
> Key: MAPREDUCE-6329
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6329
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 2.6.0
> Reporter: Peng Zhang
> Attachments: syslog.tgz
>
>
> During rolling update of NM, AM start of container on NM failed.
> And then job hang there.
> Attach AM logs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)