You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2012/07/13 16:41:34 UTC
[jira] [Commented] (MAPREDUCE-4437) Race in MR ApplicationMaster can cause reducers to never be scheduled

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413783#comment-13413783 ] 

Jason Lowe commented on MAPREDUCE-4437:
---------------------------------------

This is an excerpt of the AM log from a sleep job with 20 mappers and 20 reducers and mapreduce.job.reduce.slowstart.completedmaps=1.0:

{noformat}
...
2012-07-12 15:51:35,721 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1342108144362_0002_01_000017
2012-07-12 15:51:35,721 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1342108144362_0002_01_000015
2012-07-12 15:51:35,721 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
2012-07-12 15:51:35,721 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule...
2012-07-12 15:51:35,721 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 20
2012-07-12 15:51:35,721 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000014_0: Container killed by the ApplicationMaster.

2012-07-12 15:51:35,721 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000005_0: Container killed by the ApplicationMaster.

2012-07-12 15:51:35,721 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000009_0: Container killed by the ApplicationMaster.

2012-07-12 15:51:35,721 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000007_0: Container killed by the ApplicationMaster.

2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000010_0: Container killed by the ApplicationMaster.

2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000003_0: Container killed by the ApplicationMaster.

2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000018_0: Container killed by the ApplicationMaster.

2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000012_0: Container killed by the ApplicationMaster.

2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000008_0: Container killed by the ApplicationMaster.

2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000002_0: 
2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000017_0: 
2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000011_0: Container killed by the ApplicationMaster.

2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000001_0: Container killed by the ApplicationMaster.

2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000016_0: Container killed by the ApplicationMaster.

2012-07-12 15:51:35,722 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000015_0: Container killed by the ApplicationMaster.

2012-07-12 15:51:35,723 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1342108144362_0002_m_000013_0: Container killed by the ApplicationMaster.

2012-07-12 15:51:36,695 INFO [Socket Reader #1 for port 50500] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successfull for 2 (auth:SIMPLE)
2012-07-12 15:51:36,696 INFO [Socket Reader #1 for port 50500] SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successfull for 2 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$BlockingInterface
2012-07-12 15:51:36,715 INFO [Socket Reader #2 for port 50500] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successfull for 2 (auth:SIMPLE)
2012-07-12 15:51:36,716 INFO [Socket Reader #2 for port 50500] SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successfull for 2 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$BlockingInterface
2012-07-12 15:51:36,723 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
2012-07-12 15:51:36,725 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
2012-07-12 15:51:36,752 INFO [IPC Server handler 0 on 50500] org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task report for MAP   job_1342108144362_0002. Report-size will be 20
2012-07-12 15:51:36,774 INFO [IPC Server handler 0 on 50500] org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task report for REDUCE   job_1342108144362_0002. Report-size will be 20
2012-07-12 15:51:37,727 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
2012-07-12 15:51:37,729 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
2012-07-12 15:51:38,731 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
2012-07-12 15:51:38,733 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:18 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
2012-07-12 15:51:38,905 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1342108144362_0002_m_000002_0 TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2012-07-12 15:51:38,905 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attempt attempt_1342108144362_0002_m_000002_0
2012-07-12 15:51:38,906 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1342108144362_0002_m_000002 Task Transitioned from RUNNING to SUCCEEDED
2012-07-12 15:51:38,906 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1342108144362_0002_m_000017_0 TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2012-07-12 15:51:38,906 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 19
2012-07-12 15:51:38,906 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attempt attempt_1342108144362_0002_m_000017_0
2012-07-12 15:51:38,906 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1342108144362_0002_m_000017 Task Transitioned from RUNNING to SUCCEEDED
2012-07-12 15:51:38,906 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 20
2012-07-12 15:51:39,735 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:20 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
2012-07-12 15:51:39,736 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:20 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
2012-07-12 15:51:40,739 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:20 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
2012-07-12 15:51:40,741 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:20 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
2012-07-12 15:51:41,743 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:20 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
2012-07-12 15:51:41,744 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReduces:20 ScheduledMaps:0 ScheduledReduces:0 AssignedMaps:0 AssignedReduces:0 completedMaps:20 completedReduces:0 containersAllocated:25 containersReleased:5 hostLocalAssigned:0 rackLocalAssigned:0 availableResources(headroom):memory: 112640
...
{noformat}

Note the {{Recalculating schedule...}} message in the above log is the last occurrence of that message in the entire log.  What appears to have happened is the RM notified the AM of all 20 containers having completed *before* all of them were fully processed by the AM.  Therefore when it went to recalculate the reducer schedule, the slow start threshold had not yet been met since it thought only 18 of the 20 maps had completed.  Later it processed the remaining two map task completions, but that did *not* recalculate the reducer schedule.  No other container completions or events will occur after that, so the reducer schedule never gets recalculated.  Without recalculating the reducer schedule, the AM will perpetually mistakenly think it should not schedule any reducers, and the job hangs.

                
> Race in MR ApplicationMaster can cause reducers to never be scheduled
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4437
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4437
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>
> If the MR AM is notified of container completion by the RM before the AM receives notification of the container cleanup from the NM then it can fail to schedule reducers indefinitely.  Logs showing the issue to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira