You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Bibin A Chundatt (JIRA)" <ji...@apache.org> on 2019/03/21 07:39:00 UTC
[jira] [Commented] (MAPREDUCE-6190) If a task stucks before its
first heartbeat, it never timeouts and the MR job becomes stuck
[ https://issues.apache.org/jira/browse/MAPREDUCE-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797892#comment-16797892 ]
Bibin A Chundatt commented on MAPREDUCE-6190:
---------------------------------------------
[~uranus]
During one of the test found that mapreduce.task.timeout=0 configuration used to disable timeout doesn't work now.
If the task timeout is configured as zero the task fails with stuck timeout, if the TaskStatus is null
{code}
if (sendProgress) {
// we need to send progress update
updateCounters();
checkTaskLimits();
taskStatus.statusUpdate(taskProgress.get(),
taskProgress.toString(),
counters);
amFeedback = umbilical.statusUpdate(taskId, taskStatus);
taskFound = amFeedback.getTaskFound();
taskStatus.clearStatus();
}
else {
// send ping
amFeedback = umbilical.statusUpdate(taskId, null);
taskFound = amFeedback.getTaskFound();
}
{code}
> If a task stucks before its first heartbeat, it never timeouts and the MR job becomes stuck
> -------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-6190
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6190
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.1
> Reporter: Ankit Malhotra
> Assignee: Zhaohui Xin
> Priority: Major
> Fix For: 3.3.0
>
> Attachments: MAPREDUCE-6190.001.patch, MAPREDUCE-6190.002.patch, MAPREDUCE-6190.003.patch, MAPREDUCE-6190.004.patch, MAPREDUCE-6190.005.patch
>
>
> Trying to figure out a weird issue we started seeing on our CDH5.1.0 cluster with map reduce jobs on YARN.
> We had a job stuck for hours because one of the mappers never started up fully. Basically, the map task had 2 attempts, the first one failed and the AM tried to schedule a second one and the second attempt was stuck on STATE: STARTING, STATUS: NEW. A node never got assigned and the task along with the job was stuck indefinitely.
> The AM logs had this being logged again and again:
> {code}
> 2014-12-09 19:25:12,347 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down 0
> 2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1408745633994_450952_02_003807
> 2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce preemption successful attempt_1408745633994_450952_r_000048_1000
> 2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
> 2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 1
> 2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Preempting attempt_1408745633994_450952_r_000050_1000
> 2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=0
> 2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: completedMapPercent 0.99968 totalMemLimit:1722880 finalMapMemLimit:2560 finalReduceMemLimit:1720320 netScheduledMapMem:2560 netScheduledReduceMem:1722880
> 2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down 0
> 2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:77 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 AssignedReds:673 CompletedMaps:3124 CompletedReds:0 ContAlloc:4789 ContRel:798 HostLocal:2944 RackLocal:155
> 2014-12-09 19:25:14,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:78 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 AssignedReds:673 CompletedMaps:3124 CompletedReds:0 ContAlloc:4789 ContRel:798 HostLocal:2944 RackLocal:155
> 2014-12-09 19:25:14,359 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=0
> {code}
> On killing the task manually, the AM started up the task again, scheduled and ran it successfully completing the task and the job with it.
> Some quick code grepping led us here:
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-app/2.3.0/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java#397
> But still dont quite understand why this would happen once in a while and why the job would suddenly be ok once the stuck task is manually killed.
> Note: Other jobs succeed on the cluster while this job is stuck.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org