You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Zhaohui Xin (JIRA)" <ji...@apache.org> on 2018/11/22 03:27:00 UTC

[jira] [Comment Edited] (MAPREDUCE-6190) MR Job is stuck because of one mapper stuck in STARTING

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695520#comment-16695520 ] 

Zhaohui Xin edited comment on MAPREDUCE-6190 at 11/22/18 3:26 AM:
------------------------------------------------------------------

We solved this problem.

This problem has existed in our cluster for a year and will occur once in about a month. We finally found that it was a disk problem that led to a long container localization time, up to a few hours.

We added a parameter of container start-up timeout to actively fail task with problematic start-up.
 
 [~ajisakaa],Can you assign this issue to me? I can attach my patch here, Thank you.:D


was (Author: uranus):
We solved this problem.

This problem has existed in our cluster for a year and will occur once in about a month. We finally found that it was a disk problem that led to a long container localization time, up to a few hours.

We added a parameter of container start-up timeout to actively fail task with problematic start-up.
 
 
[~ajisakaa],Can you assign this issue to me? I can attach my patch here, Thank you.

> MR Job is stuck because of one mapper stuck in STARTING
> -------------------------------------------------------
>
>                 Key: MAPREDUCE-6190
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6190
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Ankit Malhotra
>            Priority: Major
>
> Trying to figure out a weird issue we started seeing on our CDH5.1.0 cluster with map reduce jobs on YARN.
> We had a job stuck for hours because one of the mappers never started up fully. Basically, the map task had 2 attempts, the first one failed and the AM tried to schedule a second one and the second attempt was stuck on STATE: STARTING, STATUS: NEW. A node never got assigned and the task along with the job was stuck indefinitely.
> The AM logs had this being logged again and again:
> {code}
> 2014-12-09 19:25:12,347 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down 0
> 2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1408745633994_450952_02_003807
> 2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce preemption successful attempt_1408745633994_450952_r_000048_1000
> 2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down all scheduled reduces:0
> 2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 1
> 2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Preempting attempt_1408745633994_450952_r_000050_1000
> 2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=0
> 2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: completedMapPercent 0.99968 totalMemLimit:1722880 finalMapMemLimit:2560 finalReduceMemLimit:1720320 netScheduledMapMem:2560 netScheduledReduceMem:1722880
> 2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Ramping down 0
> 2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:77 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 AssignedReds:673 CompletedMaps:3124 CompletedReds:0 ContAlloc:4789 ContRel:798 HostLocal:2944 RackLocal:155
> 2014-12-09 19:25:14,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:78 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 AssignedReds:673 CompletedMaps:3124 CompletedReds:0 ContAlloc:4789 ContRel:798 HostLocal:2944 RackLocal:155
> 2014-12-09 19:25:14,359 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=0
> {code}
> On killing the task manually, the AM started up the task again, scheduled and ran it successfully completing the task and the job with it.
> Some quick code grepping led us here:
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-app/2.3.0/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java#397
> But still dont quite understand why this would happen once in a while and why the job would suddenly be ok once the stuck task is manually killed.
> Note: Other jobs succeed on the cluster while this job is stuck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org