You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Devaraj K (Commented) (JIRA)" <ji...@apache.org> on 2012/04/11 14:41:17 UTC

[jira] [Commented] (MAPREDUCE-4030) If the nodemanager on which the maptask is executed is going down before the mapoutput is consumed by the reducer,then the job is failing with shuffle error

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13251506#comment-13251506 ] 

Devaraj K commented on MAPREDUCE-4030:
--------------------------------------

Hi Nishan, Before the reducer bailing out, it will check the health and progress using these below conditions

1. failureCounts.size() >= maxFailedUniqueFetches or failureCounts.size() == (totalMaps - doneMaps)

2. reducerHealthy ((totalFailures / (totalFailures + doneMaps)) < 0.5f)

3. reducerProgressed is not enough Enough((doneMaps / totalMaps) < 0.5f ) or reducerStalled

I think it has run all or most of the maps in the down node and satisfied the above conditions and then reducer failed without re-running the maps in your case. 

Can you please check and confirm whether the same case happened in your env?
	
                
> If the nodemanager on which the maptask is executed is going down before the mapoutput is consumed by the reducer,then the job is failing with shuffle error
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4030
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4030
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Nishan Shetty
>            Assignee: Devaraj K
>
> My cluster has 2 NM's.
> The value of "mapreduce.job.reduce.slowstart.completedmaps" is set to 1.
> When the job execution is in progress and Mappers has finished about 99% completion,one of the NM has gone down.
> The job has failed with the following trace
> "Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#1 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:123) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:371) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:148) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:143) Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. at org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler.checkReducerHealth(ShuffleScheduler.java:253) at org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler.copyFailed(ShuffleScheduler.java:187) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:240) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:152) "

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira