You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by "Qi Liu (JIRA)" <ji...@apache.org> on 2009/11/10 21:17:28 UTC

[jira] Created: (MAPREDUCE-1202) Checksum error on a single reducer does not trigger too many fetch failures for mapper during shuffle

Checksum error on a single reducer does not trigger too many fetch failures for mapper during shuffle
-----------------------------------------------------------------------------------------------------

                 Key: MAPREDUCE-1202
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1202
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: jobtracker
    Affects Versions: 0.20.1
            Reporter: Qi Liu
            Priority: Critical


During one run of a large map-reduce job, a single reducer keep throwing Checksum exception when try to shuffle from one mapper. The data on the mapper node for that particular reducer is believed to be corrupted, since there are disk issues on the mapper node. However, even with hundreds of retries to fetch the shuffling data for that particular reducer, and numerous reports to job tracker due to this issue, the mapper is still not declared as too many fetch failures in job tracker.

Here is the log:
2009-11-10 19:55:05,655 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200911010621_0023_r_005396_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2009-11-10 19:55:21,928 INFO org.apache.hadoop.mapred.ReduceTask: header: attempt_200911010621_0023_m_039676_0, compressed len: 449177, decompressed len: 776729
2009-11-10 19:55:21,928 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 776729 bytes (449177 raw bytes) into RAM from attempt_200911010621_0023_m_039676_0
2009-11-10 19:55:38,737 INFO org.apache.hadoop.mapred.ReduceTask: Failed to shuffle from attempt_200911010621_0023_m_039676_0
org.apache.hadoop.fs.ChecksumException: Checksum Error
	at org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:152)
	at org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
	at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:104)
	at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1554)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1433)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1286)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1217)
2009-11-10 19:55:38,737 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200911010621_0023_r_005396_0 copy failed: attempt_200911010621_0023_m_039676_0 from xx.yy.com
2009-11-10 19:55:38,737 WARN org.apache.hadoop.mapred.ReduceTask: org.apache.hadoop.fs.ChecksumException: Checksum Error
	at org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:152)
	at org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
	at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:104)
	at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1554)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1433)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1286)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1217)

2009-11-10 19:55:38,738 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_200911010621_0023_r_005396_0: Failed fetch #113 from attempt_200911010621_0023_m_039676_0
2009-11-10 19:55:38,738 INFO org.apache.hadoop.mapred.ReduceTask: Failed to fetch map-output from attempt_200911010621_0023_m_039676_0 even after MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to the JobTracker


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.