You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Michael Stack <st...@archive.org> on 2006/04/12 03:51:42 UTC

'Corrupt GZIP trailer' and other irrecoverable IOEs.

I had a job fail in SequenceFile$Reader#next because a record was 
corrupt.  Every attempt at reading failed with 'Corrupt GZIP trailer' 
(See stack trace on the end).  Looks like IOEs of this type -- i.e. 
irrecoverable IOEs -- need to have their maps rescheduled somehow.  One 
mechanism is to have this type of error report as an FSError.  Then the 
tasktracker will shut itself down take itself out of the pool of 
machines.  This is a little radical but should let the job complete. 

But I suppose its kinda tough distingushing the 'irrecoverables' from 
the 'recoverables' (e.g. timeouts). 

I've started trying to keep tabs.

Another 'irrecoverable'-looking IOEs that I've seen is the following:
...
060404 220614 task_r_3dd1gh  IOException null at 1294336. Skipping entries.
060404 220614 task_r_3dd1gh java.io.EOFException
060404 220614 task_r_3dd1gh     at 
java.io.DataInputStream.readFully(DataInputStream.java:178)
060404 220614 task_r_3dd1gh     at 
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:56)
060404 220614 task_r_3dd1gh     at 
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:94)
060404 220614 task_r_3dd1gh     at 
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:394)
060404 220614 task_r_3dd1gh     at 
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:209)
060404 220614 task_r_3dd1gh     at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:709)

I've gotten this a good few times.  

I recently added 'keep-going' code that is like the handleChecksumError 
code in SequenceFile.  It tries to just skip over problematic IOE 
records/files.  Its made it more likely jobs will complete on our 
hardware but this is less than optimal for a couple of reasons:  1. It 
will skip records w/ recoverable IOEs, and 2.) It breaks the mapreduce 
'determinate' results property. 

Thanks,
St.Ack


Here's the GZIP exception I mentioned up top:

060323 180210 task_r_ack5c7 0.853471% reduce > reduce
060323 180210 task_r_2u5g4x  Error running child
060323 180210 task_r_g6d5ir 0.78022665% reduce > reduce
060323 180210 task_r_2u5g4x java.lang.RuntimeException: 
java.io.IOException: Corrupt GZIP trailer
060323 180210 task_r_2u5g4x     at 
org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:132)
060323 180210 task_r_2u5g4x     at 
org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:41)
060323 180210 task_r_2u5g4x     at 
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:283)
060323 180210 task_r_2u5g4x     at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:703)
060323 180210 task_r_2u5g4x Caused by: java.io.IOException: Corrupt GZIP 
trailer
060323 180210 task_r_2u5g4x     at 
java.util.zip.GZIPInputStream.readTrailer(GZIPInputStream.java:175)
060323 180210 task_r_2u5g4x     at 
java.util.zip.GZIPInputStream.read(GZIPInputStream.java:89)
060323 180210 task_r_2u5g4x     at 
org.apache.hadoop.io.WritableUtils.readCompressedByteArray(WritableUtils.java:35)
060323 180210 task_r_2u5g4x     at 
org.apache.hadoop.io.WritableUtils.readCompressedString(WritableUtils.java:70)
060323 180210 task_r_2u5g4x     at 
org.apache.nutch.parse.ParseText.readFields(ParseText.java:44)
060323 180210 task_r_2u5g4x     at 
org.apache.nutch.parse.ParseImpl.readFields(ParseImpl.java:59)
060323 180210 task_r_2u5g4x     at 
org.apache.nutch.parse.ParseImpl.read(ParseImpl.java:69)
060323 180210 task_r_2u5g4x     at 
org.apache.nutch.fetcher.FetcherOutput.readFields(FetcherOutput.java:47)
060323 180210 task_r_2u5g4x     at 
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:344)
060323 180210 task_r_2u5g4x     at 
org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:163)
060323 180210 task_r_2u5g4x     at 
org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:129)
060323 180210 task_r_2u5g4x     ... 3 more