You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Suhas Gogate (JIRA)" <ji...@apache.org> on 2008/08/04 22:42:44 UTC

[jira] Created: (HADOOP-3898) avoid bzip2 decompressor throwing exception on corrupted (prematurely truncated) file

avoid bzip2 decompressor throwing exception on corrupted (prematurely truncated) file
-------------------------------------------------------------------------------------

                 Key: HADOOP-3898
                 URL: https://issues.apache.org/jira/browse/HADOOP-3898
             Project: Hadoop Core
          Issue Type: Improvement
          Components: mapred
    Affects Versions: 0.17.1
            Reporter: Suhas Gogate


running map-reduce streaming job using the bzip2 compressor, job fails with one of either of the two following java exceptions:

This seems to happen when one of the bz2 input files is corrupted (probably when the file is prematurely truncated).  Example,

Can we fix the bzip2 decompresser so that it does not throw the above two exceptions?


2008-07-16 07:23:39,605 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
java.io.IOException: mark/reset not supported
       at java.io.InputStream.reset(InputStream.java:334)
       at 
org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.readLine(Bzip2TextInputFormat.java:117) 

       at 
org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:140) 

       at 
org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:34) 

       at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:158)
       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
       at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

or

2008-07-16 20:49:28,020 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
java.io.IOException: CRC error
        at 
org.apache.tools.bzip2r.CBZip2InputStream.cadvise(CBZip2InputStream.java:74)
        at 
org.apache.tools.bzip2r.CBZip2InputStream.crcError(CBZip2InputStream.java:378)
        at 
org.apache.tools.bzip2r.CBZip2InputStream.endBlock(CBZip2InputStream.java:351)
        at 
org.apache.tools.bzip2r.CBZip2InputStream.setupNoRandPartA(CBZip2InputStream.java:851)
        at 
org.apache.tools.bzip2r.CBZip2InputStream.setupNoRandPartB(CBZip2InputStream.java:903)
        at 
org.apache.tools.bzip2r.CBZip2InputStream.read(CBZip2InputStream.java:240)
        at 
org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.readLine(Bzip2TextInputFormat.java:102)
        at 
org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:140)
        at 
org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:34)
        at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:158)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)


Example:
$HADOOP_HOME/bin/hadoop jar -libjars $<path>/jars/bzip2.jar 
$HADOOP_HOME/hadoop-streaming.jar \
  -inputformat org.apache.hadoop.mapred.Bzip2TextInputFormat \
  -mapper "cat" \
  -reducer "cat" \
  -numReduceTasks 20 \
  -input '<path>/corrupt-data.bz2'  \
  -output bzip2_bug_example \
  -jobconf stream.num.map.output.key.fields=1 \
  -jobconf stream.num.reduce.output.fields=1 \
  -jobconf num.key.fields.for.partition=1


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3898) avoid bzip2 decompressor throwing exception on corrupted (prematurely truncated) file

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619701#action_12619701 ] 

Chris Douglas commented on HADOOP-3898:
---------------------------------------

The bzip2 compression codec is slated for 0.19 (current trunk), not 0.17. Does this problem exist in trunk, or only in the 0.17.1+ version you're running?

> avoid bzip2 decompressor throwing exception on corrupted (prematurely truncated) file
> -------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3898
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3898
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Suhas Gogate
>
> running map-reduce streaming job using the bzip2 compressor, job fails with one of either of the two following java exceptions:
> This seems to happen when one of the bz2 input files is corrupted (probably when the file is prematurely truncated).  Example,
> Can we fix the bzip2 decompresser so that it does not throw the above two exceptions?
> 2008-07-16 07:23:39,605 WARN org.apache.hadoop.mapred.TaskTracker: Error 
> running child
> java.io.IOException: mark/reset not supported
>        at java.io.InputStream.reset(InputStream.java:334)
>        at 
> org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.readLine(Bzip2TextInputFormat.java:117) 
>        at 
> org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:140) 
>        at 
> org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:34) 
>        at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:158)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
>        at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
> or
> 2008-07-16 20:49:28,020 WARN org.apache.hadoop.mapred.TaskTracker: Error 
> running child
> java.io.IOException: CRC error
>         at 
> org.apache.tools.bzip2r.CBZip2InputStream.cadvise(CBZip2InputStream.java:74)
>         at 
> org.apache.tools.bzip2r.CBZip2InputStream.crcError(CBZip2InputStream.java:378)
>         at 
> org.apache.tools.bzip2r.CBZip2InputStream.endBlock(CBZip2InputStream.java:351)
>         at 
> org.apache.tools.bzip2r.CBZip2InputStream.setupNoRandPartA(CBZip2InputStream.java:851)
>         at 
> org.apache.tools.bzip2r.CBZip2InputStream.setupNoRandPartB(CBZip2InputStream.java:903)
>         at 
> org.apache.tools.bzip2r.CBZip2InputStream.read(CBZip2InputStream.java:240)
>         at 
> org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.readLine(Bzip2TextInputFormat.java:102)
>         at 
> org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:140)
>         at 
> org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:34)
>         at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:158)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
>         at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
> Example:
> $HADOOP_HOME/bin/hadoop jar -libjars $<path>/jars/bzip2.jar 
> $HADOOP_HOME/hadoop-streaming.jar \
>   -inputformat org.apache.hadoop.mapred.Bzip2TextInputFormat \
>   -mapper "cat" \
>   -reducer "cat" \
>   -numReduceTasks 20 \
>   -input '<path>/corrupt-data.bz2'  \
>   -output bzip2_bug_example \
>   -jobconf stream.num.map.output.key.fields=1 \
>   -jobconf stream.num.reduce.output.fields=1 \
>   -jobconf num.key.fields.for.partition=1

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.