You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2018/06/15 10:14:00 UTC

[jira] [Commented] (HADOOP-15543) IndexOutOfBoundsException when reading bzip2-compressed SequenceFile

    [ https://issues.apache.org/jira/browse/HADOOP-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513637#comment-16513637 ] 

Steve Loughran commented on HADOOP-15543:
-----------------------------------------

It'd probably be good to stick that up somewhere? home.apache.org? so we can have a look at it. 

what happens if you try and use the OS unzip tools?

I think we need to work out if this is something wrong with the reader code, or the writer has generated something bad. Then find out who knows the native code well enough to fix.

FWIW, I don't see any changes in the native bzip code since 2015 (HADOOP-10027).

Looking at the java code, the only 3.1 change in the area is HADOOP-6852 and the BZip2Codec. Going to tag that as the cause unless we can see otherwise.


> IndexOutOfBoundsException when reading bzip2-compressed SequenceFile
> --------------------------------------------------------------------
>
>                 Key: HADOOP-15543
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15543
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 3.1.0
>            Reporter: Sebastian Nagel
>            Priority: Major
>
> When reading a bzip2-compressed SequenceFile, Hadoop jobs fail with: 
> {noformat}
> IndexOutOfBoundsException: offs(477658) + len(477659) > dest.length(678046)
> {noformat}
> The SequenceFile (669 MB) has been written with the properties
>  - mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec
> - mapreduce.output.fileoutputformat.compress.type=BLOCK
> using the native bzip2 library on Hadoop CDH 5.14.2 (Ubuntu 16.04, libbz2-1.0 1.0.6-8).
> The error was seen on two development systems (local mode, no native bzip2 lib configured/installed) and, so far, is reproducible with Hadoop 3.1.0 and CDH 5.14.2.
> The following Hadoop releases are not affected:  2.7.4, 3.02, CDH 5.14.0. The SequenceFile is read successfully when these Hadoop packages are used.
> If required I can share the SequenceFile. It's a Nutch CrawlDb (contains [CrawlDatum|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java] objects.
> Full-stack as seen with 3.1.0:
> {noformat}
> 2018-06-15 10:34:43,198 INFO  mapreduce.Job -  map 93% reduce 0%
> 2018-06-15 10:34:43,532 WARN  mapred.LocalJobRunner - job_local543410164_0001
> java.lang.Exception: java.lang.IndexOutOfBoundsException: offs(477658) + len(477659) > dest.length(678046).
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552)
> Caused by: java.lang.IndexOutOfBoundsException: offs(477658) + len(477659) > dest.length(678046).
>         at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:398)
>         at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:496)
>         at java.io.DataInputStream.readFully(DataInputStream.java:195)
>         at java.io.DataInputStream.readFully(DataInputStream.java:169)
>         at org.apache.hadoop.io.WritableUtils.readString(WritableUtils.java:125)
>         at org.apache.hadoop.io.WritableUtils.readStringArray(WritableUtils.java:169)
>         at org.apache.nutch.protocol.ProtocolStatus.readFields(ProtocolStatus.java:177)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:188)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:332)
>         at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
>         at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
>         at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2374)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2358)
>         at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:78)
>         at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:568)
>         at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
>         at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:271)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org