You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Huw (Jira)" <ji...@apache.org> on 2022/09/09 04:31:00 UTC
[jira] [Comment Edited] (HADOOP-15543) IndexOutOfBoundsException when reading bzip2-compressed SequenceFile

    [ https://issues.apache.org/jira/browse/HADOOP-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17602108#comment-17602108 ] 

Huw edited comment on HADOOP-15543 at 9/9/22 4:30 AM:
------------------------------------------------------

I've recently hit this issue as well.

 

I believe that this issue is indeed with the Reader, as I have on hand a little Haskell program which reads a subset of sequence files (just block compressed with a limited number of WritableCodecs) which reads the problematic files just fine.
{code:java}

java.lang.IndexOutOfBoundsException: offs(153) + len(154) > dest.length(268).
at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:398)
at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:496)
at java.base/java.io.DataInputStream.readFully(DataInputStream.java:200){code}
See that len is indeed (1 + offs), which does seem suspect.


was (Author: JIRAUSER288917):
I've recently hit this issue as well.

 

I believe that this issue is indeed with the Reader, as I have on hand a little Haskell program which reads a subset of sequence files (just block compressed with a limited number of WritableCodecs) which reads the problematic files just fine.
java.lang.IndexOutOfBoundsException: offs(153) + len(154) > dest.length(268).
	at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:398)
	at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:496)
	at java.base/java.io.DataInputStream.readFully(DataInputStream.java:200)

See that len is indeed (1 + offs), which does seem suspect.

> IndexOutOfBoundsException when reading bzip2-compressed SequenceFile
> --------------------------------------------------------------------
>
>                 Key: HADOOP-15543
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15543
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 3.1.0
>            Reporter: Sebastian Nagel
>            Priority: Major
>
> When reading a bzip2-compressed SequenceFile, Hadoop jobs fail with: 
> {noformat}
> IndexOutOfBoundsException: offs(477658) + len(477659) > dest.length(678046)
> {noformat}
> The SequenceFile (669 MB) has been written with the properties
>  - mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec
> - mapreduce.output.fileoutputformat.compress.type=BLOCK
> using the native bzip2 library on Hadoop CDH 5.14.2 (Ubuntu 16.04, libbz2-1.0 1.0.6-8).
> The error was seen on two development systems (local mode, no native bzip2 lib configured/installed) and, so far, is reproducible with Hadoop 3.1.0 and CDH 5.14.2.
> The following Hadoop releases are not affected:  2.7.4, 3.02, CDH 5.14.0. The SequenceFile is read successfully when these Hadoop packages are used.
> If required I can share the SequenceFile. It's a Nutch CrawlDb (contains [CrawlDatum|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java] objects.
> Full-stack as seen with 3.1.0:
> {noformat}
> 2018-06-15 10:34:43,198 INFO  mapreduce.Job -  map 93% reduce 0%
> 2018-06-15 10:34:43,532 WARN  mapred.LocalJobRunner - job_local543410164_0001
> java.lang.Exception: java.lang.IndexOutOfBoundsException: offs(477658) + len(477659) > dest.length(678046).
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552)
> Caused by: java.lang.IndexOutOfBoundsException: offs(477658) + len(477659) > dest.length(678046).
>         at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:398)
>         at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:496)
>         at java.io.DataInputStream.readFully(DataInputStream.java:195)
>         at java.io.DataInputStream.readFully(DataInputStream.java:169)
>         at org.apache.hadoop.io.WritableUtils.readString(WritableUtils.java:125)
>         at org.apache.hadoop.io.WritableUtils.readStringArray(WritableUtils.java:169)
>         at org.apache.nutch.protocol.ProtocolStatus.readFields(ProtocolStatus.java:177)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:188)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:332)
>         at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
>         at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
>         at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2374)
>         at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2358)
>         at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:78)
>         at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:568)
>         at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
>         at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:271)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org