You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Shannon Duncan <jo...@liveramp.com> on 2019/09/04 22:09:53 UTC

[Java] Compressed SequenceFile

I have successfully been using the sequence file source located here:

https://github.com/googleapis/java-bigtable-hbase/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java

However recently we started to do block level compression with bzip2 on the
SequenceFile. This is supported out of the box on the Hadoop side of things.

However when reading in the files, while most records parse out just fine
there are a handful of records that throw:

####
Exception in thread "main" java.lang.IndexOutOfBoundsException: offs(1368)
+ len(1369) > dest.length(1467).
at
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:398)
####

I've gone in circles looking at this. It seems that the last record being
read from the sequencefile in each thread is hitting this on the value
retrieval (Key retrieves just fine, but value throws this error).

Any clues as to what this could be?

File is KV<Text, Text> aka
"SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text(org.apache.hadoop.io.compress.BZip2Codec"

Any help is appreciated!

- Shannon

Re: [Java] Compressed SequenceFile

Posted by Alexey Romanenko <ar...@gmail.com>.
Thank you for letting us know about root cause. 

Yes, indeed, all our Hadoop-related components depend on Hadoop version 2.7.3. So, I think that running the pipeline against different minor version should be ok in general, but not sure that someone tested it against other major version, like 3.x 

In the same time, since starting from CDH 6.0 [1] Cloudera started to include Hadoop v.3.0 into their distribution. So, probably, we need to move to Hadoop 3.0 too in Beam.

[1] https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hadoop/hadoop-common/ <https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hadoop/hadoop-common/>

> On 5 Sep 2019, at 18:49, Shannon Duncan <jo...@liveramp.com> wrote:
> 
> Something must have changed with the bzip2 codec in later versions of hadoop. When I get time I'll investigate which version actually breaks it and see what changed.
> 
> On Thu, Sep 5, 2019 at 11:40 AM Lukasz Cwik <lcwik@google.com <ma...@google.com>> wrote:
> Sorry for the poor experience and thanks for sharing a solution with others.
> 
> On Thu, Sep 5, 2019 at 6:34 AM Shannon Duncan <joseph.duncan@liveramp.com <ma...@liveramp.com>> wrote:
> FYI this was due to hadoop version. 3.2.0 was throwing this error, but rolled back to version in googles pom.xml 2.7.4 and it is working fine now.
> 
> Kindof annoying cause I wasted several hours jumping through hoops trying to get 3.2.0 working :(
> 
> On Wed, Sep 4, 2019 at 5:09 PM Shannon Duncan <joseph.duncan@liveramp.com <ma...@liveramp.com>> wrote:
> I have successfully been using the sequence file source located here:
> 
> https://github.com/googleapis/java-bigtable-hbase/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java <https://github.com/googleapis/java-bigtable-hbase/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java>
> 
> However recently we started to do block level compression with bzip2 on the SequenceFile. This is supported out of the box on the Hadoop side of things.
> 
> However when reading in the files, while most records parse out just fine there are a handful of records that throw:
> 
> ####
> Exception in thread "main" java.lang.IndexOutOfBoundsException: offs(1368) + len(1369) > dest.length(1467).
> at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:398)
> ####
> 
> I've gone in circles looking at this. It seems that the last record being read from the sequencefile in each thread is hitting this on the value retrieval (Key retrieves just fine, but value throws this error).
> 
> Any clues as to what this could be? 
> 
> File is KV<Text, Text> aka "SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text(org.apache.hadoop.io.compress.BZip2Codec"
> 
> Any help is appreciated!
> 
> - Shannon


Re: [Java] Compressed SequenceFile

Posted by Shannon Duncan <jo...@liveramp.com>.
Something must have changed with the bzip2 codec in later versions of
hadoop. When I get time I'll investigate which version actually breaks it
and see what changed.

On Thu, Sep 5, 2019 at 11:40 AM Lukasz Cwik <lc...@google.com> wrote:

> Sorry for the poor experience and thanks for sharing a solution with
> others.
>
> On Thu, Sep 5, 2019 at 6:34 AM Shannon Duncan <jo...@liveramp.com>
> wrote:
>
>> FYI this was due to hadoop version. 3.2.0 was throwing this error, but
>> rolled back to version in googles pom.xml 2.7.4 and it is working fine now.
>>
>> Kindof annoying cause I wasted several hours jumping through hoops trying
>> to get 3.2.0 working :(
>>
>> On Wed, Sep 4, 2019 at 5:09 PM Shannon Duncan <jo...@liveramp.com>
>> wrote:
>>
>>> I have successfully been using the sequence file source located here:
>>>
>>>
>>> https://github.com/googleapis/java-bigtable-hbase/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
>>>
>>> However recently we started to do block level compression with bzip2 on
>>> the SequenceFile. This is supported out of the box on the Hadoop side of
>>> things.
>>>
>>> However when reading in the files, while most records parse out just
>>> fine there are a handful of records that throw:
>>>
>>> ####
>>> Exception in thread "main" java.lang.IndexOutOfBoundsException:
>>> offs(1368) + len(1369) > dest.length(1467).
>>> at
>>> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:398)
>>> ####
>>>
>>> I've gone in circles looking at this. It seems that the last record
>>> being read from the sequencefile in each thread is hitting this on the
>>> value retrieval (Key retrieves just fine, but value throws this error).
>>>
>>> Any clues as to what this could be?
>>>
>>> File is KV<Text, Text> aka
>>> "SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text(org.apache.hadoop.io.compress.BZip2Codec"
>>>
>>> Any help is appreciated!
>>>
>>> - Shannon
>>>
>>

Re: [Java] Compressed SequenceFile

Posted by Lukasz Cwik <lc...@google.com>.
Sorry for the poor experience and thanks for sharing a solution with others.

On Thu, Sep 5, 2019 at 6:34 AM Shannon Duncan <jo...@liveramp.com>
wrote:

> FYI this was due to hadoop version. 3.2.0 was throwing this error, but
> rolled back to version in googles pom.xml 2.7.4 and it is working fine now.
>
> Kindof annoying cause I wasted several hours jumping through hoops trying
> to get 3.2.0 working :(
>
> On Wed, Sep 4, 2019 at 5:09 PM Shannon Duncan <jo...@liveramp.com>
> wrote:
>
>> I have successfully been using the sequence file source located here:
>>
>>
>> https://github.com/googleapis/java-bigtable-hbase/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
>>
>> However recently we started to do block level compression with bzip2 on
>> the SequenceFile. This is supported out of the box on the Hadoop side of
>> things.
>>
>> However when reading in the files, while most records parse out just fine
>> there are a handful of records that throw:
>>
>> ####
>> Exception in thread "main" java.lang.IndexOutOfBoundsException:
>> offs(1368) + len(1369) > dest.length(1467).
>> at
>> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:398)
>> ####
>>
>> I've gone in circles looking at this. It seems that the last record being
>> read from the sequencefile in each thread is hitting this on the value
>> retrieval (Key retrieves just fine, but value throws this error).
>>
>> Any clues as to what this could be?
>>
>> File is KV<Text, Text> aka
>> "SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text(org.apache.hadoop.io.compress.BZip2Codec"
>>
>> Any help is appreciated!
>>
>> - Shannon
>>
>

Re: [Java] Compressed SequenceFile

Posted by Shannon Duncan <jo...@liveramp.com>.
FYI this was due to hadoop version. 3.2.0 was throwing this error, but
rolled back to version in googles pom.xml 2.7.4 and it is working fine now.

Kindof annoying cause I wasted several hours jumping through hoops trying
to get 3.2.0 working :(

On Wed, Sep 4, 2019 at 5:09 PM Shannon Duncan <jo...@liveramp.com>
wrote:

> I have successfully been using the sequence file source located here:
>
>
> https://github.com/googleapis/java-bigtable-hbase/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
>
> However recently we started to do block level compression with bzip2 on
> the SequenceFile. This is supported out of the box on the Hadoop side of
> things.
>
> However when reading in the files, while most records parse out just fine
> there are a handful of records that throw:
>
> ####
> Exception in thread "main" java.lang.IndexOutOfBoundsException: offs(1368)
> + len(1369) > dest.length(1467).
> at
> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:398)
> ####
>
> I've gone in circles looking at this. It seems that the last record being
> read from the sequencefile in each thread is hitting this on the value
> retrieval (Key retrieves just fine, but value throws this error).
>
> Any clues as to what this could be?
>
> File is KV<Text, Text> aka
> "SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text(org.apache.hadoop.io.compress.BZip2Codec"
>
> Any help is appreciated!
>
> - Shannon
>