You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Edward Capriolo <ed...@gmail.com> on 2010/04/16 20:15:38 UTC

Working with my gzipped sequence file

 at org.apache.hadoop.mapred.
SequenceFileAsTextInputFormat.getRecordReader(SequenceFileAsTextInputFormat.java:43)
    at
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:296)
    at
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:311)
    ... 21 more


The compression being used here - gzip - is not suitable for splitting of
the input files. That could be the reason why you are seeing this exception.
Can you try using a different compression scheme such as bzip2, or perhaps
by not compressing the files at all?


1) can I just set the split size VERY VERY high thus causing hive never to
split this files? My files were produced from a map reduce program so they
are already split very small. I really do not want to have to force a change
upstream.

2) From the other post the key/value of the sequence file should be
ByteWritable Text. Currently my key/values are text/text. and my data is the
the Key...so

I have already written my own SequenceRecordReader, but it is not working.
but I am swapping the key and the value. So I am thinking:

1. For key emit a dummy ByteWritable maybe 'A'
2. Write the key to the value

Will this work? Are their other gotcha's here?

Thank you,
Edward

Re: Working with my gzipped sequence file

Posted by Edward Capriolo <ed...@gmail.com>.

On Fri, Apr 16, 2010 at 2:15 PM, Edward Capriolo <ed...@gmail.com>wrote:

> at org.apache.hadoop.mapred.
>
> SequenceFileAsTextInputFormat.getRecordReader(SequenceFileAsTextInputFormat.java:43)
>     at
> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:296)
>     at
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:311)
>     ... 21 more
>
>
> The compression being used here - gzip - is not suitable for splitting of
> the input files. That could be the reason why you are seeing this exception.
> Can you try using a different compression scheme such as bzip2, or perhaps
> by not compressing the files at all?
>
>
> 1) can I just set the split size VERY VERY high thus causing hive never to
> split this files? My files were produced from a map reduce program so they
> are already split very small. I really do not want to have to force a change
> upstream.
>
> 2) From the other post the key/value of the sequence file should be
> ByteWritable Text. Currently my key/values are text/text. and my data is the
> the Key...so
>
> I have already written my own SequenceRecordReader, but it is not working.
> but I am swapping the key and the value. So I am thinking:
>
> 1. For key emit a dummy ByteWritable maybe 'A'
> 2. Write the key to the value
>
> Will this work? Are their other gotcha's here?
>
> Thank you,
> Edward
>

FYI the problem here is that hadoop NEEDS the native libraries to work with
GZIP block sequence compressed files. For whatever reason the dfs -text tool
can open then but mapreduce can't. Upstream should report error like!

Tring to load native libs...
cant do it falling back to...

Should be replaced with:

trying to load native libs...
FALLING BACK TO JAVA LIBS THAT WONT WORK ANYWAY!!!