You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by Eric Hauser <ew...@gmail.com> on 2011/10/01 03:42:32 UTC

Compression and splittable Avro files in Hadoop

A coworker and I were having a conversation today about choosing a
compression algorithm for some data we are storing in Hadoop.  We have
been using (https://github.com/tomslabs/avro-utils) for our Map/Reduce
jobs and Haivvreo for integration with Hive.  By default, the
avro-utils OutputFormat uses deflate compression.  Even though
default/zlib/gzip files are not splittable, we decided that Avro data
files are always splittable because individual blocks within the file
are compressed instead of the entire file.

Is this accurate?  Thanks.

Re: Compression and splittable Avro files in Hadoop

Posted by Patrick Wendell <pw...@gmail.com>.

---
sent from my phone

On Sep 30, 2011 6:43 PM, "Eric Hauser" <ew...@gmail.com> wrote:

A coworker and I were having a conversation today about choosing a
compression algorithm for some data we are storing in Hadoop.  We have
been using (https://github.com/tomslabs/avro-utils) for our Map/Reduce
jobs and Haivvreo for integration with Hive.  By default, the
avro-utils OutputFormat uses deflate compression.  Even though
default/zlib/gzip files are not splittable, we decided that Avro data
files are always splittable because individual blocks within the file
are compressed instead of the entire file.

Is this accurate?  Thanks.

Re: Compression and splittable Avro files in Hadoop

Posted by Yang <te...@gmail.com>.

this my approach :
although you could use AvroDatafile, I used my own:

I use SequenceFile , or RCFile, or TFile as an "envelope", and just
serialize avro into a bytes array, and write that into these envelops
as a payload.  I did some tests, TFile envelope was best in speed.



On Fri, Sep 30, 2011 at 6:42 PM, Eric Hauser <ew...@gmail.com> wrote:
> A coworker and I were having a conversation today about choosing a
> compression algorithm for some data we are storing in Hadoop.  We have
> been using (https://github.com/tomslabs/avro-utils) for our Map/Reduce
> jobs and Haivvreo for integration with Hive.  By default, the
> avro-utils OutputFormat uses deflate compression.  Even though
> default/zlib/gzip files are not splittable, we decided that Avro data
> files are always splittable because individual blocks within the file
> are compressed instead of the entire file.
>
> Is this accurate?  Thanks.
>

Re: Compression and splittable Avro files in Hadoop

Posted by Eric Hauser <ew...@gmail.com>.

Thanks Scott.  I remember reading another one of your threads about
sync interval, but I had forgotten to change it.  We will do some
experimentation with compression level and the sync internal.


On Fri, Sep 30, 2011 at 9:52 PM, Scott Carey <sc...@apache.org> wrote:
> Yes, Avro Data Files are always splittable.
>
> You may want to up the default block size in the files if this is for
> MapReduce.  The block size can often have a bigger impact on the
> compression ratio than the compression level setting.
>
> If you are sensitive to the write performance, you might want lower
> deflate compression levels as well.  The read performance is relatively
> constant for deflate as the compression level changes (except for
> uncompressed level 0), but the write performance varies a quite a bit
> between compression level 1 and 9 -- typically a factor of 5 or 6.
>
> On 9/30/11 6:42 PM, "Eric Hauser" <ew...@gmail.com> wrote:
>
>>A coworker and I were having a conversation today about choosing a
>>compression algorithm for some data we are storing in Hadoop.  We have
>>been using (https://github.com/tomslabs/avro-utils) for our Map/Reduce
>>jobs and Haivvreo for integration with Hive.  By default, the
>>avro-utils OutputFormat uses deflate compression.  Even though
>>default/zlib/gzip files are not splittable, we decided that Avro data
>>files are always splittable because individual blocks within the file
>>are compressed instead of the entire file.
>>
>>Is this accurate?  Thanks.
>
>
>

Re: Compression and splittable Avro files in Hadoop

Posted by Scott Carey <sc...@apache.org>.

Yes, Avro Data Files are always splittable.

You may want to up the default block size in the files if this is for
MapReduce.  The block size can often have a bigger impact on the
compression ratio than the compression level setting.

If you are sensitive to the write performance, you might want lower
deflate compression levels as well.  The read performance is relatively
constant for deflate as the compression level changes (except for
uncompressed level 0), but the write performance varies a quite a bit
between compression level 1 and 9 -- typically a factor of 5 or 6.

On 9/30/11 6:42 PM, "Eric Hauser" <ew...@gmail.com> wrote:

>A coworker and I were having a conversation today about choosing a
>compression algorithm for some data we are storing in Hadoop.  We have
>been using (https://github.com/tomslabs/avro-utils) for our Map/Reduce
>jobs and Haivvreo for integration with Hive.  By default, the
>avro-utils OutputFormat uses deflate compression.  Even though
>default/zlib/gzip files are not splittable, we decided that Avro data
>files are always splittable because individual blocks within the file
>are compressed instead of the entire file.
>
>Is this accurate?  Thanks.