You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by Joe Crobak <jo...@gmail.com> on 2010/12/18 22:05:17 UTC

sync interval for AvroOutputFormat

AvroOutputFormat supports setting deflate level, but not the sync interval.
 Was this a conscious decision (i.e. would there be drawbacks of making the
sync interval larger)?

In some tests that I've done, Avro data files were over 50% smaller when I
upped the sync interval to 2MB (default is 16000 bytes).  I also saw a
modest speedup in building the files (I suspect my program was IO-bound).

Would folks support a patch to add setting a sync interval as a static
configuration option to AvroOutputFormat?

Best,
Joe

Re: sync interval for AvroOutputFormat

Posted by Joe Crobak <jo...@gmail.com>.
On Sun, Dec 19, 2010 at 6:14 PM, Scott Carey <sc...@richrelevance.com>wrote:

>
> On Dec 18, 2010, at 1:05 PM, Joe Crobak wrote:
>
> > AvroOutputFormat supports setting deflate level, but not the sync
> interval.
> > Was this a conscious decision (i.e. would there be drawbacks of making
> the
> > sync interval larger)?
> >
> > In some tests that I've done, Avro data files were over 50% smaller when
> I
> > upped the sync interval to 2MB (default is 16000 bytes).  I also saw a
> > modest speedup in building the files (I suspect my program was IO-bound).
> >
> > Would folks support a patch to add setting a sync interval as a static
> > configuration option to AvroOutputFormat?
>
> Yes, it makes sense to expose that.
>

In that case, I'd be happy to file a ticket and create a patch.


>
> Out of curiosity, how much of an improvement do you get for going to 64000
> bytes?  A larger default for the MapReduce case makes sense, but 2MB may be
> on the large side.  M/R has to split the file at sync boundaries and you
> don't want those to end up too far from the HDFS block boundaries.
>

Here are the compression ratios I'm seeing (block size, compression ratio):

16384 0.217
32768 0.164
65536 0.132
131072 0.116
262144 0.108
524288 0.104
1048576 0.102
2097152 0.100

So the sweet-spot for this data seems to be around 128K-256K, which is
within 7.7% - 16% of "optimal" (where optimal is the uncompressed file
compressed with command-line gzip).


>
> The file format default is moderately sized because for many non M/R use
> cases, syncing to disk more regularly is a good idea.  With the default
> deflate lookback window 32k, compression ratio as a function of block size
> tends to have a sharp elbow near that size.  In my experiments,  compression
> ratio did not go up after blocks that are about 120k in size, and was only
> moderately better than 16000 byte blocks.  But my data isn't your data.
>

Thanks for this suggestion -- I had only looked at the two extremes.  If the
ability to configure the size, then I should be able to do some tests to see
how these window sizes affect performance for our application.

Thanks,
Joe

Re: sync interval for AvroOutputFormat

Posted by Scott Carey <sc...@richrelevance.com>.
On Dec 18, 2010, at 1:05 PM, Joe Crobak wrote:

> AvroOutputFormat supports setting deflate level, but not the sync interval.
> Was this a conscious decision (i.e. would there be drawbacks of making the
> sync interval larger)?
> 
> In some tests that I've done, Avro data files were over 50% smaller when I
> upped the sync interval to 2MB (default is 16000 bytes).  I also saw a
> modest speedup in building the files (I suspect my program was IO-bound).
> 
> Would folks support a patch to add setting a sync interval as a static
> configuration option to AvroOutputFormat?

Yes, it makes sense to expose that.

Out of curiosity, how much of an improvement do you get for going to 64000 bytes?  A larger default for the MapReduce case makes sense, but 2MB may be on the large side.  M/R has to split the file at sync boundaries and you don't want those to end up too far from the HDFS block boundaries.

The file format default is moderately sized because for many non M/R use cases, syncing to disk more regularly is a good idea.  With the default deflate lookback window 32k, compression ratio as a function of block size tends to have a sharp elbow near that size.  In my experiments,  compression ratio did not go up after blocks that are about 120k in size, and was only moderately better than 16000 byte blocks.  But my data isn't your data.
> 
> Best,
> Joe