You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by snikhil0 <sn...@telenav.com> on 2012/04/18 19:23:25 UTC

Avro + Snappy changing blocksize of snappy compression

I am experimenting with Avro and snappy and want to plot the size of the
compressed avro datafile as a function of varying compression block size. I
am doing this by setting the configuration value for
"io.compression.codec.snappy.buffersize". Unfortunately, this is not
working: or more precisely for buffer sizes 256K to 2MB I get the same size
output avro (snappyfied) data file. What am I missing? Someone had success
with this?

Thanks,
Nikhil

--
View this message in context: http://apache-avro.679487.n3.nabble.com/Avro-Snappy-changing-blocksize-of-snappy-compression-tp3920732p3920732.html
Sent from the Avro - Users mailing list archive at Nabble.com.

Re: Avro + Snappy changing blocksize of snappy compression

Posted by Tatu Saloranta <ts...@gmail.com>.
On Wed, Apr 18, 2012 at 10:23 AM, snikhil0 <sn...@telenav.com> wrote:
> I am experimenting with Avro and snappy and want to plot the size of the
> compressed avro datafile as a function of varying compression block size. I
> am doing this by setting the configuration value for
> "io.compression.codec.snappy.buffersize". Unfortunately, this is not
> working: or more precisely for buffer sizes 256K to 2MB I get the same size
> output avro (snappyfied) data file. What am I missing? Someone had success
> with this?

Snappy uses blocks of 64k (like most LZ compressors), so there should
be little benefit from block sizes larger than this; blocks are
compressed independent from each other (back references are up to 8k
or such anyway). There are some compressors that can use larger
buffers, like bzip2 (I think). But those are more exceptions than
rule.

-+ Tatu +-

Re: Avro + Snappy changing blocksize of snappy compression

Posted by Tatu Saloranta <ts...@gmail.com>.
On Wed, Apr 18, 2012 at 2:18 PM, Scott Carey <sc...@apache.org> wrote:
> Try a range from smaller block sizes (4k) and up.  256K is a larger block
> size than many compression codecs are sensitive to.

Agreed: most codecs only go up to 32k or 64k (in fact, Snappy may use
just 32k, not 64k).
Deflate doesn't benefit from above 64k either, nor does lzf.
The only codecs that I think use larger buffers are bzip and lzma;
both of which are typically way too slow to be used for streaming data
processing anyway.

So testing up to 64k is usually enough.

-+ Tatu +-

Re: Avro + Snappy changing blocksize of snappy compression

Posted by Scott Carey <sc...@apache.org>.
Try a range from smaller block sizes (4k) and up.  256K is a larger block
size than many compression codecs are sensitive to.

Also for reference, try it with the deflate codec at a couple different
compression levels -- 1, 3, 5, and 7 should show a trend with respect to
block size.  As the compression level increases, the compressor can take
advantage of larger blocks.

In the deflate/gzip case that I have explored heavily, the effectiveness
of the block size also varies significantly depending on the
characteristics of the data being compressed.


(note: gzip uses deflate compression)

On 4/18/12 1:33 PM, "snikhil0" <sn...@telenav.com> wrote:

>I had tried the sync Interval as well and I get the same results: meaning
>no
>change in final avro data file.
>
>Nikhil
>
>--
>View this message in context:
>http://apache-avro.679487.n3.nabble.com/Avro-Snappy-changing-blocksize-of-
>snappy-compression-tp3920732p3921256.html
>Sent from the Avro - Users mailing list archive at Nabble.com.



Re: Avro + Snappy changing blocksize of snappy compression

Posted by snikhil0 <sn...@telenav.com>.
I had tried the sync Interval as well and I get the same results: meaning no
change in final avro data file.

Nikhil

--
View this message in context: http://apache-avro.679487.n3.nabble.com/Avro-Snappy-changing-blocksize-of-snappy-compression-tp3920732p3921256.html
Sent from the Avro - Users mailing list archive at Nabble.com.

Re: Avro + Snappy changing blocksize of snappy compression

Posted by Harsh J <ha...@cloudera.com>.
Hey Nikhil,

When using Avro Datafiles, you perhaps need to tweak its sync-interval
to affect compression chunk sizes:
http://avro.apache.org/docs/1.6.3/api/java/org/apache/avro/file/DataFileWriter.html#setSyncInterval(int)

On Wed, Apr 18, 2012 at 10:53 PM, snikhil0 <sn...@telenav.com> wrote:
> I am experimenting with Avro and snappy and want to plot the size of the
> compressed avro datafile as a function of varying compression block size. I
> am doing this by setting the configuration value for
> "io.compression.codec.snappy.buffersize". Unfortunately, this is not
> working: or more precisely for buffer sizes 256K to 2MB I get the same size
> output avro (snappyfied) data file. What am I missing? Someone had success
> with this?
>
> Thanks,
> Nikhil
>
> --
> View this message in context: http://apache-avro.679487.n3.nabble.com/Avro-Snappy-changing-blocksize-of-snappy-compression-tp3920732p3920732.html
> Sent from the Avro - Users mailing list archive at Nabble.com.



-- 
Harsh J