You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by felix gao <gr...@gmail.com> on 2011/03/02 06:05:37 UTC

Avro speed comparison with raw logs

Hello groups,

I am running some comparison tests on a data set that I converted to avro
with deflator set to level 6. The original logs consists of 2880
uncompressed http access logs with a total size of 1.4TB. The Compressed
avro log is about 2/3 of the size.  However, when I ran the same pig job on
the raw logs, it is blazing fast during the initial map phase. Finished in
under 40 min. When I ran the same pig job with avro files, the initial map
phase took 8 minutes to only finish 10%.  I am wondering is there any way to
figure out what is slowing down the map?

Thanks,

Felix

Re: Avro speed comparison with raw logs

Posted by Tatu Saloranta <ts...@gmail.com>.
On Wed, Mar 30, 2011 at 6:51 PM, Scott Carey <sc...@richrelevance.com> wrote:
> gzip/deflate is approximately the same speed to decompress for all
> compression levels.
> However, for compression, it varies by a factor of 5 or so between the
> fastest (1) and slowest (9).
>
> This is a useful link for gzip performance characteristics:
> http://tukaani.org/lzma/benchmarks.html

Also, a new project that compares performance & efficiency
(time/space) of JVM-accessible compression codecs is at:

https://github.com/ning/jvm-compressor-benchmark

and although default does not yet compare differences between deflate
levels would be easy to modify to also do that. Currently it does
include 2 deflate codecs, bzip2, quicklz, lzf and snappy (via JNI).

-+ Tatu +-

ps. It would be really nice to have benchmarks for "big data" use
cases for codecs -- jvm-serialization-benchmark for example just deals
with individual small messages. But there are multiple applicable data
formats, with very little good detailed comparative performance
benchmarking. :-/

Re: Avro speed comparison with raw logs

Posted by felix gao <gr...@gmail.com>.
Doug,

I am using avro 1.4.1 and the problem is not avro been slow, is the
AvroStorage does a recursive schema validation that makes it so slow.  It is
fixed now.

Felix

On Fri, Mar 4, 2011 at 9:25 AM, Doug Cutting <cu...@apache.org> wrote:

> On 03/01/2011 09:05 PM, felix gao wrote:
> > I am running some comparison tests on a data set that I converted to
> > avro with deflator set to level 6. The original logs consists of 2880
> > uncompressed http access logs with a total size of 1.4TB. The Compressed
> > avro log is about 2/3 of the size.  However, when I ran the same pig job
> > on the raw logs, it is blazing fast during the initial map phase.
> > Finished in under 40 min. When I ran the same pig job with avro files,
> > the initial map phase took 8 minutes to only finish 10%.  I am wondering
> > is there any way to figure out what is slowing down the map?
>
> What version of Avro are you using?  How are you integrating Avro with Pig?
>
> Also, for speed, you might try level=1 (Deflater.BEST_SPEED).
>
> Doug
>

Re: Avro speed comparison with raw logs

Posted by Scott Carey <sc...@richrelevance.com>.
gzip/deflate is approximately the same speed to decompress for all
compression levels.
However, for compression, it varies by a factor of 5 or so between the
fastest (1) and slowest (9).

This is a useful link for gzip performance characteristics:
http://tukaani.org/lzma/benchmarks.html

On 3/4/11 9:25 AM, "Doug Cutting" <cu...@apache.org> wrote:

>On 03/01/2011 09:05 PM, felix gao wrote:
>> I am running some comparison tests on a data set that I converted to
>> avro with deflator set to level 6. The original logs consists of 2880
>> uncompressed http access logs with a total size of 1.4TB. The Compressed
>> avro log is about 2/3 of the size.  However, when I ran the same pig job
>> on the raw logs, it is blazing fast during the initial map phase.
>> Finished in under 40 min. When I ran the same pig job with avro files,
>> the initial map phase took 8 minutes to only finish 10%.  I am wondering
>> is there any way to figure out what is slowing down the map?
>
>What version of Avro are you using?  How are you integrating Avro with
>Pig?
>
>Also, for speed, you might try level=1 (Deflater.BEST_SPEED).
>
>Doug


Re: Avro speed comparison with raw logs

Posted by Doug Cutting <cu...@apache.org>.
On 03/01/2011 09:05 PM, felix gao wrote:
> I am running some comparison tests on a data set that I converted to
> avro with deflator set to level 6. The original logs consists of 2880
> uncompressed http access logs with a total size of 1.4TB. The Compressed
> avro log is about 2/3 of the size.  However, when I ran the same pig job
> on the raw logs, it is blazing fast during the initial map phase.
> Finished in under 40 min. When I ran the same pig job with avro files,
> the initial map phase took 8 minutes to only finish 10%.  I am wondering
> is there any way to figure out what is slowing down the map?

What version of Avro are you using?  How are you integrating Avro with Pig?

Also, for speed, you might try level=1 (Deflater.BEST_SPEED).

Doug