You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by W W <ww...@gmail.com> on 2012/11/18 17:25:50 UTC

Lzo compression

hello

In Alan Gates'   Programming in Pig  , chapter "Making Pig Fly"  it was
mentioned
In testing we did while developing this feature we saw performance
improvements of up to 4x when using LZO, and slight performance degradation
when using gzip.
(http://ofps.oreilly.com/titles/9781449302641/making_pig_fly.html)


I've tried using lzo as the compression tools( took me couple of days to
compile it ) , and also with gzip.
The  result of gzip is the same as mentioned in the book, but the result of
with lzo is not imporvements of up to 4x , but almost the no improvement or
slight degradation as well.

I enabled the compression between Map and Reduce ,  and also between M/R
jobs "pig.tmpfilecompression=true   pig.tmpfilecompression.codec=lzo".

>From the counters I can see the HDFS bytes are compressed to about 1/3
compared to no compress.
I can followings in the log on TaskTracker.

2012-11-18 16:14:11,638 INFO
com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl
library
2012-11-18 16:14:11,639 INFO com.hadoop.compression.lzo.LzoCodec:
Successfully loaded & initialized native-lzo library
2012-11-18 16:14:11,640 INFO org.apache.hadoop.io.compress.CodecPool:
Got brand-new decompressor


The data volume is about 6G in total, and I have 100 cpus  + 150G memory
fall on  10 nodes.
My pig script is compiled into 4 M/R jobs.   The operation in each job is :
  MAP_ONLY   -->   HASH_JOIN  -->   GROUP_BY  -->   HASH_JOIN .

My guess of the reason is IO is not a bottle net for me, but was one for
Alan Gates' case when he wrote the book.

Any one have any clue why I didn't gain any improvement?


Thanks
Regards
Xingbang Wang

Re: Lzo compression

Posted by Kannan Shah <sh...@gmail.com>.
I think you nailed it with "I guess I/O is not a bottle neck for me". Yes
when you can have a dedicated cpu, decompression in stream is faster that
I/O, but if your downstream process is complicated, you probably won't see
much benefit, because the decompression process will be waiting for the
downstream process.

You'll see a little benefit if you pig job (downstream process) is faster
than I/O but possibly slower than the decompression.

Kannan

On 18 November 2012 08:25, W W <ww...@gmail.com> wrote:

> hello
>
> In Alan Gates'   Programming in Pig  , chapter "Making Pig Fly"  it was
> mentioned
> In testing we did while developing this feature we saw performance
> improvements of up to 4x when using LZO, and slight performance degradation
> when using gzip.
> (http://ofps.oreilly.com/titles/9781449302641/making_pig_fly.html)
>
>
> I've tried using lzo as the compression tools( took me couple of days to
> compile it ) , and also with gzip.
> The  result of gzip is the same as mentioned in the book, but the result of
> with lzo is not imporvements of up to 4x , but almost the no improvement or
> slight degradation as well.
>
> I enabled the compression between Map and Reduce ,  and also between M/R
> jobs "pig.tmpfilecompression=true   pig.tmpfilecompression.codec=lzo".
>
> From the counters I can see the HDFS bytes are compressed to about 1/3
> compared to no compress.
> I can followings in the log on TaskTracker.
>
> 2012-11-18 16:14:11,638 INFO
> com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl
> library
> 2012-11-18 16:14:11,639 INFO com.hadoop.compression.lzo.LzoCodec:
> Successfully loaded & initialized native-lzo library
> 2012-11-18 16:14:11,640 INFO org.apache.hadoop.io.compress.CodecPool:
> Got brand-new decompressor
>
>
> The data volume is about 6G in total, and I have 100 cpus  + 150G memory
> fall on  10 nodes.
> My pig script is compiled into 4 M/R jobs.   The operation in each job is :
>   MAP_ONLY   -->   HASH_JOIN  -->   GROUP_BY  -->   HASH_JOIN .
>
> My guess of the reason is IO is not a bottle net for me, but was one for
> Alan Gates' case when he wrote the book.
>
> Any one have any clue why I didn't gain any improvement?
>
>
> Thanks
> Regards
> Xingbang Wang
>



-- 
Kannan Shah

Analytical-Modeling Staff Scientist
Financial Services - Modeling
SAS Institute
San Diego

Detection-and-Estimation Group
Data Fusion Laboratory
Philadelphia