You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Thejas Nair <th...@hortonworks.com> on 2012/10/21 07:22:32 UTC

AvroStorage compression ratio

Based on AvroStorage code and documentation, it looks like compression 
is enabled by default, codec set to "deflate". But the file size is 
almost same as that of uncompressed tab separated text data.

This is probably a bug in AvroStorage, but I wanted to check if this is 
somehow expected, before I open a jira to track it.

Uncompressed txt     2.12 GB
avro (default compression)    2.09 GB
avro + snappy compression     2.09 GB
lzo compressed txt      0.69 GB


Thanks,
Thejas


Re: AvroStorage compression ratio

Posted by Ruslan Al-Fakikh <me...@gmail.com>.
For me it was:

27.5G for uncompressed tab-delimited plain txt
when compressed:
Format Size
sequence files	 1.6G
avro deflate with level 1	 2.9G
avro deflate with level 5	 2.4G
avro deflate with level 9	 2.2G
avro snappy	 4.1G

I was using this:
https://ccp.cloudera.com/display/CDHDOC/Avro+Usage#AvroUsage-Pig
with CDH 3

Best Regards

On Tue, Oct 23, 2012 at 2:51 AM, Thejas Nair <th...@hortonworks.com> wrote:
> What was the compression ratio you saw?
> I get the correct results, but the data size is almost same as uncompressed
> text.
>
> searches = load  '/user/testuser/aol_search_logs.txt' as (ID : int, Query :
> chararray, QueryTime : chararray, ItemRank : int, ClickURL : chararray);
> store searches into '/user/testuser/aol_search_logs.avro'  using
> AvroStorage();
>
> I also tried -
>
> SET avro.output.codec snappy
> SET mapred.output.compress true
> searches = load '/user/testuser/aol_search_logs.avro'  using
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> store searches into '/user/testuser/aol_search_logs.snappy.avro' using
> org.apache.pig.piggybank.storage.avro.AvroStorage();
>
> -Thejas
>
>
>
>
> On 10/22/12 6:02 AM, Ruslan Al-Fakikh wrote:
>>
>> How do you generate your Avro files?
>> It worked OK for me with:
>>
>> SET avro.mapred.deflate.level 5
>> inputData = LOAD 'input path' USING
>> org.apache.pig.piggybank.storage.avro.AvroStorage();
>> STORE inputData INTO 'output path' USING
>> org.apache.pig.piggybank.storage.avro.AvroStorage();
>>
>> But I did these tests a long time ago with an old version.
>>
>> Ruslan
>>
>> On Sun, Oct 21, 2012 at 9:22 AM, Thejas Nair <th...@hortonworks.com>
>> wrote:
>>>
>>> Based on AvroStorage code and documentation, it looks like compression is
>>> enabled by default, codec set to "deflate". But the file size is almost
>>> same
>>> as that of uncompressed tab separated text data.
>>>
>>> This is probably a bug in AvroStorage, but I wanted to check if this is
>>> somehow expected, before I open a jira to track it.
>>>
>>> Uncompressed txt     2.12 GB
>>> avro (default compression)    2.09 GB
>>> avro + snappy compression     2.09 GB
>>> lzo compressed txt      0.69 GB
>>>
>>>
>>> Thanks,
>>> Thejas
>>>
>

Re: AvroStorage compression ratio

Posted by Thejas Nair <th...@hortonworks.com>.
What was the compression ratio you saw?
I get the correct results, but the data size is almost same as 
uncompressed text.

searches = load  '/user/testuser/aol_search_logs.txt' as (ID : int, 
Query : chararray, QueryTime : chararray, ItemRank : int, ClickURL : 
chararray);
store searches into '/user/testuser/aol_search_logs.avro'  using 
AvroStorage();

I also tried -

SET avro.output.codec snappy
SET mapred.output.compress true
searches = load '/user/testuser/aol_search_logs.avro'  using 
org.apache.pig.piggybank.storage.avro.AvroStorage();
store searches into '/user/testuser/aol_search_logs.snappy.avro' using 
org.apache.pig.piggybank.storage.avro.AvroStorage();

-Thejas



On 10/22/12 6:02 AM, Ruslan Al-Fakikh wrote:
> How do you generate your Avro files?
> It worked OK for me with:
>
> SET avro.mapred.deflate.level 5
> inputData = LOAD 'input path' USING
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> STORE inputData INTO 'output path' USING
> org.apache.pig.piggybank.storage.avro.AvroStorage();
>
> But I did these tests a long time ago with an old version.
>
> Ruslan
>
> On Sun, Oct 21, 2012 at 9:22 AM, Thejas Nair <th...@hortonworks.com> wrote:
>> Based on AvroStorage code and documentation, it looks like compression is
>> enabled by default, codec set to "deflate". But the file size is almost same
>> as that of uncompressed tab separated text data.
>>
>> This is probably a bug in AvroStorage, but I wanted to check if this is
>> somehow expected, before I open a jira to track it.
>>
>> Uncompressed txt     2.12 GB
>> avro (default compression)    2.09 GB
>> avro + snappy compression     2.09 GB
>> lzo compressed txt      0.69 GB
>>
>>
>> Thanks,
>> Thejas
>>


Re: AvroStorage compression ratio

Posted by Ruslan Al-Fakikh <me...@gmail.com>.
How do you generate your Avro files?
It worked OK for me with:

SET avro.mapred.deflate.level 5
inputData = LOAD 'input path' USING
org.apache.pig.piggybank.storage.avro.AvroStorage();
STORE inputData INTO 'output path' USING
org.apache.pig.piggybank.storage.avro.AvroStorage();

But I did these tests a long time ago with an old version.

Ruslan

On Sun, Oct 21, 2012 at 9:22 AM, Thejas Nair <th...@hortonworks.com> wrote:
> Based on AvroStorage code and documentation, it looks like compression is
> enabled by default, codec set to "deflate". But the file size is almost same
> as that of uncompressed tab separated text data.
>
> This is probably a bug in AvroStorage, but I wanted to check if this is
> somehow expected, before I open a jira to track it.
>
> Uncompressed txt     2.12 GB
> avro (default compression)    2.09 GB
> avro + snappy compression     2.09 GB
> lzo compressed txt      0.69 GB
>
>
> Thanks,
> Thejas
>