You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Mark <st...@gmail.com> on 2011/10/30 03:26:44 UTC

LZO Compression

How can one test that LZO compression is configured correctly? I can find

RE: LZO Compression

Posted by Tim Broberg <Ti...@exar.com>.
Here are the issues that I'm aware of:

 *   Compression ratios are comparable
 *   Snappy Decompression is about twice as fast
 *   LZO is "splittable." It can be decompressed in pieces natively without using an AVRO or sequence file. For LZO, this requres a separate operation to generate an index file that identifies where the blocks are in the main file.
 *   LZO has to be downloaded and installed separately because the license is incompatible with the hadoop license.

    - Tim.

________________________________________
From: Mark [static.void.dev@gmail.com]
Sent: Sunday, October 30, 2011 9:33 AM
To: common-user@hadoop.apache.org
Subject: Re: LZO Compression

Thanks for the info, very helpful.

Whats the difference between LZO and Snappy? I like how Cloudera has
snappy support so it looks like im going to go with that but I just
wanted to know the tradeoffs.

Thanks again

On 10/29/11 8:52 PM, Harsh J wrote:
> Hey Mark,
>
> (Before you jump in with LZO, perhaps consider using Snappy+SequenceFiles?)
>
> On 30-Oct-2011, at 7:59 AM, Mark wrote:
>
>> Email was sent a bit prematurely.
>>
>> Anyway. How can one test that LZO compression is configured correctly? I've found multiple sources on how to compile the hadoop-lzo jars and native files but no where did I see a definitive way to test that the installation/configuration is correct.
> You can run the compression codec test for per-node, or run a job that reads or writes with that codec.
>
> Single node test example, using an available test jar:
>
> $ HADOOP_CLASSPATH=/usr/lib/hadoop/hadoop-test-0.20.2-cdh3u2.jar hadoop org.apache.hadoop.io.compress.TestCodec -count 1000 -codec com.hadoop.compression.lzo.LzoCodec
>
>> Also, when is this compression enabled? Is it enabled on every file I write? Do I somehow have to specify that I want to use this format? For example we have a rather large directory of server logs ... /user/mark/logs. How can we enable compression on this directory?
>>
> Compression in HDFS is pure client-side settings. You can't enable it 'globally'.
>
> For jobs, you may set the {File}OutputFormat#setOutputCompressorClass(…) to the desired class to have final job outputs written with that codec (Compression of write streams is toggled by {File}OutputFormat#setCompressOutput(…)). For optimizing the transient stages, you can use JobConf#setMapOutputCompressorClass(…) and toggle with JobConf#setCompressMapOutput(…).
>
> Reading compressed files back again is handled automagically by your Hadoop framework, and should require no settings.
>
> Hence, for a fully distributed test of your LZO install (which you may have hopefully done with Todd's easy tool at https://github.com/toddlipcon/hadoop-lzo-packager), you can run a simple parameterized (or mapred-site.xml configured) wordcount via an available example jar:
>
> $ hadoop jar /usr/lib/hadoop/hadoop-examples-0.20.2-cdh3u2.jar wordcount -Dmapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec -Dmapred.output.compress=true inputDir outputDir
>
> Hope this helps!
>
> --
> Harsh J

________________________________
The information and any attached documents contained in this message
may be confidential and/or legally privileged. The message is
intended solely for the addressee(s). If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful. If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.

Re: LZO Compression

Posted by Mark <st...@gmail.com>.
Thanks for the info, very helpful.

Whats the difference between LZO and Snappy? I like how Cloudera has 
snappy support so it looks like im going to go with that but I just 
wanted to know the tradeoffs.

Thanks again

On 10/29/11 8:52 PM, Harsh J wrote:
> Hey Mark,
>
> (Before you jump in with LZO, perhaps consider using Snappy+SequenceFiles?)
>
> On 30-Oct-2011, at 7:59 AM, Mark wrote:
>
>> Email was sent a bit prematurely.
>>
>> Anyway. How can one test that LZO compression is configured correctly? I've found multiple sources on how to compile the hadoop-lzo jars and native files but no where did I see a definitive way to test that the installation/configuration is correct.
> You can run the compression codec test for per-node, or run a job that reads or writes with that codec.
>
> Single node test example, using an available test jar:
>
> $ HADOOP_CLASSPATH=/usr/lib/hadoop/hadoop-test-0.20.2-cdh3u2.jar hadoop org.apache.hadoop.io.compress.TestCodec -count 1000 -codec com.hadoop.compression.lzo.LzoCodec
>
>> Also, when is this compression enabled? Is it enabled on every file I write? Do I somehow have to specify that I want to use this format? For example we have a rather large directory of server logs ... /user/mark/logs. How can we enable compression on this directory?
>>
> Compression in HDFS is pure client-side settings. You can't enable it 'globally'.
>
> For jobs, you may set the {File}OutputFormat#setOutputCompressorClass(…) to the desired class to have final job outputs written with that codec (Compression of write streams is toggled by {File}OutputFormat#setCompressOutput(…)). For optimizing the transient stages, you can use JobConf#setMapOutputCompressorClass(…) and toggle with JobConf#setCompressMapOutput(…).
>
> Reading compressed files back again is handled automagically by your Hadoop framework, and should require no settings.
>
> Hence, for a fully distributed test of your LZO install (which you may have hopefully done with Todd's easy tool at https://github.com/toddlipcon/hadoop-lzo-packager), you can run a simple parameterized (or mapred-site.xml configured) wordcount via an available example jar:
>
> $ hadoop jar /usr/lib/hadoop/hadoop-examples-0.20.2-cdh3u2.jar wordcount -Dmapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec -Dmapred.output.compress=true inputDir outputDir
>
> Hope this helps!
>
> --
> Harsh J

Re: LZO Compression

Posted by Harsh J <ha...@cloudera.com>.
Hey Mark,

(Before you jump in with LZO, perhaps consider using Snappy+SequenceFiles?)

On 30-Oct-2011, at 7:59 AM, Mark wrote:

> Email was sent a bit prematurely.
> 
> Anyway. How can one test that LZO compression is configured correctly? I've found multiple sources on how to compile the hadoop-lzo jars and native files but no where did I see a definitive way to test that the installation/configuration is correct.

You can run the compression codec test for per-node, or run a job that reads or writes with that codec.

Single node test example, using an available test jar:

$ HADOOP_CLASSPATH=/usr/lib/hadoop/hadoop-test-0.20.2-cdh3u2.jar hadoop org.apache.hadoop.io.compress.TestCodec -count 1000 -codec com.hadoop.compression.lzo.LzoCodec

> Also, when is this compression enabled? Is it enabled on every file I write? Do I somehow have to specify that I want to use this format? For example we have a rather large directory of server logs ... /user/mark/logs. How can we enable compression on this directory?
> 

Compression in HDFS is pure client-side settings. You can't enable it 'globally'.

For jobs, you may set the {File}OutputFormat#setOutputCompressorClass(…) to the desired class to have final job outputs written with that codec (Compression of write streams is toggled by {File}OutputFormat#setCompressOutput(…)). For optimizing the transient stages, you can use JobConf#setMapOutputCompressorClass(…) and toggle with JobConf#setCompressMapOutput(…).

Reading compressed files back again is handled automagically by your Hadoop framework, and should require no settings.

Hence, for a fully distributed test of your LZO install (which you may have hopefully done with Todd's easy tool at https://github.com/toddlipcon/hadoop-lzo-packager), you can run a simple parameterized (or mapred-site.xml configured) wordcount via an available example jar:

$ hadoop jar /usr/lib/hadoop/hadoop-examples-0.20.2-cdh3u2.jar wordcount -Dmapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec -Dmapred.output.compress=true inputDir outputDir

Hope this helps!

--
Harsh J

RE: LZO Compression

Posted by Tim Broberg <Ti...@exar.com>.
I've been fiddling with the unit tests for compression codecs to demonstrate basic installation functionality:

"java org/apache/hadoop/io/compress/TestCodec -codec org.apache.hadoop.io.compress.SnappyCodec" seems to be working ok for snappy.

    - Tim.

________________________________________
From: Mark [static.void.dev@gmail.com]
Sent: Saturday, October 29, 2011 7:29 PM
To: common-user@hadoop.apache.org
Subject: Re: LZO Compression

Email was sent a bit prematurely.

Anyway. How can one test that LZO compression is configured correctly?
I've found multiple sources on how to compile the hadoop-lzo jars and
native files but no where did I see a definitive way to test that the
installation/configuration is correct.

Also, when is this compression enabled? Is it enabled on every file I
write? Do I somehow have to specify that I want to use this format? For
example we have a rather large directory of server logs ...
/user/mark/logs. How can we enable compression on this directory?

Thanks for your help

On 10/29/11 7:26 PM, Mark wrote:
> How can one test that LZO compression is configured correctly? I can find

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.

Re: LZO Compression

Posted by Mark <st...@gmail.com>.
Email was sent a bit prematurely.

Anyway. How can one test that LZO compression is configured correctly? 
I've found multiple sources on how to compile the hadoop-lzo jars and 
native files but no where did I see a definitive way to test that the 
installation/configuration is correct.

Also, when is this compression enabled? Is it enabled on every file I 
write? Do I somehow have to specify that I want to use this format? For 
example we have a rather large directory of server logs ... 
/user/mark/logs. How can we enable compression on this directory?

Thanks for your help

On 10/29/11 7:26 PM, Mark wrote:
> How can one test that LZO compression is configured correctly? I can find