You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Marek Miglinski <mm...@seven.com> on 2012/02/01 09:34:29 UTC
Snappy in Mapreduce
Hello guys,
I have a Clouderas CDH3U2 package installed on a 3 node cluster and I've added to mapred-site:
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
Also to my pig job properties:
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
<property>
<name>pig.tmpfilecompression</name>
<value>true</value>
</property>
<property>
<name>pig.tmpfilecompression.codec</name>
<value>lzo</value>
</property>
<property>
<name>mapred.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
So I want PIG to compress it's data with LZO but mapreduce with Snappy, but as I see in the tasktracker details (Map Bytes Out) data is not compressed at all, which reduces performance a lot (IO is 100% most of the time)... What am I doing wrong and how do I fix it?
Thanks,
Marek M.
Re: Snappy in Mapreduce
Posted by Harsh J <ha...@cloudera.com>.
Marek,
Map Output Bytes are the real # of bytes output from the mapper, and
the count of that is not after the compression. If this is an MR job,
you probably want to see File Bytes Written counter for the map phase,
or the Reduce shuffle bytes for the reduce phase.
On Wed, Feb 1, 2012 at 2:04 PM, Marek Miglinski <mm...@seven.com> wrote:
> Hello guys,
>
> I have a Clouderas CDH3U2 package installed on a 3 node cluster and I've added to mapred-site:
> <property>
> <name>mapred.compress.map.output</name>
> <value>true</value>
> </property>
>
> <property>
> <name>mapred.map.output.compression.codec</name>
> <value>org.apache.hadoop.io.compress.SnappyCodec</value>
> </property>
>
> Also to my pig job properties:
> <property>
> <name>io.compression.codec.lzo.class</name>
> <value>com.hadoop.compression.lzo.LzoCodec</value>
> </property>
> <property>
> <name>pig.tmpfilecompression</name>
> <value>true</value>
> </property>
> <property>
> <name>pig.tmpfilecompression.codec</name>
> <value>lzo</value>
> </property>
> <property>
> <name>mapred.output.compress</name>
> <value>true</value>
> </property>
> <property>
> <name>mapred.output.compression.codec</name>
> <value>org.apache.hadoop.io.compress.SnappyCodec</value>
> </property>
> <property>
> <name>mapred.output.compression.type</name>
> <value>BLOCK</value>
> </property>
> <property>
> <name>mapred.compress.map.output</name>
> <value>true</value>
> </property>
> <property>
> <name>mapred.map.output.compression.codec</name>
> <value>org.apache.hadoop.io.compress.SnappyCodec</value>
> </property>
> <property>
> <name>mapreduce.map.output.compress</name>
> <value>true</value>
> </property>
> <property>
> <name>mapreduce.map.output.compress.codec</name>
> <value>org.apache.hadoop.io.compress.SnappyCodec</value>
> </property>
>
> So I want PIG to compress it's data with LZO but mapreduce with Snappy, but as I see in the tasktracker details (Map Bytes Out) data is not compressed at all, which reduces performance a lot (IO is 100% most of the time)... What am I doing wrong and how do I fix it?
>
>
> Thanks,
> Marek M.
--
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about
RE: Snappy in Mapreduce
Posted by Marek Miglinski <mm...@seven.com>.
Thank you for the help, that opened my eyes.
I've noticed that while using LZO compression "Map output bytes" is 296,608,592,100 and "HDFS_BYTES_WRITTEN" is 57,941,932,388, does that mean that reducer output compression is 296,608,592,100 / 57,941,932,388 = 5.11 times, why is it so small for Sequence File Format?
Other statistics:
FILE_BYTES_READ 121,983,712,033 135,435,145,919 257,418,857,952
HDFS_BYTES_READ 23,721,946,243 0 23,721,946,243
FILE_BYTES_WRITTEN 188,046,014,425 135,437,054,645 323,483,069,070
HDFS_BYTES_WRITTEN 0 57,941,932,388 57,941,932,388
Reduce input groups 0 1,895,637,970 1,895,637,970
Combine output records 3,791,275,940 272,362,481 4,063,638,421
Map input records 1,895,637,976 0 1,895,637,976
Reduce shuffle bytes 0 65,503,257,420 65,503,257,420
Reduce output records 0 1,895,637,970 1,895,637,970
Spilled Records 5,436,423,030 3,871,926,741 9,308,349,771
Map output bytes 296,608,592,100 0 296,608,592,100
SPLIT_RAW_BYTES 73,060 0 73,060
Map output records 1,895,637,976 0 1,895,637,976
Combine input records 3,791,275,946 272,362,481 4,063,638,427
Reduce input records 0 1,895,637,970 1,895,637,970
Thanks,
Marek M.
________________________________________
From: Harsh J [harsh@cloudera.com]
Sent: Wednesday, February 01, 2012 1:23 PM
To: user@pig.apache.org
Subject: Re: Snappy in Mapreduce
Also, if you want finalized outputs in LZO, set
"mapred.output.compression.codec" to that codec. You have it set to
Snappy presently.
On Wed, Feb 1, 2012 at 2:04 PM, Marek Miglinski <mm...@seven.com> wrote:
> Hello guys,
>
> I have a Clouderas CDH3U2 package installed on a 3 node cluster and I've added to mapred-site:
> <property>
> <name>mapred.compress.map.output</name>
> <value>true</value>
> </property>
>
> <property>
> <name>mapred.map.output.compression.codec</name>
> <value>org.apache.hadoop.io.compress.SnappyCodec</value>
> </property>
>
> Also to my pig job properties:
> <property>
> <name>io.compression.codec.lzo.class</name>
> <value>com.hadoop.compression.lzo.LzoCodec</value>
> </property>
> <property>
> <name>pig.tmpfilecompression</name>
> <value>true</value>
> </property>
> <property>
> <name>pig.tmpfilecompression.codec</name>
> <value>lzo</value>
> </property>
> <property>
> <name>mapred.output.compress</name>
> <value>true</value>
> </property>
> <property>
> <name>mapred.output.compression.codec</name>
> <value>org.apache.hadoop.io.compress.SnappyCodec</value>
> </property>
> <property>
> <name>mapred.output.compression.type</name>
> <value>BLOCK</value>
> </property>
> <property>
> <name>mapred.compress.map.output</name>
> <value>true</value>
> </property>
> <property>
> <name>mapred.map.output.compression.codec</name>
> <value>org.apache.hadoop.io.compress.SnappyCodec</value>
> </property>
> <property>
> <name>mapreduce.map.output.compress</name>
> <value>true</value>
> </property>
> <property>
> <name>mapreduce.map.output.compress.codec</name>
> <value>org.apache.hadoop.io.compress.SnappyCodec</value>
> </property>
>
> So I want PIG to compress it's data with LZO but mapreduce with Snappy, but as I see in the tasktracker details (Map Bytes Out) data is not compressed at all, which reduces performance a lot (IO is 100% most of the time)... What am I doing wrong and how do I fix it?
>
>
> Thanks,
> Marek M.
--
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about
Re: Snappy in Mapreduce
Posted by Harsh J <ha...@cloudera.com>.
Also, if you want finalized outputs in LZO, set
"mapred.output.compression.codec" to that codec. You have it set to
Snappy presently.
On Wed, Feb 1, 2012 at 2:04 PM, Marek Miglinski <mm...@seven.com> wrote:
> Hello guys,
>
> I have a Clouderas CDH3U2 package installed on a 3 node cluster and I've added to mapred-site:
> <property>
> <name>mapred.compress.map.output</name>
> <value>true</value>
> </property>
>
> <property>
> <name>mapred.map.output.compression.codec</name>
> <value>org.apache.hadoop.io.compress.SnappyCodec</value>
> </property>
>
> Also to my pig job properties:
> <property>
> <name>io.compression.codec.lzo.class</name>
> <value>com.hadoop.compression.lzo.LzoCodec</value>
> </property>
> <property>
> <name>pig.tmpfilecompression</name>
> <value>true</value>
> </property>
> <property>
> <name>pig.tmpfilecompression.codec</name>
> <value>lzo</value>
> </property>
> <property>
> <name>mapred.output.compress</name>
> <value>true</value>
> </property>
> <property>
> <name>mapred.output.compression.codec</name>
> <value>org.apache.hadoop.io.compress.SnappyCodec</value>
> </property>
> <property>
> <name>mapred.output.compression.type</name>
> <value>BLOCK</value>
> </property>
> <property>
> <name>mapred.compress.map.output</name>
> <value>true</value>
> </property>
> <property>
> <name>mapred.map.output.compression.codec</name>
> <value>org.apache.hadoop.io.compress.SnappyCodec</value>
> </property>
> <property>
> <name>mapreduce.map.output.compress</name>
> <value>true</value>
> </property>
> <property>
> <name>mapreduce.map.output.compress.codec</name>
> <value>org.apache.hadoop.io.compress.SnappyCodec</value>
> </property>
>
> So I want PIG to compress it's data with LZO but mapreduce with Snappy, but as I see in the tasktracker details (Map Bytes Out) data is not compressed at all, which reduces performance a lot (IO is 100% most of the time)... What am I doing wrong and how do I fix it?
>
>
> Thanks,
> Marek M.
--
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about