You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Marek Miglinski <mm...@seven.com> on 2012/02/01 09:34:29 UTC

Snappy in Mapreduce

Hello guys,

I have a Clouderas CDH3U2 package installed on a 3 node cluster and I've added to mapred-site:
    <property>
        <name>mapred.compress.map.output</name>
        <value>true</value>
    </property>

    <property>
        <name>mapred.map.output.compression.codec</name>
        <value>org.apache.hadoop.io.compress.SnappyCodec</value>
    </property>

Also to my pig job properties:
                <property>
                    <name>io.compression.codec.lzo.class</name>
                    <value>com.hadoop.compression.lzo.LzoCodec</value>
                </property>
                <property>
                    <name>pig.tmpfilecompression</name>
                    <value>true</value>
                </property>
                <property>
                    <name>pig.tmpfilecompression.codec</name>
                    <value>lzo</value>
                </property>
                <property>
                    <name>mapred.output.compress</name>
                    <value>true</value>
                </property>
                <property>
                    <name>mapred.output.compression.codec</name>
                    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
                </property>
                <property>
                    <name>mapred.output.compression.type</name>
                    <value>BLOCK</value>
                </property>
                <property>
                    <name>mapred.compress.map.output</name>
                    <value>true</value>
                </property>
                <property>
                    <name>mapred.map.output.compression.codec</name>
                    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
                </property>
                <property>
                    <name>mapreduce.map.output.compress</name>
                    <value>true</value>
                </property>
                <property>
                    <name>mapreduce.map.output.compress.codec</name>
                    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
                </property>

So I want PIG to compress it's data with LZO but mapreduce with Snappy, but as I see in the tasktracker details (Map Bytes Out) data is not compressed at all, which reduces performance a lot (IO is 100% most of the time)... What am I doing wrong and how do I fix it?


Thanks,
Marek M.

Re: Snappy in Mapreduce

Posted by Harsh J <ha...@cloudera.com>.

Marek,

Map Output Bytes are the real # of bytes output from the mapper, and
the count of that is not after the compression. If this is an MR job,
you probably want to see File Bytes Written counter for the map phase,
or the Reduce shuffle bytes for the reduce phase.

On Wed, Feb 1, 2012 at 2:04 PM, Marek Miglinski <mm...@seven.com> wrote:
> Hello guys,
>
> I have a Clouderas CDH3U2 package installed on a 3 node cluster and I've added to mapred-site:
>    <property>
>        <name>mapred.compress.map.output</name>
>        <value>true</value>
>    </property>
>
>    <property>
>        <name>mapred.map.output.compression.codec</name>
>        <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>    </property>
>
> Also to my pig job properties:
>                <property>
>                    <name>io.compression.codec.lzo.class</name>
>                    <value>com.hadoop.compression.lzo.LzoCodec</value>
>                </property>
>                <property>
>                    <name>pig.tmpfilecompression</name>
>                    <value>true</value>
>                </property>
>                <property>
>                    <name>pig.tmpfilecompression.codec</name>
>                    <value>lzo</value>
>                </property>
>                <property>
>                    <name>mapred.output.compress</name>
>                    <value>true</value>
>                </property>
>                <property>
>                    <name>mapred.output.compression.codec</name>
>                    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>                </property>
>                <property>
>                    <name>mapred.output.compression.type</name>
>                    <value>BLOCK</value>
>                </property>
>                <property>
>                    <name>mapred.compress.map.output</name>
>                    <value>true</value>
>                </property>
>                <property>
>                    <name>mapred.map.output.compression.codec</name>
>                    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>                </property>
>                <property>
>                    <name>mapreduce.map.output.compress</name>
>                    <value>true</value>
>                </property>
>                <property>
>                    <name>mapreduce.map.output.compress.codec</name>
>                    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>                </property>
>
> So I want PIG to compress it's data with LZO but mapreduce with Snappy, but as I see in the tasktracker details (Map Bytes Out) data is not compressed at all, which reduces performance a lot (IO is 100% most of the time)... What am I doing wrong and how do I fix it?
>
>
> Thanks,
> Marek M.



-- 
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about

RE: Snappy in Mapreduce

Posted by Marek Miglinski <mm...@seven.com>.

Thank you for the help, that opened my eyes.

I've noticed that while using LZO compression "Map output bytes" is 296,608,592,100 and "HDFS_BYTES_WRITTEN" is 57,941,932,388, does that mean that reducer output compression is 296,608,592,100 / 57,941,932,388 = 5.11 times, why is it so small for Sequence File Format?

Other statistics:

FILE_BYTES_READ	121,983,712,033	135,435,145,919	257,418,857,952
HDFS_BYTES_READ	23,721,946,243	0	23,721,946,243
FILE_BYTES_WRITTEN	188,046,014,425	135,437,054,645	323,483,069,070
HDFS_BYTES_WRITTEN	0	57,941,932,388	57,941,932,388

Reduce input groups	0	1,895,637,970	1,895,637,970
Combine output records	3,791,275,940	272,362,481	4,063,638,421
Map input records	1,895,637,976	0	1,895,637,976
Reduce shuffle bytes	0	65,503,257,420	65,503,257,420
Reduce output records	0	1,895,637,970	1,895,637,970
Spilled Records	5,436,423,030	3,871,926,741	9,308,349,771
Map output bytes	296,608,592,100	0	296,608,592,100
SPLIT_RAW_BYTES	73,060	0	73,060
Map output records	1,895,637,976	0	1,895,637,976
Combine input records	3,791,275,946	272,362,481	4,063,638,427
Reduce input records	0	1,895,637,970	1,895,637,970


Thanks,
Marek M.
________________________________________
From: Harsh J [harsh@cloudera.com]
Sent: Wednesday, February 01, 2012 1:23 PM
To: user@pig.apache.org
Subject: Re: Snappy in Mapreduce

Also, if you want finalized outputs in LZO, set
"mapred.output.compression.codec" to that codec. You have it set to
Snappy presently.

On Wed, Feb 1, 2012 at 2:04 PM, Marek Miglinski <mm...@seven.com> wrote:
> Hello guys,
>
> I have a Clouderas CDH3U2 package installed on a 3 node cluster and I've added to mapred-site:
>    <property>
>        <name>mapred.compress.map.output</name>
>        <value>true</value>
>    </property>
>
>    <property>
>        <name>mapred.map.output.compression.codec</name>
>        <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>    </property>
>
> Also to my pig job properties:
>                <property>
>                    <name>io.compression.codec.lzo.class</name>
>                    <value>com.hadoop.compression.lzo.LzoCodec</value>
>                </property>
>                <property>
>                    <name>pig.tmpfilecompression</name>
>                    <value>true</value>
>                </property>
>                <property>
>                    <name>pig.tmpfilecompression.codec</name>
>                    <value>lzo</value>
>                </property>
>                <property>
>                    <name>mapred.output.compress</name>
>                    <value>true</value>
>                </property>
>                <property>
>                    <name>mapred.output.compression.codec</name>
>                    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>                </property>
>                <property>
>                    <name>mapred.output.compression.type</name>
>                    <value>BLOCK</value>
>                </property>
>                <property>
>                    <name>mapred.compress.map.output</name>
>                    <value>true</value>
>                </property>
>                <property>
>                    <name>mapred.map.output.compression.codec</name>
>                    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>                </property>
>                <property>
>                    <name>mapreduce.map.output.compress</name>
>                    <value>true</value>
>                </property>
>                <property>
>                    <name>mapreduce.map.output.compress.codec</name>
>                    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>                </property>
>
> So I want PIG to compress it's data with LZO but mapreduce with Snappy, but as I see in the tasktracker details (Map Bytes Out) data is not compressed at all, which reduces performance a lot (IO is 100% most of the time)... What am I doing wrong and how do I fix it?
>
>
> Thanks,
> Marek M.



--
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about

Re: Snappy in Mapreduce

Posted by Harsh J <ha...@cloudera.com>.

Also, if you want finalized outputs in LZO, set
"mapred.output.compression.codec" to that codec. You have it set to
Snappy presently.

On Wed, Feb 1, 2012 at 2:04 PM, Marek Miglinski <mm...@seven.com> wrote:
> Hello guys,
>
> I have a Clouderas CDH3U2 package installed on a 3 node cluster and I've added to mapred-site:
>    <property>
>        <name>mapred.compress.map.output</name>
>        <value>true</value>
>    </property>
>
>    <property>
>        <name>mapred.map.output.compression.codec</name>
>        <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>    </property>
>
> Also to my pig job properties:
>                <property>
>                    <name>io.compression.codec.lzo.class</name>
>                    <value>com.hadoop.compression.lzo.LzoCodec</value>
>                </property>
>                <property>
>                    <name>pig.tmpfilecompression</name>
>                    <value>true</value>
>                </property>
>                <property>
>                    <name>pig.tmpfilecompression.codec</name>
>                    <value>lzo</value>
>                </property>
>                <property>
>                    <name>mapred.output.compress</name>
>                    <value>true</value>
>                </property>
>                <property>
>                    <name>mapred.output.compression.codec</name>
>                    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>                </property>
>                <property>
>                    <name>mapred.output.compression.type</name>
>                    <value>BLOCK</value>
>                </property>
>                <property>
>                    <name>mapred.compress.map.output</name>
>                    <value>true</value>
>                </property>
>                <property>
>                    <name>mapred.map.output.compression.codec</name>
>                    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>                </property>
>                <property>
>                    <name>mapreduce.map.output.compress</name>
>                    <value>true</value>
>                </property>
>                <property>
>                    <name>mapreduce.map.output.compress.codec</name>
>                    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>                </property>
>
> So I want PIG to compress it's data with LZO but mapreduce with Snappy, but as I see in the tasktracker details (Map Bytes Out) data is not compressed at all, which reduces performance a lot (IO is 100% most of the time)... What am I doing wrong and how do I fix it?
>
>
> Thanks,
> Marek M.



-- 
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about