You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Nt Never <nt...@gmail.com> on 2007/09/05 19:10:31 UTC
Re: map output compression codec setting issue
Hi Arun,
thanks for your reply, I am CCing this e-mail to hadoop-dev. I will create
the appropriate JIRA tickets today. Here are a few insights about my
experience with Hadoop compression (all my comments apply to 0.13.0):
1. Map output compression: besides the issue I mentioned to you guys about
choosing two different codecs for map output and overall job output, it
works very well for us. I have been using non-native map output compression
on jobs that generate over 6Tb of data with no problems. Since I am using
0.13.0, because of HADOOP-1193, I could test LZO native on very small jobs
only. Our benchmarks show no degradation in performance whatsoever when
using native-LZO.
2. Compression type configuration: we noticed a small issue with the
configuration here. If "io.seqfile.compression,type" is set to NONE in
hadoop-site.xml, M/R jobs will not do any compression and there is no way to
override it programmatically. As a matter of fact, each worker machine will
end up using the value read from the local hadoop conf folder. I like the
fact that each worker reads this property locally when creating generic
SequenceFile(s), but, IMHO, the behavior of M/R jobs should be set in
JobConf only. This issue is very easy to reproduce.
3. Non-native GzipCodec: the codec returns Java's
java.util.zip.GzipOutputStream and java.util.zip.GzipInputStream when native
compression is not available. However, lines 197, 238, 299, and 357 of
SequenceFile (basically all the createWriter() methods that select a
compression codec) will throw an IllegalArgumentException if the GzipCodec
is selected but the native library is *not* available. Why is that?
4. Reduce reported progress when consuming compressed map outputs: is
generally incorrect, with reducers reporting over 220% completion. This is
regardless of whether native compression is used or not.
Best,
Riccardo
On 9/5/07, Arun C Murthy <ar...@yahoo-inc.com> wrote:
>
> Hi Riccardo,
>
> On Tue, Sep 04, 2007 at 12:12:19PM -0700, Nt Never wrote:
> >Thanks Devaraj, good to hear from you.
> >
> >Actually, if you guys are interested, I have been testing Hadoop
> compression
> >(native and non-native), in the last 5 days on a cluster of 200 machines
> >(running 0.12.3, with HDFS as file system). I have a few insights you
> guys
> >might be interested into. I am just trying to figure out what the proper
> >channels would be, that is why I contacted you first. Thanks.
> >
>
> You are absolutely correct. Please file a jira (and a patch if you are so
> inclined! *smile*) to request a separate property for the 2 codecs.
>
> We'd love to hear any insights/opinion/ideas about the compression stuff
> you've been working on, please don't hesitate to mail hadoop-dev@ or file
> jira issues about any of them...
>
> thanks!
> Arun
>
> >Riccardo
> >
> >
> >On 9/4/07, Devaraj Das <dd...@yahoo-inc.com> wrote:
> >>
> >> Hi Riccardo,
> >> Thanks for contacting me. I am doing good and hope you are doing great
> >> too!
> >> I am copying this mail to Arun who is our compression expert. Arun pls
> >> respond to the mail.
> >> Thanks,
> >> Devaraj
> >>
> >> ------------------------------
> >> *From:* Nt Never [mailto:ntnever@gmail.com]
> >> *Sent:* Tuesday, September 04, 2007 10:24 PM
> >> *To:* ddas@yahoo-inc.com
> >> *Subject:* map output compression codec setting issue
> >>
> >> Hi Devaraj,
> >>
> >> how have you been doing? I finally got around to do some extensive
> testing
> >> with Hadoop's compression. I am aware of HADOOP-1193 and HADOOP-1545,
> so I
> >> am waiting for the release of 0.15.0 before I do more benchmarks.
> However,
> >> I noticed what seems to be a bug in JobConf. The property "
> >> mapred.output.compression.codec" is used when setting and getting the
> map
> >> output compression codec, thus making it impossible to use a different
> codec
> >> for map outputs and overall job outputs. The methods that affect this
> >> behavior are in line 341-371 of JobConf in Hadoop 0.13.0:
> >>
> >> /**
> >> * Set the given class as the compression codec for the map outputs.
> >> * @param codecClass the CompressionCodec class that will compress
> the
> >> * map outputs
> >> */
> >> public void setMapOutputCompressorClass(Class<? extends
> >> CompressionCodec> codecClass) {
> >> setCompressMapOutput(true);
> >> setClass("mapred.output.compression.codec", codecClass,
> >> CompressionCodec.class);
> >> }
> >>
> >> /**
> >> * Get the codec for compressing the map outputs
> >> * @param defaultValue the value to return if it is not set
> >> * @return the CompressionCodec class that should be used to compress
> >> the
> >> * map outputs
> >> * @throws IllegalArgumentException if the class was specified, but
> not
> >> found
> >> */
> >> public Class<? extends CompressionCodec>
> getMapOutputCompressorClass(Class<?
> >> extends CompressionCodec> defaultValue) {
> >> String name = get( "mapred.output.compression.codec");
> >> if (name == null) {
> >> return defaultValue;
> >> } else {
> >> try {
> >> return getClassByName(name).asSubclass( CompressionCodec.class
> );
> >> } catch (ClassNotFoundException e) {
> >> throw new IllegalArgumentException("Compression codec " + name
> +
> >> " was not found.", e);
> >> }
> >> }
> >> }
> >>
> >> This could be easily fixed by using a different property, for example,
> "
> >> map.output.compression.codec". Should I create an issue on JIRA for
> this?
> >> Thanks.
> >>
> >> Riccardo
> >>
> >>
>
RE: map output compression codec setting issue
Posted by Devaraj Das <dd...@yahoo-inc.com>.
> >4. Reduce reported progress when consuming compressed map
> outputs: is
> >generally incorrect, with reducers reporting over 220%
> completion. This
> >is regardless of whether native compression is used or not.
>
> This smells like a bug, please file a jira asap!
> I'm guessing this could be due to the fact that we are
> checking the size of uncompressed key/value pairs rather than
> the compressed sizes. Devaraj?
>
Riccardo, pls file a jira issue for this one.
Thanks,
Devaraj.
> -----Original Message-----
> From: Arun C Murthy [mailto:arunc@yahoo-inc.com]
> Sent: Thursday, September 06, 2007 12:01 AM
> To: Nt Never
> Cc: Devaraj Das; hadoop-dev@lucene.apache.org
> Subject: Re: map output compression codec setting issue
>
> Riccardo,
>
> On Wed, Sep 05, 2007 at 10:10:31AM -0700, Nt Never wrote:
> >Hi Arun,
> >
> >thanks for your reply, I am CCing this e-mail to hadoop-dev. I will
> >create the appropriate JIRA tickets today. Here are a few insights
> >about my experience with Hadoop compression (all my comments
> apply to 0.13.0):
> >
>
> Thanks!
>
> >1. Map output compression: besides the issue I mentioned to you guys
> >about choosing two different codecs for map output and overall job
> >output, it works very well for us. I have been using non-native map
> >output compression on jobs that generate over 6Tb of data with no
> >problems. Since I am using 0.13.0, because of HADOOP-1193, I
> could test
> >LZO native on very small jobs only. Our benchmarks show no
> degradation
> >in performance whatsoever when using native-LZO.
>
> That is good to hear, please keep us posted on things you
> notice with 0.14.* and beyond (i.e. post H-1193).
>
> >2. Compression type configuration: we noticed a small issue with the
> >configuration here. If "io.seqfile.compression,type" is set
> to NONE in
> >hadoop-site.xml, M/R jobs will not do any compression and
> there is no
> >way to override it programmatically. As a matter of fact,
> each worker
> >machine will end up using the value read from the local hadoop conf
> >folder. I like the fact that each worker reads this property locally
> >when creating generic SequenceFile(s), but, IMHO, the
> behavior of M/R
> >jobs should be set in JobConf only. This issue is very easy
> to reproduce.
>
> This is a known bug where JobConf is overridden by
> hadoop-site.xml, please see:
> http://issues.apache.org/jira/browse/HADOOP-785
>
> >3. Non-native GzipCodec: the codec returns Java's
> >java.util.zip.GzipOutputStream and
> java.util.zip.GzipInputStream when
> >native compression is not available. However, lines 197,
> 238, 299, and
> >357 of SequenceFile (basically all the createWriter() methods that
> >select a compression codec) will throw an
> IllegalArgumentException if
> >the GzipCodec is selected but the native library is *not*
> available. Why is that?
>
> The issue with java.util.zip.GzipInputStream is that it
> doesn't let u access the underlying decompressor, hence we
> cannot do a 'reset' and reuse it - this is required for SequenceFiles.
>
> See http://issues.apache.org/jira/browse/HADOOP-441#action_12430068
>
> >4. Reduce reported progress when consuming compressed map
> outputs: is
> >generally incorrect, with reducers reporting over 220%
> completion. This
> >is regardless of whether native compression is used or not.
>
> This smells like a bug, please file a jira asap!
> I'm guessing this could be due to the fact that we are
> checking the size of uncompressed key/value pairs rather than
> the compressed sizes. Devaraj?
>
> thanks,
> Arun
>
> >
> >Best,
> >
> >Riccardo
> >
> >
> >On 9/5/07, Arun C Murthy <ar...@yahoo-inc.com> wrote:
> >>
> >> Hi Riccardo,
> >>
> >> On Tue, Sep 04, 2007 at 12:12:19PM -0700, Nt Never wrote:
> >> >Thanks Devaraj, good to hear from you.
> >> >
> >> >Actually, if you guys are interested, I have been testing Hadoop
> >> compression
> >> >(native and non-native), in the last 5 days on a cluster of 200
> >> >machines (running 0.12.3, with HDFS as file system). I have a few
> >> >insights you
> >> guys
> >> >might be interested into. I am just trying to figure out what the
> >> >proper channels would be, that is why I contacted you
> first. Thanks.
> >> >
> >>
> >> You are absolutely correct. Please file a jira (and a patch if you
> >> are so inclined! *smile*) to request a separate property
> for the 2 codecs.
> >>
> >> We'd love to hear any insights/opinion/ideas about the compression
> >> stuff you've been working on, please don't hesitate to mail
> >> hadoop-dev@ or file jira issues about any of them...
> >>
> >> thanks!
> >> Arun
> >>
> >> >Riccardo
> >> >
> >> >
> >> >On 9/4/07, Devaraj Das <dd...@yahoo-inc.com> wrote:
> >> >>
> >> >> Hi Riccardo,
> >> >> Thanks for contacting me. I am doing good and hope you
> are doing
> >> >> great too!
> >> >> I am copying this mail to Arun who is our compression
> expert. Arun
> >> >> pls respond to the mail.
> >> >> Thanks,
> >> >> Devaraj
> >> >>
> >> >> ------------------------------
> >> >> *From:* Nt Never [mailto:ntnever@gmail.com]
> >> >> *Sent:* Tuesday, September 04, 2007 10:24 PM
> >> >> *To:* ddas@yahoo-inc.com
> >> >> *Subject:* map output compression codec setting issue
> >> >>
> >> >> Hi Devaraj,
> >> >>
> >> >> how have you been doing? I finally got around to do
> some extensive
> >> testing
> >> >> with Hadoop's compression. I am aware of HADOOP-1193 and
> >> >> HADOOP-1545,
> >> so I
> >> >> am waiting for the release of 0.15.0 before I do more
> benchmarks.
> >> However,
> >> >> I noticed what seems to be a bug in JobConf. The property "
> >> >> mapred.output.compression.codec" is used when setting
> and getting
> >> >> the
> >> map
> >> >> output compression codec, thus making it impossible to use a
> >> >> different
> >> codec
> >> >> for map outputs and overall job outputs. The methods
> that affect
> >> >> this behavior are in line 341-371 of JobConf in Hadoop 0.13.0:
> >> >>
> >> >> /**
> >> >> * Set the given class as the compression codec for
> the map outputs.
> >> >> * @param codecClass the CompressionCodec class that will
> >> >> compress
> >> the
> >> >> * map outputs
> >> >> */
> >> >> public void setMapOutputCompressorClass(Class<? extends
> >> >> CompressionCodec> codecClass) {
> >> >> setCompressMapOutput(true);
> >> >> setClass("mapred.output.compression.codec", codecClass,
> >> >> CompressionCodec.class);
> >> >> }
> >> >>
> >> >> /**
> >> >> * Get the codec for compressing the map outputs
> >> >> * @param defaultValue the value to return if it is not set
> >> >> * @return the CompressionCodec class that should be used to
> >> >> compress the
> >> >> * map outputs
> >> >> * @throws IllegalArgumentException if the class was
> specified,
> >> >> but
> >> not
> >> >> found
> >> >> */
> >> >> public Class<? extends CompressionCodec>
> >> getMapOutputCompressorClass(Class<?
> >> >> extends CompressionCodec> defaultValue) {
> >> >> String name = get( "mapred.output.compression.codec");
> >> >> if (name == null) {
> >> >> return defaultValue;
> >> >> } else {
> >> >> try {
> >> >> return getClassByName(name).asSubclass(
> >> >> CompressionCodec.class
> >> );
> >> >> } catch (ClassNotFoundException e) {
> >> >> throw new IllegalArgumentException("Compression
> codec " +
> >> >> name
> >> +
> >> >> " was not
> found.", e);
> >> >> }
> >> >> }
> >> >> }
> >> >>
> >> >> This could be easily fixed by using a different property, for
> >> >> example,
> >> "
> >> >> map.output.compression.codec". Should I create an issue on JIRA
> >> >> for
> >> this?
> >> >> Thanks.
> >> >>
> >> >> Riccardo
> >> >>
> >> >>
> >>
>
Re: map output compression codec setting issue
Posted by Arun C Murthy <ar...@yahoo-inc.com>.
Riccardo,
On Wed, Sep 05, 2007 at 10:10:31AM -0700, Nt Never wrote:
>Hi Arun,
>
>thanks for your reply, I am CCing this e-mail to hadoop-dev. I will create
>the appropriate JIRA tickets today. Here are a few insights about my
>experience with Hadoop compression (all my comments apply to 0.13.0):
>
Thanks!
>1. Map output compression: besides the issue I mentioned to you guys about
>choosing two different codecs for map output and overall job output, it
>works very well for us. I have been using non-native map output compression
>on jobs that generate over 6Tb of data with no problems. Since I am using
>0.13.0, because of HADOOP-1193, I could test LZO native on very small jobs
>only. Our benchmarks show no degradation in performance whatsoever when
>using native-LZO.
That is good to hear, please keep us posted on things you notice with 0.14.* and beyond (i.e. post H-1193).
>2. Compression type configuration: we noticed a small issue with the
>configuration here. If "io.seqfile.compression,type" is set to NONE in
>hadoop-site.xml, M/R jobs will not do any compression and there is no way to
>override it programmatically. As a matter of fact, each worker machine will
>end up using the value read from the local hadoop conf folder. I like the
>fact that each worker reads this property locally when creating generic
>SequenceFile(s), but, IMHO, the behavior of M/R jobs should be set in
>JobConf only. This issue is very easy to reproduce.
This is a known bug where JobConf is overridden by hadoop-site.xml, please see:
http://issues.apache.org/jira/browse/HADOOP-785
>3. Non-native GzipCodec: the codec returns Java's
>java.util.zip.GzipOutputStream and java.util.zip.GzipInputStream when native
>compression is not available. However, lines 197, 238, 299, and 357 of
>SequenceFile (basically all the createWriter() methods that select a
>compression codec) will throw an IllegalArgumentException if the GzipCodec
>is selected but the native library is *not* available. Why is that?
The issue with java.util.zip.GzipInputStream is that it doesn't let u access the underlying decompressor, hence we cannot do a 'reset' and reuse it - this is required for SequenceFiles.
See http://issues.apache.org/jira/browse/HADOOP-441#action_12430068
>4. Reduce reported progress when consuming compressed map outputs: is
>generally incorrect, with reducers reporting over 220% completion. This is
>regardless of whether native compression is used or not.
This smells like a bug, please file a jira asap!
I'm guessing this could be due to the fact that we are checking the size of uncompressed key/value pairs rather than the compressed sizes. Devaraj?
thanks,
Arun
>
>Best,
>
>Riccardo
>
>
>On 9/5/07, Arun C Murthy <ar...@yahoo-inc.com> wrote:
>>
>> Hi Riccardo,
>>
>> On Tue, Sep 04, 2007 at 12:12:19PM -0700, Nt Never wrote:
>> >Thanks Devaraj, good to hear from you.
>> >
>> >Actually, if you guys are interested, I have been testing Hadoop
>> compression
>> >(native and non-native), in the last 5 days on a cluster of 200 machines
>> >(running 0.12.3, with HDFS as file system). I have a few insights you
>> guys
>> >might be interested into. I am just trying to figure out what the proper
>> >channels would be, that is why I contacted you first. Thanks.
>> >
>>
>> You are absolutely correct. Please file a jira (and a patch if you are so
>> inclined! *smile*) to request a separate property for the 2 codecs.
>>
>> We'd love to hear any insights/opinion/ideas about the compression stuff
>> you've been working on, please don't hesitate to mail hadoop-dev@ or file
>> jira issues about any of them...
>>
>> thanks!
>> Arun
>>
>> >Riccardo
>> >
>> >
>> >On 9/4/07, Devaraj Das <dd...@yahoo-inc.com> wrote:
>> >>
>> >> Hi Riccardo,
>> >> Thanks for contacting me. I am doing good and hope you are doing great
>> >> too!
>> >> I am copying this mail to Arun who is our compression expert. Arun pls
>> >> respond to the mail.
>> >> Thanks,
>> >> Devaraj
>> >>
>> >> ------------------------------
>> >> *From:* Nt Never [mailto:ntnever@gmail.com]
>> >> *Sent:* Tuesday, September 04, 2007 10:24 PM
>> >> *To:* ddas@yahoo-inc.com
>> >> *Subject:* map output compression codec setting issue
>> >>
>> >> Hi Devaraj,
>> >>
>> >> how have you been doing? I finally got around to do some extensive
>> testing
>> >> with Hadoop's compression. I am aware of HADOOP-1193 and HADOOP-1545,
>> so I
>> >> am waiting for the release of 0.15.0 before I do more benchmarks.
>> However,
>> >> I noticed what seems to be a bug in JobConf. The property "
>> >> mapred.output.compression.codec" is used when setting and getting the
>> map
>> >> output compression codec, thus making it impossible to use a different
>> codec
>> >> for map outputs and overall job outputs. The methods that affect this
>> >> behavior are in line 341-371 of JobConf in Hadoop 0.13.0:
>> >>
>> >> /**
>> >> * Set the given class as the compression codec for the map outputs.
>> >> * @param codecClass the CompressionCodec class that will compress
>> the
>> >> * map outputs
>> >> */
>> >> public void setMapOutputCompressorClass(Class<? extends
>> >> CompressionCodec> codecClass) {
>> >> setCompressMapOutput(true);
>> >> setClass("mapred.output.compression.codec", codecClass,
>> >> CompressionCodec.class);
>> >> }
>> >>
>> >> /**
>> >> * Get the codec for compressing the map outputs
>> >> * @param defaultValue the value to return if it is not set
>> >> * @return the CompressionCodec class that should be used to compress
>> >> the
>> >> * map outputs
>> >> * @throws IllegalArgumentException if the class was specified, but
>> not
>> >> found
>> >> */
>> >> public Class<? extends CompressionCodec>
>> getMapOutputCompressorClass(Class<?
>> >> extends CompressionCodec> defaultValue) {
>> >> String name = get( "mapred.output.compression.codec");
>> >> if (name == null) {
>> >> return defaultValue;
>> >> } else {
>> >> try {
>> >> return getClassByName(name).asSubclass( CompressionCodec.class
>> );
>> >> } catch (ClassNotFoundException e) {
>> >> throw new IllegalArgumentException("Compression codec " + name
>> +
>> >> " was not found.", e);
>> >> }
>> >> }
>> >> }
>> >>
>> >> This could be easily fixed by using a different property, for example,
>> "
>> >> map.output.compression.codec". Should I create an issue on JIRA for
>> this?
>> >> Thanks.
>> >>
>> >> Riccardo
>> >>
>> >>
>>