You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Luke Forehand <lu...@networkedinsights.com> on 2012/03/06 22:40:37 UTC

override mapreduce compression?

Hello,

Is there a way to run the mahout kmeans program from the command line, with a parameter that will override (and disable) the reducer task compression?  I have tried several different ways of specifying -D parameter but I can't seem to get any options to pass through to the hadoop mapreduce configuration.

Thanks!
Luke

Re: override mapreduce compression?

Posted by Sean Owen <sr...@gmail.com>.

The client can override cluster defaults unless the cluster marks them "final".

On Wed, Mar 7, 2012 at 9:02 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Aren't hadoop site.xml settings on the driver's client usually
> overshadow whatever it is on the cluster? Or you don't have the privs
> to change that either?
>

Re: override mapreduce compression?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Aren't hadoop site.xml settings on the driver's client usually
overshadow whatever it is on the cluster? Or you don't have the privs
to change that either?

On Tue, Mar 6, 2012 at 4:54 PM, Luke Forehand
<lu...@networkedinsights.com> wrote:
> Our operations guy handles our hadoop configuration, and I think he has
> setup our hadoop conf to compress everything.  I'm trying to subvert him
> :-)  I think the HADOOP_OPTS trick will work for me, I think that makes
> sense.  Thanks!
>
> -Luke
>
> On 3/6/12 6:46 PM, "Sean Owen" <sr...@gmail.com> wrote:
>
>>Eh, hmm, does this job compress by default? I don't have the code here.
>>That is not generally how Hadoop works but you could make it do this. I
>>don't know if there's an override.
>>On Mar 7, 2012 12:40 AM, "Luke Forehand" <
>>luke.forehand@networkedinsights.com> wrote:
>>
>>> Why should it not be compressed in the first place?
>>>
>>> Here is the header of one of the reducer parts that was written into
>>> /mahout/kmeans/clusters-5-final
>>>
>>> SEQ
>>>org.apache.hadoop.io.Text+org.apache.mahout.clustering.kmeans.Cluster
>>>  )org.apache.hadoop.io.compress.SnappyCodec
>>>
>>>
>>> On 3/6/12 6:33 PM, "Sean Owen" <sr...@gmail.com> wrote:
>>>
>>> >Ok but you're talking about reducer output not mapper. It should not be
>>> >compressed in the first place.
>>> >On Mar 7, 2012 12:29 AM, "Luke Forehand" <
>>> >luke.forehand@networkedinsights.com> wrote:
>>> >
>>> >> I want the results of the kmeans clustering to be uncompressed or
>>> >> compressed in a way that my users can natively decompress on their
>>> >> machines.  All our other hadoop jobs use Snappy compression when
>>>writing
>>> >> output, but our users don't have Snappy and don't particularly want
>>>to
>>> >> install it (especially because of problems installing on mac).  I'll
>>>try
>>> >> adding this param to the HADOOP_OPTS and in the longterm probably
>>>come
>>> >>up
>>> >> with a cleaner way to do this.  Thanks!
>>> >>
>>> >> -Luke
>>> >>
>>> >> On 3/6/12 6:24 PM, "Sean Owen" <sr...@gmail.com> wrote:
>>> >>
>>> >> >-D arguments are to the JVM so need to be set in HADOOP_OPTS (as I
>>> >> >recall). Or you configure this in your Hadoop config files.  It has
>>>no
>>> >> >meaning to the driver script. Why do you want to disable compression
>>> >> >after the mapper?
>>> >> >
>>> >> >On Wed, Mar 7, 2012 at 12:11 AM, Luke Forehand
>>> >> ><lu...@networkedinsights.com> wrote:
>>> >> >> I tried the following and it does not work:
>>> >> >>
>>> >> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
>>> >> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000
>>>-cd
>>> >>0.01
>>> >> >> -x 100 \
>>> >> >> -Dmapreduce.map.output.compress=false
>>> >> >>
>>> >> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
>>> >> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000
>>>-cd
>>> >>0.01
>>> >> >> -x 100 \
>>> >> >>
>>> >>
>>>
>>>>>>>-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipC
>>>>>>>od
>>> >>>>ec
>>> >> >>
>>> >> >>
>>> >> >> And still getting the default codec being used (which is Snappy in
>>> >>this
>>> >> >> case and I don't want the users to have to install native snappy
>>> >>which
>>> >> >>is
>>> >> >> why I'm trying to override this param).  Passing -Dkey=value on
>>>the
>>> >> >>mahout
>>> >> >> command line does not seem to have any effect on the mapreduce job
>>> >> >> configuration from what I can tell.  Any ideas?
>>> >> >>
>>> >> >> -Luke
>>> >> >>
>>> >> >> On 3/6/12 3:48 PM, "Sean Owen" <sr...@gmail.com> wrote:
>>> >> >>
>>> >> >>>Mapper compression? -Dmapreduce.map.output.compress=false. I think
>>> >>the
>>> >> >>>key was mapred.output.compress in Hadoop 0.20.0.
>>> >> >>>I am not sure if there is reducer compression built-in, but, I
>>>could
>>> >> >>>have missed it.
>>> >> >>>
>>> >> >>>On Tue, Mar 6, 2012 at 9:40 PM, Luke Forehand
>>> >> >>><lu...@networkedinsights.com> wrote:
>>> >> >>>> Hello,
>>> >> >>>>
>>> >> >>>> Is there a way to run the mahout kmeans program from the command
>>> >>line,
>>> >> >>>>with a parameter that will override (and disable) the reducer
>>>task
>>> >> >>>>compression?  I have tried several different ways of specifying
>>>-D
>>> >> >>>>parameter but I can't seem to get any options to pass through to
>>>the
>>> >> >>>>hadoop mapreduce configuration.
>>> >> >>>>
>>> >> >>>> Thanks!
>>> >> >>>> Luke
>>> >> >>
>>> >>
>>> >>
>>>
>>>
>

Re: override mapreduce compression?

Posted by Luke Forehand <lu...@networkedinsights.com>.

Our operations guy handles our hadoop configuration, and I think he has
setup our hadoop conf to compress everything.  I'm trying to subvert him
:-)  I think the HADOOP_OPTS trick will work for me, I think that makes
sense.  Thanks!

-Luke

On 3/6/12 6:46 PM, "Sean Owen" <sr...@gmail.com> wrote:

>Eh, hmm, does this job compress by default? I don't have the code here.
>That is not generally how Hadoop works but you could make it do this. I
>don't know if there's an override.
>On Mar 7, 2012 12:40 AM, "Luke Forehand" <
>luke.forehand@networkedinsights.com> wrote:
>
>> Why should it not be compressed in the first place?
>>
>> Here is the header of one of the reducer parts that was written into
>> /mahout/kmeans/clusters-5-final
>>
>> SEQ  
>>org.apache.hadoop.io.Text+org.apache.mahout.clustering.kmeans.Cluster
>>  )org.apache.hadoop.io.compress.SnappyCodec
>>
>>
>> On 3/6/12 6:33 PM, "Sean Owen" <sr...@gmail.com> wrote:
>>
>> >Ok but you're talking about reducer output not mapper. It should not be
>> >compressed in the first place.
>> >On Mar 7, 2012 12:29 AM, "Luke Forehand" <
>> >luke.forehand@networkedinsights.com> wrote:
>> >
>> >> I want the results of the kmeans clustering to be uncompressed or
>> >> compressed in a way that my users can natively decompress on their
>> >> machines.  All our other hadoop jobs use Snappy compression when
>>writing
>> >> output, but our users don't have Snappy and don't particularly want
>>to
>> >> install it (especially because of problems installing on mac).  I'll
>>try
>> >> adding this param to the HADOOP_OPTS and in the longterm probably
>>come
>> >>up
>> >> with a cleaner way to do this.  Thanks!
>> >>
>> >> -Luke
>> >>
>> >> On 3/6/12 6:24 PM, "Sean Owen" <sr...@gmail.com> wrote:
>> >>
>> >> >-D arguments are to the JVM so need to be set in HADOOP_OPTS (as I
>> >> >recall). Or you configure this in your Hadoop config files.  It has
>>no
>> >> >meaning to the driver script. Why do you want to disable compression
>> >> >after the mapper?
>> >> >
>> >> >On Wed, Mar 7, 2012 at 12:11 AM, Luke Forehand
>> >> ><lu...@networkedinsights.com> wrote:
>> >> >> I tried the following and it does not work:
>> >> >>
>> >> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
>> >> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000
>>-cd
>> >>0.01
>> >> >> -x 100 \
>> >> >> -Dmapreduce.map.output.compress=false
>> >> >>
>> >> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
>> >> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000
>>-cd
>> >>0.01
>> >> >> -x 100 \
>> >> >>
>> >>
>> 
>>>>>>-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipC
>>>>>>od
>> >>>>ec
>> >> >>
>> >> >>
>> >> >> And still getting the default codec being used (which is Snappy in
>> >>this
>> >> >> case and I don't want the users to have to install native snappy
>> >>which
>> >> >>is
>> >> >> why I'm trying to override this param).  Passing -Dkey=value on
>>the
>> >> >>mahout
>> >> >> command line does not seem to have any effect on the mapreduce job
>> >> >> configuration from what I can tell.  Any ideas?
>> >> >>
>> >> >> -Luke
>> >> >>
>> >> >> On 3/6/12 3:48 PM, "Sean Owen" <sr...@gmail.com> wrote:
>> >> >>
>> >> >>>Mapper compression? -Dmapreduce.map.output.compress=false. I think
>> >>the
>> >> >>>key was mapred.output.compress in Hadoop 0.20.0.
>> >> >>>I am not sure if there is reducer compression built-in, but, I
>>could
>> >> >>>have missed it.
>> >> >>>
>> >> >>>On Tue, Mar 6, 2012 at 9:40 PM, Luke Forehand
>> >> >>><lu...@networkedinsights.com> wrote:
>> >> >>>> Hello,
>> >> >>>>
>> >> >>>> Is there a way to run the mahout kmeans program from the command
>> >>line,
>> >> >>>>with a parameter that will override (and disable) the reducer
>>task
>> >> >>>>compression?  I have tried several different ways of specifying
>>-D
>> >> >>>>parameter but I can't seem to get any options to pass through to
>>the
>> >> >>>>hadoop mapreduce configuration.
>> >> >>>>
>> >> >>>> Thanks!
>> >> >>>> Luke
>> >> >>
>> >>
>> >>
>>
>>

Re: override mapreduce compression?

Posted by Sean Owen <sr...@gmail.com>.

Eh, hmm, does this job compress by default? I don't have the code here.
That is not generally how Hadoop works but you could make it do this. I
don't know if there's an override.
On Mar 7, 2012 12:40 AM, "Luke Forehand" <
luke.forehand@networkedinsights.com> wrote:

> Why should it not be compressed in the first place?
>
> Here is the header of one of the reducer parts that was written into
> /mahout/kmeans/clusters-5-final
>
> SEQ  org.apache.hadoop.io.Text+org.apache.mahout.clustering.kmeans.Cluster
>  )org.apache.hadoop.io.compress.SnappyCodec
>
>
> On 3/6/12 6:33 PM, "Sean Owen" <sr...@gmail.com> wrote:
>
> >Ok but you're talking about reducer output not mapper. It should not be
> >compressed in the first place.
> >On Mar 7, 2012 12:29 AM, "Luke Forehand" <
> >luke.forehand@networkedinsights.com> wrote:
> >
> >> I want the results of the kmeans clustering to be uncompressed or
> >> compressed in a way that my users can natively decompress on their
> >> machines.  All our other hadoop jobs use Snappy compression when writing
> >> output, but our users don't have Snappy and don't particularly want to
> >> install it (especially because of problems installing on mac).  I'll try
> >> adding this param to the HADOOP_OPTS and in the longterm probably come
> >>up
> >> with a cleaner way to do this.  Thanks!
> >>
> >> -Luke
> >>
> >> On 3/6/12 6:24 PM, "Sean Owen" <sr...@gmail.com> wrote:
> >>
> >> >-D arguments are to the JVM so need to be set in HADOOP_OPTS (as I
> >> >recall). Or you configure this in your Hadoop config files.  It has no
> >> >meaning to the driver script. Why do you want to disable compression
> >> >after the mapper?
> >> >
> >> >On Wed, Mar 7, 2012 at 12:11 AM, Luke Forehand
> >> ><lu...@networkedinsights.com> wrote:
> >> >> I tried the following and it does not work:
> >> >>
> >> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
> >> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000 -cd
> >>0.01
> >> >> -x 100 \
> >> >> -Dmapreduce.map.output.compress=false
> >> >>
> >> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
> >> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000 -cd
> >>0.01
> >> >> -x 100 \
> >> >>
> >>
> >>>>-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCod
> >>>>ec
> >> >>
> >> >>
> >> >> And still getting the default codec being used (which is Snappy in
> >>this
> >> >> case and I don't want the users to have to install native snappy
> >>which
> >> >>is
> >> >> why I'm trying to override this param).  Passing -Dkey=value on the
> >> >>mahout
> >> >> command line does not seem to have any effect on the mapreduce job
> >> >> configuration from what I can tell.  Any ideas?
> >> >>
> >> >> -Luke
> >> >>
> >> >> On 3/6/12 3:48 PM, "Sean Owen" <sr...@gmail.com> wrote:
> >> >>
> >> >>>Mapper compression? -Dmapreduce.map.output.compress=false. I think
> >>the
> >> >>>key was mapred.output.compress in Hadoop 0.20.0.
> >> >>>I am not sure if there is reducer compression built-in, but, I could
> >> >>>have missed it.
> >> >>>
> >> >>>On Tue, Mar 6, 2012 at 9:40 PM, Luke Forehand
> >> >>><lu...@networkedinsights.com> wrote:
> >> >>>> Hello,
> >> >>>>
> >> >>>> Is there a way to run the mahout kmeans program from the command
> >>line,
> >> >>>>with a parameter that will override (and disable) the reducer task
> >> >>>>compression?  I have tried several different ways of specifying -D
> >> >>>>parameter but I can't seem to get any options to pass through to the
> >> >>>>hadoop mapreduce configuration.
> >> >>>>
> >> >>>> Thanks!
> >> >>>> Luke
> >> >>
> >>
> >>
>
>

Re: override mapreduce compression?

Posted by Luke Forehand <lu...@networkedinsights.com>.

Why should it not be compressed in the first place?

Here is the header of one of the reducer parts that was written into
/mahout/kmeans/clusters-5-final

SEQorg.apache.hadoop.io.Text+org.apache.mahout.clustering.kmeans.Cluster
)org.apache.hadoop.io.compress.SnappyCodec


On 3/6/12 6:33 PM, "Sean Owen" <sr...@gmail.com> wrote:

>Ok but you're talking about reducer output not mapper. It should not be
>compressed in the first place.
>On Mar 7, 2012 12:29 AM, "Luke Forehand" <
>luke.forehand@networkedinsights.com> wrote:
>
>> I want the results of the kmeans clustering to be uncompressed or
>> compressed in a way that my users can natively decompress on their
>> machines.  All our other hadoop jobs use Snappy compression when writing
>> output, but our users don't have Snappy and don't particularly want to
>> install it (especially because of problems installing on mac).  I'll try
>> adding this param to the HADOOP_OPTS and in the longterm probably come
>>up
>> with a cleaner way to do this.  Thanks!
>>
>> -Luke
>>
>> On 3/6/12 6:24 PM, "Sean Owen" <sr...@gmail.com> wrote:
>>
>> >-D arguments are to the JVM so need to be set in HADOOP_OPTS (as I
>> >recall). Or you configure this in your Hadoop config files.  It has no
>> >meaning to the driver script. Why do you want to disable compression
>> >after the mapper?
>> >
>> >On Wed, Mar 7, 2012 at 12:11 AM, Luke Forehand
>> ><lu...@networkedinsights.com> wrote:
>> >> I tried the following and it does not work:
>> >>
>> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
>> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000 -cd
>>0.01
>> >> -x 100 \
>> >> -Dmapreduce.map.output.compress=false
>> >>
>> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
>> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000 -cd
>>0.01
>> >> -x 100 \
>> >>
>> 
>>>>-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCod
>>>>ec
>> >>
>> >>
>> >> And still getting the default codec being used (which is Snappy in
>>this
>> >> case and I don't want the users to have to install native snappy
>>which
>> >>is
>> >> why I'm trying to override this param).  Passing -Dkey=value on the
>> >>mahout
>> >> command line does not seem to have any effect on the mapreduce job
>> >> configuration from what I can tell.  Any ideas?
>> >>
>> >> -Luke
>> >>
>> >> On 3/6/12 3:48 PM, "Sean Owen" <sr...@gmail.com> wrote:
>> >>
>> >>>Mapper compression? -Dmapreduce.map.output.compress=false. I think
>>the
>> >>>key was mapred.output.compress in Hadoop 0.20.0.
>> >>>I am not sure if there is reducer compression built-in, but, I could
>> >>>have missed it.
>> >>>
>> >>>On Tue, Mar 6, 2012 at 9:40 PM, Luke Forehand
>> >>><lu...@networkedinsights.com> wrote:
>> >>>> Hello,
>> >>>>
>> >>>> Is there a way to run the mahout kmeans program from the command
>>line,
>> >>>>with a parameter that will override (and disable) the reducer task
>> >>>>compression?  I have tried several different ways of specifying -D
>> >>>>parameter but I can't seem to get any options to pass through to the
>> >>>>hadoop mapreduce configuration.
>> >>>>
>> >>>> Thanks!
>> >>>> Luke
>> >>
>>
>>

Re: override mapreduce compression?

Posted by Sean Owen <sr...@gmail.com>.

Ok but you're talking about reducer output not mapper. It should not be
compressed in the first place.
On Mar 7, 2012 12:29 AM, "Luke Forehand" <
luke.forehand@networkedinsights.com> wrote:

> I want the results of the kmeans clustering to be uncompressed or
> compressed in a way that my users can natively decompress on their
> machines.  All our other hadoop jobs use Snappy compression when writing
> output, but our users don't have Snappy and don't particularly want to
> install it (especially because of problems installing on mac).  I'll try
> adding this param to the HADOOP_OPTS and in the longterm probably come up
> with a cleaner way to do this.  Thanks!
>
> -Luke
>
> On 3/6/12 6:24 PM, "Sean Owen" <sr...@gmail.com> wrote:
>
> >-D arguments are to the JVM so need to be set in HADOOP_OPTS (as I
> >recall). Or you configure this in your Hadoop config files.  It has no
> >meaning to the driver script. Why do you want to disable compression
> >after the mapper?
> >
> >On Wed, Mar 7, 2012 at 12:11 AM, Luke Forehand
> ><lu...@networkedinsights.com> wrote:
> >> I tried the following and it does not work:
> >>
> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000 -cd 0.01
> >> -x 100 \
> >> -Dmapreduce.map.output.compress=false
> >>
> >> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
> >> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000 -cd 0.01
> >> -x 100 \
> >>
> >>-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
> >>
> >>
> >> And still getting the default codec being used (which is Snappy in this
> >> case and I don't want the users to have to install native snappy which
> >>is
> >> why I'm trying to override this param).  Passing -Dkey=value on the
> >>mahout
> >> command line does not seem to have any effect on the mapreduce job
> >> configuration from what I can tell.  Any ideas?
> >>
> >> -Luke
> >>
> >> On 3/6/12 3:48 PM, "Sean Owen" <sr...@gmail.com> wrote:
> >>
> >>>Mapper compression? -Dmapreduce.map.output.compress=false. I think the
> >>>key was mapred.output.compress in Hadoop 0.20.0.
> >>>I am not sure if there is reducer compression built-in, but, I could
> >>>have missed it.
> >>>
> >>>On Tue, Mar 6, 2012 at 9:40 PM, Luke Forehand
> >>><lu...@networkedinsights.com> wrote:
> >>>> Hello,
> >>>>
> >>>> Is there a way to run the mahout kmeans program from the command line,
> >>>>with a parameter that will override (and disable) the reducer task
> >>>>compression?  I have tried several different ways of specifying -D
> >>>>parameter but I can't seem to get any options to pass through to the
> >>>>hadoop mapreduce configuration.
> >>>>
> >>>> Thanks!
> >>>> Luke
> >>
>
>

Re: override mapreduce compression?

Posted by Luke Forehand <lu...@networkedinsights.com>.

I want the results of the kmeans clustering to be uncompressed or
compressed in a way that my users can natively decompress on their
machines.  All our other hadoop jobs use Snappy compression when writing
output, but our users don't have Snappy and don't particularly want to
install it (especially because of problems installing on mac).  I'll try
adding this param to the HADOOP_OPTS and in the longterm probably come up
with a cleaner way to do this.  Thanks!

-Luke

On 3/6/12 6:24 PM, "Sean Owen" <sr...@gmail.com> wrote:

>-D arguments are to the JVM so need to be set in HADOOP_OPTS (as I
>recall). Or you configure this in your Hadoop config files.  It has no
>meaning to the driver script. Why do you want to disable compression
>after the mapper?
>
>On Wed, Mar 7, 2012 at 12:11 AM, Luke Forehand
><lu...@networkedinsights.com> wrote:
>> I tried the following and it does not work:
>>
>> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
>> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000 -cd 0.01
>> -x 100 \
>> -Dmapreduce.map.output.compress=false
>>
>> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
>> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000 -cd 0.01
>> -x 100 \
>> 
>>-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
>>
>>
>> And still getting the default codec being used (which is Snappy in this
>> case and I don't want the users to have to install native snappy which
>>is
>> why I'm trying to override this param).  Passing -Dkey=value on the
>>mahout
>> command line does not seem to have any effect on the mapreduce job
>> configuration from what I can tell.  Any ideas?
>>
>> -Luke
>>
>> On 3/6/12 3:48 PM, "Sean Owen" <sr...@gmail.com> wrote:
>>
>>>Mapper compression? -Dmapreduce.map.output.compress=false. I think the
>>>key was mapred.output.compress in Hadoop 0.20.0.
>>>I am not sure if there is reducer compression built-in, but, I could
>>>have missed it.
>>>
>>>On Tue, Mar 6, 2012 at 9:40 PM, Luke Forehand
>>><lu...@networkedinsights.com> wrote:
>>>> Hello,
>>>>
>>>> Is there a way to run the mahout kmeans program from the command line,
>>>>with a parameter that will override (and disable) the reducer task
>>>>compression?  I have tried several different ways of specifying -D
>>>>parameter but I can't seem to get any options to pass through to the
>>>>hadoop mapreduce configuration.
>>>>
>>>> Thanks!
>>>> Luke
>>

Re: override mapreduce compression?

Posted by Sean Owen <sr...@gmail.com>.

-D arguments are to the JVM so need to be set in HADOOP_OPTS (as I
recall). Or you configure this in your Hadoop config files.  It has no
meaning to the driver script. Why do you want to disable compression
after the mapper?

On Wed, Mar 7, 2012 at 12:11 AM, Luke Forehand
<lu...@networkedinsights.com> wrote:
> I tried the following and it does not work:
>
> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000 -cd 0.01
> -x 100 \
> -Dmapreduce.map.output.compress=false
>
> mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
> /mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000 -cd 0.01
> -x 100 \
> -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
>
>
> And still getting the default codec being used (which is Snappy in this
> case and I don't want the users to have to install native snappy which is
> why I'm trying to override this param).  Passing -Dkey=value on the mahout
> command line does not seem to have any effect on the mapreduce job
> configuration from what I can tell.  Any ideas?
>
> -Luke
>
> On 3/6/12 3:48 PM, "Sean Owen" <sr...@gmail.com> wrote:
>
>>Mapper compression? -Dmapreduce.map.output.compress=false. I think the
>>key was mapred.output.compress in Hadoop 0.20.0.
>>I am not sure if there is reducer compression built-in, but, I could
>>have missed it.
>>
>>On Tue, Mar 6, 2012 at 9:40 PM, Luke Forehand
>><lu...@networkedinsights.com> wrote:
>>> Hello,
>>>
>>> Is there a way to run the mahout kmeans program from the command line,
>>>with a parameter that will override (and disable) the reducer task
>>>compression?  I have tried several different ways of specifying -D
>>>parameter but I can't seem to get any options to pass through to the
>>>hadoop mapreduce configuration.
>>>
>>> Thanks!
>>> Luke
>

Re: override mapreduce compression?

Posted by Luke Forehand <lu...@networkedinsights.com>.

I tried the following and it does not work:

mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
/mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000 -cd 0.01
-x 100 \
-Dmapreduce.map.output.compress=false

mahout kmeans -i /mahout/sparse/test1/tfidf-vectors -c
/mahout/initial-clusters/test1 -o /mahout/kmeans/test1 -k 10000 -cd 0.01
-x 100 \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec


And still getting the default codec being used (which is Snappy in this
case and I don't want the users to have to install native snappy which is
why I'm trying to override this param).  Passing -Dkey=value on the mahout
command line does not seem to have any effect on the mapreduce job
configuration from what I can tell.  Any ideas?

-Luke

On 3/6/12 3:48 PM, "Sean Owen" <sr...@gmail.com> wrote:

>Mapper compression? -Dmapreduce.map.output.compress=false. I think the
>key was mapred.output.compress in Hadoop 0.20.0.
>I am not sure if there is reducer compression built-in, but, I could
>have missed it.
>
>On Tue, Mar 6, 2012 at 9:40 PM, Luke Forehand
><lu...@networkedinsights.com> wrote:
>> Hello,
>>
>> Is there a way to run the mahout kmeans program from the command line,
>>with a parameter that will override (and disable) the reducer task
>>compression?  I have tried several different ways of specifying -D
>>parameter but I can't seem to get any options to pass through to the
>>hadoop mapreduce configuration.
>>
>> Thanks!
>> Luke

Re: override mapreduce compression?

Posted by Sean Owen <sr...@gmail.com>.

Mapper compression? -Dmapreduce.map.output.compress=false. I think the
key was mapred.output.compress in Hadoop 0.20.0.
I am not sure if there is reducer compression built-in, but, I could
have missed it.

On Tue, Mar 6, 2012 at 9:40 PM, Luke Forehand
<lu...@networkedinsights.com> wrote:
> Hello,
>
> Is there a way to run the mahout kmeans program from the command line, with a parameter that will override (and disable) the reducer task compression?  I have tried several different ways of specifying -D parameter but I can't seem to get any options to pass through to the hadoop mapreduce configuration.
>
> Thanks!
> Luke