You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by beneo_7 <be...@163.com> on 2010/12/28 07:45:17 UTC

where i can set -Dmapred.map.tasks=X

i read onMahout in Action that I should set -Dmapred.map.tasks=X
but it did not work for hadoop

Re: where i can set -Dmapred.map.tasks=X

Posted by Sean Owen <sr...@gmail.com>.

You set this when launching the jobs with the "hadoop" command. It
should come first among the arguments.
While it's not guaranteed to control the number of tasks it is a
strong hint: http://wiki.apache.org/hadoop/HowManyMapsAndReduces
If it's not working you may see whether something is overriding your setting.
This is a question more about Hadoop, not Mahout.

2010/12/28 beneo_7 <be...@163.com>:
> i read onMahout in Action that I should set -Dmapred.map.tasks=X
> but it did not work for hadoop
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Shige Takeda <st...@yahoo-inc.com>.

Jeff, I can help on this task if you don't mind.
One complexity I found was the case where one driver kicks off both MR 
and/or Sequential jobs. Although sequential one may not really need conf 
but passes conf to new FileSystem(uri, conf) to get a FileSystem but 
getConf() returns null resulting in NullPointerException.
Thanks,
-- Shige

Jeff Eastman wrote:
> Ok, this seems to be a more widespread problem. Let's identify all the places that need to be touched and I will commit them all at the same time.
>
> -----Original Message-----
> From: Shige Takeda [mailto:stakeda@yahoo-inc.com]
> Sent: Tuesday, January 04, 2011 9:03 AM
> To: user@mahout.apache.org
> Subject: Re: where i can set -Dmapred.map.tasks=X
>
> Hello,
>
> Coincidentally I came across the same problem last week and found the
> cause is Seq2Sparse's main didn't use ToolRunner.run(Tool,String[]),
> which automatically feeds -D parameters into a configuration object,
> which is accessible by Configurable.getConf().
>
> Also I see a lot of driver main functions, especially around
> clusterings, don't use TooRunner.run(Tool,String[]) but
> ToolRunner.run(Configuraiton,Too,String[]). A problem with the latter
> one is it doesn't consider the passed -D parameters.
>
> See the difference in this javadoc.
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/ToolRunner.html
>
> FYI, a specific problem to me is -Dmapred.job.queue.name=something is
> required when I run a job in the company's Hadoop cluster.
>
> Btw, any correction/suggestion to my comment is welcome as I'm also
> learning codes since last month.
>
> Thanks,
> -- Shige Takeda
>
> On 1/3/2011 8:27 PM, Jeff Eastman wrote:
>    
>> Seq2Sparse has this problem too? Not good. Users really need -D
>> abilities there. How about you JIRA your patch and I will get it in?
>>
>>
>> On 1/3/11 7:43 PM, Dmitriy Lyubimov wrote:
>>      
>>> Jeff, i also have a similar patch for seq2sparse. Not sure if it makes a lot
>>> of sense there since it is a composite job and i am not sure if
>>> configuration is propagated to those. But i got it too if need be.
>>>
>>> On Mon, Jan 3, 2011 at 5:36 PM, Dmitriy Lyubimov<dl...@gmail.com>    wrote:
>>>
>>>        
>>>> Resolved in mahout-574.
>>>>
>>>>
>>>> On Mon, Jan 3, 2011 at 3:49 PM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>>>>
>>>>          
>>>>> Yes, it could indeed. See my previous email which shows the problem unique
>>>>> to this class.
>>>>>
>>>>>
>>>>> On 1/3/11 3:30 PM, Dmitriy Lyubimov wrote:
>>>>>
>>>>>            
>>>>>> Could it be because of SequenceFileFromDirectory is not an AbstractJob?
>>>>>>
>>>>>>
>>>>>>              
>
>
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Ok thanks that's the way then

On Tue, Jan 4, 2011 at 1:37 PM, Sebastian Schelter <ss...@apache.org> wrote:

> IIRC nothing more than calling ToolRunner.run(...) with the current
> configuration from within your job class is needed to propagate the
> configuration when invoking other jobs.
>
> o.a.m.cf.taste.hadoop.item.RecommenderJob which internally calls
> RowSimilarityJob had the problem a while ago.
>
> --sebastian
>
> Am 04.01.2011 22:04, schrieb Dmitriy Lyubimov:
> > Sean,
> >
> > so, is there's a comment or document on how to propagate configuration to
> > multiple jobs? or perhaps an example driver class that adheres to that?
> >
> >
> > On Tue, Jan 4, 2011 at 12:30 PM, Sean Owen <sr...@gmail.com> wrote:
> >
> >> As a side point, the long-standing push to standardize on some
> >> approach for running MapReduce jobs (or groups of them), embodied in
> >> AbstractJob, would also solve this since details like this are handled
> >> already. It'd be good to move towards that model, not only because it
> >> fixes this and avoids some similar future issues, but for the sake of
> >> standardization.
> >>
> >>
> >> On Tue, Jan 4, 2011 at 12:30 PM, Dmitriy Lyubimov <dl...@gmail.com>
> >> wrote:
> >>> Jeff, he meant that those that _don't_ use ToolRunner can't parse -D.
> >> Those
> >>> that do use, can.
> >>>
> >>> I did patch for seq2sparse. It worked reasonably well for me (in a
> >> strange
> >>> way). However, I am hesitant to offer it. The reason like i said is
> that
> >>> unlike seqdirectory job, seq2sparse uses a lot of jobs and in order to
> >> make
> >>> use of -D parameters, it needs to make sure that either every one of
> them
> >> is
> >>> launched thru a ToolRunner, or propagate obtained Configuration object
> to
> >>> them explicitly using API-ish approach. Which my patch doesn't really
> >> take
> >>> care of to a due extent, there's more work to be done to do so.
> >>>
> >>> (BTW i realize my ssvd work suffers from this too).
> >>>
> >>> -d
> >>
> >
>
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Sebastian Schelter <ss...@apache.org>.

IIRC nothing more than calling ToolRunner.run(...) with the current
configuration from within your job class is needed to propagate the
configuration when invoking other jobs.

o.a.m.cf.taste.hadoop.item.RecommenderJob which internally calls
RowSimilarityJob had the problem a while ago.

--sebastian

Am 04.01.2011 22:04, schrieb Dmitriy Lyubimov:
> Sean,
> 
> so, is there's a comment or document on how to propagate configuration to
> multiple jobs? or perhaps an example driver class that adheres to that?
> 
> 
> On Tue, Jan 4, 2011 at 12:30 PM, Sean Owen <sr...@gmail.com> wrote:
> 
>> As a side point, the long-standing push to standardize on some
>> approach for running MapReduce jobs (or groups of them), embodied in
>> AbstractJob, would also solve this since details like this are handled
>> already. It'd be good to move towards that model, not only because it
>> fixes this and avoids some similar future issues, but for the sake of
>> standardization.
>>
>>
>> On Tue, Jan 4, 2011 at 12:30 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>>> Jeff, he meant that those that _don't_ use ToolRunner can't parse -D.
>> Those
>>> that do use, can.
>>>
>>> I did patch for seq2sparse. It worked reasonably well for me (in a
>> strange
>>> way). However, I am hesitant to offer it. The reason like i said is that
>>> unlike seqdirectory job, seq2sparse uses a lot of jobs and in order to
>> make
>>> use of -D parameters, it needs to make sure that either every one of them
>> is
>>> launched thru a ToolRunner, or propagate obtained Configuration object to
>>> them explicitly using API-ish approach. Which my patch doesn't really
>> take
>>> care of to a due extent, there's more work to be done to do so.
>>>
>>> (BTW i realize my ssvd work suffers from this too).
>>>
>>> -d
>>
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

yes it does.  ok will do tomorrow. Thank you, Jeff.
-Dmitriy

On Mon, Jan 3, 2011 at 8:27 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Seq2Sparse has this problem too? Not good. Users really need -D abilities
> there. How about you JIRA your patch and I will get it in?
>
>
>
> On 1/3/11 7:43 PM, Dmitriy Lyubimov wrote:
>
>> Jeff, i also have a similar patch for seq2sparse. Not sure if it makes a
>> lot
>> of sense there since it is a composite job and i am not sure if
>> configuration is propagated to those. But i got it too if need be.
>>
>> On Mon, Jan 3, 2011 at 5:36 PM, Dmitriy Lyubimov<dl...@gmail.com>
>>  wrote:
>>
>> Resolved in mahout-574.
>>>
>>>
>>> On Mon, Jan 3, 2011 at 3:49 PM, Jeff Eastman<jdog@windwardsolutions.com
>>> >wrote:
>>>
>>> Yes, it could indeed. See my previous email which shows the problem
>>>> unique
>>>> to this class.
>>>>
>>>>
>>>> On 1/3/11 3:30 PM, Dmitriy Lyubimov wrote:
>>>>
>>>> Could it be because of SequenceFileFromDirectory is not an AbstractJob?
>>>>>
>>>>>
>>>>>
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Sean,

so, is there's a comment or document on how to propagate configuration to
multiple jobs? or perhaps an example driver class that adheres to that?


On Tue, Jan 4, 2011 at 12:30 PM, Sean Owen <sr...@gmail.com> wrote:

> As a side point, the long-standing push to standardize on some
> approach for running MapReduce jobs (or groups of them), embodied in
> AbstractJob, would also solve this since details like this are handled
> already. It'd be good to move towards that model, not only because it
> fixes this and avoids some similar future issues, but for the sake of
> standardization.
>
>
> On Tue, Jan 4, 2011 at 12:30 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> > Jeff, he meant that those that _don't_ use ToolRunner can't parse -D.
> Those
> > that do use, can.
> >
> > I did patch for seq2sparse. It worked reasonably well for me (in a
> strange
> > way). However, I am hesitant to offer it. The reason like i said is that
> > unlike seqdirectory job, seq2sparse uses a lot of jobs and in order to
> make
> > use of -D parameters, it needs to make sure that either every one of them
> is
> > launched thru a ToolRunner, or propagate obtained Configuration object to
> > them explicitly using API-ish approach. Which my patch doesn't really
> take
> > care of to a due extent, there's more work to be done to do so.
> >
> > (BTW i realize my ssvd work suffers from this too).
> >
> > -d
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Sean Owen <sr...@gmail.com>.

As a side point, the long-standing push to standardize on some
approach for running MapReduce jobs (or groups of them), embodied in
AbstractJob, would also solve this since details like this are handled
already. It'd be good to move towards that model, not only because it
fixes this and avoids some similar future issues, but for the sake of
standardization.

On Tue, Jan 4, 2011 at 12:30 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Jeff, he meant that those that _don't_ use ToolRunner can't parse -D. Those
> that do use, can.
>
> I did patch for seq2sparse. It worked reasonably well for me (in a strange
> way). However, I am hesitant to offer it. The reason like i said is that
> unlike seqdirectory job, seq2sparse uses a lot of jobs and in order to make
> use of -D parameters, it needs to make sure that either every one of them is
> launched thru a ToolRunner, or propagate obtained Configuration object to
> them explicitly using API-ish approach. Which my patch doesn't really take
> care of to a due extent, there's more work to be done to do so.
>
> (BTW i realize my ssvd work suffers from this too).
>
> -d

Re: where i can set -Dmapred.map.tasks=X

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Jeff, he meant that those that _don't_ use ToolRunner can't parse -D. Those
that do use, can.

I did patch for seq2sparse. It worked reasonably well for me (in a strange
way). However, I am hesitant to offer it. The reason like i said is that
unlike seqdirectory job, seq2sparse uses a lot of jobs and in order to make
use of -D parameters, it needs to make sure that either every one of them is
launched thru a ToolRunner, or propagate obtained Configuration object to
them explicitly using API-ish approach. Which my patch doesn't really take
care of to a due extent, there's more work to be done to do so.

(BTW i realize my ssvd work suffers from this too).

-d



On Tue, Jan 4, 2011 at 9:43 AM, Jeff Eastman <je...@narus.com> wrote:

> It's odd though, that kmeans works correctly with multiple -D arguments,
> even though it uses the ToolRunner.run(Configuration,Tool,String[]). Are you
> sure about the semantics difference? It's not obvious from the javadocs.
>
> -----Original Message-----
> From: Jeff Eastman [mailto:jeastman@narus.com]
> Sent: Tuesday, January 04, 2011 9:09 AM
> To: user@mahout.apache.org
> Subject: RE: where i can set -Dmapred.map.tasks=X
>
> Ok, this seems to be a more widespread problem. Let's identify all the
> places that need to be touched and I will commit them all at the same time.
>
> -----Original Message-----
> From: Shige Takeda [mailto:stakeda@yahoo-inc.com]
> Sent: Tuesday, January 04, 2011 9:03 AM
> To: user@mahout.apache.org
> Subject: Re: where i can set -Dmapred.map.tasks=X
>
> Hello,
>
> Coincidentally I came across the same problem last week and found the
> cause is Seq2Sparse's main didn't use ToolRunner.run(Tool,String[]),
> which automatically feeds -D parameters into a configuration object,
> which is accessible by Configurable.getConf().
>
> Also I see a lot of driver main functions, especially around
> clusterings, don't use TooRunner.run(Tool,String[]) but
> ToolRunner.run(Configuraiton,Too,String[]). A problem with the latter
> one is it doesn't consider the passed -D parameters.
>
> See the difference in this javadoc.
>
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/ToolRunner.html
>
> FYI, a specific problem to me is -Dmapred.job.queue.name=something is
> required when I run a job in the company's Hadoop cluster.
>
> Btw, any correction/suggestion to my comment is welcome as I'm also
> learning codes since last month.
>
> Thanks,
> -- Shige Takeda
>
> On 1/3/2011 8:27 PM, Jeff Eastman wrote:
> > Seq2Sparse has this problem too? Not good. Users really need -D
> > abilities there. How about you JIRA your patch and I will get it in?
> >
> >
> > On 1/3/11 7:43 PM, Dmitriy Lyubimov wrote:
> >> Jeff, i also have a similar patch for seq2sparse. Not sure if it makes a
> lot
> >> of sense there since it is a composite job and i am not sure if
> >> configuration is propagated to those. But i got it too if need be.
> >>
> >> On Mon, Jan 3, 2011 at 5:36 PM, Dmitriy Lyubimov<dl...@gmail.com>
> wrote:
> >>
> >>> Resolved in mahout-574.
> >>>
> >>>
> >>> On Mon, Jan 3, 2011 at 3:49 PM, Jeff Eastman<
> jdog@windwardsolutions.com>wrote:
> >>>
> >>>> Yes, it could indeed. See my previous email which shows the problem
> unique
> >>>> to this class.
> >>>>
> >>>>
> >>>> On 1/3/11 3:30 PM, Dmitriy Lyubimov wrote:
> >>>>
> >>>>> Could it be because of SequenceFileFromDirectory is not an
> AbstractJob?
> >>>>>
> >>>>>
>
>
>

RE: where i can set -Dmapred.map.tasks=X

Posted by Jeff Eastman <je...@Narus.com>.

It's odd though, that kmeans works correctly with multiple -D arguments, even though it uses the ToolRunner.run(Configuration,Tool,String[]). Are you sure about the semantics difference? It's not obvious from the javadocs.

-----Original Message-----
From: Jeff Eastman [mailto:jeastman@narus.com] 
Sent: Tuesday, January 04, 2011 9:09 AM
To: user@mahout.apache.org
Subject: RE: where i can set -Dmapred.map.tasks=X

Ok, this seems to be a more widespread problem. Let's identify all the places that need to be touched and I will commit them all at the same time.

-----Original Message-----
From: Shige Takeda [mailto:stakeda@yahoo-inc.com] 
Sent: Tuesday, January 04, 2011 9:03 AM
To: user@mahout.apache.org
Subject: Re: where i can set -Dmapred.map.tasks=X

Hello,

Coincidentally I came across the same problem last week and found the 
cause is Seq2Sparse's main didn't use ToolRunner.run(Tool,String[]), 
which automatically feeds -D parameters into a configuration object, 
which is accessible by Configurable.getConf().

Also I see a lot of driver main functions, especially around 
clusterings, don't use TooRunner.run(Tool,String[]) but 
ToolRunner.run(Configuraiton,Too,String[]). A problem with the latter 
one is it doesn't consider the passed -D parameters.

See the difference in this javadoc.
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/ToolRunner.html

FYI, a specific problem to me is -Dmapred.job.queue.name=something is 
required when I run a job in the company's Hadoop cluster.

Btw, any correction/suggestion to my comment is welcome as I'm also 
learning codes since last month.

Thanks,
-- Shige Takeda

On 1/3/2011 8:27 PM, Jeff Eastman wrote:
> Seq2Sparse has this problem too? Not good. Users really need -D
> abilities there. How about you JIRA your patch and I will get it in?
>
>
> On 1/3/11 7:43 PM, Dmitriy Lyubimov wrote:
>> Jeff, i also have a similar patch for seq2sparse. Not sure if it makes a lot
>> of sense there since it is a composite job and i am not sure if
>> configuration is propagated to those. But i got it too if need be.
>>
>> On Mon, Jan 3, 2011 at 5:36 PM, Dmitriy Lyubimov<dl...@gmail.com>   wrote:
>>
>>> Resolved in mahout-574.
>>>
>>>
>>> On Mon, Jan 3, 2011 at 3:49 PM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>>>
>>>> Yes, it could indeed. See my previous email which shows the problem unique
>>>> to this class.
>>>>
>>>>
>>>> On 1/3/11 3:30 PM, Dmitriy Lyubimov wrote:
>>>>
>>>>> Could it be because of SequenceFileFromDirectory is not an AbstractJob?
>>>>>
>>>>>

RE: where i can set -Dmapred.map.tasks=X

Posted by Jeff Eastman <je...@Narus.com>.

Ok, this seems to be a more widespread problem. Let's identify all the places that need to be touched and I will commit them all at the same time.

-----Original Message-----
From: Shige Takeda [mailto:stakeda@yahoo-inc.com] 
Sent: Tuesday, January 04, 2011 9:03 AM
To: user@mahout.apache.org
Subject: Re: where i can set -Dmapred.map.tasks=X

Hello,

Coincidentally I came across the same problem last week and found the 
cause is Seq2Sparse's main didn't use ToolRunner.run(Tool,String[]), 
which automatically feeds -D parameters into a configuration object, 
which is accessible by Configurable.getConf().

Also I see a lot of driver main functions, especially around 
clusterings, don't use TooRunner.run(Tool,String[]) but 
ToolRunner.run(Configuraiton,Too,String[]). A problem with the latter 
one is it doesn't consider the passed -D parameters.

See the difference in this javadoc.
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/ToolRunner.html

FYI, a specific problem to me is -Dmapred.job.queue.name=something is 
required when I run a job in the company's Hadoop cluster.

Btw, any correction/suggestion to my comment is welcome as I'm also 
learning codes since last month.

Thanks,
-- Shige Takeda

On 1/3/2011 8:27 PM, Jeff Eastman wrote:
> Seq2Sparse has this problem too? Not good. Users really need -D
> abilities there. How about you JIRA your patch and I will get it in?
>
>
> On 1/3/11 7:43 PM, Dmitriy Lyubimov wrote:
>> Jeff, i also have a similar patch for seq2sparse. Not sure if it makes a lot
>> of sense there since it is a composite job and i am not sure if
>> configuration is propagated to those. But i got it too if need be.
>>
>> On Mon, Jan 3, 2011 at 5:36 PM, Dmitriy Lyubimov<dl...@gmail.com>   wrote:
>>
>>> Resolved in mahout-574.
>>>
>>>
>>> On Mon, Jan 3, 2011 at 3:49 PM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>>>
>>>> Yes, it could indeed. See my previous email which shows the problem unique
>>>> to this class.
>>>>
>>>>
>>>> On 1/3/11 3:30 PM, Dmitriy Lyubimov wrote:
>>>>
>>>>> Could it be because of SequenceFileFromDirectory is not an AbstractJob?
>>>>>
>>>>>

Re: where i can set -Dmapred.map.tasks=X

Posted by Shige Takeda <st...@yahoo-inc.com>.

Hello,

Coincidentally I came across the same problem last week and found the 
cause is Seq2Sparse's main didn't use ToolRunner.run(Tool,String[]), 
which automatically feeds -D parameters into a configuration object, 
which is accessible by Configurable.getConf().

Also I see a lot of driver main functions, especially around 
clusterings, don't use TooRunner.run(Tool,String[]) but 
ToolRunner.run(Configuraiton,Too,String[]). A problem with the latter 
one is it doesn't consider the passed -D parameters.

See the difference in this javadoc.
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/ToolRunner.html

FYI, a specific problem to me is -Dmapred.job.queue.name=something is 
required when I run a job in the company's Hadoop cluster.

Btw, any correction/suggestion to my comment is welcome as I'm also 
learning codes since last month.

Thanks,
-- Shige Takeda

On 1/3/2011 8:27 PM, Jeff Eastman wrote:
> Seq2Sparse has this problem too? Not good. Users really need -D
> abilities there. How about you JIRA your patch and I will get it in?
>
>
> On 1/3/11 7:43 PM, Dmitriy Lyubimov wrote:
>> Jeff, i also have a similar patch for seq2sparse. Not sure if it makes a lot
>> of sense there since it is a composite job and i am not sure if
>> configuration is propagated to those. But i got it too if need be.
>>
>> On Mon, Jan 3, 2011 at 5:36 PM, Dmitriy Lyubimov<dl...@gmail.com>   wrote:
>>
>>> Resolved in mahout-574.
>>>
>>>
>>> On Mon, Jan 3, 2011 at 3:49 PM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>>>
>>>> Yes, it could indeed. See my previous email which shows the problem unique
>>>> to this class.
>>>>
>>>>
>>>> On 1/3/11 3:30 PM, Dmitriy Lyubimov wrote:
>>>>
>>>>> Could it be because of SequenceFileFromDirectory is not an AbstractJob?
>>>>>
>>>>>

Re: where i can set -Dmapred.map.tasks=X

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Seq2Sparse has this problem too? Not good. Users really need -D 
abilities there. How about you JIRA your patch and I will get it in?


On 1/3/11 7:43 PM, Dmitriy Lyubimov wrote:
> Jeff, i also have a similar patch for seq2sparse. Not sure if it makes a lot
> of sense there since it is a composite job and i am not sure if
> configuration is propagated to those. But i got it too if need be.
>
> On Mon, Jan 3, 2011 at 5:36 PM, Dmitriy Lyubimov<dl...@gmail.com>  wrote:
>
>> Resolved in mahout-574.
>>
>>
>> On Mon, Jan 3, 2011 at 3:49 PM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>>
>>> Yes, it could indeed. See my previous email which shows the problem unique
>>> to this class.
>>>
>>>
>>> On 1/3/11 3:30 PM, Dmitriy Lyubimov wrote:
>>>
>>>> Could it be because of SequenceFileFromDirectory is not an AbstractJob?
>>>>
>>>>

Re: where i can set -Dmapred.map.tasks=X

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Jeff, i also have a similar patch for seq2sparse. Not sure if it makes a lot
of sense there since it is a composite job and i am not sure if
configuration is propagated to those. But i got it too if need be.

On Mon, Jan 3, 2011 at 5:36 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Resolved in mahout-574.
>
>
> On Mon, Jan 3, 2011 at 3:49 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
>
>> Yes, it could indeed. See my previous email which shows the problem unique
>> to this class.
>>
>>
>> On 1/3/11 3:30 PM, Dmitriy Lyubimov wrote:
>>
>>> Could it be because of SequenceFileFromDirectory is not an AbstractJob?
>>>
>>>

Re: where i can set -Dmapred.map.tasks=X

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Resolved in mahout-574.

On Mon, Jan 3, 2011 at 3:49 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Yes, it could indeed. See my previous email which shows the problem unique
> to this class.
>
>
> On 1/3/11 3:30 PM, Dmitriy Lyubimov wrote:
>
>> Could it be because of SequenceFileFromDirectory is not an AbstractJob?
>>
>>

Re: where i can set -Dmapred.map.tasks=X

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Yes, it could indeed. See my previous email which shows the problem 
unique to this class.

On 1/3/11 3:30 PM, Dmitriy Lyubimov wrote:
> Could it be because of SequenceFileFromDirectory is not an AbstractJob?
>
> On Mon, Jan 3, 2011 at 3:21 PM, Dmitriy Lyubimov<dl...@gmail.com>  wrote:
>
>> I printed out arguments that it supplies to hadoop program driver:
>>
>> [seqdirectory, -Dfs.default.name=file:///, -Dmapred.job.tracker=local, -c,
>> UTF-8, -o, /home/dmitriy/projects/testcollections/reuters-seqfiles, -i,
>> /home/dmitriy/projects/testcollections/reuters-extracted/]
>>
>>
>> So it seems to be doing the right thing with the ordering now but it still
>> doesn't work for some reason with this particular command line.
>>
>> -Dmitriy
>>
>>
>> On Mon, Jan 3, 2011 at 3:17 PM, Dmitriy Lyubimov<dl...@gmail.com>wrote:
>>
>>> Jeff,
>>> now it stopped complaining about first -D but started doing so about the
>>> second one.
>>>
>>>
>>> bin/mahout seqdirectory -Dmapred.job.tracker=local -Dfs.default.name=file:///
>>> -c UTF-8 -i /home/dmitriy/projects/testcollections/reuters-extracted/ -o
>>> /home/dmitriy/projects/testcollections/reuters-seqfiles
>>> Running on hadoop, using HADOOP_HOME=/home/dmitriy/tools/hadoop
>>> No HADOOP_CONF_DIR set, using /home/dmitriy/tools/hadoop/conf
>>> 11/01/03 15:16:13 ERROR text.SequenceFilesFromDirectory: Exception
>>> org.apache.commons.cli2.OptionException: Unexpected -Dfs.default.name=file:///
>>> while processing Options
>>>
>>>          at
>>> org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
>>>          at
>>> org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:201)
>>>
>>>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>          at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>          at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>          at java.lang.reflect.Method.invoke(Method.java:597)
>>>          at
>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>          at
>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>          at
>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:183)
>>>
>>>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>          at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>          at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>          at java.lang.reflect.Method.invoke(Method.java:597)
>>>          at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>>>
>>>
>>> On Mon, Jan 3, 2011 at 1:04 PM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>>>
>>>> Yes, I committed a small patch on the 29th. Try a new trunk build.
>>>>
>>>>
>>>> On 1/3/11 12:37 PM, Dmitriy Lyubimov wrote:
>>>>
>>>>> Hi Jeff,
>>>>>
>>>>> so did you get around to fixing this? i am having this little bugger all
>>>>> over the place, including book examples that don't work directly if i
>>>>> have
>>>>> hadoop setup on my machine such as in the following:
>>>>>
>>>>> bin/mahout seqdirectory -Dmapred.job.tracker=local -Dfs.default.name
>>>>> =file:///
>>>>> -c UTF-8 -i /home/dmitriy/projects/testcollections/reuters-extracted/ -o
>>>>> /home/dmitriy/projects/testcollections/reuters-seqfiles
>>>>> Running on hadoop, using HADOOP_HOME=/home/dmitriy/tools/hadoop
>>>>> No HADOOP_CONF_DIR set, using /home/dmitriy/tools/hadoop/conf
>>>>> 11/01/03 12:32:06 ERROR text.SequenceFilesFromDirectory: Exception
>>>>> org.apache.commons.cli2.OptionException: Unexpected
>>>>> -Dmapred.job.tracker=local while processing Options
>>>>>          at
>>>>> org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
>>>>>          at
>>>>>
>>>>> org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:187)
>>>>>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>          at
>>>>>
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>>          at
>>>>>
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>          at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>          at
>>>>>
>>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>>          at
>>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>>          at
>>>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:182)
>>>>>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>          at
>>>>>
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>>          at
>>>>>
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>          at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>          at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>>>>>
>>>>>
>>>>> Thanks.
>>>>> -Dmitriy
>>>>>
>>>>> On Wed, Dec 29, 2010 at 11:50 AM, Jeff Eastman<je...@narus.com>
>>>>>   wrote:
>>>>>
>>>>>   The patch to MahoutDriver involves the code in the for loop at lines
>>>>>> 203-216. If the arg.startsWith("-D") then the arg needs to be added to
>>>>>> argsList at position 1, else at the end. I will commit a patch for this
>>>>>> tonight as I have not got my Narus CLA signed yet.
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
>>>>>> Sent: Wednesday, December 29, 2010 11:46 AM
>>>>>> To: user@mahout.apache.org
>>>>>> Cc: dev@mahout.apache.org
>>>>>> Subject: Re: where i can set -Dmapred.map.tasks=X
>>>>>>
>>>>>> ok, thank you, Jeff. Good to know. I actually expected to rely on this
>>>>>> for
>>>>>> a
>>>>>> wide range of issues (most common being task jvm parameters override).
>>>>>>
>>>>>>
>>>>>>

Re: where i can set -Dmapred.map.tasks=X

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Could it be because of SequenceFileFromDirectory is not an AbstractJob?

On Mon, Jan 3, 2011 at 3:21 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> I printed out arguments that it supplies to hadoop program driver:
>
> [seqdirectory, -Dfs.default.name=file:///, -Dmapred.job.tracker=local, -c,
> UTF-8, -o, /home/dmitriy/projects/testcollections/reuters-seqfiles, -i,
> /home/dmitriy/projects/testcollections/reuters-extracted/]
>
>
> So it seems to be doing the right thing with the ordering now but it still
> doesn't work for some reason with this particular command line.
>
> -Dmitriy
>
>
> On Mon, Jan 3, 2011 at 3:17 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>
>> Jeff,
>> now it stopped complaining about first -D but started doing so about the
>> second one.
>>
>>
>> bin/mahout seqdirectory -Dmapred.job.tracker=local -Dfs.default.name=file:///
>> -c UTF-8 -i /home/dmitriy/projects/testcollections/reuters-extracted/ -o
>> /home/dmitriy/projects/testcollections/reuters-seqfiles
>> Running on hadoop, using HADOOP_HOME=/home/dmitriy/tools/hadoop
>> No HADOOP_CONF_DIR set, using /home/dmitriy/tools/hadoop/conf
>> 11/01/03 15:16:13 ERROR text.SequenceFilesFromDirectory: Exception
>> org.apache.commons.cli2.OptionException: Unexpected -Dfs.default.name=file:///
>> while processing Options
>>
>>         at
>> org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
>>         at
>> org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:201)
>>
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>         at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>         at
>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:183)
>>
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>>
>>
>> On Mon, Jan 3, 2011 at 1:04 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
>>
>>> Yes, I committed a small patch on the 29th. Try a new trunk build.
>>>
>>>
>>> On 1/3/11 12:37 PM, Dmitriy Lyubimov wrote:
>>>
>>>> Hi Jeff,
>>>>
>>>> so did you get around to fixing this? i am having this little bugger all
>>>> over the place, including book examples that don't work directly if i
>>>> have
>>>> hadoop setup on my machine such as in the following:
>>>>
>>>> bin/mahout seqdirectory -Dmapred.job.tracker=local -Dfs.default.name
>>>> =file:///
>>>> -c UTF-8 -i /home/dmitriy/projects/testcollections/reuters-extracted/ -o
>>>> /home/dmitriy/projects/testcollections/reuters-seqfiles
>>>> Running on hadoop, using HADOOP_HOME=/home/dmitriy/tools/hadoop
>>>> No HADOOP_CONF_DIR set, using /home/dmitriy/tools/hadoop/conf
>>>> 11/01/03 12:32:06 ERROR text.SequenceFilesFromDirectory: Exception
>>>> org.apache.commons.cli2.OptionException: Unexpected
>>>> -Dmapred.job.tracker=local while processing Options
>>>>         at
>>>> org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
>>>>         at
>>>>
>>>> org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:187)
>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>         at
>>>>
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>         at
>>>>
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>>>         at
>>>>
>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>         at
>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>         at
>>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:182)
>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>         at
>>>>
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>         at
>>>>
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>>>>
>>>>
>>>> Thanks.
>>>> -Dmitriy
>>>>
>>>> On Wed, Dec 29, 2010 at 11:50 AM, Jeff Eastman<je...@narus.com>
>>>>  wrote:
>>>>
>>>>  The patch to MahoutDriver involves the code in the for loop at lines
>>>>> 203-216. If the arg.startsWith("-D") then the arg needs to be added to
>>>>> argsList at position 1, else at the end. I will commit a patch for this
>>>>> tonight as I have not got my Narus CLA signed yet.
>>>>>
>>>>> -----Original Message-----
>>>>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
>>>>> Sent: Wednesday, December 29, 2010 11:46 AM
>>>>> To: user@mahout.apache.org
>>>>> Cc: dev@mahout.apache.org
>>>>> Subject: Re: where i can set -Dmapred.map.tasks=X
>>>>>
>>>>> ok, thank you, Jeff. Good to know. I actually expected to rely on this
>>>>> for
>>>>> a
>>>>> wide range of issues (most common being task jvm parameters override).
>>>>>
>>>>>
>>>>>
>>>
>>
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

It works for me with this command line:

./bin/mahout kmeans -Dmapred.reduce.tasks=10 -Dmapred.map.tasks=10 
-Dfs.default.name=file:/// -Dmapred.job.tracker=local -i foo -c bar -o 
baz -x 10

but not yours either:

./bin/mahout seqdirectory -Dmapred.job.tracker=local 
-Dfs.default.name=file:/// -c UTF-8 -i 
/home/dmitriy/projects/testcollections/reuters-extracted/ -o 
/home/dmitriy/projects/testcollections/reuters-seqfiles

This seems to be a different problem.


On 1/3/11 3:21 PM, Dmitriy Lyubimov wrote:
> I printed out arguments that it supplies to hadoop program driver:
>
> [seqdirectory, -Dfs.default.name=file:///, -Dmapred.job.tracker=local, -c,
> UTF-8, -o, /home/dmitriy/projects/testcollections/reuters-seqfiles, -i,
> /home/dmitriy/projects/testcollections/reuters-extracted/]
>
>
> So it seems to be doing the right thing with the ordering now but it still
> doesn't work for some reason with this particular command line.
>
> -Dmitriy
>
> On Mon, Jan 3, 2011 at 3:17 PM, Dmitriy Lyubimov<dl...@gmail.com>  wrote:
>
>> Jeff,
>> now it stopped complaining about first -D but started doing so about the
>> second one.
>>
>>
>> bin/mahout seqdirectory -Dmapred.job.tracker=local -Dfs.default.name=file:///
>> -c UTF-8 -i /home/dmitriy/projects/testcollections/reuters-extracted/ -o
>> /home/dmitriy/projects/testcollections/reuters-seqfiles
>> Running on hadoop, using HADOOP_HOME=/home/dmitriy/tools/hadoop
>> No HADOOP_CONF_DIR set, using /home/dmitriy/tools/hadoop/conf
>> 11/01/03 15:16:13 ERROR text.SequenceFilesFromDirectory: Exception
>> org.apache.commons.cli2.OptionException: Unexpected -Dfs.default.name=file:///
>> while processing Options
>>
>>          at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
>>          at
>> org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:201)
>>
>>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>          at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>          at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>          at java.lang.reflect.Method.invoke(Method.java:597)
>>          at
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>          at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>          at
>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:183)
>>
>>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>          at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>          at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>          at java.lang.reflect.Method.invoke(Method.java:597)
>>          at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>>
>>
>> On Mon, Jan 3, 2011 at 1:04 PM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>>
>>> Yes, I committed a small patch on the 29th. Try a new trunk build.
>>>
>>>
>>> On 1/3/11 12:37 PM, Dmitriy Lyubimov wrote:
>>>
>>>> Hi Jeff,
>>>>
>>>> so did you get around to fixing this? i am having this little bugger all
>>>> over the place, including book examples that don't work directly if i
>>>> have
>>>> hadoop setup on my machine such as in the following:
>>>>
>>>> bin/mahout seqdirectory -Dmapred.job.tracker=local -Dfs.default.name
>>>> =file:///
>>>> -c UTF-8 -i /home/dmitriy/projects/testcollections/reuters-extracted/ -o
>>>> /home/dmitriy/projects/testcollections/reuters-seqfiles
>>>> Running on hadoop, using HADOOP_HOME=/home/dmitriy/tools/hadoop
>>>> No HADOOP_CONF_DIR set, using /home/dmitriy/tools/hadoop/conf
>>>> 11/01/03 12:32:06 ERROR text.SequenceFilesFromDirectory: Exception
>>>> org.apache.commons.cli2.OptionException: Unexpected
>>>> -Dmapred.job.tracker=local while processing Options
>>>>          at
>>>> org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
>>>>          at
>>>>
>>>> org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:187)
>>>>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>          at
>>>>
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>          at
>>>>
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>          at java.lang.reflect.Method.invoke(Method.java:597)
>>>>          at
>>>>
>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>          at
>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>          at
>>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:182)
>>>>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>          at
>>>>
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>          at
>>>>
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>          at java.lang.reflect.Method.invoke(Method.java:597)
>>>>          at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>>>>
>>>>
>>>> Thanks.
>>>> -Dmitriy
>>>>
>>>> On Wed, Dec 29, 2010 at 11:50 AM, Jeff Eastman<je...@narus.com>
>>>>   wrote:
>>>>
>>>>   The patch to MahoutDriver involves the code in the for loop at lines
>>>>> 203-216. If the arg.startsWith("-D") then the arg needs to be added to
>>>>> argsList at position 1, else at the end. I will commit a patch for this
>>>>> tonight as I have not got my Narus CLA signed yet.
>>>>>
>>>>> -----Original Message-----
>>>>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
>>>>> Sent: Wednesday, December 29, 2010 11:46 AM
>>>>> To: user@mahout.apache.org
>>>>> Cc: dev@mahout.apache.org
>>>>> Subject: Re: where i can set -Dmapred.map.tasks=X
>>>>>
>>>>> ok, thank you, Jeff. Good to know. I actually expected to rely on this
>>>>> for
>>>>> a
>>>>> wide range of issues (most common being task jvm parameters override).
>>>>>
>>>>>
>>>>>

Re: where i can set -Dmapred.map.tasks=X

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I printed out arguments that it supplies to hadoop program driver:

[seqdirectory, -Dfs.default.name=file:///, -Dmapred.job.tracker=local, -c,
UTF-8, -o, /home/dmitriy/projects/testcollections/reuters-seqfiles, -i,
/home/dmitriy/projects/testcollections/reuters-extracted/]


So it seems to be doing the right thing with the ordering now but it still
doesn't work for some reason with this particular command line.

-Dmitriy

On Mon, Jan 3, 2011 at 3:17 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Jeff,
> now it stopped complaining about first -D but started doing so about the
> second one.
>
>
> bin/mahout seqdirectory -Dmapred.job.tracker=local -Dfs.default.name=file:///
> -c UTF-8 -i /home/dmitriy/projects/testcollections/reuters-extracted/ -o
> /home/dmitriy/projects/testcollections/reuters-seqfiles
> Running on hadoop, using HADOOP_HOME=/home/dmitriy/tools/hadoop
> No HADOOP_CONF_DIR set, using /home/dmitriy/tools/hadoop/conf
> 11/01/03 15:16:13 ERROR text.SequenceFilesFromDirectory: Exception
> org.apache.commons.cli2.OptionException: Unexpected -Dfs.default.name=file:///
> while processing Options
>
>         at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
>         at
> org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:201)
>
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>         at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>         at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:183)
>
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>
>
> On Mon, Jan 3, 2011 at 1:04 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
>
>> Yes, I committed a small patch on the 29th. Try a new trunk build.
>>
>>
>> On 1/3/11 12:37 PM, Dmitriy Lyubimov wrote:
>>
>>> Hi Jeff,
>>>
>>> so did you get around to fixing this? i am having this little bugger all
>>> over the place, including book examples that don't work directly if i
>>> have
>>> hadoop setup on my machine such as in the following:
>>>
>>> bin/mahout seqdirectory -Dmapred.job.tracker=local -Dfs.default.name
>>> =file:///
>>> -c UTF-8 -i /home/dmitriy/projects/testcollections/reuters-extracted/ -o
>>> /home/dmitriy/projects/testcollections/reuters-seqfiles
>>> Running on hadoop, using HADOOP_HOME=/home/dmitriy/tools/hadoop
>>> No HADOOP_CONF_DIR set, using /home/dmitriy/tools/hadoop/conf
>>> 11/01/03 12:32:06 ERROR text.SequenceFilesFromDirectory: Exception
>>> org.apache.commons.cli2.OptionException: Unexpected
>>> -Dmapred.job.tracker=local while processing Options
>>>         at
>>> org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
>>>         at
>>>
>>> org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:187)
>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>         at
>>>
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>         at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>>         at
>>>
>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>         at
>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>         at
>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:182)
>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>         at
>>>
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>         at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>>>
>>>
>>> Thanks.
>>> -Dmitriy
>>>
>>> On Wed, Dec 29, 2010 at 11:50 AM, Jeff Eastman<je...@narus.com>
>>>  wrote:
>>>
>>>  The patch to MahoutDriver involves the code in the for loop at lines
>>>> 203-216. If the arg.startsWith("-D") then the arg needs to be added to
>>>> argsList at position 1, else at the end. I will commit a patch for this
>>>> tonight as I have not got my Narus CLA signed yet.
>>>>
>>>> -----Original Message-----
>>>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
>>>> Sent: Wednesday, December 29, 2010 11:46 AM
>>>> To: user@mahout.apache.org
>>>> Cc: dev@mahout.apache.org
>>>> Subject: Re: where i can set -Dmapred.map.tasks=X
>>>>
>>>> ok, thank you, Jeff. Good to know. I actually expected to rely on this
>>>> for
>>>> a
>>>> wide range of issues (most common being task jvm parameters override).
>>>>
>>>>
>>>>
>>
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Jeff,
now it stopped complaining about first -D but started doing so about the
second one.

bin/mahout seqdirectory -Dmapred.job.tracker=local -Dfs.default.name=file:///
-c UTF-8 -i /home/dmitriy/projects/testcollections/reuters-extracted/ -o
/home/dmitriy/projects/testcollections/reuters-seqfiles
Running on hadoop, using HADOOP_HOME=/home/dmitriy/tools/hadoop
No HADOOP_CONF_DIR set, using /home/dmitriy/tools/hadoop/conf
11/01/03 15:16:13 ERROR text.SequenceFilesFromDirectory: Exception
org.apache.commons.cli2.OptionException: Unexpected -Dfs.default.name=file:///
while processing Options
        at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
        at
org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:201)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
        at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:183)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)


On Mon, Jan 3, 2011 at 1:04 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Yes, I committed a small patch on the 29th. Try a new trunk build.
>
>
> On 1/3/11 12:37 PM, Dmitriy Lyubimov wrote:
>
>> Hi Jeff,
>>
>> so did you get around to fixing this? i am having this little bugger all
>> over the place, including book examples that don't work directly if i have
>> hadoop setup on my machine such as in the following:
>>
>> bin/mahout seqdirectory -Dmapred.job.tracker=local -Dfs.default.name
>> =file:///
>> -c UTF-8 -i /home/dmitriy/projects/testcollections/reuters-extracted/ -o
>> /home/dmitriy/projects/testcollections/reuters-seqfiles
>> Running on hadoop, using HADOOP_HOME=/home/dmitriy/tools/hadoop
>> No HADOOP_CONF_DIR set, using /home/dmitriy/tools/hadoop/conf
>> 11/01/03 12:32:06 ERROR text.SequenceFilesFromDirectory: Exception
>> org.apache.commons.cli2.OptionException: Unexpected
>> -Dmapred.job.tracker=local while processing Options
>>         at
>> org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
>>         at
>>
>> org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:187)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>         at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at
>>
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>         at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>         at
>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:182)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>         at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>>
>>
>> Thanks.
>> -Dmitriy
>>
>> On Wed, Dec 29, 2010 at 11:50 AM, Jeff Eastman<je...@narus.com>
>>  wrote:
>>
>>  The patch to MahoutDriver involves the code in the for loop at lines
>>> 203-216. If the arg.startsWith("-D") then the arg needs to be added to
>>> argsList at position 1, else at the end. I will commit a patch for this
>>> tonight as I have not got my Narus CLA signed yet.
>>>
>>> -----Original Message-----
>>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
>>> Sent: Wednesday, December 29, 2010 11:46 AM
>>> To: user@mahout.apache.org
>>> Cc: dev@mahout.apache.org
>>> Subject: Re: where i can set -Dmapred.map.tasks=X
>>>
>>> ok, thank you, Jeff. Good to know. I actually expected to rely on this
>>> for
>>> a
>>> wide range of issues (most common being task jvm parameters override).
>>>
>>>
>>>
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Yes, I committed a small patch on the 29th. Try a new trunk build.

On 1/3/11 12:37 PM, Dmitriy Lyubimov wrote:
> Hi Jeff,
>
> so did you get around to fixing this? i am having this little bugger all
> over the place, including book examples that don't work directly if i have
> hadoop setup on my machine such as in the following:
>
> bin/mahout seqdirectory -Dmapred.job.tracker=local -Dfs.default.name=file:///
> -c UTF-8 -i /home/dmitriy/projects/testcollections/reuters-extracted/ -o
> /home/dmitriy/projects/testcollections/reuters-seqfiles
> Running on hadoop, using HADOOP_HOME=/home/dmitriy/tools/hadoop
> No HADOOP_CONF_DIR set, using /home/dmitriy/tools/hadoop/conf
> 11/01/03 12:32:06 ERROR text.SequenceFilesFromDirectory: Exception
> org.apache.commons.cli2.OptionException: Unexpected
> -Dmapred.job.tracker=local while processing Options
>          at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
>          at
> org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:187)
>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>          at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>          at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>          at java.lang.reflect.Method.invoke(Method.java:597)
>          at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>          at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>          at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:182)
>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>          at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>          at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>          at java.lang.reflect.Method.invoke(Method.java:597)
>          at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>
>
> Thanks.
> -Dmitriy
>
> On Wed, Dec 29, 2010 at 11:50 AM, Jeff Eastman<je...@narus.com>  wrote:
>
>> The patch to MahoutDriver involves the code in the for loop at lines
>> 203-216. If the arg.startsWith("-D") then the arg needs to be added to
>> argsList at position 1, else at the end. I will commit a patch for this
>> tonight as I have not got my Narus CLA signed yet.
>>
>> -----Original Message-----
>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
>> Sent: Wednesday, December 29, 2010 11:46 AM
>> To: user@mahout.apache.org
>> Cc: dev@mahout.apache.org
>> Subject: Re: where i can set -Dmapred.map.tasks=X
>>
>> ok, thank you, Jeff. Good to know. I actually expected to rely on this for
>> a
>> wide range of issues (most common being task jvm parameters override).
>>
>>

Re: where i can set -Dmapred.map.tasks=X

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Hi Jeff,

so did you get around to fixing this? i am having this little bugger all
over the place, including book examples that don't work directly if i have
hadoop setup on my machine such as in the following:

bin/mahout seqdirectory -Dmapred.job.tracker=local -Dfs.default.name=file:///
-c UTF-8 -i /home/dmitriy/projects/testcollections/reuters-extracted/ -o
/home/dmitriy/projects/testcollections/reuters-seqfiles
Running on hadoop, using HADOOP_HOME=/home/dmitriy/tools/hadoop
No HADOOP_CONF_DIR set, using /home/dmitriy/tools/hadoop/conf
11/01/03 12:32:06 ERROR text.SequenceFilesFromDirectory: Exception
org.apache.commons.cli2.OptionException: Unexpected
-Dmapred.job.tracker=local while processing Options
        at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
        at
org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:187)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
        at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:182)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)


Thanks.
-Dmitriy

On Wed, Dec 29, 2010 at 11:50 AM, Jeff Eastman <je...@narus.com> wrote:

> The patch to MahoutDriver involves the code in the for loop at lines
> 203-216. If the arg.startsWith("-D") then the arg needs to be added to
> argsList at position 1, else at the end. I will commit a patch for this
> tonight as I have not got my Narus CLA signed yet.
>
> -----Original Message-----
> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> Sent: Wednesday, December 29, 2010 11:46 AM
> To: user@mahout.apache.org
> Cc: dev@mahout.apache.org
> Subject: Re: where i can set -Dmapred.map.tasks=X
>
> ok, thank you, Jeff. Good to know. I actually expected to rely on this for
> a
> wide range of issues (most common being task jvm parameters override).
>
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Hi Jeff,

so did you get around to fixing this? i am having this little bugger all
over the place, including book examples that don't work directly if i have
hadoop setup on my machine such as in the following:

bin/mahout seqdirectory -Dmapred.job.tracker=local -Dfs.default.name=file:///
-c UTF-8 -i /home/dmitriy/projects/testcollections/reuters-extracted/ -o
/home/dmitriy/projects/testcollections/reuters-seqfiles
Running on hadoop, using HADOOP_HOME=/home/dmitriy/tools/hadoop
No HADOOP_CONF_DIR set, using /home/dmitriy/tools/hadoop/conf
11/01/03 12:32:06 ERROR text.SequenceFilesFromDirectory: Exception
org.apache.commons.cli2.OptionException: Unexpected
-Dmapred.job.tracker=local while processing Options
        at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
        at
org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:187)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
        at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:182)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)


Thanks.
-Dmitriy

On Wed, Dec 29, 2010 at 11:50 AM, Jeff Eastman <je...@narus.com> wrote:

> The patch to MahoutDriver involves the code in the for loop at lines
> 203-216. If the arg.startsWith("-D") then the arg needs to be added to
> argsList at position 1, else at the end. I will commit a patch for this
> tonight as I have not got my Narus CLA signed yet.
>
> -----Original Message-----
> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> Sent: Wednesday, December 29, 2010 11:46 AM
> To: user@mahout.apache.org
> Cc: dev@mahout.apache.org
> Subject: Re: where i can set -Dmapred.map.tasks=X
>
> ok, thank you, Jeff. Good to know. I actually expected to rely on this for
> a
> wide range of issues (most common being task jvm parameters override).
>
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Lance Norskog <go...@gmail.com>.

Ah!.

It is nice having custom code to fix this unique -D problem: someone
won't change it back to HashMap and break it.

On Wed, Dec 29, 2010 at 10:43 PM, Ted Dunning <te...@gmail.com> wrote:
> Actually, I think that the Tree collections order things according the
> comparator that is supplied explicitly or implicitly.  Ties cause
> over-writing.
>
> The LinkedHashMap will preserve order of addition.
>
> On Wed, Dec 29, 2010 at 9:57 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
>
>> Good point. I thought the logic was awkward, testing startsWith twice, so I
>> went with the more direct solution.
>>
>> On 12/29/10 6:29 PM, Lance Norskog wrote:
>>
>>> The Tree Map and Set classes preserve the order of addition to the
>>> Map/Set.
>>
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: where i can set -Dmapred.map.tasks=X

Posted by Ted Dunning <te...@gmail.com>.

Actually, I think that the Tree collections order things according the
comparator that is supplied explicitly or implicitly.  Ties cause
over-writing.

The LinkedHashMap will preserve order of addition.

On Wed, Dec 29, 2010 at 9:57 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Good point. I thought the logic was awkward, testing startsWith twice, so I
> went with the more direct solution.
>
> On 12/29/10 6:29 PM, Lance Norskog wrote:
>
>> The Tree Map and Set classes preserve the order of addition to the
>> Map/Set.
>
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Good point. I thought the logic was awkward, testing startsWith twice, 
so I went with the more direct solution.

On 12/29/10 6:29 PM, Lance Norskog wrote:
> The Tree Map and Set classes preserve the order of addition to the Map/Set.
>
> On Wed, Dec 29, 2010 at 11:50 AM, Jeff Eastman<je...@narus.com>  wrote:
>> The patch to MahoutDriver involves the code in the for loop at lines 203-216. If the arg.startsWith("-D") then the arg needs to be added to argsList at position 1, else at the end. I will commit a patch for this tonight as I have not got my Narus CLA signed yet.
>>
>> -----Original Message-----
>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
>> Sent: Wednesday, December 29, 2010 11:46 AM
>> To: user@mahout.apache.org
>> Cc: dev@mahout.apache.org
>> Subject: Re: where i can set -Dmapred.map.tasks=X
>>
>> ok, thank you, Jeff. Good to know. I actually expected to rely on this for a
>> wide range of issues (most common being task jvm parameters override).
>>
>> On Wed, Dec 29, 2010 at 11:29 AM, Jeff Eastman<je...@narus.com>  wrote:
>>
>>> I've found the problem: the MahoutDriver uses a Map to organize the command
>>> line arguments and this reorders them so that the -D arguments may not be
>>> first. This causes them to be treated as job-specific options, causing the
>>> failures. I'm working on a fix.
>>>
>>> Jeff
>>>
>>> -----Original Message-----
>>> From: Jeff Eastman [mailto:jeastman@narus.com]
>>> Sent: Tuesday, December 28, 2010 5:19 PM
>>> To: dev@mahout.apache.org
>>> Subject: RE: where i can set -Dmapred.map.tasks=X
>>>
>>> That's where I'm beginning to look too. It seems the driver code is working
>>> correctly (I thought I had tested that) but the CLI isn't.
>>>
>>> The original post was for -Dmapred.map.tasks but I noticed the reduce.tasks
>>> didn't work either.
>>>
>>> -----Original Message-----
>>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
>>> Sent: Tuesday, December 28, 2010 5:15 PM
>>> To: dev@mahout.apache.org
>>> Subject: Re: where i can set -Dmapred.map.tasks=X
>>>
>>> Oh, so you are trying to set number of reduce tasks. i missed that,
>>> original
>>> post was about # of map tasks. sorry.
>>>
>>> No, no idea why that error pops up in mahout command line. i would need to
>>> dig into the mahout's cli code -- i don't thing i dug that deep there
>>> before.
>>>
>>> On Tue, Dec 28, 2010 at 5:06 PM, Jeff Eastman<je...@narus.com>  wrote:
>>>
>>>> It's very odd: when I run k-means from Eclipse and add
>>>> -Dmapred.reduce.tasks=10 as the first argument the driver loves it and
>>>> job.getNumReduceTasks() is set correctly to 10. When I run the same
>>> command
>>>> line using bin/mahout; however, it fails:  with "Unexpected
>>>> -Dmapred.reduce.tasks=10 while processing Job-Specific Options.
>>>>
>>>> The CLI invocation is: ./bin/mahout kmeans -Dmapred.reduce.tasks-10 -I
>>> ...
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
>>>> Sent: Tuesday, December 28, 2010 4:55 PM
>>>> To: dev@mahout.apache.org
>>>> Subject: Re: where i can set -Dmapred.map.tasks=X
>>>>
>>>> PPS it doesn't tell you what InputFileFormat actually uses for it as a
>>>> property, and i don't remember on top of my head either. but i assume you
>>>> could use them with -D as well.
>>>>
>>>> On Tue, Dec 28, 2010 at 4:54 PM, Dmitriy Lyubimov<dl...@gmail.com>
>>>> wrote:
>>>>
>>>>> In particular, QJob is one of the drivers that uses that , in the
>>>> following
>>>>> way:
>>>>>
>>>>> f ( minSplitSize>0)
>>>>>   SequenceFileInputFormat.setMinInputSplitSize(job, minSplitSize);
>>>>>
>>>>> Interestng pecularity about that parameter is that in the current
>>> hadoop
>>>>> release for anything derived from InputFileFormat it ensures that all
>>>> splits
>>>>> are at least that big and the last split is at least times 1.1  that
>>> big.
>>>> I
>>>>> am not quite sure why special treatment for the last split but that's
>>> how
>>>> it
>>>>> goes there.
>>>>>
>>>>> -Dmitriy
>>>>>
>>>>>
>>>>> On Tue, Dec 28, 2010 at 4:48 PM, Dmitriy Lyubimov<dlieu.7@gmail.com
>>>>> wrote:
>>>>>
>>>>>> Jeff,
>>>>>>
>>>>>> it's mahout-376 patch i don't think it is committed. the driver class
>>>>>> there is SSVDCli, for your convenience you can find it here :
>>>>>>
>>> https://github.com/dlyubimov/ssvd-lsi/tree/givens-ssvd/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd
>>>>>> but like i said, i did not try to use it with -D option since i wanted
>>>> to
>>>>>> give an explicit option to increase split size if needed (and a help
>>> for
>>>>>> it). Another reason is that solver has a series of jobs and only those
>>>>>> reading the source matrix have anything to do with the split size.
>>>>>>
>>>>>>
>>>>>> -d
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 28, 2010 at 4:39 PM, Jeff Eastman<je...@narus.com>
>>>> wrote:
>>>>>>> What's the driver class? If the -D parameters are working for you I
>>>> want
>>>>>>> to compare to the clustering drovers
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
>>>>>>> Sent: Tuesday, December 28, 2010 4:37 PM
>>>>>>> To: dev@mahout.apache.org
>>>>>>> Subject: Re: where i can set -Dmapred.map.tasks=X
>>>>>>>
>>>>>>> as far as i understand, this option is not forced. I suspect it
>>>> actually
>>>>>>> means 'minimum degree of parallelism'. so if you expect to use that
>>> to
>>>>>>> reduce number of mappers, i don't think this is expected to work so
>>>> much.
>>>>>>> The one that do enforce anything are min split size and max split
>>> size
>>>> in
>>>>>>> file input so i guess you can try those. I rely on them (and open it
>>> up
>>>>>>> as a
>>>>>>> job-specific option) in stochastic svd.
>>>>>>>
>>>>>>> but usually forcing split size to increase creates a 'superslits'
>>>>>>> problem,
>>>>>>> where a lot of data is moved around to just supply data to mappers.
>>>> which
>>>>>>> is
>>>>>>> perhaps why this option is meant to increase parallelism only, but
>>>>>>> probably
>>>>>>> not to decrease it.
>>>>>>>
>>>>>>> -d
>>>>>>>
>>>>>>> On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman<je...@narus.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> This is supposed to be a generic option. You should be able to
>>>> specify
>>>>>>>> Hadoop options such as this on the command line invocation of your
>>>>>>> favorite
>>>>>>>> Mahout routine, but I'm having a similar problem setting
>>>>>>>> -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with
>>>> and
>>>>>>>> without a space after the -D.
>>>>>>>>
>>>>>>>> Can someone point me to a Mahout command where this does work? Both
>>>>>>> drivers
>>>>>>>> extend AbstractJob and do the usual option processing pushups. I
>>>> don't
>>>>>>> have
>>>>>>>> Hadoop source locally so I can't debug the generic options parsing.
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: beneo_7 [mailto:beneo_7@163.com]
>>>>>>>> Sent: Monday, December 27, 2010 10:45 PM
>>>>>>>> To: dev@mahout.apache.org
>>>>>>>> Subject: where i can set -Dmapred.map.tasks=X
>>>>>>>>
>>>>>>>> i read onMahout in Action that I should set -Dmapred.map.tasks=X
>>>>>>>> but it did not work for hadoop
>>>>>>>>
>>>>>>
>
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Lance Norskog <go...@gmail.com>.

The Tree Map and Set classes preserve the order of addition to the Map/Set.

On Wed, Dec 29, 2010 at 11:50 AM, Jeff Eastman <je...@narus.com> wrote:
> The patch to MahoutDriver involves the code in the for loop at lines 203-216. If the arg.startsWith("-D") then the arg needs to be added to argsList at position 1, else at the end. I will commit a patch for this tonight as I have not got my Narus CLA signed yet.
>
> -----Original Message-----
> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> Sent: Wednesday, December 29, 2010 11:46 AM
> To: user@mahout.apache.org
> Cc: dev@mahout.apache.org
> Subject: Re: where i can set -Dmapred.map.tasks=X
>
> ok, thank you, Jeff. Good to know. I actually expected to rely on this for a
> wide range of issues (most common being task jvm parameters override).
>
> On Wed, Dec 29, 2010 at 11:29 AM, Jeff Eastman <je...@narus.com> wrote:
>
>> I've found the problem: the MahoutDriver uses a Map to organize the command
>> line arguments and this reorders them so that the -D arguments may not be
>> first. This causes them to be treated as job-specific options, causing the
>> failures. I'm working on a fix.
>>
>> Jeff
>>
>> -----Original Message-----
>> From: Jeff Eastman [mailto:jeastman@narus.com]
>> Sent: Tuesday, December 28, 2010 5:19 PM
>> To: dev@mahout.apache.org
>> Subject: RE: where i can set -Dmapred.map.tasks=X
>>
>> That's where I'm beginning to look too. It seems the driver code is working
>> correctly (I thought I had tested that) but the CLI isn't.
>>
>> The original post was for -Dmapred.map.tasks but I noticed the reduce.tasks
>> didn't work either.
>>
>> -----Original Message-----
>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
>> Sent: Tuesday, December 28, 2010 5:15 PM
>> To: dev@mahout.apache.org
>> Subject: Re: where i can set -Dmapred.map.tasks=X
>>
>> Oh, so you are trying to set number of reduce tasks. i missed that,
>> original
>> post was about # of map tasks. sorry.
>>
>> No, no idea why that error pops up in mahout command line. i would need to
>> dig into the mahout's cli code -- i don't thing i dug that deep there
>> before.
>>
>> On Tue, Dec 28, 2010 at 5:06 PM, Jeff Eastman <je...@narus.com> wrote:
>>
>> > It's very odd: when I run k-means from Eclipse and add
>> > -Dmapred.reduce.tasks=10 as the first argument the driver loves it and
>> > job.getNumReduceTasks() is set correctly to 10. When I run the same
>> command
>> > line using bin/mahout; however, it fails:  with "Unexpected
>> > -Dmapred.reduce.tasks=10 while processing Job-Specific Options.
>> >
>> > The CLI invocation is: ./bin/mahout kmeans -Dmapred.reduce.tasks-10 -I
>> ...
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
>> > Sent: Tuesday, December 28, 2010 4:55 PM
>> > To: dev@mahout.apache.org
>> > Subject: Re: where i can set -Dmapred.map.tasks=X
>> >
>> > PPS it doesn't tell you what InputFileFormat actually uses for it as a
>> > property, and i don't remember on top of my head either. but i assume you
>> > could use them with -D as well.
>> >
>> > On Tue, Dec 28, 2010 at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> > wrote:
>> >
>> > > In particular, QJob is one of the drivers that uses that , in the
>> > following
>> > > way:
>> > >
>> > > f ( minSplitSize>0)
>> > >  SequenceFileInputFormat.setMinInputSplitSize(job, minSplitSize);
>> > >
>> > > Interestng pecularity about that parameter is that in the current
>> hadoop
>> > > release for anything derived from InputFileFormat it ensures that all
>> > splits
>> > > are at least that big and the last split is at least times 1.1  that
>> big.
>> > I
>> > > am not quite sure why special treatment for the last split but that's
>> how
>> > it
>> > > goes there.
>> > >
>> > > -Dmitriy
>> > >
>> > >
>> > > On Tue, Dec 28, 2010 at 4:48 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>> > >wrote:
>> > >
>> > >> Jeff,
>> > >>
>> > >> it's mahout-376 patch i don't think it is committed. the driver class
>> > >> there is SSVDCli, for your convenience you can find it here :
>> > >>
>> >
>> https://github.com/dlyubimov/ssvd-lsi/tree/givens-ssvd/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd
>> > >>
>> > >> but like i said, i did not try to use it with -D option since i wanted
>> > to
>> > >> give an explicit option to increase split size if needed (and a help
>> for
>> > >> it). Another reason is that solver has a series of jobs and only those
>> > >> reading the source matrix have anything to do with the split size.
>> > >>
>> > >>
>> > >> -d
>> > >>
>> > >>
>> > >> On Tue, Dec 28, 2010 at 4:39 PM, Jeff Eastman <je...@narus.com>
>> > wrote:
>> > >>
>> > >>> What's the driver class? If the -D parameters are working for you I
>> > want
>> > >>> to compare to the clustering drovers
>> > >>>
>> > >>> -----Original Message-----
>> > >>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
>> > >>> Sent: Tuesday, December 28, 2010 4:37 PM
>> > >>> To: dev@mahout.apache.org
>> > >>> Subject: Re: where i can set -Dmapred.map.tasks=X
>> > >>>
>> > >>> as far as i understand, this option is not forced. I suspect it
>> > actually
>> > >>> means 'minimum degree of parallelism'. so if you expect to use that
>> to
>> > >>> reduce number of mappers, i don't think this is expected to work so
>> > much.
>> > >>> The one that do enforce anything are min split size and max split
>> size
>> > in
>> > >>> file input so i guess you can try those. I rely on them (and open it
>> up
>> > >>> as a
>> > >>> job-specific option) in stochastic svd.
>> > >>>
>> > >>> but usually forcing split size to increase creates a 'superslits'
>> > >>> problem,
>> > >>> where a lot of data is moved around to just supply data to mappers.
>> > which
>> > >>> is
>> > >>> perhaps why this option is meant to increase parallelism only, but
>> > >>> probably
>> > >>> not to decrease it.
>> > >>>
>> > >>> -d
>> > >>>
>> > >>> On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman <je...@narus.com>
>> > >>> wrote:
>> > >>>
>> > >>> > This is supposed to be a generic option. You should be able to
>> > specify
>> > >>> > Hadoop options such as this on the command line invocation of your
>> > >>> favorite
>> > >>> > Mahout routine, but I'm having a similar problem setting
>> > >>> > -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with
>> > and
>> > >>> > without a space after the -D.
>> > >>> >
>> > >>> > Can someone point me to a Mahout command where this does work? Both
>> > >>> drivers
>> > >>> > extend AbstractJob and do the usual option processing pushups. I
>> > don't
>> > >>> have
>> > >>> > Hadoop source locally so I can't debug the generic options parsing.
>> > >>> >
>> > >>> > -----Original Message-----
>> > >>> > From: beneo_7 [mailto:beneo_7@163.com]
>> > >>> > Sent: Monday, December 27, 2010 10:45 PM
>> > >>> > To: dev@mahout.apache.org
>> > >>> > Subject: where i can set -Dmapred.map.tasks=X
>> > >>> >
>> > >>> > i read onMahout in Action that I should set -Dmapred.map.tasks=X
>> > >>> > but it did not work for hadoop
>> > >>> >
>> > >>>
>> > >>
>> > >>
>> > >
>> >
>>
>



-- 
Lance Norskog
goksron@gmail.com

RE: where i can set -Dmapred.map.tasks=X

Posted by Jeff Eastman <je...@Narus.com>.

The patch to MahoutDriver involves the code in the for loop at lines 203-216. If the arg.startsWith("-D") then the arg needs to be added to argsList at position 1, else at the end. I will commit a patch for this tonight as I have not got my Narus CLA signed yet.

-----Original Message-----
From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com] 
Sent: Wednesday, December 29, 2010 11:46 AM
To: user@mahout.apache.org
Cc: dev@mahout.apache.org
Subject: Re: where i can set -Dmapred.map.tasks=X

ok, thank you, Jeff. Good to know. I actually expected to rely on this for a
wide range of issues (most common being task jvm parameters override).

On Wed, Dec 29, 2010 at 11:29 AM, Jeff Eastman <je...@narus.com> wrote:

> I've found the problem: the MahoutDriver uses a Map to organize the command
> line arguments and this reorders them so that the -D arguments may not be
> first. This causes them to be treated as job-specific options, causing the
> failures. I'm working on a fix.
>
> Jeff
>
> -----Original Message-----
> From: Jeff Eastman [mailto:jeastman@narus.com]
> Sent: Tuesday, December 28, 2010 5:19 PM
> To: dev@mahout.apache.org
> Subject: RE: where i can set -Dmapred.map.tasks=X
>
> That's where I'm beginning to look too. It seems the driver code is working
> correctly (I thought I had tested that) but the CLI isn't.
>
> The original post was for -Dmapred.map.tasks but I noticed the reduce.tasks
> didn't work either.
>
> -----Original Message-----
> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> Sent: Tuesday, December 28, 2010 5:15 PM
> To: dev@mahout.apache.org
> Subject: Re: where i can set -Dmapred.map.tasks=X
>
> Oh, so you are trying to set number of reduce tasks. i missed that,
> original
> post was about # of map tasks. sorry.
>
> No, no idea why that error pops up in mahout command line. i would need to
> dig into the mahout's cli code -- i don't thing i dug that deep there
> before.
>
> On Tue, Dec 28, 2010 at 5:06 PM, Jeff Eastman <je...@narus.com> wrote:
>
> > It's very odd: when I run k-means from Eclipse and add
> > -Dmapred.reduce.tasks=10 as the first argument the driver loves it and
> > job.getNumReduceTasks() is set correctly to 10. When I run the same
> command
> > line using bin/mahout; however, it fails:  with "Unexpected
> > -Dmapred.reduce.tasks=10 while processing Job-Specific Options.
> >
> > The CLI invocation is: ./bin/mahout kmeans -Dmapred.reduce.tasks-10 -I
> ...
> >
> >
> >
> > -----Original Message-----
> > From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> > Sent: Tuesday, December 28, 2010 4:55 PM
> > To: dev@mahout.apache.org
> > Subject: Re: where i can set -Dmapred.map.tasks=X
> >
> > PPS it doesn't tell you what InputFileFormat actually uses for it as a
> > property, and i don't remember on top of my head either. but i assume you
> > could use them with -D as well.
> >
> > On Tue, Dec 28, 2010 at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > In particular, QJob is one of the drivers that uses that , in the
> > following
> > > way:
> > >
> > > f ( minSplitSize>0)
> > >  SequenceFileInputFormat.setMinInputSplitSize(job, minSplitSize);
> > >
> > > Interestng pecularity about that parameter is that in the current
> hadoop
> > > release for anything derived from InputFileFormat it ensures that all
> > splits
> > > are at least that big and the last split is at least times 1.1  that
> big.
> > I
> > > am not quite sure why special treatment for the last split but that's
> how
> > it
> > > goes there.
> > >
> > > -Dmitriy
> > >
> > >
> > > On Tue, Dec 28, 2010 at 4:48 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> > >wrote:
> > >
> > >> Jeff,
> > >>
> > >> it's mahout-376 patch i don't think it is committed. the driver class
> > >> there is SSVDCli, for your convenience you can find it here :
> > >>
> >
> https://github.com/dlyubimov/ssvd-lsi/tree/givens-ssvd/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd
> > >>
> > >> but like i said, i did not try to use it with -D option since i wanted
> > to
> > >> give an explicit option to increase split size if needed (and a help
> for
> > >> it). Another reason is that solver has a series of jobs and only those
> > >> reading the source matrix have anything to do with the split size.
> > >>
> > >>
> > >> -d
> > >>
> > >>
> > >> On Tue, Dec 28, 2010 at 4:39 PM, Jeff Eastman <je...@narus.com>
> > wrote:
> > >>
> > >>> What's the driver class? If the -D parameters are working for you I
> > want
> > >>> to compare to the clustering drovers
> > >>>
> > >>> -----Original Message-----
> > >>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> > >>> Sent: Tuesday, December 28, 2010 4:37 PM
> > >>> To: dev@mahout.apache.org
> > >>> Subject: Re: where i can set -Dmapred.map.tasks=X
> > >>>
> > >>> as far as i understand, this option is not forced. I suspect it
> > actually
> > >>> means 'minimum degree of parallelism'. so if you expect to use that
> to
> > >>> reduce number of mappers, i don't think this is expected to work so
> > much.
> > >>> The one that do enforce anything are min split size and max split
> size
> > in
> > >>> file input so i guess you can try those. I rely on them (and open it
> up
> > >>> as a
> > >>> job-specific option) in stochastic svd.
> > >>>
> > >>> but usually forcing split size to increase creates a 'superslits'
> > >>> problem,
> > >>> where a lot of data is moved around to just supply data to mappers.
> > which
> > >>> is
> > >>> perhaps why this option is meant to increase parallelism only, but
> > >>> probably
> > >>> not to decrease it.
> > >>>
> > >>> -d
> > >>>
> > >>> On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman <je...@narus.com>
> > >>> wrote:
> > >>>
> > >>> > This is supposed to be a generic option. You should be able to
> > specify
> > >>> > Hadoop options such as this on the command line invocation of your
> > >>> favorite
> > >>> > Mahout routine, but I'm having a similar problem setting
> > >>> > -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with
> > and
> > >>> > without a space after the -D.
> > >>> >
> > >>> > Can someone point me to a Mahout command where this does work? Both
> > >>> drivers
> > >>> > extend AbstractJob and do the usual option processing pushups. I
> > don't
> > >>> have
> > >>> > Hadoop source locally so I can't debug the generic options parsing.
> > >>> >
> > >>> > -----Original Message-----
> > >>> > From: beneo_7 [mailto:beneo_7@163.com]
> > >>> > Sent: Monday, December 27, 2010 10:45 PM
> > >>> > To: dev@mahout.apache.org
> > >>> > Subject: where i can set -Dmapred.map.tasks=X
> > >>> >
> > >>> > i read onMahout in Action that I should set -Dmapred.map.tasks=X
> > >>> > but it did not work for hadoop
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

ok, thank you, Jeff. Good to know. I actually expected to rely on this for a
wide range of issues (most common being task jvm parameters override).

On Wed, Dec 29, 2010 at 11:29 AM, Jeff Eastman <je...@narus.com> wrote:

> I've found the problem: the MahoutDriver uses a Map to organize the command
> line arguments and this reorders them so that the -D arguments may not be
> first. This causes them to be treated as job-specific options, causing the
> failures. I'm working on a fix.
>
> Jeff
>
> -----Original Message-----
> From: Jeff Eastman [mailto:jeastman@narus.com]
> Sent: Tuesday, December 28, 2010 5:19 PM
> To: dev@mahout.apache.org
> Subject: RE: where i can set -Dmapred.map.tasks=X
>
> That's where I'm beginning to look too. It seems the driver code is working
> correctly (I thought I had tested that) but the CLI isn't.
>
> The original post was for -Dmapred.map.tasks but I noticed the reduce.tasks
> didn't work either.
>
> -----Original Message-----
> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> Sent: Tuesday, December 28, 2010 5:15 PM
> To: dev@mahout.apache.org
> Subject: Re: where i can set -Dmapred.map.tasks=X
>
> Oh, so you are trying to set number of reduce tasks. i missed that,
> original
> post was about # of map tasks. sorry.
>
> No, no idea why that error pops up in mahout command line. i would need to
> dig into the mahout's cli code -- i don't thing i dug that deep there
> before.
>
> On Tue, Dec 28, 2010 at 5:06 PM, Jeff Eastman <je...@narus.com> wrote:
>
> > It's very odd: when I run k-means from Eclipse and add
> > -Dmapred.reduce.tasks=10 as the first argument the driver loves it and
> > job.getNumReduceTasks() is set correctly to 10. When I run the same
> command
> > line using bin/mahout; however, it fails:  with "Unexpected
> > -Dmapred.reduce.tasks=10 while processing Job-Specific Options.
> >
> > The CLI invocation is: ./bin/mahout kmeans -Dmapred.reduce.tasks-10 -I
> ...
> >
> >
> >
> > -----Original Message-----
> > From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> > Sent: Tuesday, December 28, 2010 4:55 PM
> > To: dev@mahout.apache.org
> > Subject: Re: where i can set -Dmapred.map.tasks=X
> >
> > PPS it doesn't tell you what InputFileFormat actually uses for it as a
> > property, and i don't remember on top of my head either. but i assume you
> > could use them with -D as well.
> >
> > On Tue, Dec 28, 2010 at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > In particular, QJob is one of the drivers that uses that , in the
> > following
> > > way:
> > >
> > > f ( minSplitSize>0)
> > >  SequenceFileInputFormat.setMinInputSplitSize(job, minSplitSize);
> > >
> > > Interestng pecularity about that parameter is that in the current
> hadoop
> > > release for anything derived from InputFileFormat it ensures that all
> > splits
> > > are at least that big and the last split is at least times 1.1  that
> big.
> > I
> > > am not quite sure why special treatment for the last split but that's
> how
> > it
> > > goes there.
> > >
> > > -Dmitriy
> > >
> > >
> > > On Tue, Dec 28, 2010 at 4:48 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> > >wrote:
> > >
> > >> Jeff,
> > >>
> > >> it's mahout-376 patch i don't think it is committed. the driver class
> > >> there is SSVDCli, for your convenience you can find it here :
> > >>
> >
> https://github.com/dlyubimov/ssvd-lsi/tree/givens-ssvd/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd
> > >>
> > >> but like i said, i did not try to use it with -D option since i wanted
> > to
> > >> give an explicit option to increase split size if needed (and a help
> for
> > >> it). Another reason is that solver has a series of jobs and only those
> > >> reading the source matrix have anything to do with the split size.
> > >>
> > >>
> > >> -d
> > >>
> > >>
> > >> On Tue, Dec 28, 2010 at 4:39 PM, Jeff Eastman <je...@narus.com>
> > wrote:
> > >>
> > >>> What's the driver class? If the -D parameters are working for you I
> > want
> > >>> to compare to the clustering drovers
> > >>>
> > >>> -----Original Message-----
> > >>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> > >>> Sent: Tuesday, December 28, 2010 4:37 PM
> > >>> To: dev@mahout.apache.org
> > >>> Subject: Re: where i can set -Dmapred.map.tasks=X
> > >>>
> > >>> as far as i understand, this option is not forced. I suspect it
> > actually
> > >>> means 'minimum degree of parallelism'. so if you expect to use that
> to
> > >>> reduce number of mappers, i don't think this is expected to work so
> > much.
> > >>> The one that do enforce anything are min split size and max split
> size
> > in
> > >>> file input so i guess you can try those. I rely on them (and open it
> up
> > >>> as a
> > >>> job-specific option) in stochastic svd.
> > >>>
> > >>> but usually forcing split size to increase creates a 'superslits'
> > >>> problem,
> > >>> where a lot of data is moved around to just supply data to mappers.
> > which
> > >>> is
> > >>> perhaps why this option is meant to increase parallelism only, but
> > >>> probably
> > >>> not to decrease it.
> > >>>
> > >>> -d
> > >>>
> > >>> On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman <je...@narus.com>
> > >>> wrote:
> > >>>
> > >>> > This is supposed to be a generic option. You should be able to
> > specify
> > >>> > Hadoop options such as this on the command line invocation of your
> > >>> favorite
> > >>> > Mahout routine, but I'm having a similar problem setting
> > >>> > -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with
> > and
> > >>> > without a space after the -D.
> > >>> >
> > >>> > Can someone point me to a Mahout command where this does work? Both
> > >>> drivers
> > >>> > extend AbstractJob and do the usual option processing pushups. I
> > don't
> > >>> have
> > >>> > Hadoop source locally so I can't debug the generic options parsing.
> > >>> >
> > >>> > -----Original Message-----
> > >>> > From: beneo_7 [mailto:beneo_7@163.com]
> > >>> > Sent: Monday, December 27, 2010 10:45 PM
> > >>> > To: dev@mahout.apache.org
> > >>> > Subject: where i can set -Dmapred.map.tasks=X
> > >>> >
> > >>> > i read onMahout in Action that I should set -Dmapred.map.tasks=X
> > >>> > but it did not work for hadoop
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

ok, thank you, Jeff. Good to know. I actually expected to rely on this for a
wide range of issues (most common being task jvm parameters override).

On Wed, Dec 29, 2010 at 11:29 AM, Jeff Eastman <je...@narus.com> wrote:

> I've found the problem: the MahoutDriver uses a Map to organize the command
> line arguments and this reorders them so that the -D arguments may not be
> first. This causes them to be treated as job-specific options, causing the
> failures. I'm working on a fix.
>
> Jeff
>
> -----Original Message-----
> From: Jeff Eastman [mailto:jeastman@narus.com]
> Sent: Tuesday, December 28, 2010 5:19 PM
> To: dev@mahout.apache.org
> Subject: RE: where i can set -Dmapred.map.tasks=X
>
> That's where I'm beginning to look too. It seems the driver code is working
> correctly (I thought I had tested that) but the CLI isn't.
>
> The original post was for -Dmapred.map.tasks but I noticed the reduce.tasks
> didn't work either.
>
> -----Original Message-----
> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> Sent: Tuesday, December 28, 2010 5:15 PM
> To: dev@mahout.apache.org
> Subject: Re: where i can set -Dmapred.map.tasks=X
>
> Oh, so you are trying to set number of reduce tasks. i missed that,
> original
> post was about # of map tasks. sorry.
>
> No, no idea why that error pops up in mahout command line. i would need to
> dig into the mahout's cli code -- i don't thing i dug that deep there
> before.
>
> On Tue, Dec 28, 2010 at 5:06 PM, Jeff Eastman <je...@narus.com> wrote:
>
> > It's very odd: when I run k-means from Eclipse and add
> > -Dmapred.reduce.tasks=10 as the first argument the driver loves it and
> > job.getNumReduceTasks() is set correctly to 10. When I run the same
> command
> > line using bin/mahout; however, it fails:  with "Unexpected
> > -Dmapred.reduce.tasks=10 while processing Job-Specific Options.
> >
> > The CLI invocation is: ./bin/mahout kmeans -Dmapred.reduce.tasks-10 -I
> ...
> >
> >
> >
> > -----Original Message-----
> > From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> > Sent: Tuesday, December 28, 2010 4:55 PM
> > To: dev@mahout.apache.org
> > Subject: Re: where i can set -Dmapred.map.tasks=X
> >
> > PPS it doesn't tell you what InputFileFormat actually uses for it as a
> > property, and i don't remember on top of my head either. but i assume you
> > could use them with -D as well.
> >
> > On Tue, Dec 28, 2010 at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > In particular, QJob is one of the drivers that uses that , in the
> > following
> > > way:
> > >
> > > f ( minSplitSize>0)
> > >  SequenceFileInputFormat.setMinInputSplitSize(job, minSplitSize);
> > >
> > > Interestng pecularity about that parameter is that in the current
> hadoop
> > > release for anything derived from InputFileFormat it ensures that all
> > splits
> > > are at least that big and the last split is at least times 1.1  that
> big.
> > I
> > > am not quite sure why special treatment for the last split but that's
> how
> > it
> > > goes there.
> > >
> > > -Dmitriy
> > >
> > >
> > > On Tue, Dec 28, 2010 at 4:48 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> > >wrote:
> > >
> > >> Jeff,
> > >>
> > >> it's mahout-376 patch i don't think it is committed. the driver class
> > >> there is SSVDCli, for your convenience you can find it here :
> > >>
> >
> https://github.com/dlyubimov/ssvd-lsi/tree/givens-ssvd/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd
> > >>
> > >> but like i said, i did not try to use it with -D option since i wanted
> > to
> > >> give an explicit option to increase split size if needed (and a help
> for
> > >> it). Another reason is that solver has a series of jobs and only those
> > >> reading the source matrix have anything to do with the split size.
> > >>
> > >>
> > >> -d
> > >>
> > >>
> > >> On Tue, Dec 28, 2010 at 4:39 PM, Jeff Eastman <je...@narus.com>
> > wrote:
> > >>
> > >>> What's the driver class? If the -D parameters are working for you I
> > want
> > >>> to compare to the clustering drovers
> > >>>
> > >>> -----Original Message-----
> > >>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> > >>> Sent: Tuesday, December 28, 2010 4:37 PM
> > >>> To: dev@mahout.apache.org
> > >>> Subject: Re: where i can set -Dmapred.map.tasks=X
> > >>>
> > >>> as far as i understand, this option is not forced. I suspect it
> > actually
> > >>> means 'minimum degree of parallelism'. so if you expect to use that
> to
> > >>> reduce number of mappers, i don't think this is expected to work so
> > much.
> > >>> The one that do enforce anything are min split size and max split
> size
> > in
> > >>> file input so i guess you can try those. I rely on them (and open it
> up
> > >>> as a
> > >>> job-specific option) in stochastic svd.
> > >>>
> > >>> but usually forcing split size to increase creates a 'superslits'
> > >>> problem,
> > >>> where a lot of data is moved around to just supply data to mappers.
> > which
> > >>> is
> > >>> perhaps why this option is meant to increase parallelism only, but
> > >>> probably
> > >>> not to decrease it.
> > >>>
> > >>> -d
> > >>>
> > >>> On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman <je...@narus.com>
> > >>> wrote:
> > >>>
> > >>> > This is supposed to be a generic option. You should be able to
> > specify
> > >>> > Hadoop options such as this on the command line invocation of your
> > >>> favorite
> > >>> > Mahout routine, but I'm having a similar problem setting
> > >>> > -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with
> > and
> > >>> > without a space after the -D.
> > >>> >
> > >>> > Can someone point me to a Mahout command where this does work? Both
> > >>> drivers
> > >>> > extend AbstractJob and do the usual option processing pushups. I
> > don't
> > >>> have
> > >>> > Hadoop source locally so I can't debug the generic options parsing.
> > >>> >
> > >>> > -----Original Message-----
> > >>> > From: beneo_7 [mailto:beneo_7@163.com]
> > >>> > Sent: Monday, December 27, 2010 10:45 PM
> > >>> > To: dev@mahout.apache.org
> > >>> > Subject: where i can set -Dmapred.map.tasks=X
> > >>> >
> > >>> > i read onMahout in Action that I should set -Dmapred.map.tasks=X
> > >>> > but it did not work for hadoop
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

RE: where i can set -Dmapred.map.tasks=X

Posted by Jeff Eastman <je...@Narus.com>.

I've found the problem: the MahoutDriver uses a Map to organize the command line arguments and this reorders them so that the -D arguments may not be first. This causes them to be treated as job-specific options, causing the failures. I'm working on a fix.

Jeff

-----Original Message-----
From: Jeff Eastman [mailto:jeastman@narus.com] 
Sent: Tuesday, December 28, 2010 5:19 PM
To: dev@mahout.apache.org
Subject: RE: where i can set -Dmapred.map.tasks=X

That's where I'm beginning to look too. It seems the driver code is working correctly (I thought I had tested that) but the CLI isn't.

The original post was for -Dmapred.map.tasks but I noticed the reduce.tasks didn't work either.

-----Original Message-----
From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com] 
Sent: Tuesday, December 28, 2010 5:15 PM
To: dev@mahout.apache.org
Subject: Re: where i can set -Dmapred.map.tasks=X

Oh, so you are trying to set number of reduce tasks. i missed that, original
post was about # of map tasks. sorry.

No, no idea why that error pops up in mahout command line. i would need to
dig into the mahout's cli code -- i don't thing i dug that deep there
before.

On Tue, Dec 28, 2010 at 5:06 PM, Jeff Eastman <je...@narus.com> wrote:

> It's very odd: when I run k-means from Eclipse and add
> -Dmapred.reduce.tasks=10 as the first argument the driver loves it and
> job.getNumReduceTasks() is set correctly to 10. When I run the same command
> line using bin/mahout; however, it fails:  with "Unexpected
> -Dmapred.reduce.tasks=10 while processing Job-Specific Options.
>
> The CLI invocation is: ./bin/mahout kmeans -Dmapred.reduce.tasks-10 -I ...
>
>
>
> -----Original Message-----
> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> Sent: Tuesday, December 28, 2010 4:55 PM
> To: dev@mahout.apache.org
> Subject: Re: where i can set -Dmapred.map.tasks=X
>
> PPS it doesn't tell you what InputFileFormat actually uses for it as a
> property, and i don't remember on top of my head either. but i assume you
> could use them with -D as well.
>
> On Tue, Dec 28, 2010 at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > In particular, QJob is one of the drivers that uses that , in the
> following
> > way:
> >
> > f ( minSplitSize>0)
> >  SequenceFileInputFormat.setMinInputSplitSize(job, minSplitSize);
> >
> > Interestng pecularity about that parameter is that in the current hadoop
> > release for anything derived from InputFileFormat it ensures that all
> splits
> > are at least that big and the last split is at least times 1.1  that big.
> I
> > am not quite sure why special treatment for the last split but that's how
> it
> > goes there.
> >
> > -Dmitriy
> >
> >
> > On Tue, Dec 28, 2010 at 4:48 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
> >
> >> Jeff,
> >>
> >> it's mahout-376 patch i don't think it is committed. the driver class
> >> there is SSVDCli, for your convenience you can find it here :
> >>
> https://github.com/dlyubimov/ssvd-lsi/tree/givens-ssvd/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd
> >>
> >> but like i said, i did not try to use it with -D option since i wanted
> to
> >> give an explicit option to increase split size if needed (and a help for
> >> it). Another reason is that solver has a series of jobs and only those
> >> reading the source matrix have anything to do with the split size.
> >>
> >>
> >> -d
> >>
> >>
> >> On Tue, Dec 28, 2010 at 4:39 PM, Jeff Eastman <je...@narus.com>
> wrote:
> >>
> >>> What's the driver class? If the -D parameters are working for you I
> want
> >>> to compare to the clustering drovers
> >>>
> >>> -----Original Message-----
> >>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> >>> Sent: Tuesday, December 28, 2010 4:37 PM
> >>> To: dev@mahout.apache.org
> >>> Subject: Re: where i can set -Dmapred.map.tasks=X
> >>>
> >>> as far as i understand, this option is not forced. I suspect it
> actually
> >>> means 'minimum degree of parallelism'. so if you expect to use that to
> >>> reduce number of mappers, i don't think this is expected to work so
> much.
> >>> The one that do enforce anything are min split size and max split size
> in
> >>> file input so i guess you can try those. I rely on them (and open it up
> >>> as a
> >>> job-specific option) in stochastic svd.
> >>>
> >>> but usually forcing split size to increase creates a 'superslits'
> >>> problem,
> >>> where a lot of data is moved around to just supply data to mappers.
> which
> >>> is
> >>> perhaps why this option is meant to increase parallelism only, but
> >>> probably
> >>> not to decrease it.
> >>>
> >>> -d
> >>>
> >>> On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman <je...@narus.com>
> >>> wrote:
> >>>
> >>> > This is supposed to be a generic option. You should be able to
> specify
> >>> > Hadoop options such as this on the command line invocation of your
> >>> favorite
> >>> > Mahout routine, but I'm having a similar problem setting
> >>> > -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with
> and
> >>> > without a space after the -D.
> >>> >
> >>> > Can someone point me to a Mahout command where this does work? Both
> >>> drivers
> >>> > extend AbstractJob and do the usual option processing pushups. I
> don't
> >>> have
> >>> > Hadoop source locally so I can't debug the generic options parsing.
> >>> >
> >>> > -----Original Message-----
> >>> > From: beneo_7 [mailto:beneo_7@163.com]
> >>> > Sent: Monday, December 27, 2010 10:45 PM
> >>> > To: dev@mahout.apache.org
> >>> > Subject: where i can set -Dmapred.map.tasks=X
> >>> >
> >>> > i read onMahout in Action that I should set -Dmapred.map.tasks=X
> >>> > but it did not work for hadoop
> >>> >
> >>>
> >>
> >>
> >
>

RE: where i can set -Dmapred.map.tasks=X

Posted by Jeff Eastman <je...@Narus.com>.

I've found the problem: the MahoutDriver uses a Map to organize the command line arguments and this reorders them so that the -D arguments may not be first. This causes them to be treated as job-specific options, causing the failures. I'm working on a fix.

Jeff

-----Original Message-----
From: Jeff Eastman [mailto:jeastman@narus.com] 
Sent: Tuesday, December 28, 2010 5:19 PM
To: dev@mahout.apache.org
Subject: RE: where i can set -Dmapred.map.tasks=X

That's where I'm beginning to look too. It seems the driver code is working correctly (I thought I had tested that) but the CLI isn't.

The original post was for -Dmapred.map.tasks but I noticed the reduce.tasks didn't work either.

-----Original Message-----
From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com] 
Sent: Tuesday, December 28, 2010 5:15 PM
To: dev@mahout.apache.org
Subject: Re: where i can set -Dmapred.map.tasks=X

Oh, so you are trying to set number of reduce tasks. i missed that, original
post was about # of map tasks. sorry.

No, no idea why that error pops up in mahout command line. i would need to
dig into the mahout's cli code -- i don't thing i dug that deep there
before.

On Tue, Dec 28, 2010 at 5:06 PM, Jeff Eastman <je...@narus.com> wrote:

> It's very odd: when I run k-means from Eclipse and add
> -Dmapred.reduce.tasks=10 as the first argument the driver loves it and
> job.getNumReduceTasks() is set correctly to 10. When I run the same command
> line using bin/mahout; however, it fails:  with "Unexpected
> -Dmapred.reduce.tasks=10 while processing Job-Specific Options.
>
> The CLI invocation is: ./bin/mahout kmeans -Dmapred.reduce.tasks-10 -I ...
>
>
>
> -----Original Message-----
> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> Sent: Tuesday, December 28, 2010 4:55 PM
> To: dev@mahout.apache.org
> Subject: Re: where i can set -Dmapred.map.tasks=X
>
> PPS it doesn't tell you what InputFileFormat actually uses for it as a
> property, and i don't remember on top of my head either. but i assume you
> could use them with -D as well.
>
> On Tue, Dec 28, 2010 at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > In particular, QJob is one of the drivers that uses that , in the
> following
> > way:
> >
> > f ( minSplitSize>0)
> >  SequenceFileInputFormat.setMinInputSplitSize(job, minSplitSize);
> >
> > Interestng pecularity about that parameter is that in the current hadoop
> > release for anything derived from InputFileFormat it ensures that all
> splits
> > are at least that big and the last split is at least times 1.1  that big.
> I
> > am not quite sure why special treatment for the last split but that's how
> it
> > goes there.
> >
> > -Dmitriy
> >
> >
> > On Tue, Dec 28, 2010 at 4:48 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
> >
> >> Jeff,
> >>
> >> it's mahout-376 patch i don't think it is committed. the driver class
> >> there is SSVDCli, for your convenience you can find it here :
> >>
> https://github.com/dlyubimov/ssvd-lsi/tree/givens-ssvd/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd
> >>
> >> but like i said, i did not try to use it with -D option since i wanted
> to
> >> give an explicit option to increase split size if needed (and a help for
> >> it). Another reason is that solver has a series of jobs and only those
> >> reading the source matrix have anything to do with the split size.
> >>
> >>
> >> -d
> >>
> >>
> >> On Tue, Dec 28, 2010 at 4:39 PM, Jeff Eastman <je...@narus.com>
> wrote:
> >>
> >>> What's the driver class? If the -D parameters are working for you I
> want
> >>> to compare to the clustering drovers
> >>>
> >>> -----Original Message-----
> >>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> >>> Sent: Tuesday, December 28, 2010 4:37 PM
> >>> To: dev@mahout.apache.org
> >>> Subject: Re: where i can set -Dmapred.map.tasks=X
> >>>
> >>> as far as i understand, this option is not forced. I suspect it
> actually
> >>> means 'minimum degree of parallelism'. so if you expect to use that to
> >>> reduce number of mappers, i don't think this is expected to work so
> much.
> >>> The one that do enforce anything are min split size and max split size
> in
> >>> file input so i guess you can try those. I rely on them (and open it up
> >>> as a
> >>> job-specific option) in stochastic svd.
> >>>
> >>> but usually forcing split size to increase creates a 'superslits'
> >>> problem,
> >>> where a lot of data is moved around to just supply data to mappers.
> which
> >>> is
> >>> perhaps why this option is meant to increase parallelism only, but
> >>> probably
> >>> not to decrease it.
> >>>
> >>> -d
> >>>
> >>> On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman <je...@narus.com>
> >>> wrote:
> >>>
> >>> > This is supposed to be a generic option. You should be able to
> specify
> >>> > Hadoop options such as this on the command line invocation of your
> >>> favorite
> >>> > Mahout routine, but I'm having a similar problem setting
> >>> > -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with
> and
> >>> > without a space after the -D.
> >>> >
> >>> > Can someone point me to a Mahout command where this does work? Both
> >>> drivers
> >>> > extend AbstractJob and do the usual option processing pushups. I
> don't
> >>> have
> >>> > Hadoop source locally so I can't debug the generic options parsing.
> >>> >
> >>> > -----Original Message-----
> >>> > From: beneo_7 [mailto:beneo_7@163.com]
> >>> > Sent: Monday, December 27, 2010 10:45 PM
> >>> > To: dev@mahout.apache.org
> >>> > Subject: where i can set -Dmapred.map.tasks=X
> >>> >
> >>> > i read onMahout in Action that I should set -Dmapred.map.tasks=X
> >>> > but it did not work for hadoop
> >>> >
> >>>
> >>
> >>
> >
>

RE: where i can set -Dmapred.map.tasks=X

Posted by Jeff Eastman <je...@Narus.com>.

That's where I'm beginning to look too. It seems the driver code is working correctly (I thought I had tested that) but the CLI isn't.

The original post was for -Dmapred.map.tasks but I noticed the reduce.tasks didn't work either.

-----Original Message-----
From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com] 
Sent: Tuesday, December 28, 2010 5:15 PM
To: dev@mahout.apache.org
Subject: Re: where i can set -Dmapred.map.tasks=X

Oh, so you are trying to set number of reduce tasks. i missed that, original
post was about # of map tasks. sorry.

No, no idea why that error pops up in mahout command line. i would need to
dig into the mahout's cli code -- i don't thing i dug that deep there
before.

On Tue, Dec 28, 2010 at 5:06 PM, Jeff Eastman <je...@narus.com> wrote:

> It's very odd: when I run k-means from Eclipse and add
> -Dmapred.reduce.tasks=10 as the first argument the driver loves it and
> job.getNumReduceTasks() is set correctly to 10. When I run the same command
> line using bin/mahout; however, it fails:  with "Unexpected
> -Dmapred.reduce.tasks=10 while processing Job-Specific Options.
>
> The CLI invocation is: ./bin/mahout kmeans -Dmapred.reduce.tasks-10 -I ...
>
>
>
> -----Original Message-----
> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> Sent: Tuesday, December 28, 2010 4:55 PM
> To: dev@mahout.apache.org
> Subject: Re: where i can set -Dmapred.map.tasks=X
>
> PPS it doesn't tell you what InputFileFormat actually uses for it as a
> property, and i don't remember on top of my head either. but i assume you
> could use them with -D as well.
>
> On Tue, Dec 28, 2010 at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > In particular, QJob is one of the drivers that uses that , in the
> following
> > way:
> >
> > f ( minSplitSize>0)
> >  SequenceFileInputFormat.setMinInputSplitSize(job, minSplitSize);
> >
> > Interestng pecularity about that parameter is that in the current hadoop
> > release for anything derived from InputFileFormat it ensures that all
> splits
> > are at least that big and the last split is at least times 1.1  that big.
> I
> > am not quite sure why special treatment for the last split but that's how
> it
> > goes there.
> >
> > -Dmitriy
> >
> >
> > On Tue, Dec 28, 2010 at 4:48 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
> >
> >> Jeff,
> >>
> >> it's mahout-376 patch i don't think it is committed. the driver class
> >> there is SSVDCli, for your convenience you can find it here :
> >>
> https://github.com/dlyubimov/ssvd-lsi/tree/givens-ssvd/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd
> >>
> >> but like i said, i did not try to use it with -D option since i wanted
> to
> >> give an explicit option to increase split size if needed (and a help for
> >> it). Another reason is that solver has a series of jobs and only those
> >> reading the source matrix have anything to do with the split size.
> >>
> >>
> >> -d
> >>
> >>
> >> On Tue, Dec 28, 2010 at 4:39 PM, Jeff Eastman <je...@narus.com>
> wrote:
> >>
> >>> What's the driver class? If the -D parameters are working for you I
> want
> >>> to compare to the clustering drovers
> >>>
> >>> -----Original Message-----
> >>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> >>> Sent: Tuesday, December 28, 2010 4:37 PM
> >>> To: dev@mahout.apache.org
> >>> Subject: Re: where i can set -Dmapred.map.tasks=X
> >>>
> >>> as far as i understand, this option is not forced. I suspect it
> actually
> >>> means 'minimum degree of parallelism'. so if you expect to use that to
> >>> reduce number of mappers, i don't think this is expected to work so
> much.
> >>> The one that do enforce anything are min split size and max split size
> in
> >>> file input so i guess you can try those. I rely on them (and open it up
> >>> as a
> >>> job-specific option) in stochastic svd.
> >>>
> >>> but usually forcing split size to increase creates a 'superslits'
> >>> problem,
> >>> where a lot of data is moved around to just supply data to mappers.
> which
> >>> is
> >>> perhaps why this option is meant to increase parallelism only, but
> >>> probably
> >>> not to decrease it.
> >>>
> >>> -d
> >>>
> >>> On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman <je...@narus.com>
> >>> wrote:
> >>>
> >>> > This is supposed to be a generic option. You should be able to
> specify
> >>> > Hadoop options such as this on the command line invocation of your
> >>> favorite
> >>> > Mahout routine, but I'm having a similar problem setting
> >>> > -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with
> and
> >>> > without a space after the -D.
> >>> >
> >>> > Can someone point me to a Mahout command where this does work? Both
> >>> drivers
> >>> > extend AbstractJob and do the usual option processing pushups. I
> don't
> >>> have
> >>> > Hadoop source locally so I can't debug the generic options parsing.
> >>> >
> >>> > -----Original Message-----
> >>> > From: beneo_7 [mailto:beneo_7@163.com]
> >>> > Sent: Monday, December 27, 2010 10:45 PM
> >>> > To: dev@mahout.apache.org
> >>> > Subject: where i can set -Dmapred.map.tasks=X
> >>> >
> >>> > i read onMahout in Action that I should set -Dmapred.map.tasks=X
> >>> > but it did not work for hadoop
> >>> >
> >>>
> >>
> >>
> >
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Oh, so you are trying to set number of reduce tasks. i missed that, original
post was about # of map tasks. sorry.

No, no idea why that error pops up in mahout command line. i would need to
dig into the mahout's cli code -- i don't thing i dug that deep there
before.

On Tue, Dec 28, 2010 at 5:06 PM, Jeff Eastman <je...@narus.com> wrote:

> It's very odd: when I run k-means from Eclipse and add
> -Dmapred.reduce.tasks=10 as the first argument the driver loves it and
> job.getNumReduceTasks() is set correctly to 10. When I run the same command
> line using bin/mahout; however, it fails:  with "Unexpected
> -Dmapred.reduce.tasks=10 while processing Job-Specific Options.
>
> The CLI invocation is: ./bin/mahout kmeans -Dmapred.reduce.tasks-10 -I ...
>
>
>
> -----Original Message-----
> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> Sent: Tuesday, December 28, 2010 4:55 PM
> To: dev@mahout.apache.org
> Subject: Re: where i can set -Dmapred.map.tasks=X
>
> PPS it doesn't tell you what InputFileFormat actually uses for it as a
> property, and i don't remember on top of my head either. but i assume you
> could use them with -D as well.
>
> On Tue, Dec 28, 2010 at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > In particular, QJob is one of the drivers that uses that , in the
> following
> > way:
> >
> > f ( minSplitSize>0)
> >  SequenceFileInputFormat.setMinInputSplitSize(job, minSplitSize);
> >
> > Interestng pecularity about that parameter is that in the current hadoop
> > release for anything derived from InputFileFormat it ensures that all
> splits
> > are at least that big and the last split is at least times 1.1  that big.
> I
> > am not quite sure why special treatment for the last split but that's how
> it
> > goes there.
> >
> > -Dmitriy
> >
> >
> > On Tue, Dec 28, 2010 at 4:48 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
> >
> >> Jeff,
> >>
> >> it's mahout-376 patch i don't think it is committed. the driver class
> >> there is SSVDCli, for your convenience you can find it here :
> >>
> https://github.com/dlyubimov/ssvd-lsi/tree/givens-ssvd/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd
> >>
> >> but like i said, i did not try to use it with -D option since i wanted
> to
> >> give an explicit option to increase split size if needed (and a help for
> >> it). Another reason is that solver has a series of jobs and only those
> >> reading the source matrix have anything to do with the split size.
> >>
> >>
> >> -d
> >>
> >>
> >> On Tue, Dec 28, 2010 at 4:39 PM, Jeff Eastman <je...@narus.com>
> wrote:
> >>
> >>> What's the driver class? If the -D parameters are working for you I
> want
> >>> to compare to the clustering drovers
> >>>
> >>> -----Original Message-----
> >>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> >>> Sent: Tuesday, December 28, 2010 4:37 PM
> >>> To: dev@mahout.apache.org
> >>> Subject: Re: where i can set -Dmapred.map.tasks=X
> >>>
> >>> as far as i understand, this option is not forced. I suspect it
> actually
> >>> means 'minimum degree of parallelism'. so if you expect to use that to
> >>> reduce number of mappers, i don't think this is expected to work so
> much.
> >>> The one that do enforce anything are min split size and max split size
> in
> >>> file input so i guess you can try those. I rely on them (and open it up
> >>> as a
> >>> job-specific option) in stochastic svd.
> >>>
> >>> but usually forcing split size to increase creates a 'superslits'
> >>> problem,
> >>> where a lot of data is moved around to just supply data to mappers.
> which
> >>> is
> >>> perhaps why this option is meant to increase parallelism only, but
> >>> probably
> >>> not to decrease it.
> >>>
> >>> -d
> >>>
> >>> On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman <je...@narus.com>
> >>> wrote:
> >>>
> >>> > This is supposed to be a generic option. You should be able to
> specify
> >>> > Hadoop options such as this on the command line invocation of your
> >>> favorite
> >>> > Mahout routine, but I'm having a similar problem setting
> >>> > -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with
> and
> >>> > without a space after the -D.
> >>> >
> >>> > Can someone point me to a Mahout command where this does work? Both
> >>> drivers
> >>> > extend AbstractJob and do the usual option processing pushups. I
> don't
> >>> have
> >>> > Hadoop source locally so I can't debug the generic options parsing.
> >>> >
> >>> > -----Original Message-----
> >>> > From: beneo_7 [mailto:beneo_7@163.com]
> >>> > Sent: Monday, December 27, 2010 10:45 PM
> >>> > To: dev@mahout.apache.org
> >>> > Subject: where i can set -Dmapred.map.tasks=X
> >>> >
> >>> > i read onMahout in Action that I should set -Dmapred.map.tasks=X
> >>> > but it did not work for hadoop
> >>> >
> >>>
> >>
> >>
> >
>

RE: where i can set -Dmapred.map.tasks=X

Posted by Jeff Eastman <je...@Narus.com>.

It's very odd: when I run k-means from Eclipse and add -Dmapred.reduce.tasks=10 as the first argument the driver loves it and job.getNumReduceTasks() is set correctly to 10. When I run the same command line using bin/mahout; however, it fails:  with "Unexpected -Dmapred.reduce.tasks=10 while processing Job-Specific Options.

The CLI invocation is: ./bin/mahout kmeans -Dmapred.reduce.tasks-10 -I ...



-----Original Message-----
From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com] 
Sent: Tuesday, December 28, 2010 4:55 PM
To: dev@mahout.apache.org
Subject: Re: where i can set -Dmapred.map.tasks=X

PPS it doesn't tell you what InputFileFormat actually uses for it as a
property, and i don't remember on top of my head either. but i assume you
could use them with -D as well.

On Tue, Dec 28, 2010 at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> In particular, QJob is one of the drivers that uses that , in the following
> way:
>
> f ( minSplitSize>0)
>  SequenceFileInputFormat.setMinInputSplitSize(job, minSplitSize);
>
> Interestng pecularity about that parameter is that in the current hadoop
> release for anything derived from InputFileFormat it ensures that all splits
> are at least that big and the last split is at least times 1.1  that big. I
> am not quite sure why special treatment for the last split but that's how it
> goes there.
>
> -Dmitriy
>
>
> On Tue, Dec 28, 2010 at 4:48 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>
>> Jeff,
>>
>> it's mahout-376 patch i don't think it is committed. the driver class
>> there is SSVDCli, for your convenience you can find it here :
>> https://github.com/dlyubimov/ssvd-lsi/tree/givens-ssvd/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd
>>
>> but like i said, i did not try to use it with -D option since i wanted to
>> give an explicit option to increase split size if needed (and a help for
>> it). Another reason is that solver has a series of jobs and only those
>> reading the source matrix have anything to do with the split size.
>>
>>
>> -d
>>
>>
>> On Tue, Dec 28, 2010 at 4:39 PM, Jeff Eastman <je...@narus.com> wrote:
>>
>>> What's the driver class? If the -D parameters are working for you I want
>>> to compare to the clustering drovers
>>>
>>> -----Original Message-----
>>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
>>> Sent: Tuesday, December 28, 2010 4:37 PM
>>> To: dev@mahout.apache.org
>>> Subject: Re: where i can set -Dmapred.map.tasks=X
>>>
>>> as far as i understand, this option is not forced. I suspect it actually
>>> means 'minimum degree of parallelism'. so if you expect to use that to
>>> reduce number of mappers, i don't think this is expected to work so much.
>>> The one that do enforce anything are min split size and max split size in
>>> file input so i guess you can try those. I rely on them (and open it up
>>> as a
>>> job-specific option) in stochastic svd.
>>>
>>> but usually forcing split size to increase creates a 'superslits'
>>> problem,
>>> where a lot of data is moved around to just supply data to mappers. which
>>> is
>>> perhaps why this option is meant to increase parallelism only, but
>>> probably
>>> not to decrease it.
>>>
>>> -d
>>>
>>> On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman <je...@narus.com>
>>> wrote:
>>>
>>> > This is supposed to be a generic option. You should be able to specify
>>> > Hadoop options such as this on the command line invocation of your
>>> favorite
>>> > Mahout routine, but I'm having a similar problem setting
>>> > -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with and
>>> > without a space after the -D.
>>> >
>>> > Can someone point me to a Mahout command where this does work? Both
>>> drivers
>>> > extend AbstractJob and do the usual option processing pushups. I don't
>>> have
>>> > Hadoop source locally so I can't debug the generic options parsing.
>>> >
>>> > -----Original Message-----
>>> > From: beneo_7 [mailto:beneo_7@163.com]
>>> > Sent: Monday, December 27, 2010 10:45 PM
>>> > To: dev@mahout.apache.org
>>> > Subject: where i can set -Dmapred.map.tasks=X
>>> >
>>> > i read onMahout in Action that I should set -Dmapred.map.tasks=X
>>> > but it did not work for hadoop
>>> >
>>>
>>
>>
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

PPS it doesn't tell you what InputFileFormat actually uses for it as a
property, and i don't remember on top of my head either. but i assume you
could use them with -D as well.

On Tue, Dec 28, 2010 at 4:54 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> In particular, QJob is one of the drivers that uses that , in the following
> way:
>
> f ( minSplitSize>0)
>  SequenceFileInputFormat.setMinInputSplitSize(job, minSplitSize);
>
> Interestng pecularity about that parameter is that in the current hadoop
> release for anything derived from InputFileFormat it ensures that all splits
> are at least that big and the last split is at least times 1.1  that big. I
> am not quite sure why special treatment for the last split but that's how it
> goes there.
>
> -Dmitriy
>
>
> On Tue, Dec 28, 2010 at 4:48 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>
>> Jeff,
>>
>> it's mahout-376 patch i don't think it is committed. the driver class
>> there is SSVDCli, for your convenience you can find it here :
>> https://github.com/dlyubimov/ssvd-lsi/tree/givens-ssvd/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd
>>
>> but like i said, i did not try to use it with -D option since i wanted to
>> give an explicit option to increase split size if needed (and a help for
>> it). Another reason is that solver has a series of jobs and only those
>> reading the source matrix have anything to do with the split size.
>>
>>
>> -d
>>
>>
>> On Tue, Dec 28, 2010 at 4:39 PM, Jeff Eastman <je...@narus.com> wrote:
>>
>>> What's the driver class? If the -D parameters are working for you I want
>>> to compare to the clustering drovers
>>>
>>> -----Original Message-----
>>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
>>> Sent: Tuesday, December 28, 2010 4:37 PM
>>> To: dev@mahout.apache.org
>>> Subject: Re: where i can set -Dmapred.map.tasks=X
>>>
>>> as far as i understand, this option is not forced. I suspect it actually
>>> means 'minimum degree of parallelism'. so if you expect to use that to
>>> reduce number of mappers, i don't think this is expected to work so much.
>>> The one that do enforce anything are min split size and max split size in
>>> file input so i guess you can try those. I rely on them (and open it up
>>> as a
>>> job-specific option) in stochastic svd.
>>>
>>> but usually forcing split size to increase creates a 'superslits'
>>> problem,
>>> where a lot of data is moved around to just supply data to mappers. which
>>> is
>>> perhaps why this option is meant to increase parallelism only, but
>>> probably
>>> not to decrease it.
>>>
>>> -d
>>>
>>> On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman <je...@narus.com>
>>> wrote:
>>>
>>> > This is supposed to be a generic option. You should be able to specify
>>> > Hadoop options such as this on the command line invocation of your
>>> favorite
>>> > Mahout routine, but I'm having a similar problem setting
>>> > -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with and
>>> > without a space after the -D.
>>> >
>>> > Can someone point me to a Mahout command where this does work? Both
>>> drivers
>>> > extend AbstractJob and do the usual option processing pushups. I don't
>>> have
>>> > Hadoop source locally so I can't debug the generic options parsing.
>>> >
>>> > -----Original Message-----
>>> > From: beneo_7 [mailto:beneo_7@163.com]
>>> > Sent: Monday, December 27, 2010 10:45 PM
>>> > To: dev@mahout.apache.org
>>> > Subject: where i can set -Dmapred.map.tasks=X
>>> >
>>> > i read onMahout in Action that I should set -Dmapred.map.tasks=X
>>> > but it did not work for hadoop
>>> >
>>>
>>
>>
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

In particular, QJob is one of the drivers that uses that , in the following
way:

f ( minSplitSize>0)
SequenceFileInputFormat.setMinInputSplitSize(job, minSplitSize);

Interestng pecularity about that parameter is that in the current hadoop
release for anything derived from InputFileFormat it ensures that all splits
are at least that big and the last split is at least times 1.1  that big. I
am not quite sure why special treatment for the last split but that's how it
goes there.

-Dmitriy


On Tue, Dec 28, 2010 at 4:48 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Jeff,
>
> it's mahout-376 patch i don't think it is committed. the driver class there
> is SSVDCli, for your convenience you can find it here :
> https://github.com/dlyubimov/ssvd-lsi/tree/givens-ssvd/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd
>
> but like i said, i did not try to use it with -D option since i wanted to
> give an explicit option to increase split size if needed (and a help for
> it). Another reason is that solver has a series of jobs and only those
> reading the source matrix have anything to do with the split size.
>
>
> -d
>
>
> On Tue, Dec 28, 2010 at 4:39 PM, Jeff Eastman <je...@narus.com> wrote:
>
>> What's the driver class? If the -D parameters are working for you I want
>> to compare to the clustering drovers
>>
>> -----Original Message-----
>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
>> Sent: Tuesday, December 28, 2010 4:37 PM
>> To: dev@mahout.apache.org
>> Subject: Re: where i can set -Dmapred.map.tasks=X
>>
>> as far as i understand, this option is not forced. I suspect it actually
>> means 'minimum degree of parallelism'. so if you expect to use that to
>> reduce number of mappers, i don't think this is expected to work so much.
>> The one that do enforce anything are min split size and max split size in
>> file input so i guess you can try those. I rely on them (and open it up as
>> a
>> job-specific option) in stochastic svd.
>>
>> but usually forcing split size to increase creates a 'superslits' problem,
>> where a lot of data is moved around to just supply data to mappers. which
>> is
>> perhaps why this option is meant to increase parallelism only, but
>> probably
>> not to decrease it.
>>
>> -d
>>
>> On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman <je...@narus.com> wrote:
>>
>> > This is supposed to be a generic option. You should be able to specify
>> > Hadoop options such as this on the command line invocation of your
>> favorite
>> > Mahout routine, but I'm having a similar problem setting
>> > -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with and
>> > without a space after the -D.
>> >
>> > Can someone point me to a Mahout command where this does work? Both
>> drivers
>> > extend AbstractJob and do the usual option processing pushups. I don't
>> have
>> > Hadoop source locally so I can't debug the generic options parsing.
>> >
>> > -----Original Message-----
>> > From: beneo_7 [mailto:beneo_7@163.com]
>> > Sent: Monday, December 27, 2010 10:45 PM
>> > To: dev@mahout.apache.org
>> > Subject: where i can set -Dmapred.map.tasks=X
>> >
>> > i read onMahout in Action that I should set -Dmapred.map.tasks=X
>> > but it did not work for hadoop
>> >
>>
>
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Jeff,

it's mahout-376 patch i don't think it is committed. the driver class there
is SSVDCli, for your convenience you can find it here :
https://github.com/dlyubimov/ssvd-lsi/tree/givens-ssvd/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd

but like i said, i did not try to use it with -D option since i wanted to
give an explicit option to increase split size if needed (and a help for
it). Another reason is that solver has a series of jobs and only those
reading the source matrix have anything to do with the split size.


-d

On Tue, Dec 28, 2010 at 4:39 PM, Jeff Eastman <je...@narus.com> wrote:

> What's the driver class? If the -D parameters are working for you I want to
> compare to the clustering drovers
>
> -----Original Message-----
> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> Sent: Tuesday, December 28, 2010 4:37 PM
> To: dev@mahout.apache.org
> Subject: Re: where i can set -Dmapred.map.tasks=X
>
> as far as i understand, this option is not forced. I suspect it actually
> means 'minimum degree of parallelism'. so if you expect to use that to
> reduce number of mappers, i don't think this is expected to work so much.
> The one that do enforce anything are min split size and max split size in
> file input so i guess you can try those. I rely on them (and open it up as
> a
> job-specific option) in stochastic svd.
>
> but usually forcing split size to increase creates a 'superslits' problem,
> where a lot of data is moved around to just supply data to mappers. which
> is
> perhaps why this option is meant to increase parallelism only, but probably
> not to decrease it.
>
> -d
>
> On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman <je...@narus.com> wrote:
>
> > This is supposed to be a generic option. You should be able to specify
> > Hadoop options such as this on the command line invocation of your
> favorite
> > Mahout routine, but I'm having a similar problem setting
> > -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with and
> > without a space after the -D.
> >
> > Can someone point me to a Mahout command where this does work? Both
> drivers
> > extend AbstractJob and do the usual option processing pushups. I don't
> have
> > Hadoop source locally so I can't debug the generic options parsing.
> >
> > -----Original Message-----
> > From: beneo_7 [mailto:beneo_7@163.com]
> > Sent: Monday, December 27, 2010 10:45 PM
> > To: dev@mahout.apache.org
> > Subject: where i can set -Dmapred.map.tasks=X
> >
> > i read onMahout in Action that I should set -Dmapred.map.tasks=X
> > but it did not work for hadoop
> >
>

RE: where i can set -Dmapred.map.tasks=X

Posted by Jeff Eastman <je...@Narus.com>.

What's the driver class? If the -D parameters are working for you I want to compare to the clustering drovers

-----Original Message-----
From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com] 
Sent: Tuesday, December 28, 2010 4:37 PM
To: dev@mahout.apache.org
Subject: Re: where i can set -Dmapred.map.tasks=X

as far as i understand, this option is not forced. I suspect it actually
means 'minimum degree of parallelism'. so if you expect to use that to
reduce number of mappers, i don't think this is expected to work so much.
The one that do enforce anything are min split size and max split size in
file input so i guess you can try those. I rely on them (and open it up as a
job-specific option) in stochastic svd.

but usually forcing split size to increase creates a 'superslits' problem,
where a lot of data is moved around to just supply data to mappers. which is
perhaps why this option is meant to increase parallelism only, but probably
not to decrease it.

-d

On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman <je...@narus.com> wrote:

> This is supposed to be a generic option. You should be able to specify
> Hadoop options such as this on the command line invocation of your favorite
> Mahout routine, but I'm having a similar problem setting
> -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with and
> without a space after the -D.
>
> Can someone point me to a Mahout command where this does work? Both drivers
> extend AbstractJob and do the usual option processing pushups. I don't have
> Hadoop source locally so I can't debug the generic options parsing.
>
> -----Original Message-----
> From: beneo_7 [mailto:beneo_7@163.com]
> Sent: Monday, December 27, 2010 10:45 PM
> To: dev@mahout.apache.org
> Subject: where i can set -Dmapred.map.tasks=X
>
> i read onMahout in Action that I should set -Dmapred.map.tasks=X
> but it did not work for hadoop
>

Re: where i can set -Dmapred.map.tasks=X

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

as far as i understand, this option is not forced. I suspect it actually
means 'minimum degree of parallelism'. so if you expect to use that to
reduce number of mappers, i don't think this is expected to work so much.
The one that do enforce anything are min split size and max split size in
file input so i guess you can try those. I rely on them (and open it up as a
job-specific option) in stochastic svd.

but usually forcing split size to increase creates a 'superslits' problem,
where a lot of data is moved around to just supply data to mappers. which is
perhaps why this option is meant to increase parallelism only, but probably
not to decrease it.

-d

On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman <je...@narus.com> wrote:

> This is supposed to be a generic option. You should be able to specify
> Hadoop options such as this on the command line invocation of your favorite
> Mahout routine, but I'm having a similar problem setting
> -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with and
> without a space after the -D.
>
> Can someone point me to a Mahout command where this does work? Both drivers
> extend AbstractJob and do the usual option processing pushups. I don't have
> Hadoop source locally so I can't debug the generic options parsing.
>
> -----Original Message-----
> From: beneo_7 [mailto:beneo_7@163.com]
> Sent: Monday, December 27, 2010 10:45 PM
> To: dev@mahout.apache.org
> Subject: where i can set -Dmapred.map.tasks=X
>
> i read onMahout in Action that I should set -Dmapred.map.tasks=X
> but it did not work for hadoop
>

RE: where i can set -Dmapred.map.tasks=X

Posted by Jeff Eastman <je...@Narus.com>.

This is supposed to be a generic option. You should be able to specify Hadoop options such as this on the command line invocation of your favorite Mahout routine, but I'm having a similar problem setting -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with and without a space after the -D.

Can someone point me to a Mahout command where this does work? Both drivers extend AbstractJob and do the usual option processing pushups. I don't have Hadoop source locally so I can't debug the generic options parsing.

-----Original Message-----
From: beneo_7 [mailto:beneo_7@163.com] 
Sent: Monday, December 27, 2010 10:45 PM
To: dev@mahout.apache.org
Subject: where i can set -Dmapred.map.tasks=X

i read onMahout in Action that I should set -Dmapred.map.tasks=X
but it did not work for hadoop