You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Jeff Eastman <jd...@windwardsolutions.com> on 2010/06/11 19:01:14 UTC

Re: Setting Number of Mappers and Reducers in DistributedRowMatrix Jobs

Over to dev list:

Sean, we currently have some jobs which accept numbers of mappers and 
reducers as optional command arguments and others that require the -D 
arguments to control same as you have written. Seems like our usability 
would improve if we adopted a consistent policy across all Mahout 
components. If so, would you argue that all use -D arguments for this 
control? What about situations where our default is not whatever Hadoop 
does by default? Would this result in noticable behavior changes? Also, 
some algorithms don't work with arbitrary numbers of reducers and some 
don't use reducers at all. What would you suggest?

Jeff

On 6/11/10 9:35 AM, Sean Owen wrote:
> -Dmapred.map.tasks and same for reduce? These should be Hadoop params
> you set directly to Hadoop.
>
> On Fri, Jun 11, 2010 at 5:07 PM, Kris Jack<mr...@gmail.com>  wrote:
>    
>> Hi everyone,
>>
>> I am running code that uses some of the jobs defined in the
>> DistributedRowMatrix class and would like to know if I can define the number
>> of mappers and reducers that they use when running?  In particular, with the
>> jobs:
>>
>> - MatrixMultiplicationJob
>> - TransposeJob
>>
>> I am happy to comfortable with changing the code to get this to work but I
>> was wondering if the algorithmic logic being employed would allow multiple
>> mappers and reducers.
>>
>> Thanks,
>> Kris
>>
>>      
>

Re: Setting Number of Mappers and Reducers in DistributedRowMatrix Jobs

Posted by Ted Dunning <te...@gmail.com>.

Sounds right to me.

On Sun, Jun 13, 2010 at 9:53 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> But these are possible solutions, can we agree for now on a statement of
> the problem?

Re: Setting Number of Mappers and Reducers in DistributedRowMatrix Jobs

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

I agree this is a usability defect at least. If setting the number of 
reducers is a common activity which users need to perform in running 
Mahout applications then we ought to have a standard way of specifying 
this in our APIs without exposing the full set of Hadoop options, 
especially to our non-power-users. This is the case for some 
applications already but others require the use of Hadoop-level -D 
arguments to achieve reasonable out-of-the-box parallelism even when 
running our examples. I think the usability defect is that some of our 
algorithms won't scale without it and that we don't have a standard way 
to specify this in our APIs.

If exposing --numReducers (and in one case at least --numMappers too) is 
duplicating the Hadoop-level arguments then perhaps instead adding 
--desiredParallelism or even --howManyNodesImRunningThisOn to all Mahout 
applications would give them the ability to optimize the lower-level 
Hadoop arguments in a more intelligent manner than they can do today. 
But these are possible solutions, can we agree for now on a statement of 
the problem?

On 6/11/10 1:17 PM, Ted Dunning wrote:
> I view this behavior as a bug in our code.  The default behavior should be
> reasonable.  When it is not, that isn't evidence that the user needs flags
> to fix the behavior ... it is evidence that we should fix the default
> behavior.
>
> (I hate buying products where the default for -Ddo-something-stupid is
> true)
>
> On Fri, Jun 11, 2010 at 12:29 PM, Jeff Eastman
> <jd...@windwardsolutions.com>wrote:
>
>    
>> Do we have evidence the other way, that users regularly need to
>>      
>>> control this to achieve best performance? I personally actually never
>>> set it and let Hadoop base it on the file splits and blocks and such,
>>> which is a pretty good heuristic.
>>>
>>>
>>>
>>>        
>> Anecdotal: When I ran PFPGrowth on the accidents.dat database on a 4 data
>> node cluster it only used a single reducer. Haven't yet tried that with -D
>> but I think others have. Before I added numReducers propagation to
>> seq2sparse it only launched a single reducer for the back-end steps doing
>> Reuters and LDA took 3x longer than necessary on my cluster.
>> DistributedRowMatrix requires -D to achieve parallelism at the top of this
>> thread. I suspect there are others. I've observed Hadoop does a pretty good
>> job with mappers based upon file splits etc but not so well at reducers
>> which is why we have --numReducers in the first place.
>>      
>

Re: Setting Number of Mappers and Reducers in DistributedRowMatrix Jobs

Posted by Ted Dunning <te...@gmail.com>.

I view this behavior as a bug in our code.  The default behavior should be
reasonable.  When it is not, that isn't evidence that the user needs flags
to fix the behavior ... it is evidence that we should fix the default
behavior.

(I hate buying products where the default for -Ddo-something-stupid is
true)

On Fri, Jun 11, 2010 at 12:29 PM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

> Do we have evidence the other way, that users regularly need to
>> control this to achieve best performance? I personally actually never
>> set it and let Hadoop base it on the file splits and blocks and such,
>> which is a pretty good heuristic.
>>
>>
>>
> Anecdotal: When I ran PFPGrowth on the accidents.dat database on a 4 data
> node cluster it only used a single reducer. Haven't yet tried that with -D
> but I think others have. Before I added numReducers propagation to
> seq2sparse it only launched a single reducer for the back-end steps doing
> Reuters and LDA took 3x longer than necessary on my cluster.
> DistributedRowMatrix requires -D to achieve parallelism at the top of this
> thread. I suspect there are others. I've observed Hadoop does a pretty good
> job with mappers based upon file splits etc but not so well at reducers
> which is why we have --numReducers in the first place.

Re: Setting Number of Mappers and Reducers in DistributedRowMatrix Jobs

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

On 6/11/10 11:42 AM, Sean Owen wrote:
> On Fri, Jun 11, 2010 at 7:33 PM, Jeff Eastman
> <jd...@windwardsolutions.com>  wrote:
>    
>> complete enough for 'neophyte users' and 'regular users' and that only
>> 'power users' should be using the -D abstractions (and with that accepting
>> any idiosyncrasies that may result since we cannot guarantee how they may
>> interact).
>>      
> That's a reasonable rule. All you really need to specify is input and
> output, and Hadoop's defaults should work reasonably from there. So I
> view this as an argument to create --input and --output, and that's
> done.
>    
+1 The fact that --input and --Dmapred.input.dir both have the word 
'input' in them is a coincidence: They are not interchangeable and our 
arguments have a broader function especially when multiple job steps are 
invoked.
>> Since the degree of parallelism obtained is often a function of the number
>> of mappers/reducers specified, and since the degree of parallelism is
>> something our 'regular users' would reasonably need to control, perhaps
>> replacing the --numReducers options with --desiredParallelism (or something)
>> and having reasonable defaults on that for our neophytes would be better.
>> Then the implementation could take the user's desires into account and
>> internally manage the numbers of map and reduce tasks where it makes sense
>> to do so.
>>      
> On this flag in particular --
>
> It's an appealing idea, but how do the details work? for example on
> the recommender jobs, there are at least 4 mapreduces, each of which
> have a fairly different best parallelism setting. The big, last phase
> should be parallelized as much as possible; early phases would just be
> slowed down by using too many mappers.
>    
Good question and I expect each algorithm would have to do some thinking 
about how best to honor the user's desired concurrency level. In many 
cases it would just influence the number of reducers, but file input 
split size has also been suggested as a control variable. And, it might 
not be possible to honor it at all in some, multistage jobs such as 
recommenders. How do you optimize that?
> What would the neophyte user using this flag do with it? Presumably
> the neophyte just wants it set to "optimal" or "as much as is
> reasonable" and that's basically what Hadoop is already doing, better
> than the user can determine.
>    
I think that neophytes just want something reasonable and would not be 
messing with this flag. But some of our jobs do not do something 
reasonable yet on multi-node clusters (see below).
> Encouraging the non-power-user to set number of mappers and reducers
> also has the potential to invite them to hurt performance.
>    
Yes, I don't want them messing with those parameters at all. It probably 
does not make sense to specify --desiredParallelism = 100 on a 4 node 
cluster but a clever implementation might be able to spawn maybe 8 
reducers given that hint.
> Do we have evidence the other way, that users regularly need to
> control this to achieve best performance? I personally actually never
> set it and let Hadoop base it on the file splits and blocks and such,
> which is a pretty good heuristic.
>
>    
Anecdotal: When I ran PFPGrowth on the accidents.dat database on a 4 
data node cluster it only used a single reducer. Haven't yet tried that 
with -D but I think others have. Before I added numReducers propagation 
to seq2sparse it only launched a single reducer for the back-end steps 
doing Reuters and LDA took 3x longer than necessary on my cluster. 
DistributedRowMatrix requires -D to achieve parallelism at the top of 
this thread. I suspect there are others. I've observed Hadoop does a 
pretty good job with mappers based upon file splits etc but not so well 
at reducers which is why we have --numReducers in the first place.

Re: Setting Number of Mappers and Reducers in DistributedRowMatrix Jobs

Posted by Sean Owen <sr...@gmail.com>.

On Fri, Jun 11, 2010 at 7:33 PM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
> complete enough for 'neophyte users' and 'regular users' and that only
> 'power users' should be using the -D abstractions (and with that accepting
> any idiosyncrasies that may result since we cannot guarantee how they may
> interact).

That's a reasonable rule. All you really need to specify is input and
output, and Hadoop's defaults should work reasonably from there. So I
view this as an argument to create --input and --output, and that's
done.

> Since the degree of parallelism obtained is often a function of the number
> of mappers/reducers specified, and since the degree of parallelism is
> something our 'regular users' would reasonably need to control, perhaps
> replacing the --numReducers options with --desiredParallelism (or something)
> and having reasonable defaults on that for our neophytes would be better.
> Then the implementation could take the user's desires into account and
> internally manage the numbers of map and reduce tasks where it makes sense
> to do so.

On this flag in particular --

It's an appealing idea, but how do the details work? for example on
the recommender jobs, there are at least 4 mapreduces, each of which
have a fairly different best parallelism setting. The big, last phase
should be parallelized as much as possible; early phases would just be
slowed down by using too many mappers.

What would the neophyte user using this flag do with it? Presumably
the neophyte just wants it set to "optimal" or "as much as is
reasonable" and that's basically what Hadoop is already doing, better
than the user can determine.

Encouraging the non-power-user to set number of mappers and reducers
also has the potential to invite them to hurt performance.

Do we have evidence the other way, that users regularly need to
control this to achieve best performance? I personally actually never
set it and let Hadoop base it on the file splits and blocks and such,
which is a pretty good heuristic.

Re: Setting Number of Mappers and Reducers in DistributedRowMatrix Jobs

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

I completely agree that many of the Hadoop options are inappropriate as 
standard Mahout arguments. The challenge I see from a usability 
perspective is that the -D option introduces two different levels of 
abstraction into our user APIs. It's like exposing the full engine and 
transmission APIs in an automobile on the dashboard next to the cruise 
control buttons. I would argue that the Mahout APIs (our standard 
command line arguments) ought to be complete enough for 'neophyte users' 
and 'regular users' and that only 'power users' should be using the -D 
abstractions (and with that accepting any idiosyncrasies that may result 
since we cannot guarantee how they may interact).

Since the degree of parallelism obtained is often a function of the 
number of mappers/reducers specified, and since the degree of 
parallelism is something our 'regular users' would reasonably need to 
control, perhaps replacing the --numReducers options with 
--desiredParallelism (or something) and having reasonable defaults on 
that for our neophytes would be better. Then the implementation could 
take the user's desires into account and internally manage the numbers 
of map and reduce tasks where it makes sense to do so.

Said a little differently, the Configuration values set in the Drivers 
clearly need to come from our standard command arguments. So too do some 
of the Job values, but more indirectly as you note with --input and 
--output handling being managed internally to each job step. I think 
this also applies to --numMappers and --numReducers settings and that 
managing them internally via an application-level --desiredParallelism 
argument would be an improvement that would keep our API abstraction 
layers distinct.

On 6/11/10 10:13 AM, Sean Owen wrote:
> It's the same question as --input and -Dmapred.input.dir. The latter
> is the standard Hadoop parameter, which we have to support if only
> because this is something the user may be configuring in the XML
> configs, but also because it'll be familiar to Hadoop users I assume.
>
>
> Jobs can read and change these settings to implement additional
> restrictions, sure. For example, the user-supplied input and output
> dir are only used to control the first M/R input in a chain of M/Rs
> run by a job, and the output of its final M/R. In between, it's
> overriding this value on individual M/Rs as needed of course, to
> direct intermediate output elsewhere.
>
>
> So the question is not whether we need our own way to control Hadoop
> parameters at times -- we very much do, and this already happens and
> works internally. The question is merely one of command-line "UI",
> duplicating Hadoop flags with our own.
>
> I personally am inclined to not do this, as it's just more code, more
> possibilities to support and debug, more difference from the norm.
> However in the case of input and output I think we all agreed that
> such a basic flag might as well have its own custom version that works
> in the same way as the Hadoop one.
>
> I'd argue we wouldn't want to do the same thing for number of mappers
> and reducers. From there, why not duplicate about 10 other flags I can
> think of? compressing map output, reducer output, IO sort buffer size,
> etc etc.
>
>
> On Fri, Jun 11, 2010 at 6:01 PM, Jeff Eastman
> <jd...@windwardsolutions.com>  wrote:
>    
>> Over to dev list:
>>
>> Sean, we currently have some jobs which accept numbers of mappers and
>> reducers as optional command arguments and others that require the -D
>> arguments to control same as you have written. Seems like our usability
>> would improve if we adopted a consistent policy across all Mahout
>> components. If so, would you argue that all use -D arguments for this
>> control? What about situations where our default is not whatever Hadoop does
>> by default? Would this result in noticable behavior changes? Also, some
>> algorithms don't work with arbitrary numbers of reducers and some don't use
>> reducers at all. What would you suggest?
>>
>>      
>

Re: Setting Number of Mappers and Reducers in DistributedRowMatrix Jobs

Posted by Sean Owen <sr...@gmail.com>.

It's the same question as --input and -Dmapred.input.dir. The latter
is the standard Hadoop parameter, which we have to support if only
because this is something the user may be configuring in the XML
configs, but also because it'll be familiar to Hadoop users I assume.

Jobs can read and change these settings to implement additional
restrictions, sure. For example, the user-supplied input and output
dir are only used to control the first M/R input in a chain of M/Rs
run by a job, and the output of its final M/R. In between, it's
overriding this value on individual M/Rs as needed of course, to
direct intermediate output elsewhere.

So the question is not whether we need our own way to control Hadoop
parameters at times -- we very much do, and this already happens and
works internally. The question is merely one of command-line "UI",
duplicating Hadoop flags with our own.

I personally am inclined to not do this, as it's just more code, more
possibilities to support and debug, more difference from the norm.
However in the case of input and output I think we all agreed that
such a basic flag might as well have its own custom version that works
in the same way as the Hadoop one.

I'd argue we wouldn't want to do the same thing for number of mappers
and reducers. From there, why not duplicate about 10 other flags I can
think of? compressing map output, reducer output, IO sort buffer size,
etc etc.

On Fri, Jun 11, 2010 at 6:01 PM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
> Over to dev list:
>
> Sean, we currently have some jobs which accept numbers of mappers and
> reducers as optional command arguments and others that require the -D
> arguments to control same as you have written. Seems like our usability
> would improve if we adopted a consistent policy across all Mahout
> components. If so, would you argue that all use -D arguments for this
> control? What about situations where our default is not whatever Hadoop does
> by default? Would this result in noticable behavior changes? Also, some
> algorithms don't work with arbitrary numbers of reducers and some don't use
> reducers at all. What would you suggest?
>