You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by "Tim R. Havens" <ti...@gmail.com> on 2012/01/24 15:23:28 UTC

Re: LDA on single node is much faster than 20 nodes

Sean Owen <srowen <at> gmail.com> writes:

...snip...
> You can of course force it to use more mappers, and that's probably a good
> idea here. -Dmapred.map.tasks=20 perhaps. More mappers means more overhead
> of spinning up mappers to process less data, and Hadoop's guess indicates
> that it thinks it's not efficient to use 20 workers. If you know that those
> other 18 are otherwise idle, my guess is you'd benefit from just making it
> use 20.
...

How can I accomplish this when doing something like this from command line?

Is it possible to force the map tasks and reduce tasks to a higher number 
in this example?  I've been running a few jobs like this with 'fpg' but 
I haven't been able to find solid doc's on how to increase the number of 
map/reducers for the jobs.  Currently this will run on about 8-9M rows 
of input on our cluster, but it never uses more than 2 map 2 reduce per 
job.

mahout fpg -i /user/<user>/stopword_filtered/search_terms.txt \
           -o stopword_filtered/patterns \
           -g 5000 \
           -k 20 \
           -method mapreduce \
           -regex '[\ ]' \
           -s 120

RE: LDA on single node is much faster than 20 nodes

Posted by Paritosh Ranjan <pr...@xebia.com>.

-Dmapred.map.tasks=20 might not help in using 20 mappers if the block size such that only few blocks (splits) consume the whole data.
I think decreasing the block size will automatically bring more mappers into action.

One block per mapper is the general concept ( as split size is equal to block size by default ). 

I have tried using  -Dmapred.map.tasks to force it to use more mappers but it did not work for me if the block size was limiting the number of blocks (splits) to be processed. 
Decreasing the block size to write data in more blocks ( number of blocks around number of map tasks available ) helps.

Decreasing split size can also help.

Just sharing my experience, in case it can help.
________________________________________
From: Sean Owen [srowen@gmail.com]
Sent: Tuesday, January 24, 2012 3:32 PM
To: user@mahout.apache.org
Subject: Re: LDA on single node is much faster than 20 nodes

I haven't used the CLI in ages but I believe there's an env variable like
"MAHOUT_OPTS" where you can set flags like -Dmapred.map.tasks=20.

On Tue, Jan 24, 2012 at 2:23 PM, Tim R. Havens <ti...@gmail.com> wrote:

> Sean Owen <srowen <at> gmail.com> writes:
>
> ...snip...
> > You can of course force it to use more mappers, and that's probably a
> good
> > idea here. -Dmapred.map.tasks=20 perhaps. More mappers means more
> overhead
> > of spinning up mappers to process less data, and Hadoop's guess indicates
> > that it thinks it's not efficient to use 20 workers. If you know that
> those
> > other 18 are otherwise idle, my guess is you'd benefit from just making
> it
> > use 20.
> ...
>
> How can I accomplish this when doing something like this from command line?
>
> Is it possible to force the map tasks and reduce tasks to a higher number
> in this example?  I've been running a few jobs like this with 'fpg' but
> I haven't been able to find solid doc's on how to increase the number of
> map/reducers for the jobs.  Currently this will run on about 8-9M rows
> of input on our cluster, but it never uses more than 2 map 2 reduce per
> job.
>
> mahout fpg -i /user/<user>/stopword_filtered/search_terms.txt \
>           -o stopword_filtered/patterns \
>           -g 5000 \
>           -k 20 \
>           -method mapreduce \
>           -regex '[\ ]' \
>           -s 120
>
>

Re: LDA on single node is much faster than 20 nodes

Posted by Sean Owen <sr...@gmail.com>.

I haven't used the CLI in ages but I believe there's an env variable like
"MAHOUT_OPTS" where you can set flags like -Dmapred.map.tasks=20.

On Tue, Jan 24, 2012 at 2:23 PM, Tim R. Havens <ti...@gmail.com> wrote:

> Sean Owen <srowen <at> gmail.com> writes:
>
> ...snip...
> > You can of course force it to use more mappers, and that's probably a
> good
> > idea here. -Dmapred.map.tasks=20 perhaps. More mappers means more
> overhead
> > of spinning up mappers to process less data, and Hadoop's guess indicates
> > that it thinks it's not efficient to use 20 workers. If you know that
> those
> > other 18 are otherwise idle, my guess is you'd benefit from just making
> it
> > use 20.
> ...
>
> How can I accomplish this when doing something like this from command line?
>
> Is it possible to force the map tasks and reduce tasks to a higher number
> in this example?  I've been running a few jobs like this with 'fpg' but
> I haven't been able to find solid doc's on how to increase the number of
> map/reducers for the jobs.  Currently this will run on about 8-9M rows
> of input on our cluster, but it never uses more than 2 map 2 reduce per
> job.
>
> mahout fpg -i /user/<user>/stopword_filtered/search_terms.txt \
>           -o stopword_filtered/patterns \
>           -g 5000 \
>           -k 20 \
>           -method mapreduce \
>           -regex '[\ ]' \
>           -s 120
>
>