You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Matt Molek <mp...@gmail.com> on 2012/11/02 16:30:23 UTC

Using multiple reducers with rowsimilarity job

Will I cause any problems by running bin/mahout rowsimilarity with a
"-Dmapred.reduce.tasks" parameter?

I'm used to mahout jobs that work with multiple reducers having a "-nr"
command line options.

Thanks for the help!

Re: Using multiple reducers with rowsimilarity job

Posted by Sebastian Schelter <ss...@apache.org>.
Hi Matt,

Computing pairwise similarity is a quadratic problem. The runtime does
not so much depend on the amount of it as on its distribution. If you
have a few things in your data that cooccur with everything else, you
will get quadratic intermediate sizes.

In the collaborative filtering code, users with an enormous number of
interactions are down sampled to avoid this. If you use the "raw"
rowsimilarity job, you might have to do this yourself.

You should have a look at your data and see whether this is the case.

Best,
Sebastian



On 05.11.2012 17:22, Matt Molek wrote:
> Having found a few mentions of running rowsimilarity with multiple
> reducers, I assume it's ok.
> 
> I'm having a problem with the RowSimilarityJob-CooccurrencesMapper-Reducer
> job though. I'm running over a data set of ~5 million entries x ~3 million
> boolean features, where each entry has no more than 10 non-zeros. With 256
> mappers, ~95% of them finish within 10 minutes. The last 5% get stuck at
> random levels of completeness, like 44.47%, and just sit there for ages
> spilling more and more output but never increasing the completeness
> counter. Eventually after as much as 8 hours they jump to 100%, merge their
> output, and finish.
> 
> It's usually the early map tasks that have trouble. Right now I'm sitting
> with all tasks done except mappers 0-4 which are stuck at various states of
> completeness.
> 
> Is there something about the ordering of the output of the
> RowSimilarityJob-VectorNormMapper-Reducer job that would consistently cause
> the early map tasks on RowSimilarityJob-CooccurrencesMapper-Reducer job to
> take forever? Is there any tuning I can do to more evenly distribute this
> load so 5% of my mappers don't slow my job down so horribly?
> 


Re: Using multiple reducers with rowsimilarity job

Posted by Matt Molek <mp...@gmail.com>.
Having found a few mentions of running rowsimilarity with multiple
reducers, I assume it's ok.

I'm having a problem with the RowSimilarityJob-CooccurrencesMapper-Reducer
job though. I'm running over a data set of ~5 million entries x ~3 million
boolean features, where each entry has no more than 10 non-zeros. With 256
mappers, ~95% of them finish within 10 minutes. The last 5% get stuck at
random levels of completeness, like 44.47%, and just sit there for ages
spilling more and more output but never increasing the completeness
counter. Eventually after as much as 8 hours they jump to 100%, merge their
output, and finish.

It's usually the early map tasks that have trouble. Right now I'm sitting
with all tasks done except mappers 0-4 which are stuck at various states of
completeness.

Is there something about the ordering of the output of the
RowSimilarityJob-VectorNormMapper-Reducer job that would consistently cause
the early map tasks on RowSimilarityJob-CooccurrencesMapper-Reducer job to
take forever? Is there any tuning I can do to more evenly distribute this
load so 5% of my mappers don't slow my job down so horribly?