You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Ma...@htc.com on 2012/05/10 11:29:07 UTC

Extra low speed mahout distribution with hadoop

Hi, I am study mahout by Mahaout in Action Book.
I have downloaded wikipedia links database and tried to executed
recommendation for it using mahout and hadoop.
I have used following command:
hadoop jar mahout-core-0.6-job.jar
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-Dmapred.input.dir=input/input.txt -Dmapred.output.dir=output --usersFile
input/users.txt --booleanData true -s SIMILARITY_LOGLIKELIHOOD

The command took for execution about 4 hours on my Mac Book Pro. At the
same time on my book the recommendation without hadoop have required about
2 minuts. So why mahout+hadoop so slow?

Re: Extra low speed mahout distribution with hadoop

Posted by Sean Owen <sr...@gmail.com>.

If you need anything for "one user" and "quickly" then you don't want
to use Hadoop. :) It is extremely inefficient to try to use Hadoop
this way since by running it you are recomputing every single
similarity, then producing one recommendation. At best, use Hadoop to
precompute similarities, then use them to recommend in real-time.

This is really the right way, I think, to use Hadoop with
recommendations. You use Hadoop to compute, offline and periodically,
some underlying model. Then you load that into a server that can
quickly make any recommendation from it at run-time. This is the
architecture I'm building in the (Mahout-based) Myrrix recommender
engine (myrrix.com)

Sean

On Thu, May 10, 2012 at 10:53 AM, Maksim Areshkau
<ma...@htc.com> wrote:
> On 10.05.12 12:31, "Sean Owen" <sr...@gmail.com> wrote:
>
>
>>Hadoop just about always makes things slower, in terms of total
>>resources needed. It adds a lot of overhead; such is the price of
>>parallelism. My rule of thumb is that Hadoop-based algorithms will,
>>all else equal, take 4x more CPU hours. But of course Hadoop lets you
>>distribute.
>>
>>However, I doubt that's the total explanation here. What did you do in
>>2 minutes? I can't believe even the Mahout non-distributed recommender
>>would build its model and make recs for *all* users in that time.
>>Really?
> Nope. I just need a recommendation for one user. But I need this
> recommendation quickly.
> So As Input for hadoop I have used two files(input.txt - wikipedia
> database). Users.txt - files with user id for which I need a
> recommendation.
> So I need a param to specify that I need a recommendation for one user?
>>  Remember that RecommenderJob is computing all recommendations
>>for all users. The non-distributed recommender doesn't do anything
>>like that until you ask it.
>>
>>On Thu, May 10, 2012 at 10:29 AM,  <Ma...@htc.com> wrote:
>>> Hi, I am study mahout by Mahaout in Action Book.
>>> I have downloaded wikipedia links database and tried to executed
>>> recommendation for it using mahout and hadoop.
>>> I have used following command:
>>> hadoop jar mahout-core-0.6-job.jar
>>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
>>> -Dmapred.input.dir=input/input.txt -Dmapred.output.dir=output
>>>--usersFile
>>> input/users.txt --booleanData true -s SIMILARITY_LOGLIKELIHOOD
>>>
>>> The command took for execution about 4 hours on my Mac Book Pro. At the
>>> same time on my book the recommendation without hadoop have required
>>>about
>>> 2 minuts. So why mahout+hadoop so slow?
>
>

Re: Extra low speed mahout distribution with hadoop

Posted by Maksim Areshkau <ma...@htc.com>.

On 10.05.12 12:31, "Sean Owen" <sr...@gmail.com> wrote:


>Hadoop just about always makes things slower, in terms of total
>resources needed. It adds a lot of overhead; such is the price of
>parallelism. My rule of thumb is that Hadoop-based algorithms will,
>all else equal, take 4x more CPU hours. But of course Hadoop lets you
>distribute.
>
>However, I doubt that's the total explanation here. What did you do in
>2 minutes? I can't believe even the Mahout non-distributed recommender
>would build its model and make recs for *all* users in that time.
>Really?
Nope. I just need a recommendation for one user. But I need this
recommendation quickly.
So As Input for hadoop I have used two files(input.txt - wikipedia
database). Users.txt - files with user id for which I need a
recommendation.
So I need a param to specify that I need a recommendation for one user?
>  Remember that RecommenderJob is computing all recommendations
>for all users. The non-distributed recommender doesn't do anything
>like that until you ask it.
>
>On Thu, May 10, 2012 at 10:29 AM,  <Ma...@htc.com> wrote:
>> Hi, I am study mahout by Mahaout in Action Book.
>> I have downloaded wikipedia links database and tried to executed
>> recommendation for it using mahout and hadoop.
>> I have used following command:
>> hadoop jar mahout-core-0.6-job.jar
>> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
>> -Dmapred.input.dir=input/input.txt -Dmapred.output.dir=output
>>--usersFile
>> input/users.txt --booleanData true -s SIMILARITY_LOGLIKELIHOOD
>>
>> The command took for execution about 4 hours on my Mac Book Pro. At the
>> same time on my book the recommendation without hadoop have required
>>about
>> 2 minuts. So why mahout+hadoop so slow?

Re: Extra low speed mahout distribution with hadoop

Posted by Sean Owen <sr...@gmail.com>.

Hadoop just about always makes things slower, in terms of total
resources needed. It adds a lot of overhead; such is the price of
parallelism. My rule of thumb is that Hadoop-based algorithms will,
all else equal, take 4x more CPU hours. But of course Hadoop lets you
distribute.

However, I doubt that's the total explanation here. What did you do in
2 minutes? I can't believe even the Mahout non-distributed recommender
would build its model and make recs for *all* users in that time.
Really?  Remember that RecommenderJob is computing all recommendations
for all users. The non-distributed recommender doesn't do anything
like that until you ask it.

On Thu, May 10, 2012 at 10:29 AM,  <Ma...@htc.com> wrote:
> Hi, I am study mahout by Mahaout in Action Book.
> I have downloaded wikipedia links database and tried to executed
> recommendation for it using mahout and hadoop.
> I have used following command:
> hadoop jar mahout-core-0.6-job.jar
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
> -Dmapred.input.dir=input/input.txt -Dmapred.output.dir=output --usersFile
> input/users.txt --booleanData true -s SIMILARITY_LOGLIKELIHOOD
>
> The command took for execution about 4 hours on my Mac Book Pro. At the
> same time on my book the recommendation without hadoop have required about
> 2 minuts. So why mahout+hadoop so slow?