You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by Amit Sangroya <sa...@gmail.com> on 2011/09/14 13:28:58 UTC

RecommenderJob Mahout Creating a data model

Hi all,

I am trying to run the example from
https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering
,

with the following command bin/mahout
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-Dmapred.input.dir=input -Dmapred.output.dir=output --itemsFile itemfile
--tempDir tempDir

The algorithm estimate the preference of a user towards an item which he/she
has not yet seen. Once an algorithm can predict preferences it can also be
used to do Top-N-Recommendation where the task is to find the N items a
given user might like best. It is mentioned that given a DataModel, it can
produce recommendations.

The algorithm takes approx. 5 minutes to generate top 5 recommendations for
one user on a 10 node hadoop cluster. The size of input is shortened only to
200 users from "1 Million MovieLens Dataset" from Grouplens.org.

I have few questions:

1) I want to know that if it is possible to isolate the data model building
step to generating recommendations.

2) Can we use the model once generated using the training data for
generating recommendations for a range of users.

3) To be specific, if I want to provide an on-line service that generates
recommendations for users, Can I minimize the cost of MapReduce interactions
each time.

I am not a data mining expert. Please help me to understand this in a better
way.


Thanks and Regards,
Amit

RecommenderJob Mahout Creating a data model

Posted by Amit Sangroya <sa...@gmail.com>.
Hi all,

I am trying to run the example from
https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering
,

with the following command bin/mahout
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-Dmapred.input.dir=input -Dmapred.output.dir=output --itemsFile itemfile
--tempDir tempDir

The algorithm estimate the preference of a user towards an item which he/she
has not yet seen. Once an algorithm can predict preferences it can also be
used to do Top-N-Recommendation where the task is to find the N items a
given user might like best. It is mentioned that given a DataModel, it can
produce recommendations.

The algorithm takes approx. 5 minutes to generate top 5 recommendations for
one user on a 10 node hadoop cluster. The size of input is shortened only to
200 users from "1 Million MovieLens Dataset" from Grouplens.org.

I have few questions:

1) I want to know that if it is possible to isolate the data model building
step to generating recommendations.

2) Can we use the model once generated using the training data for
generating recommendations for a range of users.

3) To be specific, if I want to provide an on-line service that generates
recommendations for users, Can I minimize the cost of MapReduce interactions
each time.

I am not a data mining expert. Please help me to understand this in a better
way.


Thanks and Regards,
Amit

Re: RecommenderJob Mahout Creating a data model

Posted by Sean Owen <sr...@gmail.com>.
What do you mean by "isolate the data model building step"? You can
run or re-run any step you want in the chain.

So I guess the answer to 2 is "yes", if you mean computed item-item
similarities. But these will change slowly over time and need to be
recomputed sometimes.

MapReduce is never ever something that works in real-time, so if your
question 3 is whether it can answer real-time queries -- no. You would
always pre-compute your results and serve them up at runtime.

It sounds like you are running on a very tiny data set. All of the
time is spent in Hadoop overhead, like starting up workers. It's not
efficient or necessary to use Hadoop at this scale.

Sean

On Wed, Sep 14, 2011 at 3:36 PM, Robert Evans <ev...@yahoo-inc.com> wrote:
> This should probably be directed more toward the Mahout list then the Hadoop Map/reduce one.
>
> mahout-user@apache.org
>
> --Bobby Evans
>
> On 9/14/11 6:28 AM, "Amit Sangroya" <sa...@gmail.com> wrote:
>
> Hi all,
>
> I am trying to run the example from
> https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering
> ,
>
> with the following command bin/mahout
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
> -Dmapred.input.dir=input -Dmapred.output.dir=output --itemsFile itemfile
> --tempDir tempDir
>
> The algorithm estimate the preference of a user towards an item which he/she
> has not yet seen. Once an algorithm can predict preferences it can also be
> used to do Top-N-Recommendation where the task is to find the N items a
> given user might like best. It is mentioned that given a DataModel, it can
> produce recommendations.
>
> The algorithm takes approx. 5 minutes to generate top 5 recommendations for
> one user on a 10 node hadoop cluster. The size of input is shortened only to
> 200 users from "1 Million MovieLens Dataset" from Grouplens.org.
>
> I have few questions:
>
> 1) I want to know that if it is possible to isolate the data model building
> step to generating recommendations.
>
> 2) Can we use the model once generated using the training data for
> generating recommendations for a range of users.
>
> 3) To be specific, if I want to provide an on-line service that generates
> recommendations for users, Can I minimize the cost of MapReduce interactions
> each time.
>
> I am not a data mining expert. Please help me to understand this in a better
> way.
>
>
> Thanks and Regards,
> Amit
>
>

Re: RecommenderJob Mahout Creating a data model

Posted by Sebastian Schelter <ss...@apache.org>.
Hello Amit,

I think we best start with you giving us more details about your
usecase. How much data do you have? How much users? What kind of domain
does your system live in?

If you answer these questions first, I'm confident we'll figure out the
best way you can use Mahout.

Mahout's recommender code supports lots of scenarios ranging from
in-memory recommenders on a single machine for small data to massive
batch recommendation computation on hadoop for datasets with dozens of
millions of interactions.

We'll have to find out how much complexity you really have to adapt.

--sebastian


On 14.09.2011 16:36, Robert Evans wrote:
> This should probably be directed more toward the Mahout list then the Hadoop Map/reduce one.
> 
> mahout-user@apache.org
> 
> --Bobby Evans
> 
> On 9/14/11 6:28 AM, "Amit Sangroya" <sa...@gmail.com> wrote:
> 
> Hi all
> I am trying to run the example from
> https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering
> ,
> 
> with the following command bin/mahout
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
> -Dmapred.input.dir=input -Dmapred.output.dir=output --itemsFile itemfile
> --tempDir tempDir
> 
> The algorithm estimate the preference of a user towards an item which he/she
> has not yet seen. Once an algorithm can predict preferences it can also be
> used to do Top-N-Recommendation where the task is to find the N items a
> given user might like best. It is mentioned that given a DataModel, it can
> produce recommendations.
> 
> The algorithm takes approx. 5 minutes to generate top 5 recommendations for
> one user on a 10 node hadoop cluster. The size of input is shortened only to
> 200 users from "1 Million MovieLens Dataset" from Grouplens.org.
> 
> I have few questions:
> 
> 1) I want to know that if it is possible to isolate the data model building
> step to generating recommendations.
> 
> 2) Can we use the model once generated using the training data for
> generating recommendations for a range of users.
> 
> 3) To be specific, if I want to provide an on-line service that generates
> recommendations for users, Can I minimize the cost of MapReduce interactions
> each time.
> 
> I am not a data mining expert. Please help me to understand this in a better
> way.
> 
> 
> Thanks and Regards,
> Amit
> 
> 


Re: RecommenderJob Mahout Creating a data model

Posted by Robert Evans <ev...@yahoo-inc.com>.
This should probably be directed more toward the Mahout list then the Hadoop Map/reduce one.

mahout-user@apache.org

--Bobby Evans

On 9/14/11 6:28 AM, "Amit Sangroya" <sa...@gmail.com> wrote:

Hi all,

I am trying to run the example from
https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering
,

with the following command bin/mahout
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-Dmapred.input.dir=input -Dmapred.output.dir=output --itemsFile itemfile
--tempDir tempDir

The algorithm estimate the preference of a user towards an item which he/she
has not yet seen. Once an algorithm can predict preferences it can also be
used to do Top-N-Recommendation where the task is to find the N items a
given user might like best. It is mentioned that given a DataModel, it can
produce recommendations.

The algorithm takes approx. 5 minutes to generate top 5 recommendations for
one user on a 10 node hadoop cluster. The size of input is shortened only to
200 users from "1 Million MovieLens Dataset" from Grouplens.org.

I have few questions:

1) I want to know that if it is possible to isolate the data model building
step to generating recommendations.

2) Can we use the model once generated using the training data for
generating recommendations for a range of users.

3) To be specific, if I want to provide an on-line service that generates
recommendations for users, Can I minimize the cost of MapReduce interactions
each time.

I am not a data mining expert. Please help me to understand this in a better
way.


Thanks and Regards,
Amit


Re: RecommenderJob Mahout Creating a data model

Posted by Robert Evans <ev...@yahoo-inc.com>.
This should probably be directed more toward the Mahout list then the Hadoop Map/reduce one.

mahout-user@apache.org

--Bobby Evans

On 9/14/11 6:28 AM, "Amit Sangroya" <sa...@gmail.com> wrote:

Hi all,

I am trying to run the example from
https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering
,

with the following command bin/mahout
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-Dmapred.input.dir=input -Dmapred.output.dir=output --itemsFile itemfile
--tempDir tempDir

The algorithm estimate the preference of a user towards an item which he/she
has not yet seen. Once an algorithm can predict preferences it can also be
used to do Top-N-Recommendation where the task is to find the N items a
given user might like best. It is mentioned that given a DataModel, it can
produce recommendations.

The algorithm takes approx. 5 minutes to generate top 5 recommendations for
one user on a 10 node hadoop cluster. The size of input is shortened only to
200 users from "1 Million MovieLens Dataset" from Grouplens.org.

I have few questions:

1) I want to know that if it is possible to isolate the data model building
step to generating recommendations.

2) Can we use the model once generated using the training data for
generating recommendations for a range of users.

3) To be specific, if I want to provide an on-line service that generates
recommendations for users, Can I minimize the cost of MapReduce interactions
each time.

I am not a data mining expert. Please help me to understand this in a better
way.


Thanks and Regards,
Amit