You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Kris Jack <mr...@gmail.com> on 2011/07/14 16:11:48 UTC

Understanding mahout's recommendation system parameters

Hello,

I'm trying to get a better understanding of the following 2 RecommenderJob
parameters:
1) --maxCooccurrencesPerItem (integer): Maximum number of cooccurrences
considered per item (100)
2) --maxSimilaritiesPerItem (integer): Maximum number of similarities
considered per item (100)

Could you please help me to understand these in terms of a recommender job
where we are trying to recommend items to users?

>From what I see, maxCooccurrencesPerItem first gets used in job 4/12 in the
pipeline, the MaybePruneRowsMapper job.  Does maxCooccurrencesPerItem limit
the number of cooccurrences that are kept for that item?  Is this limit
within a single user's set of items or globally for all users?  For example,
if a user has 100 items then each item can be seen to cooccur with the 99
other items.  Taking all user libraries, however, assume that it cooccurs
with 1,000,000 other items.  Does maxCooccurrencesPerItem limit the number
of cooccurrences on a user item set basis or is this applied to the set of
items with which the item cooccurs with regard to all user libraries?  Also,
how is the selection made (most frequent or first found)?

maxSimilaritiesPerItem first gets used in job 7/12 in the pipeline,
EntriesToVectorsReducer.  Does this cap the number of rows that are compared
with one another?  Are the rows cooccurrence vectors of items for a given
user by this point in the process?

Thanks,
Kris

Re: Understanding mahout's recommendation system parameters

Posted by Kris Jack <mr...@gmail.com>.

Hello Sebastian,

Thanks very much for those explanations, very useful indeed.

For 1) --maxCooccurrencesPerItem, I understand that it's a cap on the number
of items that are considered for individual users.  This is very clear.  I
also agree with you that a more intelligent form of sampling could be useful
here.

For 2) --maxSimilaritiesPerItem, I'm not so sure that I follow so I'd like
to confirm my understanding with you.  Imagine that we have item A, that has
been co-rated with items B, C and D.  As a user, I have rated items B, C and
D, and I'd like to predict my rating for item A.  If
--maxSimilaritiesPerItem is set to 1, then only my rating for one of B, C
and D will be taken into account?  Similarly, if --maxSimilaritiesPerItem is
set to 2, then values for 2 of B, C and D will be taken into account?  If
that's correct, how is the selection made (e.g. random, frequency of
co-rating)?

Thanks,
Kris



2011/7/14 Sebastian Schelter <ss...@apache.org>

> Hi Jack,
>
> trying to answer your questions as detailed as possible:
>
> Regarding point 2) --maxSimilaritiesPerItem
>
> RecommenderJob uses Itembased Collaborative Filtering to compute the
> recommendations and is a parallelized implementation of the algorithm
> presented in [1]. The main idea is to use a "neighbourhood" of similar items
> that have already been rated by a user to estimate his/her preference
> towards an unknown item. These similar items are found by comparing the
> ratings of frequently co-rated items according to some similarity measure.
> The parameter --maxSimilaritiesPerItem lets you specify the number of
> similar items per item to consider when estimating preferences towards an
> unknown item. Usually a small number of items should be sufficient, have a
> look into [1] for some numbers and experiments.
>
> Regarding point 1) --maxCooccurrencesPerItem
>
> In order to compute the item-item-similarities a naive approach would have
> to consider all possible pairs of items which has quadratic complexity and
> obviously won't scale.
>
> RowSimilarityJob which is at the heart of both RecommenderJob and
> ItemSimilarityJob ensures that only pairs of items that have at least been
> co-rated once are taken into consideration which helps a lot in
> recommendation usecases as most users have only rated a very small number of
> items.
>
> However if you look at the distribution of the number of ratings per user
> or per item, it will usually follow a heavily tailed distribution, which
> means that there is a small number of items ("topsellers") with an
> exorbitant number of ratings as well as a small number of users
> ("powerusers") that show the same behavior.
>
> These powerusers and topsellers might slow down the similarity computation
> orders of magnitude (as all pairs of items that have been co-rated have to
> be considered which is still quadratic growth) without providing a lot of
> additional insight. I think Ted wrote a mail to this list some time ago
> where he confirmed this observation from his experience.
>
> So we need some way to sample down these ratings which is done in
> MaybePruneRowsMapper with a very simple heuristic using
> --maxCooccurrencesPerItem that only looks at the portion of data available
> for that single mapper instance and might throw away ratings for very
> frequently rated items.
>
> I think this is a point where a lot of optimization is possible, Mahout
> should provide support for customizable sampling strategies here, like
> looking only at the x latest ratings of a user for example.
>
>
> --sebastian
>
> [1] Sarwar et. al. "Itembased Collaborative Filtering Algorithms"
> http://portal.acm.org/**citation.cfm?id=372071<http://portal.acm.org/citation.cfm?id=372071>
>
>
>
> On 14.07.2011 16:11, Kris Jack wrote:
>
>> Hello,
>>
>> I'm trying to get a better understanding of the following 2 RecommenderJob
>> parameters:
>> 1) --maxCooccurrencesPerItem (integer): Maximum number of cooccurrences
>> considered per item (100)
>> 2) --maxSimilaritiesPerItem (integer): Maximum number of similarities
>> considered per item (100)
>>
>> Could you please help me to understand these in terms of a recommender job
>> where we are trying to recommend items to users?
>>
>>  From what I see, maxCooccurrencesPerItem first gets used in job 4/12 in
>> the
>> pipeline, the MaybePruneRowsMapper job.  Does maxCooccurrencesPerItem
>> limit
>> the number of cooccurrences that are kept for that item?  Is this limit
>> within a single user's set of items or globally for all users?  For
>> example,
>> if a user has 100 items then each item can be seen to cooccur with the 99
>> other items.  Taking all user libraries, however, assume that it cooccurs
>> with 1,000,000 other items.  Does maxCooccurrencesPerItem limit the number
>> of cooccurrences on a user item set basis or is this applied to the set of
>> items with which the item cooccurs with regard to all user libraries?
>>  Also,
>> how is the selection made (most frequent or first found)?
>>
>> maxSimilaritiesPerItem first gets used in job 7/12 in the pipeline,
>> EntriesToVectorsReducer.  Does this cap the number of rows that are
>> compared
>> with one another?  Are the rows cooccurrence vectors of items for a given
>> user by this point in the process?
>>
>> Thanks,
>> Kris
>>
>>
>


-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/

Re: Understanding mahout's recommendation system parameters

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Jack,

trying to answer your questions as detailed as possible:

Regarding point 2) --maxSimilaritiesPerItem

RecommenderJob uses Itembased Collaborative Filtering to compute the 
recommendations and is a parallelized implementation of the algorithm 
presented in [1]. The main idea is to use a "neighbourhood" of similar 
items that have already been rated by a user to estimate his/her 
preference towards an unknown item. These similar items are found by 
comparing the ratings of frequently co-rated items according to some 
similarity measure. The parameter --maxSimilaritiesPerItem lets you 
specify the number of similar items per item to consider when estimating 
preferences towards an unknown item. Usually a small number of items 
should be sufficient, have a look into [1] for some numbers and experiments.

Regarding point 1) --maxCooccurrencesPerItem

In order to compute the item-item-similarities a naive approach would 
have to consider all possible pairs of items which has quadratic 
complexity and obviously won't scale.

RowSimilarityJob which is at the heart of both RecommenderJob and 
ItemSimilarityJob ensures that only pairs of items that have at least 
been co-rated once are taken into consideration which helps a lot in 
recommendation usecases as most users have only rated a very small 
number of items.

However if you look at the distribution of the number of ratings per 
user or per item, it will usually follow a heavily tailed distribution, 
which means that there is a small number of items ("topsellers") with an 
exorbitant number of ratings as well as a small number of users 
("powerusers") that show the same behavior.

These powerusers and topsellers might slow down the similarity 
computation orders of magnitude (as all pairs of items that have been 
co-rated have to be considered which is still quadratic growth) without 
providing a lot of additional insight. I think Ted wrote a mail to this 
list some time ago where he confirmed this observation from his experience.

So we need some way to sample down these ratings which is done in 
MaybePruneRowsMapper with a very simple heuristic using 
--maxCooccurrencesPerItem that only looks at the portion of data 
available for that single mapper instance and might throw away ratings 
for very frequently rated items.

I think this is a point where a lot of optimization is possible, Mahout 
should provide support for customizable sampling strategies here, like 
looking only at the x latest ratings of a user for example.

--sebastian

[1] Sarwar et. al. "Itembased Collaborative Filtering Algorithms" 
http://portal.acm.org/citation.cfm?id=372071

On 14.07.2011 16:11, Kris Jack wrote:
> Hello,
>
> I'm trying to get a better understanding of the following 2 RecommenderJob
> parameters:
> 1) --maxCooccurrencesPerItem (integer): Maximum number of cooccurrences
> considered per item (100)
> 2) --maxSimilaritiesPerItem (integer): Maximum number of similarities
> considered per item (100)
>
> Could you please help me to understand these in terms of a recommender job
> where we are trying to recommend items to users?
>
>  From what I see, maxCooccurrencesPerItem first gets used in job 4/12 in the
> pipeline, the MaybePruneRowsMapper job.  Does maxCooccurrencesPerItem limit
> the number of cooccurrences that are kept for that item?  Is this limit
> within a single user's set of items or globally for all users?  For example,
> if a user has 100 items then each item can be seen to cooccur with the 99
> other items.  Taking all user libraries, however, assume that it cooccurs
> with 1,000,000 other items.  Does maxCooccurrencesPerItem limit the number
> of cooccurrences on a user item set basis or is this applied to the set of
> items with which the item cooccurs with regard to all user libraries?  Also,
> how is the selection made (most frequent or first found)?
>
> maxSimilaritiesPerItem first gets used in job 7/12 in the pipeline,
> EntriesToVectorsReducer.  Does this cap the number of rows that are compared
> with one another?  Are the rows cooccurrence vectors of items for a given
> user by this point in the process?
>
> Thanks,
> Kris
>