You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Chris Schilling <ch...@cellixis.com> on 2011/02/19 00:43:31 UTC

user-user recommendations

Hello again,

Very simple question here:  I am also testing the user-user cf in mahout.  So, once I define my user neighborhood, is it possible to select the recommendations from that based on the number of preferences per item rather than a weighted average?  Basically, I'd like to recommend the items with the most preferences.  It would be simple to implement, so I was curious if this was already possible.  I understand that in this case, the counts become dependent on the size of the neighborhood. This is something I'd want to use for testing.

Thanks
Chris

Re: user-user recommendations

Posted by Chris Schilling <ch...@cellixis.com>.
Thanks Sean!

SVD is the next stop.  Thanks for all the help.  Been learning a lot the past few days!

Chris


On Feb 19, 2011, at 9:21 AM, Sean Owen wrote:

> Yes this is the essential problem with some similarity metrics like
> Pearson correlation. In its pure form, it takes no account of the size
> of the data set on which the calculation is based. (That's why the
> framework has a crude variation which you can invoke with
> Weighting.WEIGHTED, to factor this in.)
> 
> I think your proposal perhaps goes far the other way, completely
> favoring "count". But it's not crazy or anything and probably works
> reasonably in some data sets.
> 
> There are many ways you could modify these stock algorithms to account
> for the effects you have in mind. Most of what's in the framework is
> just the basic ideas that come from canonical books and papers.
> 
> Here's another idea to play with: instead of weighting and item's
> score by average similarity to the user's preferred items, weight by
> average minus standard deviation. This tends to penalize candidate
> items that are similar to only a few of the user's items, since there
> will be only a few data points and the standard deviation larger.
> 
> Matrix factorizaton / SVD-based approaches are deeper magic -- more
> complex, more computation, much harder math, but theoretically more
> powerful. I'd see how far you can get on a basic user-user approach
> (or item-item) as a baseline and then go dig into these.
> 
> 
> On Sat, Feb 19, 2011 at 12:02 PM, Chris Schilling <ch...@cellixis.com> wrote:
>> Hey Sean,
>> 
>> Thank you for the detailed reply.  Interesting points.  I think I have approached some of these points in my subsequent emails.
>> 
>> You bring up the case where all the users hate the same item.  What about the case where very few (a single?) similar users loves a place?  In that case, is this really a better  recommendation than the popular vote?  Where is the middle ground.  I think its an interesting point.   Ill see how the SVD performs.
>> 


Re: user-user recommendations

Posted by Sean Owen <sr...@gmail.com>.
Yes this is the essential problem with some similarity metrics like
Pearson correlation. In its pure form, it takes no account of the size
of the data set on which the calculation is based. (That's why the
framework has a crude variation which you can invoke with
Weighting.WEIGHTED, to factor this in.)

I think your proposal perhaps goes far the other way, completely
favoring "count". But it's not crazy or anything and probably works
reasonably in some data sets.

There are many ways you could modify these stock algorithms to account
for the effects you have in mind. Most of what's in the framework is
just the basic ideas that come from canonical books and papers.

Here's another idea to play with: instead of weighting and item's
score by average similarity to the user's preferred items, weight by
average minus standard deviation. This tends to penalize candidate
items that are similar to only a few of the user's items, since there
will be only a few data points and the standard deviation larger.

Matrix factorizaton / SVD-based approaches are deeper magic -- more
complex, more computation, much harder math, but theoretically more
powerful. I'd see how far you can get on a basic user-user approach
(or item-item) as a baseline and then go dig into these.


On Sat, Feb 19, 2011 at 12:02 PM, Chris Schilling <ch...@cellixis.com> wrote:
> Hey Sean,
>
> Thank you for the detailed reply.  Interesting points.  I think I have approached some of these points in my subsequent emails.
>
> You bring up the case where all the users hate the same item.  What about the case where very few (a single?) similar users loves a place?  In that case, is this really a better  recommendation than the popular vote?  Where is the middle ground.  I think its an interesting point.   Ill see how the SVD performs.
>

Re: user-user recommendations

Posted by Chris Schilling <ch...@cellixis.com>.
Hey Sean,

Thank you for the detailed reply.  Interesting points.  I think I have approached some of these points in my subsequent emails. 

You bring up the case where all the users hate the same item.  What about the case where very few (a single?) similar users loves a place?  In that case, is this really a better  recommendation than the popular vote?  Where is the middle ground.  I think its an interesting point.   Ill see how the SVD performs.


On Feb 18, 2011, at 11:20 PM, Sean Owen wrote:

> User-user similarity is based on these counts? That sounds a bit like
> the Tanimoto / Jaccard coefficient.See TanimotoCoeffcientSimilarity.
> Yes you can use that though log-likelihood is probably a more
> sophisticated choice.
> 
> Recommending an item that occurs most in the neighborhood? Sure you
> can make it work that way. It probably works "OK" in practice though
> you can see possible problems with it. What if everyone in the
> neighborhood hates an item? this would recommend it highly. It's also
> throwing away the degree of similarity to the user who likes an item.
> 
> The conventional wisdom in recommenders is that you want to fight the
> tendency to always recommend well-known items. People probably already
> know about the well-known items even if they've not rated them yet. It
> also makes the recommendations less personalized in a sense -- the
> recommendation result approaches the one you'd get by just
> recommending the globally most-preferred items.
> 
> If your goal is to fight sparseness, start looking at SVD-based
> methods. This is really the point of SVDs, to "summarize" a very
> high-dimensional user-item matrix in a much lower-dimensional "user
> group" - "item group" matrix. Maybe you don't have enough information
> to recommend Bauhaus to Joan, a teenage goth, but, the SVD lets you
> sort of draw conclusions like "gothy teens like Peter Murphy's
> albums". That is the summary is much less sparse and so works better
> for recommendation for users/items with little connection to the rest
> of the matrix otherwise.
> 
> 
> On Sat, Feb 19, 2011 at 2:43 AM, Chris Schilling <ch...@cellixis.com> wrote:
>> Hello again,
>> 
>> Very simple question here:  I am also testing the user-user cf in mahout.  So, once I define my user neighborhood, is it possible to select the recommendations from that based on the number of preferences per item rather than a weighted average?  Basically, I'd like to recommend the items with the most preferences.  It would be simple to implement, so I was curious if this was already possible.  I understand that in this case, the counts become dependent on the size of the neighborhood. This is something I'd want to use for testing.
>> 
>> Thanks
>> Chris


Re: user-user recommendations

Posted by Sean Owen <sr...@gmail.com>.
User-user similarity is based on these counts? That sounds a bit like
the Tanimoto / Jaccard coefficient.See TanimotoCoeffcientSimilarity.
Yes you can use that though log-likelihood is probably a more
sophisticated choice.

Recommending an item that occurs most in the neighborhood? Sure you
can make it work that way. It probably works "OK" in practice though
you can see possible problems with it. What if everyone in the
neighborhood hates an item? this would recommend it highly. It's also
throwing away the degree of similarity to the user who likes an item.

The conventional wisdom in recommenders is that you want to fight the
tendency to always recommend well-known items. People probably already
know about the well-known items even if they've not rated them yet. It
also makes the recommendations less personalized in a sense -- the
recommendation result approaches the one you'd get by just
recommending the globally most-preferred items.

If your goal is to fight sparseness, start looking at SVD-based
methods. This is really the point of SVDs, to "summarize" a very
high-dimensional user-item matrix in a much lower-dimensional "user
group" - "item group" matrix. Maybe you don't have enough information
to recommend Bauhaus to Joan, a teenage goth, but, the SVD lets you
sort of draw conclusions like "gothy teens like Peter Murphy's
albums". That is the summary is much less sparse and so works better
for recommendation for users/items with little connection to the rest
of the matrix otherwise.


On Sat, Feb 19, 2011 at 2:43 AM, Chris Schilling <ch...@cellixis.com> wrote:
> Hello again,
>
> Very simple question here:  I am also testing the user-user cf in mahout.  So, once I define my user neighborhood, is it possible to select the recommendations from that based on the number of preferences per item rather than a weighted average?  Basically, I'd like to recommend the items with the most preferences.  It would be simple to implement, so I was curious if this was already possible.  I understand that in this case, the counts become dependent on the size of the neighborhood. This is something I'd want to use for testing.
>
> Thanks
> Chris

Re: user-user recommendations (Implementation)

Posted by Chris Schilling <ch...@cellixis.com>.
Okay, 

I was wrong about the protected constructors in AbstractRecommender.  I am able to extend that class in my code without problem.  Sorry for the noise there.  Perhaps it still adds to the modularity if there is a constructor which allows the passing of a personalized Evaluator class. 


On Feb 18, 2011, at 9:36 PM, Chris Schilling wrote:

> Hi again!
> 
> I was able to "hijack" the GenericUserBasedRecommendation class in my code to produce recommendations based on the count rather than the weighted average.  In order to do this I had to copy both the AbstractUserRecommender and the GUBR classes into my own code and then change the implementation of the Evaluator in GUBR.  I am reasonably new to Java, so there may have been a better way, but this seemed to be the quickest solution.  One reason I thought this was necessary is because the constructors in AbstractRecommender are protected, so it makes it difficult(impossible?) to extend this class outside of the impl.recommender package within Mahout.  I did not want to change the Mahout code and recompile.  
> 
> Now, unless I am wrong, I can see two ways of improving(?) the implementation of the Generic*BasedRecommenders.  One would be the addition of a constructor which allowed the passing of an Evaluator.  The other way would be to make the constructors/methods public in the AbstractRecommender.  Then, I can extend this class in my own code.  I understand that Mahout was built to be somewhat standalone.  However there are cases when it would be nice to personalize the evaluation outside of Mahout.  So, maybe the ability to pass an implementation of Evaluator to the constructor is a better option.  
> 
> On a slightly different yet related note, I think a better metric than the count (for user based)  would be the sum of similarities between users who have rated the item.  Again, I think the weighted average of sum(rating*similarity)/sum(similarity) shows the problem noted in my previous posts. 
> 
> Again, any thoughts are appreciated :)
> 
> Thank you for this beautiful framework!  I have been really enjoying all the discussion and learning the intricacies of recommendation engines (and ML in general).
> Chris
> 
> On Feb 18, 2011, at 5:29 PM, Chris Schilling wrote:
> 
>> So, I've been thinking about this a bit more.
>> 
>> Take an example:  I haverated a very small number of items.  I am able to extract a neighborhood of similar users.  Now let's say there is a single user who has rated the same items with the same rating, but this user is the only rater in my neighborhood who has rated an obscure item very highly.  In the case using a weighted average to predict my 
>> recommendations, this obscure item would rise to the top of the list.   In this case, it seems like items rated the most would be better recommendations.  
>> 
>> I was able to hijack the GenericUserRecommender and change the calculation of the preference to return the count rather than the weighted average.  In my case, this seems to return more intuitive results.  
>> 
>> Again this is related to the sparseness of the data, but I could see this type of thing occurring often. Any thoughts?
>> 
>> 
>> On Feb 18, 2011, at 3:43 PM, Chris Schilling wrote:
>> 
>>> Hello again,
>>> 
>>> Very simple question here:  I am also testing the user-user cf in mahout.  So, once I define my user neighborhood, is it possible to select the recommendations from that based on the number of preferences per item rather than a weighted average?  Basically, I'd like to recommend the items with the most preferences.  It would be simple to implement, so I was curious if this was already possible.  I understand that in this case, the counts become dependent on the size of the neighborhood. This is something I'd want to use for testing.
>>> 
>>> Thanks
>>> Chris
>> 
> 


Re: user-user recommendations (Implementation)

Posted by Chris Schilling <ch...@cellixis.com>.
Hi again!

I was able to "hijack" the GenericUserBasedRecommendation class in my code to produce recommendations based on the count rather than the weighted average.  In order to do this I had to copy both the AbstractUserRecommender and the GUBR classes into my own code and then change the implementation of the Evaluator in GUBR.  I am reasonably new to Java, so there may have been a better way, but this seemed to be the quickest solution.  One reason I thought this was necessary is because the constructors in AbstractRecommender are protected, so it makes it difficult(impossible?) to extend this class outside of the impl.recommender package within Mahout.  I did not want to change the Mahout code and recompile.  

Now, unless I am wrong, I can see two ways of improving(?) the implementation of the Generic*BasedRecommenders.  One would be the addition of a constructor which allowed the passing of an Evaluator.  The other way would be to make the constructors/methods public in the AbstractRecommender.  Then, I can extend this class in my own code.  I understand that Mahout was built to be somewhat standalone.  However there are cases when it would be nice to personalize the evaluation outside of Mahout.  So, maybe the ability to pass an implementation of Evaluator to the constructor is a better option.  

On a slightly different yet related note, I think a better metric than the count (for user based)  would be the sum of similarities between users who have rated the item.  Again, I think the weighted average of sum(rating*similarity)/sum(similarity) shows the problem noted in my previous posts. 

Again, any thoughts are appreciated :)

Thank you for this beautiful framework!  I have been really enjoying all the discussion and learning the intricacies of recommendation engines (and ML in general).
Chris

On Feb 18, 2011, at 5:29 PM, Chris Schilling wrote:

> So, I've been thinking about this a bit more.
> 
> Take an example:  I haverated a very small number of items.  I am able to extract a neighborhood of similar users.  Now let's say there is a single user who has rated the same items with the same rating, but this user is the only rater in my neighborhood who has rated an obscure item very highly.  In the case using a weighted average to predict my 
> recommendations, this obscure item would rise to the top of the list.   In this case, it seems like items rated the most would be better recommendations.  
> 
> I was able to hijack the GenericUserRecommender and change the calculation of the preference to return the count rather than the weighted average.  In my case, this seems to return more intuitive results.  
> 
> Again this is related to the sparseness of the data, but I could see this type of thing occurring often. Any thoughts?
> 
> 
> On Feb 18, 2011, at 3:43 PM, Chris Schilling wrote:
> 
>> Hello again,
>> 
>> Very simple question here:  I am also testing the user-user cf in mahout.  So, once I define my user neighborhood, is it possible to select the recommendations from that based on the number of preferences per item rather than a weighted average?  Basically, I'd like to recommend the items with the most preferences.  It would be simple to implement, so I was curious if this was already possible.  I understand that in this case, the counts become dependent on the size of the neighborhood. This is something I'd want to use for testing.
>> 
>> Thanks
>> Chris
> 


Re: user-user recommendations

Posted by Chris Schilling <ch...@cellixis.com>.
So, I've been thinking about this a bit more.

Take an example:  I haverated a very small number of items.  I am able to extract a neighborhood of similar users.  Now let's say there is a single user who has rated the same items with the same rating, but this user is the only rater in my neighborhood who has rated an obscure item very highly.  In the case using a weighted average to predict my 
recommendations, this obscure item would rise to the top of the list.   In this case, it seems like items rated the most would be better recommendations.  

I was able to hijack the GenericUserRecommender and change the calculation of the preference to return the count rather than the weighted average.  In my case, this seems to return more intuitive results.  

Again this is related to the sparseness of the data, but I could see this type of thing occurring often. Any thoughts?


On Feb 18, 2011, at 3:43 PM, Chris Schilling wrote:

> Hello again,
> 
> Very simple question here:  I am also testing the user-user cf in mahout.  So, once I define my user neighborhood, is it possible to select the recommendations from that based on the number of preferences per item rather than a weighted average?  Basically, I'd like to recommend the items with the most preferences.  It would be simple to implement, so I was curious if this was already possible.  I understand that in this case, the counts become dependent on the size of the neighborhood. This is something I'd want to use for testing.
> 
> Thanks
> Chris