You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2008/09/26 19:52:56 UTC

Recommending when working with binary data sets

Hi,

I've been reading the chapter on recommendations in Programming Collective Intelligence and looking at Taste. The examples in PCI all assume scenarios where items to recommend have been rated by users on some scale. I understand how items can be recommended to users using item-based filtering and user-item ratings and why this is preferred over user-based filtering when the number of users is larger than the number of items.
But what if all I've got is item-item similarity (content-based) and there are no user-item ratings? Say I have a situation where people simply either consume content (e.g. read an article, watch a video...) or not consume it (don't read an article, don't watch the video...). In other words, I really have only yes/no or 1/0 or seen/not seen type "rating".

I can't really use Euclidean distance or Pearson correlation coefficient, can I?

What do people use in such scenarios? Would it make sense to use http://en.wikipedia.org/wiki/Jaccard_index for such cases?
... Ah, I do see javadoc in TanimotoCoefficientSimilarity saying exactly that, good.

But then my question is:
Doesn't the use of Jaccard/Tanimoto mean going back to the expensive user-user similarity computation?

That is, if I need to recommend items for user U1 don't I need to:
1) have user-user similarity pre-computed (and recomputed periodically)
2) find top N users U{2,3,4,...N} who are the most similar to U1
3) then for these top N most similar users find their "seen" items that U1 has not seen (possibly limit this to only recently seen items)
4) select top N items from 3) and recommend those to U1.

If so, then 1) is again expensive.
And what how would one go about selecting top N items from the list in this case other than ordering them by user-user similarity?

Of course, something is telling me I'm demonstrating that I don't yet have the full grasp of item-based filtering. I hope that's the case! :)

Thanks,
Otis

Re: Recommending when working with binary data sets

Posted by Grant Ingersoll <gs...@apache.org>.

Not sure I know the answer in terms of Taste, but did a little bit of  
digging (mind you, I'm no CF expert, but I'm learning thanks to Taste  
and Sean).

At any rate, came across:
Started at Wikipedia's page: http://en.wikipedia.org/wiki/Collaborative_filtering

which lead to http://en.wikipedia.org/wiki/Slope_One, which then has  
an interesting comment about Amazon's item-item approach, which, via  
Google Scholar leads to:

http://dsonline.computer.org/portal/site/dsonline/menuitem.9ed3d9924aeb0dcd82ccc6716bbe36ec/index.jsp?&pName=dso_level1&path=dsonline/2003_Archives/0301/d&file=wp1lind.xml&xsl=article.xsl&;jsessionid=LghY1grHgYJpBTLpWjX5NtvQwhH1Bkv9rpfXT4VnpVtDNVpfZ8n0!-1404507079

In particular, see the "How it Works" section.  Essentially, it  
describes how they build the item to item similarity matrix, which I  
believe is also what you need to do.

HTH,
Grant

On Sep 26, 2008, at 1:52 PM, Otis Gospodnetic wrote:

> Hi,
>
> I've been reading the chapter on recommendations in Programming  
> Collective Intelligence and looking at Taste.  The examples in PCI  
> all assume scenarios where items to recommend have been rated by  
> users on some scale.  I understand how items can be recommended to  
> users using item-based filtering and user-item ratings and why this  
> is preferred over user-based filtering when the number of users is  
> larger than the number of items.
> But what if all I've got is item-item similarity (content-based) and  
> there are no user-item ratings?  Say I have a situation where people  
> simply either consume content (e.g. read an article, watch a  
> video...) or not consume it (don't read an article, don't watch the  
> video...).  In other words, I really have only yes/no or 1/0 or seen/ 
> not seen type "rating".
>
> I can't really use Euclidean distance or Pearson correlation  
> coefficient, can I?
>
> What do people use in such scenarios?  Would it make sense to use http://en.wikipedia.org/wiki/Jaccard_index 
>  for such cases?
> ... Ah, I do see javadoc in TanimotoCoefficientSimilarity saying  
> exactly that, good.
>
> But then my question is:
> Doesn't the use of Jaccard/Tanimoto mean going back to the expensive  
> user-user similarity computation?
>
> That is, if I need to recommend items for user U1 don't I need to:
> 1) have user-user similarity pre-computed (and recomputed  
> periodically)
> 2) find top N users U{2,3,4,...N} who are the most similar to U1
> 3) then for these top N most similar users find their "seen" items  
> that U1 has not seen (possibly limit this to only recently seen items)
> 4) select top N items from 3) and recommend those to U1.
>
> If so, then 1) is again expensive.
> And what how would one go about selecting top N items from the list  
> in this case other than ordering them by user-user similarity?
>
> Of course, something is telling me I'm demonstrating that I don't  
> yet have the full grasp of item-based filtering.  I hope that's the  
> case! :)
>
> Thanks,
> Otis

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Recommending when working with binary data sets

Posted by Sean Owen <sr...@gmail.com>.

Sorry for the late reply -- I've been traveling.

On Fri, Sep 26, 2008 at 6:52 PM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> I've been reading the chapter on recommendations in Programming Collective Intelligence and looking at Taste.  The examples in PCI

(PS that is a really good book. Recommended -- highly recommended --
to everyone involved with Mahout. I kinda cross-checked what I had
done against the book and think it agrees. The book suggested more
good ideas, particularly the Tanimoto coefficient business.)

> I can't really use Euclidean distance or Pearson correlation coefficient, can I?

You could but it wouldn't make much sense. In the framework I do have
an implementation of Preference which is supposed to encapsulate a
binary value like this. Its existence means a 'yes' and as far as the
framework is concerned means the user expresses a '1.0' preference for
the item. That value doesn't really matter.

(and yes, it would be more efficient to not have such a simple dummy
implementation of Preference to represent this. I threw it in since it
fits cleanly in the framework. Get it right first -- then make it
fast. If there is interest in these areas then we start making more
customized versions of User and some of the algorithms that take
advantage of the fact that preferences are binary.)

> What do people use in such scenarios?  Would it make sense to use http://en.wikipedia.org/wiki/Jaccard_index for such cases?
> ... Ah, I do see javadoc in TanimotoCoefficientSimilarity saying exactly that, good.
>
> But then my question is:
> Doesn't the use of Jaccard/Tanimoto mean going back to the expensive user-user similarity computation?

TanimotoCoefficientSimilarity implements both UserSimilarity and
ItemSimilarity, so it can be plugged into either a user-based or
item-based recommender, which need a UserSimilarity or ItemSimilarity,
respectively. So, no, you aren't forced to user-based recommenders in
this context.