You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2009/08/20 17:20:59 UTC

Re: [Taste] Sanity Check and Questions

To kind of follow up on this, since Grant and I talked at length about
this today --

Yeah in the end he was able to create some basic examples that gave
reasonable results on this data set. Except for approaches based on
item-based recommenders. This data set highlighted a weakness of a
straightforward reading of the algorithm, and it might be interesting
to explain here for any interested parties.

In this data set it's clear what the 'right' answers are -- there are
a small group of fans of books about Lincoln and one expects more
Lincoln books to be recommended. Indeed the algorithms estimated high
ratings for such books on a scale of 1 to 5. But, a number of
seemingly unrelated books were topping the recommendation list with an
estimated rating of 5.

Why? well the core of the item-based recommender is the bit that
estimates a probable rating for some new item under consideration:

estimatePreference for item:
  for each preferredItem that the user expresses a preference for:
    compute similarity metric m between item and preferredItem
    if it is defined:
      store the preference value for preferredItem, weighted by
(multiplied by) similarity m
  return the weighted average over those stored preference values

Simple enough. But it's vulnerable to a not-uncommon scenario. Imagine
a user likes only Lincoln-related books. In the course of producing
recommendations, we estimate a preference value for, say, a cookbook.
Should be low right? This is a user who rates Lincoln books, just
about exclusively, and maybe rated a recent popular Lincoln biography
5 out of 5.

But because there are few users in the world who both cook and study
history, it's possible that given the data, there is no definable
similarity value between any of the Lincoln books this user likes, and
the cookbook. Fine, in this case the cookbook won't be recommended.

But imagine that, as it happens, a couple people like the cookbook and
this recent Lincoln biography. We compute the similarity and still
find it's low. Maybe 0.1. Still seems like this isn't a good
recommendation right?

Well, if that's the only Lincoln book in the preferences that has any
defined similarity to the cookbook, the above algorithm says we should
return the weighted average of those preferences, weighted by
similarity. That's: (0.1 * 5.0) / 0.1. Or 5.0. The process says the
estimated rating is 5, and indeed there is some justification for
that. But it flies in the face of intuition.


Interesting. Takeaways?

1) The stock item-based recommender just has this weakness. I am open
to suggestion about alternate modes that could work around this. For
example: reject as a recommendation any item like this, where the
estimate would be based on a single data point.

2) Once again it seems that using preference values can actually get
one into trouble. Interesting.


On Thu, Jun 18, 2009 at 9:43 PM, Ted Dunning<te...@gmail.com> wrote:
> Grant,
>
> The data you described is pretty simple and should produce good results at
> all levels of overlap.  That it does not is definitely a problem.   In fact,
> I would recommend making the data harder to deal with by giving non-Lincoln
> items highly variable popularities and then making the groundlings rate
> items according to their popularity.  This will result in an apparent
> pattern where the inclusion of any number of non-lincoln fans will show an
> apparent pattern of liking popular items.  The correct inference should,
> however, be that any neighbor group that has a large number of Lincoln fans
> seems to like popular items less than expected.
>
> For problems like this, I have had good luck with using measures that were
> robust in the face of noise (aka small counts) and in the face of large
> external trends (aka the top-40 problem).  The simplest one that I know of
> is the generalized multinomial log-likelihood
> ratio<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>that
> you hear me nattering about so often.  LSI does a decent job of
> dealing
> with the top-40, but has trouble with small counts.  LDA and related
> probabiliistic methods should work somewhat better than log-likelihood
> ratio, but are considerably more complex to implement.
>
> The key here is to compare counts within the local neighborhood to counts
> outside the neighborhood.  Things that are significantly different about the
> neighborhood relative to rest of the world are candidates for
> recommendation.  Things to avoid when looking for interesting differences
> include:
>
> - correlation measures such as Pearson's R (based on normal distribution
> approximation and unscaled thus suffers from both small count and top-40
> problems)
>
> - anomaly measures based simply on frequency ratios (very sensitive to small
> count problems, doesn't account for top-40 at all)
>
> What I would recommend for a nearest neighbor approach is to continue with
> the current neighbor retrieval, but switch to a log-likelihood ratio for
> generating recommendations.
>
> What I would recommend for a production system would be to scrap the nearest
> neighbor approach entirely and go to a coocurrence matrix based approach.
> This costs much less to compute at recommendation and is very robust against
> both small counts and top-40 issues.
>
> On Thu, Jun 18, 2009 at 9:37 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> Still seems a little funny. I am away from the code otherwise I would check
>> - forget if I ever implemented weighting in the standard correlation
>> similarity metrics. A pure correlation does not account for whether you
>> overlapped in 10 or 1000 items. This sort of weighting exists elsewhere but
>> I forget about here. It might help.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: [Taste] Sanity Check and Questions

Posted by Sean Owen <sr...@gmail.com>.
On Thu, Aug 20, 2009 at 6:07 PM, Ted Dunning<te...@gmail.com> wrote:
> ordered.  The idea that you can subtract ratings and get a sensible number
> is not actually correct.  In particular, when displaying recommended items,
> you typically only display items with high estimated values or values that

Completely agree. Initially I had no notion of an
'estimatePreference()' method which would actually try to estimate the
preference value in the right scale, etc. The algorithms would come up
with some number indicating how good a recommendation was -- but it
was not guaranteed to be an estimate of the actual preference. It was
just some value which was higher when recommendations were better.

Later I felt it useful for applications to have access to this
functionality. All the current algorithms really do operate by trying
to estimate the preference value, so it was actually pretty simple to
redesign a few things to produce and use actual preference estimates.

It is a restricting assumption, but does let the framework provide an
'estimatePreference()' function. This is useful in many use cases, and
also enables an evaluation framework. It also lets it return some
meaningful notion of how 'good' the recommendation is to the caller,
instead of just an ordered list.

Nevertheless I do agree there are drawbacks to making this assumption,
and the estimate isn't necessarily great.

Re: [Taste] Sanity Check and Questions

Posted by Ted Dunning <te...@gmail.com>.
I don't think that this is so much a problem of ratings (which I dislike in
practice as everybody must know by now).

Instead I see it as parallel problems.  The first is a problem with small
counts and adventitious results based on coincidental pairings.  The second
is the treatment of ratings as a continuous scale rather than as an
ordered.  The idea that you can subtract ratings and get a sensible number
is not actually correct.  In particular, when displaying recommended items,
you typically only display items with high estimated values or values that
have value for learning.  This means that items with very low estimated
values never get displayed at all.  If you only get to display items with
estimated ratings of 4.5 and above, then the consequences of an error of
positive or negative 1/2 are very large for something with an estimate of
4.5, but an error of -2 for an item with a rating 3 is completely
inconsequential.

The problem of small counts probably dominates, but it is good to not lose
sight of the other problem.  Dealing with low counts with ratings requires a
bit of thought and generally involves the complete rejection of techniques
that are based on the assumption of normal distribution.  Hence,
correlation, normal chi^2 tests, t-tests, LSI, squared errors and so on are
all out the window.  Welcome instead latent variable models based on
multinomials, Bayesian techniques and generalized log-likelihood ratios.

On Thu, Aug 20, 2009 at 8:20 AM, Sean Owen <sr...@gmail.com> wrote:

>
> Well, if that's the only Lincoln book in the preferences that has any
> defined similarity to the cookbook, the above algorithm says we should
> return the weighted average of those preferences, weighted by
> similarity. That's: (0.1 * 5.0) / 0.1. Or 5.0. The process says the
> estimated rating is 5, and indeed there is some justification for
> that. But it flies in the face of intuition.
>
>
> Interesting. Takeaways?
>
> 1) The stock item-based recommender just has this weakness. I am open
> to suggestion about alternate modes that could work around this. For
> example: reject as a recommendation any item like this, where the
> estimate would be based on a single data point.
>
> 2) Once again it seems that using preference values can actually get
> one into trouble. Interesting.




-- 
Ted Dunning, CTO
DeepDyve

Re: [Taste] Sanity Check and Questions

Posted by Ted Dunning <te...@gmail.com>.
If you are going with this approach, I think that the prior needs
considerably more weight (as in the equivalent of several ratings at below
average recommendation).  That way, slight correlations will result in
average or below average recommendation level.

One of the worst things a recommendation engine can do is make ludicrous
recommendations.  That turns off users and it turns off decision makers even
faster.

On Fri, Aug 21, 2009 at 8:11 AM, Mark Desnoyer <md...@gmail.com> wrote:

> Well mathematically, right now, by omitting the unknown similarities, you
> are equivalently setting their similarities to zero and thus they drop off.
> Unless there's some other calculation I don't know about that....
>
> If you want a more theoretically sound version of what I'm trying to do
> (instead of my hacked up way), see section 2.2.1 in:
>
> http://research.microsoft.com/pubs/69656/tr-98-12.pdf
>
> And yes, if you have a default similarity, you will change the result for
> different entries. In the above example say there is another book, maybe
> about fishing that we also want to estimate the preference for. Say this
> book has a similarity of 0.1 with the Lincoln book and 0.1 with the france
> book. Without the prior, your estimate is:
>
> score(fishing) = (0.1*5 + 0.1*3) / (0.1 + 0.1) = 4
>
> With the prior, it's:
>
> score(fishing) = (0.1*5 + 0.1*3 + 0.01*1) / (0.1 + 0.1 +0.01) = 3.85
>
> This is 96% of the original value, whereas for the cookbook example, it's
> 90%. They aren't moving in lockstep because the impact of the prior is
> different depending on which entries we have real data for.
>
> -Mark
>
> On Fri, Aug 21, 2009 at 5:02 AM, Sean Owen <sr...@gmail.com> wrote:
>
> > The only piece of what you're saying that sort of doesn't click with
> > me is assuming a prior of 0, or some small value. Similarities range
> > from -1 to 1, so a value of 0 is a positive statement that there is
> > exactly no relationship between the ratings for both items. This is
> > different than having no opinion on it and it matters to the
> > subsequent calculations.
> >
> > But does your second suggest really meaningfully change the result...
> > yeah I push the rating down from 5 to 4.5, but throwing in these other
> > small terms in the average does roughly the same to all the values? I
> > perhaps haven't thought this through. I understand the intuitions here
> > and think they are right.
> >
> >
> > One meta-issue here for the library is this: there's a 'standard'
> > item-based recommender algorithm out there, and we want to have that.
> > And we do. So I don't want to touch it -- perhaps add some options to
> > modify its behavior. So we're talking about maybe inventing a variant
> > algorithm... or three or four. That's good I guess, though not exactly
> > the remit I had in mind for the CF part. I was imagining it would
> > provide access to canonical approaches with perhaps some small
> > variants tacked on, or at least hooks to modify parts of the logic.
> >
> > Basically I also need to have a think about how to include variants
> > like this in a logical way.
> >
> >
> > On Thu, Aug 20, 2009 at 6:07 PM, Mark Desnoyer<md...@gmail.com>
> wrote:
> > > You could do it that way, but I don't think you're restricted to
> ignoring
> > > the rating values. For example, you could define similarity between
> item
> > i
> > > and item j like (the normalization is probably incomplete, but this is
> > the
> > > idea):
> > >
> > > similarity(i,j) = (prior + sum({x_ij})) / (count({x_ij}) + 1)
> > >
> > > where each x_ij is the similarity defined by a single user and could be
> > > based on their ratings. So I think the way you're thinking of it, x_ij
> =
> > 1,
> > > but it could be a function of the ratings, say higher if the ratings
> are
> > > closer and lower if they are far apart.
> > >
> > > You can still do the weighted average, you just have more items to
> > > calculate. Say a user has rated the Liconln book 5 a book on france 3
> and
> > a
> > > book on space travel 1. Assuming there is no data linking the france or
> > > space books to the cookbook, then their similarities would be the
> prior,
> > or
> > > 0.01. Then, you'd calculate the score for the cookbook recommendation
> as:
> > >
> > > score(cookbook) = 5*0.1 + 3*0.01 + 1*0.01 / (0.1 + 0.01 + 0.01) =  4.5
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: [Taste] Sanity Check and Questions

Posted by Mark Desnoyer <md...@gmail.com>.
Well mathematically, right now, by omitting the unknown similarities, you
are equivalently setting their similarities to zero and thus they drop off.
Unless there's some other calculation I don't know about that....

If you want a more theoretically sound version of what I'm trying to do
(instead of my hacked up way), see section 2.2.1 in:

http://research.microsoft.com/pubs/69656/tr-98-12.pdf

And yes, if you have a default similarity, you will change the result for
different entries. In the above example say there is another book, maybe
about fishing that we also want to estimate the preference for. Say this
book has a similarity of 0.1 with the Lincoln book and 0.1 with the france
book. Without the prior, your estimate is:

score(fishing) = (0.1*5 + 0.1*3) / (0.1 + 0.1) = 4

With the prior, it's:

score(fishing) = (0.1*5 + 0.1*3 + 0.01*1) / (0.1 + 0.1 +0.01) = 3.85

This is 96% of the original value, whereas for the cookbook example, it's
90%. They aren't moving in lockstep because the impact of the prior is
different depending on which entries we have real data for.

-Mark

On Fri, Aug 21, 2009 at 5:02 AM, Sean Owen <sr...@gmail.com> wrote:

> The only piece of what you're saying that sort of doesn't click with
> me is assuming a prior of 0, or some small value. Similarities range
> from -1 to 1, so a value of 0 is a positive statement that there is
> exactly no relationship between the ratings for both items. This is
> different than having no opinion on it and it matters to the
> subsequent calculations.
>
> But does your second suggest really meaningfully change the result...
> yeah I push the rating down from 5 to 4.5, but throwing in these other
> small terms in the average does roughly the same to all the values? I
> perhaps haven't thought this through. I understand the intuitions here
> and think they are right.
>
>
> One meta-issue here for the library is this: there's a 'standard'
> item-based recommender algorithm out there, and we want to have that.
> And we do. So I don't want to touch it -- perhaps add some options to
> modify its behavior. So we're talking about maybe inventing a variant
> algorithm... or three or four. That's good I guess, though not exactly
> the remit I had in mind for the CF part. I was imagining it would
> provide access to canonical approaches with perhaps some small
> variants tacked on, or at least hooks to modify parts of the logic.
>
> Basically I also need to have a think about how to include variants
> like this in a logical way.
>
>
> On Thu, Aug 20, 2009 at 6:07 PM, Mark Desnoyer<md...@gmail.com> wrote:
> > You could do it that way, but I don't think you're restricted to ignoring
> > the rating values. For example, you could define similarity between item
> i
> > and item j like (the normalization is probably incomplete, but this is
> the
> > idea):
> >
> > similarity(i,j) = (prior + sum({x_ij})) / (count({x_ij}) + 1)
> >
> > where each x_ij is the similarity defined by a single user and could be
> > based on their ratings. So I think the way you're thinking of it, x_ij =
> 1,
> > but it could be a function of the ratings, say higher if the ratings are
> > closer and lower if they are far apart.
> >
> > You can still do the weighted average, you just have more items to
> > calculate. Say a user has rated the Liconln book 5 a book on france 3 and
> a
> > book on space travel 1. Assuming there is no data linking the france or
> > space books to the cookbook, then their similarities would be the prior,
> or
> > 0.01. Then, you'd calculate the score for the cookbook recommendation as:
> >
> > score(cookbook) = 5*0.1 + 3*0.01 + 1*0.01 / (0.1 + 0.01 + 0.01) =  4.5
>

Re: [Taste] Sanity Check and Questions

Posted by Grant Ingersoll <gs...@apache.org>.
On Aug 21, 2009, at 5:02 AM, Sean Owen wrote:
>
> One meta-issue here for the library is this: there's a 'standard'
> item-based recommender algorithm out there, and we want to have that.
> And we do. So I don't want to touch it -- perhaps add some options to
> modify its behavior. So we're talking about maybe inventing a variant
> algorithm... or three or four. That's good I guess, though not exactly
> the remit I had in mind for the CF part. I was imagining it would
> provide access to canonical approaches with perhaps some small
> variants tacked on, or at least hooks to modify parts of the logic.
>
> Basically I also need to have a think about how to include variants
> like this in a logical way.

+1.  Theory and practice are two different things.  :-)  We run into  
this in Lucene often as well.  The goal is to create an API that has  
high out of the box performance, but gives people the proverbial rope  
to go hang.  Documentation can often help give people the insight into  
what they need to know to properly tune the libraries.  "Standard"  
collections like Netflix and GroupLens can be used to show "out of the  
box" performance, while still being careful of overtuning.  Over time,  
we will want the Wiki to evolve to capture best practices for Taste  
and Mahout with people contributing real world experience.  The code,  
of course, can evolve too.  It need not be perfect the first time.

Re: [Taste] Sanity Check and Questions

Posted by Sean Owen <sr...@gmail.com>.
The only piece of what you're saying that sort of doesn't click with
me is assuming a prior of 0, or some small value. Similarities range
from -1 to 1, so a value of 0 is a positive statement that there is
exactly no relationship between the ratings for both items. This is
different than having no opinion on it and it matters to the
subsequent calculations.

But does your second suggest really meaningfully change the result...
yeah I push the rating down from 5 to 4.5, but throwing in these other
small terms in the average does roughly the same to all the values? I
perhaps haven't thought this through. I understand the intuitions here
and think they are right.


One meta-issue here for the library is this: there's a 'standard'
item-based recommender algorithm out there, and we want to have that.
And we do. So I don't want to touch it -- perhaps add some options to
modify its behavior. So we're talking about maybe inventing a variant
algorithm... or three or four. That's good I guess, though not exactly
the remit I had in mind for the CF part. I was imagining it would
provide access to canonical approaches with perhaps some small
variants tacked on, or at least hooks to modify parts of the logic.

Basically I also need to have a think about how to include variants
like this in a logical way.


On Thu, Aug 20, 2009 at 6:07 PM, Mark Desnoyer<md...@gmail.com> wrote:
> You could do it that way, but I don't think you're restricted to ignoring
> the rating values. For example, you could define similarity between item i
> and item j like (the normalization is probably incomplete, but this is the
> idea):
>
> similarity(i,j) = (prior + sum({x_ij})) / (count({x_ij}) + 1)
>
> where each x_ij is the similarity defined by a single user and could be
> based on their ratings. So I think the way you're thinking of it, x_ij = 1,
> but it could be a function of the ratings, say higher if the ratings are
> closer and lower if they are far apart.
>
> You can still do the weighted average, you just have more items to
> calculate. Say a user has rated the Liconln book 5 a book on france 3 and a
> book on space travel 1. Assuming there is no data linking the france or
> space books to the cookbook, then their similarities would be the prior, or
> 0.01. Then, you'd calculate the score for the cookbook recommendation as:
>
> score(cookbook) = 5*0.1 + 3*0.01 + 1*0.01 / (0.1 + 0.01 + 0.01) =  4.5

Re: [Taste] Sanity Check and Questions

Posted by Mark Desnoyer <md...@gmail.com>.
On Thu, Aug 20, 2009 at 12:07 PM, Sean Owen <sr...@gmail.com> wrote:

> On Thu, Aug 20, 2009 at 4:55 PM, Mark Desnoyer<md...@gmail.com> wrote:
> > What about defining a small prior similarity value between all items? In
> > this case, the Lincoln book and the cookbook would start with some small
> > similarity like 0.01 and as more users connect the books, this value gets
> > swamped with the true value. It's the same concept that's used in Beta or
> > Dirichlet distributions.
>
> The core of what you're suggesting, I think, is that the similarity
> value increases as the number of users that are connected to both
> items increases? And then it doesn't even depend on the rating values.
> Yes, actually I think this works well (and Ted would agree, I
> believe.) Or, you could say that indeed this attacks the very problem
> highlighted by this scenario, that the stock algorithm takes no
> account of this number.


You could do it that way, but I don't think you're restricted to ignoring
the rating values. For example, you could define similarity between item i
and item j like (the normalization is probably incomplete, but this is the
idea):

similarity(i,j) = (prior + sum({x_ij})) / (count({x_ij}) + 1)

where each x_ij is the similarity defined by a single user and could be
based on their ratings. So I think the way you're thinking of it, x_ij = 1,
but it could be a function of the ratings, say higher if the ratings are
closer and lower if they are far apart.


>
>
> > Anyway, in the case of this algorithm, if there is no user data between
> > Lincoln books and the cookbook, then the resulting preference would just
> be
> > the average of all the user's previous ratings. If there is some week
>
> That's a variant, to fall back on the average of the user's ratings. I
> personally don't like it -- would rather just disqualify the items
> from recommendations. But it's most certainly plausible. Assuming the
> top recommendations have values that are far from 'average', and
> that's reasonable, it won't matter whether you reject these items or
> give them a middling score, which almost certainly puts them out of
> the top recommendations.
>
>
> > similarity, say 0.1 with a rating of 5, then you'd skew the resulting
> > preference score higher, but it won't go all the way to 5.0. How much it
> > skews is controlled by the strength of prior relative to the similarity
> from
> > the data.
>
> I think you're sort of suggesting to not normalize for the weights in
> the weighted average? right now indeed it sort of does this --
> multiplies the 5 by 0.1. But in the end it divides through by the 0.1.
> You could not do that; then the results aren't really estimated
> preferences since they won't necessarily map into the 1-5 range in
> this example, for instance.


You can still do the weighted average, you just have more items to
calculate. Say a user has rated the Liconln book 5 a book on france 3 and a
book on space travel 1. Assuming there is no data linking the france or
space books to the cookbook, then their similarities would be the prior, or
0.01. Then, you'd calculate the score for the cookbook recommendation as:

score(cookbook) = 5*0.1 + 3*0.01 + 1*0.01 / (0.1 + 0.01 + 0.01) =  4.5


>
> But then sure you just map the results into this range. Yeah, this
> sort of transform is what I stuck onto the Pearson correlation to
> account for a similar phenomenon. I'll look at adapting that, which is
> I think roughly what you are describing.
>

Re: [Taste] Sanity Check and Questions

Posted by Sean Owen <sr...@gmail.com>.
On Thu, Aug 20, 2009 at 4:55 PM, Mark Desnoyer<md...@gmail.com> wrote:
> What about defining a small prior similarity value between all items? In
> this case, the Lincoln book and the cookbook would start with some small
> similarity like 0.01 and as more users connect the books, this value gets
> swamped with the true value. It's the same concept that's used in Beta or
> Dirichlet distributions.

The core of what you're suggesting, I think, is that the similarity
value increases as the number of users that are connected to both
items increases? And then it doesn't even depend on the rating values.
Yes, actually I think this works well (and Ted would agree, I
believe.) Or, you could say that indeed this attacks the very problem
highlighted by this scenario, that the stock algorithm takes no
account of this number.


> Anyway, in the case of this algorithm, if there is no user data between
> Lincoln books and the cookbook, then the resulting preference would just be
> the average of all the user's previous ratings. If there is some week

That's a variant, to fall back on the average of the user's ratings. I
personally don't like it -- would rather just disqualify the items
from recommendations. But it's most certainly plausible. Assuming the
top recommendations have values that are far from 'average', and
that's reasonable, it won't matter whether you reject these items or
give them a middling score, which almost certainly puts them out of
the top recommendations.


> similarity, say 0.1 with a rating of 5, then you'd skew the resulting
> preference score higher, but it won't go all the way to 5.0. How much it
> skews is controlled by the strength of prior relative to the similarity from
> the data.

I think you're sort of suggesting to not normalize for the weights in
the weighted average? right now indeed it sort of does this --
multiplies the 5 by 0.1. But in the end it divides through by the 0.1.
You could not do that; then the results aren't really estimated
preferences since they won't necessarily map into the 1-5 range in
this example, for instance.

But then sure you just map the results into this range. Yeah, this
sort of transform is what I stuck onto the Pearson correlation to
account for a similar phenomenon. I'll look at adapting that, which is
I think roughly what you are describing.

Re: [Taste] Sanity Check and Questions

Posted by Mark Desnoyer <md...@gmail.com>.
What about defining a small prior similarity value between all items? In
this case, the Lincoln book and the cookbook would start with some small
similarity like 0.01 and as more users connect the books, this value gets
swamped with the true value. It's the same concept that's used in Beta or
Dirichlet distributions.

Anyway, in the case of this algorithm, if there is no user data between
Lincoln books and the cookbook, then the resulting preference would just be
the average of all the user's previous ratings. If there is some week
similarity, say 0.1 with a rating of 5, then you'd skew the resulting
preference score higher, but it won't go all the way to 5.0. How much it
skews is controlled by the strength of prior relative to the similarity from
the data.

-Mark

On Thu, Aug 20, 2009 at 11:20 AM, Sean Owen <sr...@gmail.com> wrote:

> To kind of follow up on this, since Grant and I talked at length about
> this today --
>
> Yeah in the end he was able to create some basic examples that gave
> reasonable results on this data set. Except for approaches based on
> item-based recommenders. This data set highlighted a weakness of a
> straightforward reading of the algorithm, and it might be interesting
> to explain here for any interested parties.
>
> In this data set it's clear what the 'right' answers are -- there are
> a small group of fans of books about Lincoln and one expects more
> Lincoln books to be recommended. Indeed the algorithms estimated high
> ratings for such books on a scale of 1 to 5. But, a number of
> seemingly unrelated books were topping the recommendation list with an
> estimated rating of 5.
>
> Why? well the core of the item-based recommender is the bit that
> estimates a probable rating for some new item under consideration:
>
> estimatePreference for item:
>  for each preferredItem that the user expresses a preference for:
>    compute similarity metric m between item and preferredItem
>    if it is defined:
>      store the preference value for preferredItem, weighted by
> (multiplied by) similarity m
>  return the weighted average over those stored preference values
>
> Simple enough. But it's vulnerable to a not-uncommon scenario. Imagine
> a user likes only Lincoln-related books. In the course of producing
> recommendations, we estimate a preference value for, say, a cookbook.
> Should be low right? This is a user who rates Lincoln books, just
> about exclusively, and maybe rated a recent popular Lincoln biography
> 5 out of 5.
>
> But because there are few users in the world who both cook and study
> history, it's possible that given the data, there is no definable
> similarity value between any of the Lincoln books this user likes, and
> the cookbook. Fine, in this case the cookbook won't be recommended.
>
> But imagine that, as it happens, a couple people like the cookbook and
> this recent Lincoln biography. We compute the similarity and still
> find it's low. Maybe 0.1. Still seems like this isn't a good
> recommendation right?
>
> Well, if that's the only Lincoln book in the preferences that has any
> defined similarity to the cookbook, the above algorithm says we should
> return the weighted average of those preferences, weighted by
> similarity. That's: (0.1 * 5.0) / 0.1. Or 5.0. The process says the
> estimated rating is 5, and indeed there is some justification for
> that. But it flies in the face of intuition.
>
>
> Interesting. Takeaways?
>
> 1) The stock item-based recommender just has this weakness. I am open
> to suggestion about alternate modes that could work around this. For
> example: reject as a recommendation any item like this, where the
> estimate would be based on a single data point.
>
> 2) Once again it seems that using preference values can actually get
> one into trouble. Interesting.
>
>
> On Thu, Jun 18, 2009 at 9:43 PM, Ted Dunning<te...@gmail.com> wrote:
> > Grant,
> >
> > The data you described is pretty simple and should produce good results
> at
> > all levels of overlap.  That it does not is definitely a problem.   In
> fact,
> > I would recommend making the data harder to deal with by giving
> non-Lincoln
> > items highly variable popularities and then making the groundlings rate
> > items according to their popularity.  This will result in an apparent
> > pattern where the inclusion of any number of non-lincoln fans will show
> an
> > apparent pattern of liking popular items.  The correct inference should,
> > however, be that any neighbor group that has a large number of Lincoln
> fans
> > seems to like popular items less than expected.
> >
> > For problems like this, I have had good luck with using measures that
> were
> > robust in the face of noise (aka small counts) and in the face of large
> > external trends (aka the top-40 problem).  The simplest one that I know
> of
> > is the generalized multinomial log-likelihood
> > ratio<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
> >that
> > you hear me nattering about so often.  LSI does a decent job of
> > dealing
> > with the top-40, but has trouble with small counts.  LDA and related
> > probabiliistic methods should work somewhat better than log-likelihood
> > ratio, but are considerably more complex to implement.
> >
> > The key here is to compare counts within the local neighborhood to counts
> > outside the neighborhood.  Things that are significantly different about
> the
> > neighborhood relative to rest of the world are candidates for
> > recommendation.  Things to avoid when looking for interesting differences
> > include:
> >
> > - correlation measures such as Pearson's R (based on normal distribution
> > approximation and unscaled thus suffers from both small count and top-40
> > problems)
> >
> > - anomaly measures based simply on frequency ratios (very sensitive to
> small
> > count problems, doesn't account for top-40 at all)
> >
> > What I would recommend for a nearest neighbor approach is to continue
> with
> > the current neighbor retrieval, but switch to a log-likelihood ratio for
> > generating recommendations.
> >
> > What I would recommend for a production system would be to scrap the
> nearest
> > neighbor approach entirely and go to a coocurrence matrix based approach.
> > This costs much less to compute at recommendation and is very robust
> against
> > both small counts and top-40 issues.
> >
> > On Thu, Jun 18, 2009 at 9:37 AM, Sean Owen <sr...@gmail.com> wrote:
> >
> >> Still seems a little funny. I am away from the code otherwise I would
> check
> >> - forget if I ever implemented weighting in the standard correlation
> >> similarity metrics. A pure correlation does not account for whether you
> >> overlapped in 10 or 1000 items. This sort of weighting exists elsewhere
> but
> >> I forget about here. It might help.
> >>
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>