You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2010/12/27 15:54:16 UTC

Evaluating recommendations through user observation

Hi,

I was wondering how people evaluate the quality of recommendations other than 
RMSE and such in eval package.
For example, what are some good ways to measure/evaluate the quality of 
recommendations based on simply observing users' usage of recommendations?
Here are 2 ideas.

* If you have a mechanism to capture user's rating of the watched item,  that 
gives you (in)direct feedback about the quality of the  recommendation.  When 
evaluating and comparing you probably also want to  take into account the 
ordinal of the recommended item in the list of  recommended items.  If a person 
chooses 1st recommendation and gives it a  score of 10 (best) it's different 
than when a person chooses 7th  recommendation and gives it a score of 10.  Or 
if a person chooses 1st  recommendation and gives it a rating of 1.0 (worst) vs. 
choosing 10th  recommendation and rating it 1.0.

* Even if you don't have a mechanism to capture rating feedback from  viewers, 
you can evaluate and compare.  You can do that by purely  looking at ordinals of 
items selected from recommendations.  If a  person chooses something closer to 
"the top" of the recommendation list,  the recommendations can be considered 
better than if the user chooses  something closer to "the bottom".  This idea is 
similar to MRR in search  - http://en.wikipedia.org/wiki/Mean_reciprocal_rank .

* The above ideas assume recommendations are not shuffled, meaning that their  
order represents their real recommendation score-based order

I'm wondering:
A) if these ways or measuring/evaluating quality of recommendations are 
good/bad/flawed
B) if there are other, better ways of doing this

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Re: Evaluating recommendations through user observation

Posted by Lance Norskog <go...@gmail.com>.

Here's a better way to describe my pov: there is a list of use cases
you would like to implement with your recommender, and some of these
are about the psychology and actions of when and why people push the
button. Then, there is a list of features available in the (medium
rich) recommender class suite. So these create your classic matrix of
use cases v.s. features.

The tools are composable. The contract around the tool APIs is
somewhat loose, and different tools have different interpretations.
The features on the side of the matrix are often non-intuitive
combinations of tools rather than individual tools.

There is a learning curve here, and I would like it to be other than
"Ask Ted". This paper is really helpful. This paper by some of the
same crew is about explaining the recommendation scores to the user:

http://www.grouplens.org/papers/pdf/explain-CSCW.pdf



On Tue, Dec 28, 2010 at 6:26 AM, Alan Said <Al...@dai-labor.de> wrote:
> There's a very nice paper by Herlocker et al. - "Evaluating Collaborative Filtering Recommender Systems" which describes different aspects of evaluation. Recommended reading if you're interested in the topic.
>
> PDF available here:
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.5270&rep=rep1&type=pdf
>
> --
> ***************************************
> M.Sc.(Eng.) Alan Said
> Compentence Center Information Retrieval & Machine Learning
> Technische Universität Berlin / DAI-Lab
> Sekr. TEL 14 Ernst-Reuter-Platz 7
> 10587 Berlin / Germany
> Phone:  0049 - 30 - 314 74072
> Fax:    0049 - 30 - 314 74003
> E-mail: alan.said@dai-lab.de
> http://www.dai-labor.de
> ***************************************
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Sent: Monday, December 27, 2010 3:54 PM
> To: user@mahout.apache.org
> Subject: Evaluating recommendations through user observation
>
> Hi,
>
> I was wondering how people evaluate the quality of recommendations other than
> RMSE and such in eval package.
> For example, what are some good ways to measure/evaluate the quality of
> recommendations based on simply observing users' usage of recommendations?
> Here are 2 ideas.
>
> * If you have a mechanism to capture user's rating of the watched item,  that
> gives you (in)direct feedback about the quality of the  recommendation.  When
> evaluating and comparing you probably also want to  take into account the
> ordinal of the recommended item in the list of  recommended items.  If a person
> chooses 1st recommendation and gives it a  score of 10 (best) it's different
> than when a person chooses 7th  recommendation and gives it a score of 10.  Or
> if a person chooses 1st  recommendation and gives it a rating of 1.0 (worst) vs.
> choosing 10th  recommendation and rating it 1.0.
>
> * Even if you don't have a mechanism to capture rating feedback from  viewers,
> you can evaluate and compare.  You can do that by purely  looking at ordinals of
> items selected from recommendations.  If a  person chooses something closer to
> "the top" of the recommendation list,  the recommendations can be considered
> better than if the user chooses  something closer to "the bottom".  This idea is
> similar to MRR in search  - http://en.wikipedia.org/wiki/Mean_reciprocal_rank .
>
> * The above ideas assume recommendations are not shuffled, meaning that their
> order represents their real recommendation score-based order
>
> I'm wondering:
> A) if these ways or measuring/evaluating quality of recommendations are
> good/bad/flawed
> B) if there are other, better ways of doing this
>
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>



-- 
Lance Norskog
goksron@gmail.com

RE: Evaluating recommendations through user observation

Posted by Alan Said <Al...@dai-labor.de>.

There's a very nice paper by Herlocker et al. - "Evaluating Collaborative Filtering Recommender Systems" which describes different aspects of evaluation. Recommended reading if you're interested in the topic.

PDF available here: 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.5270&rep=rep1&type=pdf

-- 
***************************************
M.Sc.(Eng.) Alan Said
Compentence Center Information Retrieval & Machine Learning 
Technische Universität Berlin / DAI-Lab 
Sekr. TEL 14 Ernst-Reuter-Platz 7
10587 Berlin / Germany
Phone:  0049 - 30 - 314 74072
Fax:    0049 - 30 - 314 74003
E-mail: alan.said@dai-lab.de
http://www.dai-labor.de
***************************************

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: Monday, December 27, 2010 3:54 PM
To: user@mahout.apache.org
Subject: Evaluating recommendations through user observation

Hi,

I was wondering how people evaluate the quality of recommendations other than 
RMSE and such in eval package.
For example, what are some good ways to measure/evaluate the quality of 
recommendations based on simply observing users' usage of recommendations?
Here are 2 ideas.

* If you have a mechanism to capture user's rating of the watched item,  that 
gives you (in)direct feedback about the quality of the  recommendation.  When 
evaluating and comparing you probably also want to  take into account the 
ordinal of the recommended item in the list of  recommended items.  If a person 
chooses 1st recommendation and gives it a  score of 10 (best) it's different 
than when a person chooses 7th  recommendation and gives it a score of 10.  Or 
if a person chooses 1st  recommendation and gives it a rating of 1.0 (worst) vs. 
choosing 10th  recommendation and rating it 1.0.

* Even if you don't have a mechanism to capture rating feedback from  viewers, 
you can evaluate and compare.  You can do that by purely  looking at ordinals of 
items selected from recommendations.  If a  person chooses something closer to 
"the top" of the recommendation list,  the recommendations can be considered 
better than if the user chooses  something closer to "the bottom".  This idea is 
similar to MRR in search  - http://en.wikipedia.org/wiki/Mean_reciprocal_rank .

* The above ideas assume recommendations are not shuffled, meaning that their  
order represents their real recommendation score-based order

I'm wondering:
A) if these ways or measuring/evaluating quality of recommendations are 
good/bad/flawed
B) if there are other, better ways of doing this

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Re: Evaluating recommendations through user observation

Posted by Sean Owen <sr...@gmail.com>.

If the general point is that user behavior is an incomplete, indirect,
and sometimes erroneous expression of what they like, yes I agree.
Sometimes users don't even know what they like (hence recommenders).
That's a meta-point, I think, which is an issue for recommending, or
evaluating the recommender against other user signals. The framework
has nothing to say about translating user signals into ratings, no, if
that's what you mean.

But, part of the question was what to do with whatever imperfect
translation to ratings one has, so I take that as a given. I don't
know of any special secrets here. I tend to think recommendations are
a coarse sort of output; I wouldn't read too much into whether rec 1
or 6 was picked; that one was picked as "good" is significant at all.
RMSE and other metrics are fine, though still suffer from the fact
that their input (user ratings) is noisy.

A crude test which you can run in the lab is to see if their future
behavior seemed to agree with recommendations. For this, you don't
need future user data, you can just hold out the most recent n days of
data and train on the rest. For situations where you have ratings, the
code does have support for you, running RMSE tests and such.

When you don't have ratings, it'll also help you do
precision/recall-style tests. This is a little problematic as you may
have really good top 10 recommendations but simply not observe the
user interacting with them. Precision and recall will always be really
low -- maybe useful as a relative comparison of two implementations,
but not a lot. Another sort of test which is not in the code is to
look at the n days of real user activity in this situation and see how
strongly the recommender would have rated that item for the user. The
higher the better. That too is a useful sort of relative comparison
for implementations in the lab.

I think the best measure is the broadest and most direct one. You put
in recs for a reason, to increase clicks/conversions/engagement over
some baseline. Do recommendations improve that metric when put in
place vs when not shown? A/B testing is the way to go, on
clicks/conversions or whatever. This is harder since you have to
deploy this in the field over some days and measure the difference,
but perhaps the best way.

On Mon, Dec 27, 2010 at 11:37 PM, Lance Norskog <go...@gmail.com> wrote:
> Different people watch different numbers of movies. They also rate
> some but not all. Their recommendations may be in one or a few
> clusters (other clustering can be genre, which day of the week is the
> rating, on and on) or may be scattered all over genres (Harry Potter &
> British comedy & European soft-core 70's porn). Evaluating the worth
> of user X's ratings is also important. If you want to interpret the
> ratings in an absolute number system, you want to map the incoming
> ratings because they may average at 7.
>
> The code in Mahout doesn't address these issues.
>
> On Mon, Dec 27, 2010 at 6:54 AM, Otis Gospodnetic
> <ot...@yahoo.com> wrote:
>> Hi,
>>
>> I was wondering how people evaluate the quality of recommendations other than
>> RMSE and such in eval package.
>> For example, what are some good ways to measure/evaluate the quality of
>> recommendations based on simply observing users' usage of recommendations?
>> Here are 2 ideas.
>>
>> * If you have a mechanism to capture user's rating of the watched item,  that
>> gives you (in)direct feedback about the quality of the  recommendation.  When
>> evaluating and comparing you probably also want to  take into account the
>> ordinal of the recommended item in the list of  recommended items.  If a person
>> chooses 1st recommendation and gives it a  score of 10 (best) it's different
>> than when a person chooses 7th  recommendation and gives it a score of 10.  Or
>> if a person chooses 1st  recommendation and gives it a rating of 1.0 (worst) vs.
>> choosing 10th  recommendation and rating it 1.0.
>>
>> * Even if you don't have a mechanism to capture rating feedback from  viewers,
>> you can evaluate and compare.  You can do that by purely  looking at ordinals of
>> items selected from recommendations.  If a  person chooses something closer to
>> "the top" of the recommendation list,  the recommendations can be considered
>> better than if the user chooses  something closer to "the bottom".  This idea is
>> similar to MRR in search  - http://en.wikipedia.org/wiki/Mean_reciprocal_rank .
>>
>> * The above ideas assume recommendations are not shuffled, meaning that their
>> order represents their real recommendation score-based order
>>
>> I'm wondering:
>> A) if these ways or measuring/evaluating quality of recommendations are
>> good/bad/flawed
>> B) if there are other, better ways of doing this
>>
>> Thanks,
>> Otis
>> ----
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Evaluating recommendations through user observation

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Dec 27, 2010 at 4:24 PM, Sebastian Schelter <ss...@apache.org> wrote:

> From my experience the best insights are found by A/B testing
> different algorithms against live users and measuring relevant actions
> you want to see triggered by your recommender system (the number of
> recommended items put into a shopping cart for example).
>

Amen to this.  I only addressed off-line evaluation, but on-line evaluation
is far better if you have sufficient traffic.  Generally, offline testing is
only usable to weed out totally useless options and A/B testing is required
for more realistic assessment.

> > >
> > > On Mon, Dec 27, 2010 at 6:54 AM, Otis Gospodnetic
> > > <ot...@yahoo.com> wrote:
> > > > Hi,
> > > >
> > > > I was wondering how people evaluate the quality of recommendations
> other
> > > than
> > > > RMSE and such in eval package.
> > >
> >
> > Off-line evaluation is difficult.  Your suggestion of MRR and related
> > measures is reasonable, but I prefer to count every presentation on the
> > first page as equivalent.
> >
> > The real problem is that historical data will only include presentations
> of
> > items from a single recommendation system.  That means that any new
> system
> > that brings in new recommendations is at a disadvantage at least in terms
> of
> > error bars around the estimated click through rate.
> >
> > Another option is to compute grouped AUC for clicked items relative to
> > unclicked items.  To do this, iterate over users with clicks.  Pick a
> random
> > clicked item and a random unclicked item.  Score 1 if clicked item has
> > higher score, 0 otherwise.  Ties can be broken at random, but I prefer to
> > score 0 or 0.5 for them.  Average score near 1 is awesome.
> >
> > I don't find it all that helpful to use the exact rank.  Rather, I like
> to
> > group all impressions that are shown in the same screenful together and
> then
> > ignore second and later pages.  I also prefer to measure changes in
> behavior
> > that has business value rather than just ratings.
>

Re: Evaluating recommendations through user observation

Posted by Sebastian Schelter <ss...@apache.org>.

>From my experience the best insights are found by A/B testing
different algorithms against live users and measuring relevant actions
you want to see triggered by your recommender system (the number of
recommended items put into a shopping cart for example).

The paper "Google News Personalization: Scalable Online Collaborative
Filtering" ( http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.4329&rep=rep1&type=pdf
) has a chapter about how the guys there evaluated their newly built
recommender system, maybe that gives us some more ideas.

--sebastian

2010/12/28 Ted Dunning <te...@gmail.com>
>
> Actually, the Mahout code does address some of these issues:
>
> On Mon, Dec 27, 2010 at 3:37 PM, Lance Norskog <go...@gmail.com> wrote:
>
> > Different people watch different numbers of movies.
>
>
> This is no problem except that with fewer movies rated (or watched if you
> are using implicit feedback) then the results are less certain.
>
> They also rate
> > some but not all.
>
>
> Again, not a problem.
>
>
> > Their recommendations may be in one or a few
> > clusters (other clustering can be genre, which day of the week is the
> > rating, on and on) or may be scattered all over genres (Harry Potter &
> > British comedy & European soft-core 70's porn).
>
>
> This isn't a problem except insofar as recommendations are a portfolio in
> which getting non-zero click-through on a set of recommendations is
> typically what you want, but most recommendation system optimize the
> expected number of clicks.  This isn't the same thing because clicks can
> correlate and it helps to hedge your bets by increasing the diversity of the
> recommended set.  This is usually handled in an ad hoc fashion.
>
>
> > Evaluating the worth
> > of user X's ratings is also important.
>
>
> No sure what you mean by this.  There are effectively several options for
> this in Mahout.
>
>
> > If you want to interpret the
> > ratings in an absolute number system, you want to map the incoming
> > ratings because they may average at 7.
> >
>
> Not sure what you mean by this.  If you have ratings limited to a particular
> range, then the average can't be outside that range.  You may indeed want to
> subtract the user mean rating for each user before building the rec data and
> add back the mean for the user being recommended.  Item means may be treated
> the same way.  This is equivalent to subtracting a rank-1 approximation of
> the rankings that is derived using SVD.
>
>
> >
> > The code in Mahout doesn't address these issues.
> >
>
> Hmmm... I think it does.  Perhaps Sean can comment.
>
> Moving to Otis' comments:
>
>
> >
> > On Mon, Dec 27, 2010 at 6:54 AM, Otis Gospodnetic
> > <ot...@yahoo.com> wrote:
> > > Hi,
> > >
> > > I was wondering how people evaluate the quality of recommendations other
> > than
> > > RMSE and such in eval package.
> >
>
> Off-line evaluation is difficult.  Your suggestion of MRR and related
> measures is reasonable, but I prefer to count every presentation on the
> first page as equivalent.
>
> The real problem is that historical data will only include presentations of
> items from a single recommendation system.  That means that any new system
> that brings in new recommendations is at a disadvantage at least in terms of
> error bars around the estimated click through rate.
>
> Another option is to compute grouped AUC for clicked items relative to
> unclicked items.  To do this, iterate over users with clicks.  Pick a random
> clicked item and a random unclicked item.  Score 1 if clicked item has
> higher score, 0 otherwise.  Ties can be broken at random, but I prefer to
> score 0 or 0.5 for them.  Average score near 1 is awesome.
>
> I don't find it all that helpful to use the exact rank.  Rather, I like to
> group all impressions that are shown in the same screenful together and then
> ignore second and later pages.  I also prefer to measure changes in behavior
> that has business value rather than just ratings.

Re: Evaluating recommendations through user observation

Posted by Ted Dunning <te...@gmail.com>.

Actually, the Mahout code does address some of these issues:

On Mon, Dec 27, 2010 at 3:37 PM, Lance Norskog <go...@gmail.com> wrote:

> Different people watch different numbers of movies.

This is no problem except that with fewer movies rated (or watched if you
are using implicit feedback) then the results are less certain.

They also rate
> some but not all.

Again, not a problem.

> Their recommendations may be in one or a few
> clusters (other clustering can be genre, which day of the week is the
> rating, on and on) or may be scattered all over genres (Harry Potter &
> British comedy & European soft-core 70's porn).

This isn't a problem except insofar as recommendations are a portfolio in
which getting non-zero click-through on a set of recommendations is
typically what you want, but most recommendation system optimize the
expected number of clicks.  This isn't the same thing because clicks can
correlate and it helps to hedge your bets by increasing the diversity of the
recommended set.  This is usually handled in an ad hoc fashion.

> Evaluating the worth
> of user X's ratings is also important.

No sure what you mean by this.  There are effectively several options for
this in Mahout.

> If you want to interpret the
> ratings in an absolute number system, you want to map the incoming
> ratings because they may average at 7.
>

Not sure what you mean by this.  If you have ratings limited to a particular
range, then the average can't be outside that range.  You may indeed want to
subtract the user mean rating for each user before building the rec data and
add back the mean for the user being recommended.  Item means may be treated
the same way.  This is equivalent to subtracting a rank-1 approximation of
the rankings that is derived using SVD.

>
> The code in Mahout doesn't address these issues.
>

Hmmm... I think it does.  Perhaps Sean can comment.

Moving to Otis' comments:

>
> On Mon, Dec 27, 2010 at 6:54 AM, Otis Gospodnetic
> <ot...@yahoo.com> wrote:
> > Hi,
> >
> > I was wondering how people evaluate the quality of recommendations other
> than
> > RMSE and such in eval package.
>

Off-line evaluation is difficult.  Your suggestion of MRR and related
measures is reasonable, but I prefer to count every presentation on the
first page as equivalent.

The real problem is that historical data will only include presentations of
items from a single recommendation system.  That means that any new system
that brings in new recommendations is at a disadvantage at least in terms of
error bars around the estimated click through rate.

Another option is to compute grouped AUC for clicked items relative to
unclicked items.  To do this, iterate over users with clicks.  Pick a random
clicked item and a random unclicked item.  Score 1 if clicked item has
higher score, 0 otherwise.  Ties can be broken at random, but I prefer to
score 0 or 0.5 for them.  Average score near 1 is awesome.

I don't find it all that helpful to use the exact rank.  Rather, I like to
group all impressions that are shown in the same screenful together and then
ignore second and later pages.  I also prefer to measure changes in behavior
that has business value rather than just ratings.

Re: Evaluating recommendations through user observation

Posted by Lance Norskog <go...@gmail.com>.

Different people watch different numbers of movies. They also rate
some but not all. Their recommendations may be in one or a few
clusters (other clustering can be genre, which day of the week is the
rating, on and on) or may be scattered all over genres (Harry Potter &
British comedy & European soft-core 70's porn). Evaluating the worth
of user X's ratings is also important. If you want to interpret the
ratings in an absolute number system, you want to map the incoming
ratings because they may average at 7.

The code in Mahout doesn't address these issues.

On Mon, Dec 27, 2010 at 6:54 AM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Hi,
>
> I was wondering how people evaluate the quality of recommendations other than
> RMSE and such in eval package.
> For example, what are some good ways to measure/evaluate the quality of
> recommendations based on simply observing users' usage of recommendations?
> Here are 2 ideas.
>
> * If you have a mechanism to capture user's rating of the watched item,  that
> gives you (in)direct feedback about the quality of the  recommendation.  When
> evaluating and comparing you probably also want to  take into account the
> ordinal of the recommended item in the list of  recommended items.  If a person
> chooses 1st recommendation and gives it a  score of 10 (best) it's different
> than when a person chooses 7th  recommendation and gives it a score of 10.  Or
> if a person chooses 1st  recommendation and gives it a rating of 1.0 (worst) vs.
> choosing 10th  recommendation and rating it 1.0.
>
> * Even if you don't have a mechanism to capture rating feedback from  viewers,
> you can evaluate and compare.  You can do that by purely  looking at ordinals of
> items selected from recommendations.  If a  person chooses something closer to
> "the top" of the recommendation list,  the recommendations can be considered
> better than if the user chooses  something closer to "the bottom".  This idea is
> similar to MRR in search  - http://en.wikipedia.org/wiki/Mean_reciprocal_rank .
>
> * The above ideas assume recommendations are not shuffled, meaning that their
> order represents their real recommendation score-based order
>
> I'm wondering:
> A) if these ways or measuring/evaluating quality of recommendations are
> good/bad/flawed
> B) if there are other, better ways of doing this
>
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>



-- 
Lance Norskog
goksron@gmail.com