You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Peter Harrington <pe...@gmail.com> on 2011/04/26 02:28:58 UTC

How to evaluate a recommender with binary ratings?

Does anyone have a suggestion for how to evaluate a recommendation engine
that uses a binary rating system?
Usually the R scores (similarity score * rating of other items) are
normalized by dividing by the sum of all rated similarity scores.  If I do
this for a binary scoring system I would get 1.0 for every item.

Is there another normalization I can do to get a number between 0 and 1.0?
Should I just use precision and recall?

Thanks for the help,
Peter Harrington

Re: How to evaluate a recommender with binary ratings?

Posted by Ted Dunning <te...@gmail.com>.
Ahh...

In that case, AUC and log-likelihood (for probability outputs) are the
natural measures of quality.  Precision at 20 or comparable measures are
also very helpful.

If you can deploy the system on a subset of data, then recommendation
click-through rate is the most realistic measure.  Hopefully you can push
this back to an off-line measure.

On Mon, Apr 25, 2011 at 6:12 PM, Peter Harrington <
peter.b.harrington@gmail.com> wrote:

> Ted,
> Thanks for the quick response.  Perhaps I used the wrong terminology, but a
> recommender that uses binary data is nothing new.  For example: a news web
> site would like to recommend news stories based on your past
> viewing behavior:  you viewed an article or not.  Chapter 6 in Mahout in
> Action has the Wikipedia snapshot with link exists or not, recommendations
> are done on these binary datasets.
> The recommender is not generating a 1 or 0.
>
> Thanks again, I will probably go with precision.  What do you think about
> coverage?
> Peter
>
> On Mon, Apr 25, 2011 at 5:50 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > If the recommendation will only produce binary output scores and you have
> > actual held out user data, then you can still compute AUC.  If you want
> to
> > compute log-likelihood, you need to compute probabilities p_1 and p_2
> that
> > represent what the recommender *should* have said when it actually said 0
> > or
> > 1.  You can adapt these to give optimum log-likelihood on one held out
> set
> > and then get a real value for log-likelihood on another held out set.
> >
> > Precision, recall, false positive rate are also possibly useful.
> >
> > If the engine has an internal threshold knob, you can build ROC curves
> and
> > estimate AUC using averaging.
> >
> > But the question remains, why would use such a recommendation engine?
> >
> > On Mon, Apr 25, 2011 at 5:28 PM, Peter Harrington <
> > peter.b.harrington@gmail.com> wrote:
> >
> > > Does anyone have a suggestion for how to evaluate a recommendation
> engine
> > > that uses a binary rating system?
> > > Usually the R scores (similarity score * rating of other items) are
> > > normalized by dividing by the sum of all rated similarity scores.  If I
> > do
> > > this for a binary scoring system I would get 1.0 for every item.
> > >
> > > Is there another normalization I can do to get a number between 0 and
> > 1.0?
> > > Should I just use precision and recall?
> > >
> > > Thanks for the help,
> > > Peter Harrington
> > >
> >
>

Re: How to evaluate a recommender with binary ratings?

Posted by Peter Harrington <pe...@gmail.com>.
Ted,
Thanks for the quick response.  Perhaps I used the wrong terminology, but a
recommender that uses binary data is nothing new.  For example: a news web
site would like to recommend news stories based on your past
viewing behavior:  you viewed an article or not.  Chapter 6 in Mahout in
Action has the Wikipedia snapshot with link exists or not, recommendations
are done on these binary datasets.
The recommender is not generating a 1 or 0.

Thanks again, I will probably go with precision.  What do you think about
coverage?
Peter

On Mon, Apr 25, 2011 at 5:50 PM, Ted Dunning <te...@gmail.com> wrote:

> If the recommendation will only produce binary output scores and you have
> actual held out user data, then you can still compute AUC.  If you want to
> compute log-likelihood, you need to compute probabilities p_1 and p_2 that
> represent what the recommender *should* have said when it actually said 0
> or
> 1.  You can adapt these to give optimum log-likelihood on one held out set
> and then get a real value for log-likelihood on another held out set.
>
> Precision, recall, false positive rate are also possibly useful.
>
> If the engine has an internal threshold knob, you can build ROC curves and
> estimate AUC using averaging.
>
> But the question remains, why would use such a recommendation engine?
>
> On Mon, Apr 25, 2011 at 5:28 PM, Peter Harrington <
> peter.b.harrington@gmail.com> wrote:
>
> > Does anyone have a suggestion for how to evaluate a recommendation engine
> > that uses a binary rating system?
> > Usually the R scores (similarity score * rating of other items) are
> > normalized by dividing by the sum of all rated similarity scores.  If I
> do
> > this for a binary scoring system I would get 1.0 for every item.
> >
> > Is there another normalization I can do to get a number between 0 and
> 1.0?
> > Should I just use precision and recall?
> >
> > Thanks for the help,
> > Peter Harrington
> >
>

Re: How to evaluate a recommender with binary ratings?

Posted by Sean Owen <sr...@gmail.com>.
Peter (/Ted),

Yes this is all answered in the framework already. You would never directly
use the recommenders intended for data sets with ratings, as most don't make
sense when all ratings are 1.0. You would use, for example,
GenericBooleanPrefItemBasedRecommender, a variant on
GenericItemBasedRecommender, which overloads the notion of
"estimatePreference()" to still return a useful value.

There is already GenericRecommenderIRStatsEvaluator which runs precision,
recall, f-score and NDCG stats on a recommender. These are meaningful even
without ratings, though of course things like RMSE aren't anymore. (This is
all in Mahout in Action too, yes.)

The output of a recommender or similarity metric isn't a probability in
general, so you can't apply AUC in all cases, so this is not implemented in
general. However yes for the case of LogLikelihoodSimilarity you could
manage to put that together.

On Tue, Apr 26, 2011 at 1:50 AM, Ted Dunning <te...@gmail.com> wrote:

> If the recommendation will only produce binary output scores and you have
> actual held out user data, then you can still compute AUC.  If you want to
> compute log-likelihood, you need to compute probabilities p_1 and p_2 that
> represent what the recommender *should* have said when it actually said 0
> or
> 1.  You can adapt these to give optimum log-likelihood on one held out set
> and then get a real value for log-likelihood on another held out set.
>
> Precision, recall, false positive rate are also possibly useful.
>
> If the engine has an internal threshold knob, you can build ROC curves and
> estimate AUC using averaging.
>
> But the question remains, why would use such a recommendation engine?
>
> On Mon, Apr 25, 2011 at 5:28 PM, Peter Harrington <
> peter.b.harrington@gmail.com> wrote:
>
> > Does anyone have a suggestion for how to evaluate a recommendation engine
> > that uses a binary rating system?
> > Usually the R scores (similarity score * rating of other items) are
> > normalized by dividing by the sum of all rated similarity scores.  If I
> do
> > this for a binary scoring system I would get 1.0 for every item.
> >
> > Is there another normalization I can do to get a number between 0 and
> 1.0?
> > Should I just use precision and recall?
> >
> > Thanks for the help,
> > Peter Harrington
> >
>

Re: How to evaluate a recommender with binary ratings?

Posted by Ted Dunning <te...@gmail.com>.
If the recommendation will only produce binary output scores and you have
actual held out user data, then you can still compute AUC.  If you want to
compute log-likelihood, you need to compute probabilities p_1 and p_2 that
represent what the recommender *should* have said when it actually said 0 or
1.  You can adapt these to give optimum log-likelihood on one held out set
and then get a real value for log-likelihood on another held out set.

Precision, recall, false positive rate are also possibly useful.

If the engine has an internal threshold knob, you can build ROC curves and
estimate AUC using averaging.

But the question remains, why would use such a recommendation engine?

On Mon, Apr 25, 2011 at 5:28 PM, Peter Harrington <
peter.b.harrington@gmail.com> wrote:

> Does anyone have a suggestion for how to evaluate a recommendation engine
> that uses a binary rating system?
> Usually the R scores (similarity score * rating of other items) are
> normalized by dividing by the sum of all rated similarity scores.  If I do
> this for a binary scoring system I would get 1.0 for every item.
>
> Is there another normalization I can do to get a number between 0 and 1.0?
> Should I just use precision and recall?
>
> Thanks for the help,
> Peter Harrington
>