You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Rafal Lukawiecki <ra...@projectbotticelli.com> on 2013/08/08 18:49:11 UTC

Evaluating Precision and Recall of Various Similarity Metrics

I'd like to compare the accuracy, precision and recall of various vector similarity measures with regards to our data sets. Ideally, I'd like to do that for RecommenderJob, including CooccurrenceCount. However, I don't think RecommenderJob supports calculation of the performance metrics.

Alternatively, I could use the evaluator logic in the non-Hadoop-based Item-based recommenders, but they do not seem to support the option of using CooccurrenceCount as a measure, or am I wrong? 

Reading archived conversations from here, I can see others have asked a similar question in 2011 (http://comments.gmane.org/gmane.comp.apache.mahout.user/9758) but there seems no clear guidance. Also, I am unsure if it is valid to split the data set into training/testing that way, as testing users' key characteristic is the items they have preferred—and there is no "model" to fit them to, so to speak, or they would become anonymous users if we stripped their preferences. Am I right in thinking that I could test RecommenderJob by feeding X random preferences of a user, having hidden the remainder of their preferences, and see if the hidden items/preferences would become their recommendations? However, that approach would change what a user "likes" (by hiding their preferences for testing purposes) and I'd be concerned about the value of the recommendation. Am I in a loop? Is there a way to somehow tap into the recommendation to get an accuracy metric out?

Did anyone, perhaps, share a method or a script (R, Python, Java) for evaluating RecommenderJob results?

Many thanks,
Rafal Lukawiecki

Re: Evaluating Precision and Recall of Various Similarity Metrics

Posted by Ted Dunning <te...@gmail.com>.
Rafal,

The major problems with these sorts of metrics with recommendations include

a) different algorithms pull up different data and you don't have any
deeply scored reference data.  The problem is similar to search except
without test collections.  There are some partial solutions to this

b) recommendations are typically very strongly dependent on feedback from
data that they themselves sample.  This means, for instance, that a system
with dithering will often out-perform the same system without dithering.
 Dithering is a form of noise added to the result of a recommender so the
quality of the system with dithering logically has to be worse than the
system without.  The system with dithering performs much better, however,
because it is able to gather broader information and thus learns about
things that the version without dithering would never find.

Problem (b) is the strongly limiting case because dithering can make a
bigger change than almost any reasonable algorithmic choice.  Sadly,
problem (a) is the one attacked in most academic research.




On Thu, Aug 8, 2013 at 10:34 AM, Rafal Lukawiecki <
rafal@projectbotticelli.com> wrote:

> Hi Sebastian—thank you for your suggestions, incl considering other
> similarity measures like LoglikelihoodRation. I still hope to do a
> comparison of all of the available ones, under our data. I realise the
> importance (and also some limitations) of A/B in production testing, but
> having a broader way to test the recommender would have been useful.
>
> I suppose, I am used to looking at lift/profit charts, cross-validation,
> RMSE etc metrics of accuracy and reliability when working with data mining
> models, such as decision trees or clustering, but also using this technique
> for association rules evaluation, where I'd be hoping that the model
> correctly predicts basket completions. I am curious if there is anything
> along this line of thinking for evaluating recommenders that do not expose
> explicit models.
>
> Many thanks, very much indeed, for all your replies.
>
> Rafal
>
> On 8 Aug 2013, at 17:58, Sebastian Schelter <ss...@apache.org>
>  wrote:
>
> Hi Rafal,
>
> you are right, unfortunately there is no tooling available for doing
> holdout tests with RecommenderJob. It would be an awesome contribution to
> Mahout though.
>
> Ideally, you would want to split your dataset in a way that you retain some
> portion of the interactions of each user and then see how much of the
> held-out interactions you can reproduce. You should be aware that this is
> basically a test of how good a recommender can reproduce what already
> happened. If you get recommendations for items that are not in your held
> out data, this does not automatically mean that they are wrong. They might
> be very interesting things that the user simply hasn't had a chance to look
> at yet. The real "performance" of a recommender can only be found via
> extensive A/B testing in production systems.
>
> Btw, I would strongly recommend that you use a more sophisticated
> similarity than cooccurrence count, e..g LoglikelihoodRation.
>
> Best,
> Sebastian
>
>
> 2013/8/8 Rafal Lukawiecki <ra...@projectbotticelli.com>
>
> > I'd like to compare the accuracy, precision and recall of various vector
> > similarity measures with regards to our data sets. Ideally, I'd like to
> do
> > that for RecommenderJob, including CooccurrenceCount. However, I don't
> > think RecommenderJob supports calculation of the performance metrics.
> >
> > Alternatively, I could use the evaluator logic in the non-Hadoop-based
> > Item-based recommenders, but they do not seem to support the option of
> > using CooccurrenceCount as a measure, or am I wrong?
> >
> > Reading archived conversations from here, I can see others have asked a
> > similar question in 2011 (
> > http://comments.gmane.org/gmane.comp.apache.mahout.user/9758) but there
> > seems no clear guidance. Also, I am unsure if it is valid to split the
> data
> > set into training/testing that way, as testing users' key characteristic
> is
> > the items they have preferred—and there is no "model" to fit them to, so
> to
> > speak, or they would become anonymous users if we stripped their
> > preferences. Am I right in thinking that I could test RecommenderJob by
> > feeding X random preferences of a user, having hidden the remainder of
> > their preferences, and see if the hidden items/preferences would become
> > their recommendations? However, that approach would change what a user
> > "likes" (by hiding their preferences for testing purposes) and I'd be
> > concerned about the value of the recommendation. Am I in a loop? Is
> there a
> > way to somehow tap into the recommendation to get an accuracy metric out?
> >
> > Did anyone, perhaps, share a method or a script (R, Python, Java) for
> > evaluating RecommenderJob results?
> >
> > Many thanks,
> > Rafal Lukawiecki
> >
>
>
>

Re: Evaluating Precision and Recall of Various Similarity Metrics

Posted by Rafal Lukawiecki <ra...@projectbotticelli.com>.
Hi Sebastian—thank you for your suggestions, incl considering other similarity measures like LoglikelihoodRation. I still hope to do a comparison of all of the available ones, under our data. I realise the importance (and also some limitations) of A/B in production testing, but having a broader way to test the recommender would have been useful.

I suppose, I am used to looking at lift/profit charts, cross-validation, RMSE etc metrics of accuracy and reliability when working with data mining models, such as decision trees or clustering, but also using this technique for association rules evaluation, where I'd be hoping that the model correctly predicts basket completions. I am curious if there is anything along this line of thinking for evaluating recommenders that do not expose explicit models.

Many thanks, very much indeed, for all your replies.

Rafal
  
On 8 Aug 2013, at 17:58, Sebastian Schelter <ss...@apache.org>
 wrote:

Hi Rafal,

you are right, unfortunately there is no tooling available for doing
holdout tests with RecommenderJob. It would be an awesome contribution to
Mahout though.

Ideally, you would want to split your dataset in a way that you retain some
portion of the interactions of each user and then see how much of the
held-out interactions you can reproduce. You should be aware that this is
basically a test of how good a recommender can reproduce what already
happened. If you get recommendations for items that are not in your held
out data, this does not automatically mean that they are wrong. They might
be very interesting things that the user simply hasn't had a chance to look
at yet. The real "performance" of a recommender can only be found via
extensive A/B testing in production systems.

Btw, I would strongly recommend that you use a more sophisticated
similarity than cooccurrence count, e..g LoglikelihoodRation.

Best,
Sebastian


2013/8/8 Rafal Lukawiecki <ra...@projectbotticelli.com>

> I'd like to compare the accuracy, precision and recall of various vector
> similarity measures with regards to our data sets. Ideally, I'd like to do
> that for RecommenderJob, including CooccurrenceCount. However, I don't
> think RecommenderJob supports calculation of the performance metrics.
> 
> Alternatively, I could use the evaluator logic in the non-Hadoop-based
> Item-based recommenders, but they do not seem to support the option of
> using CooccurrenceCount as a measure, or am I wrong?
> 
> Reading archived conversations from here, I can see others have asked a
> similar question in 2011 (
> http://comments.gmane.org/gmane.comp.apache.mahout.user/9758) but there
> seems no clear guidance. Also, I am unsure if it is valid to split the data
> set into training/testing that way, as testing users' key characteristic is
> the items they have preferred—and there is no "model" to fit them to, so to
> speak, or they would become anonymous users if we stripped their
> preferences. Am I right in thinking that I could test RecommenderJob by
> feeding X random preferences of a user, having hidden the remainder of
> their preferences, and see if the hidden items/preferences would become
> their recommendations? However, that approach would change what a user
> "likes" (by hiding their preferences for testing purposes) and I'd be
> concerned about the value of the recommendation. Am I in a loop? Is there a
> way to somehow tap into the recommendation to get an accuracy metric out?
> 
> Did anyone, perhaps, share a method or a script (R, Python, Java) for
> evaluating RecommenderJob results?
> 
> Many thanks,
> Rafal Lukawiecki
> 



Re: Evaluating Precision and Recall of Various Similarity Metrics

Posted by Sebastian Schelter <ss...@apache.org>.
Hi Rafal,

you are right, unfortunately there is no tooling available for doing
holdout tests with RecommenderJob. It would be an awesome contribution to
Mahout though.

Ideally, you would want to split your dataset in a way that you retain some
portion of the interactions of each user and then see how much of the
held-out interactions you can reproduce. You should be aware that this is
basically a test of how good a recommender can reproduce what already
happened. If you get recommendations for items that are not in your held
out data, this does not automatically mean that they are wrong. They might
be very interesting things that the user simply hasn't had a chance to look
at yet. The real "performance" of a recommender can only be found via
extensive A/B testing in production systems.

Btw, I would strongly recommend that you use a more sophisticated
similarity than cooccurrence count, e..g LoglikelihoodRation.

Best,
Sebastian


2013/8/8 Rafal Lukawiecki <ra...@projectbotticelli.com>

> I'd like to compare the accuracy, precision and recall of various vector
> similarity measures with regards to our data sets. Ideally, I'd like to do
> that for RecommenderJob, including CooccurrenceCount. However, I don't
> think RecommenderJob supports calculation of the performance metrics.
>
> Alternatively, I could use the evaluator logic in the non-Hadoop-based
> Item-based recommenders, but they do not seem to support the option of
> using CooccurrenceCount as a measure, or am I wrong?
>
> Reading archived conversations from here, I can see others have asked a
> similar question in 2011 (
> http://comments.gmane.org/gmane.comp.apache.mahout.user/9758) but there
> seems no clear guidance. Also, I am unsure if it is valid to split the data
> set into training/testing that way, as testing users' key characteristic is
> the items they have preferred—and there is no "model" to fit them to, so to
> speak, or they would become anonymous users if we stripped their
> preferences. Am I right in thinking that I could test RecommenderJob by
> feeding X random preferences of a user, having hidden the remainder of
> their preferences, and see if the hidden items/preferences would become
> their recommendations? However, that approach would change what a user
> "likes" (by hiding their preferences for testing purposes) and I'd be
> concerned about the value of the recommendation. Am I in a loop? Is there a
> way to somehow tap into the recommendation to get an accuracy metric out?
>
> Did anyone, perhaps, share a method or a script (R, Python, Java) for
> evaluating RecommenderJob results?
>
> Many thanks,
> Rafal Lukawiecki
>