You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by MT <ma...@telecom-bretagne.eu> on 2011/07/25 11:05:12 UTC

Mahout Binary Recommender Evaluator

Mahout Binary Recommender Evaluation

I'm working on a common dataset that includes the user id, item id, and
timestamp (the moment the user bought the item). As there are no
preferences, I needed a binary item-based recommender, which I found in
Mahout (GenericBooleanPrefItemBasedRecommender and the Tanimoto
coefficient). Following the recommender documentation, I tried to evaluate
it with GenericRecommenderIRStatsEvaluator(), but I ran into a few problems.

In fact, correct me if I'm wrong, but to me the evaluator will invariably
give us the same value for precision and recall. Since the items are all
rated with the binary 1.0 value, we give the recommender a threshold lower
than 1, thus for each user at items are considered relevant and removed from
the user's preferences to compute at recommendations. Precision and recall
are then computed with the two sets : relevant and retrieved items. Which
leads (I guess unless the recommender cannot compute at items) to precision
and recall being equal.

Results are still useful though, since a value of 0.2 for precision tells us
that among the at recommended items, 20% were effectively bought by the
user. Although one can wonder if those items are the best recommendations,
the least we can say is that it somehow corresponds to the user's
preferences.

However, I had a few ideas to give more meaning to precision and recall taht
I wanted to share, to get some advice before implementing them.

I read this topic and I fully understand that IRStatsEvaluator is different
from classic evaluators (giving the MAE for example), but I feel that it
makes sense to have a parameter trainingPercentage that divides users'
preferences in two subsets of items. The first (typically 20%) are
considered as relevant items, which are to be predicted using the second
subset. This task is at the moment defined by at, resulting in often equal
numbers of items in the relevant and retrieved subset. This at value would
still be a parameter used to define the number of items retrieved. The
evaluator could then be run varying these two parameters to find the best
compromise between precision and recall.

Furthermore, should the dataset contain a timestamp for each purchase, would
it not be logic to set the test set as the last items bought by the user ?
The evaluator would then follow what happens in real calculations.

Finaly, I believe the documentation page has some mistakes in the last code
excerpt :

evaluator.evaluate(builder, myModel, null, 3,
RecommenderIRStatusEvaluator.CHOOSE_THRESHOLD,
&sect;1.0);

should be
evaluator.evaluate(builder, null, myModel, null, 3,
GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1.0);

Thanks for your help !

--
View this message in context: http://lucene.472066.n3.nabble.com/Mahout-Binary-Recommender-Evaluator-tp3196925p3196925.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Mahout Binary Recommender Evaluator

Posted by Marko Ciric <ci...@gmail.com>.

Correction: I didn't mean to re-implement the existing functionality, but
there should be an easy way to connect UAC with Taste evaluators.

On 28 July 2011 12:57, Marko Ciric <ci...@gmail.com> wrote:

> I think it wouldn't be a big problem to reimplement it thought it would
> have to have a sort of wrapper around the existing evaluator (it would
> require to evaluate multiple times for a single user, having different
> relevant items I think).
>
>
> On 26 July 2011 01:35, Ted Dunning <te...@gmail.com> wrote:
>
>> Well, we do have numerous ways to compute AUC.  I don't think that they
>> are
>> integrated into the recommendation evaluation framework yet.  Would you
>> like
>> to take on the application of suitable glue?
>>
>>
>> On Mon, Jul 25, 2011 at 1:00 PM, Marko Ciric <ci...@gmail.com>
>> wrote:
>>
>> > Is there a plan to include this in Mahout in some future release?
>> >
>> > On 25 July 2011 17:20, Ted Dunning <te...@gmail.com> wrote:
>> >
>> > > That would allow you to compute AUC which might be useful.  AUC is the
>> > > probability that a relevant (purchased) item is ranked higher than a
>> > > non-relevant item.
>> > >
>> > > On Mon, Jul 25, 2011 at 3:16 AM, Marko Ciric <ci...@gmail.com>
>> > > wrote:
>> > >
>> > > > The better way to do it is to implement an evaluator which accepts
>> the
>> > > > collection of items that are relevant.
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > --
>> > Marko Ćirić
>> > ciric.marko@gmail.com
>> >
>>
>
>
>
> --
> --
> Marko Ćirić
> ciric.marko@gmail.com
>
>


-- 
--
Marko Ćirić
ciric.marko@gmail.com

Re: Mahout Binary Recommender Evaluator

Posted by Marko Ciric <ci...@gmail.com>.

I think it wouldn't be a big problem to reimplement it thought it would have
to have a sort of wrapper around the existing evaluator (it would require to
evaluate multiple times for a single user, having different relevant items I
think).

On 26 July 2011 01:35, Ted Dunning <te...@gmail.com> wrote:

> Well, we do have numerous ways to compute AUC.  I don't think that they are
> integrated into the recommendation evaluation framework yet.  Would you
> like
> to take on the application of suitable glue?
>
>
> On Mon, Jul 25, 2011 at 1:00 PM, Marko Ciric <ci...@gmail.com>
> wrote:
>
> > Is there a plan to include this in Mahout in some future release?
> >
> > On 25 July 2011 17:20, Ted Dunning <te...@gmail.com> wrote:
> >
> > > That would allow you to compute AUC which might be useful.  AUC is the
> > > probability that a relevant (purchased) item is ranked higher than a
> > > non-relevant item.
> > >
> > > On Mon, Jul 25, 2011 at 3:16 AM, Marko Ciric <ci...@gmail.com>
> > > wrote:
> > >
> > > > The better way to do it is to implement an evaluator which accepts
> the
> > > > collection of items that are relevant.
> > > >
> > >
> >
> >
> >
> > --
> > --
> > Marko Ćirić
> > ciric.marko@gmail.com
> >
>



-- 
--
Marko Ćirić
ciric.marko@gmail.com

Re: Mahout Binary Recommender Evaluator

Posted by Ted Dunning <te...@gmail.com>.

Well, we do have numerous ways to compute AUC.  I don't think that they are
integrated into the recommendation evaluation framework yet.  Would you like
to take on the application of suitable glue?

On Mon, Jul 25, 2011 at 1:00 PM, Marko Ciric <ci...@gmail.com> wrote:

> Is there a plan to include this in Mahout in some future release?
>
> On 25 July 2011 17:20, Ted Dunning <te...@gmail.com> wrote:
>
> > That would allow you to compute AUC which might be useful.  AUC is the
> > probability that a relevant (purchased) item is ranked higher than a
> > non-relevant item.
> >
> > On Mon, Jul 25, 2011 at 3:16 AM, Marko Ciric <ci...@gmail.com>
> > wrote:
> >
> > > The better way to do it is to implement an evaluator which accepts the
> > > collection of items that are relevant.
> > >
> >
>
>
>
> --
> --
> Marko Ćirić
> ciric.marko@gmail.com
>

Re: Mahout Binary Recommender Evaluator

Posted by Marko Ciric <ci...@gmail.com>.

Is there a plan to include this in Mahout in some future release?

On 25 July 2011 17:20, Ted Dunning <te...@gmail.com> wrote:

> That would allow you to compute AUC which might be useful.  AUC is the
> probability that a relevant (purchased) item is ranked higher than a
> non-relevant item.
>
> On Mon, Jul 25, 2011 at 3:16 AM, Marko Ciric <ci...@gmail.com>
> wrote:
>
> > The better way to do it is to implement an evaluator which accepts the
> > collection of items that are relevant.
> >
>



-- 
--
Marko Ćirić
ciric.marko@gmail.com

Re: Mahout Binary Recommender Evaluator

Posted by Ted Dunning <te...@gmail.com>.

That would allow you to compute AUC which might be useful.  AUC is the
probability that a relevant (purchased) item is ranked higher than a
non-relevant item.

On Mon, Jul 25, 2011 at 3:16 AM, Marko Ciric <ci...@gmail.com> wrote:

> The better way to do it is to implement an evaluator which accepts the
> collection of items that are relevant.
>

Re: Mahout Binary Recommender Evaluator

Posted by Marko Ciric <ci...@gmail.com>.

Hi,

First of all, it's rather easy to implement the evaluator not to remove all
the items (which is the case when working with boolean preferences data
set). The easiest implementation would be to use relevanceThreshold argument
as the percent of the whole user's preference data set. For example if it is
0,4, you can remove 40% perecent of items that are preferred and so on.

The better way to do it is to implement an evaluator which accepts the
collection of items that are relevant.


On 25 July 2011 11:55, Sean Owen <sr...@gmail.com> wrote:

> On Mon, Jul 25, 2011 at 10:05 AM, MT <mael.thomas@telecom-bretagne.eu
> >wrote:
> >
> >
> > In fact, correct me if I'm wrong, but to me the evaluator will invariably
> > give us the same value for precision and recall. Since the items are all
> > rated with the binary 1.0 value, we give the recommender a threshold
> lower
> > than 1, thus for each user at items are considered relevant and removed
> > from
> > the user's preferences to compute at recommendations. Precision and
> recall
> > are then computed with the two sets : relevant and retrieved items. Which
> > leads (I guess unless the recommender cannot compute at items) to
> precision
> > and recall being equal.
> >
>
> I think that's right in this case, where there are no ratings. It's pretty
> artificial to define 'relevant' here based on ratings!
> This isn't true if you have ratings.
>
>
> >
> > Results are still useful though, since a value of 0.2 for precision tells
> > us
> > that among the at recommended items, 20% were effectively bought by the
> > user. Although one can wonder if those items are the best
> recommendations,
> > the least we can say is that it somehow corresponds to the user's
> > preferences.
> >
>
> Right.
>
>
> I read this topic and I fully understand that IRStatsEvaluator is different
> > from classic evaluators (giving the MAE for example), but I feel that it
> > makes sense to have a parameter trainingPercentage that divides users'
> > preferences in two subsets of items. The first (typically 20%) are
> > considered as relevant items, which are to be predicted using the second
> > subset. This task is at the moment defined by at, resulting in often
> equal
> > numbers of items in the relevant and retrieved subset. This at value
> would
> > still be a parameter used to define the number of items retrieved. The
> > evaluator could then be run varying these two parameters to find the best
> > compromise between precision and recall.
> >
>
> I think it already has this parameter? it already accepts an "at" value. Is
> this what you mean? maybe an example or patch would clarify.
>
>
> >
> > Furthermore, should the dataset contain a timestamp for each purchase,
> > would
> > it not be logic to set the test set as the last items bought by the user
> ?
> > The evaluator would then follow what happens in real calculations.
> >
>
> Yes that sounds like a great improvement. The only difficulty is including
> it in a clean way. Up for a patch?
>
>
>
> >
> > Finaly, I believe the documentation page has some mistakes in the last
> code
> > excerpt :
> >
> > evaluator.evaluate(builder, myModel, null, 3,
> > RecommenderIRStatusEvaluator.CHOOSE_THRESHOLD,
> >        &sect;1.0);
> >
> > should be
> > evaluator.evaluate(builder, null, myModel, null, 3,
> > GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1.0);
> >
> >
> > OK will look at that.
>



-- 
--
Marko Ćirić
ciric.marko@gmail.com

Re: Mahout Binary Recommender Evaluator

Posted by Sean Owen <sr...@gmail.com>.

On Mon, Jul 25, 2011 at 10:05 AM, MT <ma...@telecom-bretagne.eu>wrote:
>
>
> In fact, correct me if I'm wrong, but to me the evaluator will invariably
> give us the same value for precision and recall. Since the items are all
> rated with the binary 1.0 value, we give the recommender a threshold lower
> than 1, thus for each user at items are considered relevant and removed
> from
> the user's preferences to compute at recommendations. Precision and recall
> are then computed with the two sets : relevant and retrieved items. Which
> leads (I guess unless the recommender cannot compute at items) to precision
> and recall being equal.
>

I think that's right in this case, where there are no ratings. It's pretty
artificial to define 'relevant' here based on ratings!
This isn't true if you have ratings.


>
> Results are still useful though, since a value of 0.2 for precision tells
> us
> that among the at recommended items, 20% were effectively bought by the
> user. Although one can wonder if those items are the best recommendations,
> the least we can say is that it somehow corresponds to the user's
> preferences.
>

Right.


I read this topic and I fully understand that IRStatsEvaluator is different
> from classic evaluators (giving the MAE for example), but I feel that it
> makes sense to have a parameter trainingPercentage that divides users'
> preferences in two subsets of items. The first (typically 20%) are
> considered as relevant items, which are to be predicted using the second
> subset. This task is at the moment defined by at, resulting in often equal
> numbers of items in the relevant and retrieved subset. This at value would
> still be a parameter used to define the number of items retrieved. The
> evaluator could then be run varying these two parameters to find the best
> compromise between precision and recall.
>

I think it already has this parameter? it already accepts an "at" value. Is
this what you mean? maybe an example or patch would clarify.


>
> Furthermore, should the dataset contain a timestamp for each purchase,
> would
> it not be logic to set the test set as the last items bought by the user ?
> The evaluator would then follow what happens in real calculations.
>

Yes that sounds like a great improvement. The only difficulty is including
it in a clean way. Up for a patch?



>
> Finaly, I believe the documentation page has some mistakes in the last code
> excerpt :
>
> evaluator.evaluate(builder, myModel, null, 3,
> RecommenderIRStatusEvaluator.CHOOSE_THRESHOLD,
>        &sect;1.0);
>
> should be
> evaluator.evaluate(builder, null, myModel, null, 3,
> GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1.0);
>
>
> OK will look at that.