You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Ahmet Ylmaz <ah...@yahoo.com> on 2013/02/16 19:30:00 UTC

Problems with Mahout's RecommenderIRStatsEvaluator

Hi,

I have looked at the internals of Mahout's RecommenderIRStatsEvaluator code. I think that there are two important problems here.

According to my understanding the experimental protocol used in this code is something like this:

It takes away a certain percentage of users as test users.
For
each test user it builds a training set consisting of ratings given by
all other users + the ratings of the test user which are below the
relevanceThreshold.
It then builds a model and makes a
recommendation to the test user and finds the intersection between this
recommendation list and the items which are rated above the
relevanceThreshold by the test user.
It then calculates the precision and recall in the usual way.

Probems:
1. (mild) It builds a model for every test user which can take a lot of time.

2. (severe) Only the ratings (of the test user) which are below the
relevanceThreshold are put into the training set. This means that the algorithm
only knows the preferences of the test user about the items which s/he don't like. This is not a good representation of user ratings.

Moreover when I run this evaluator on movielens 1m data, the precision and recall turned out to be, respectively,

0.011534185658699288
0.007905982905982885

and the run took about 13 minutes on my intel core i3. (I used user based recommendation with k=2)

Altgough I know that it is not ok to judge the performance of a recommendation algorithm by looking at these absolute precision and recall values, still these numbers seems to me too low which might be the result of the second problem I mentioned above.

Am I missing something?

Thanks
Ahmet

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Posted by Ahmet Ylmaz <ah...@yahoo.com>.

Thanks for the replies.




________________________________
 From: Sean Owen <sr...@gmail.com>
To: Mahout User List <us...@mahout.apache.org> 
Sent: Saturday, February 16, 2013 11:34 PM
Subject: Re: Problems with Mahout's RecommenderIRStatsEvaluator
 
I understand the idea, but this boils down to the current implementation,
plus going back and throwing out some additional training data that is
lower rated -- it's neither in test or training. Anything's possible, but I
do not imagine this is a helpful practice in general.


On Sat, Feb 16, 2013 at 10:29 PM, Tevfik Aytekin
<te...@gmail.com>wrote:

> I'm suggesting the second one. In that way the test user's ratings in
> the training set will compose of both low and high rated items, that
> prevents the problem pointed out by Ahmet.
>
> On Sat, Feb 16, 2013 at 11:19 PM, Sean Owen <sr...@gmail.com> wrote:
> > If you're suggesting that you hold out only high-rated items, and then
> > sample them, then that's what is done already in the code, except without
> > the sampling. The sampling doesn't buy anything that I can see.
> >
> > If you're suggesting holding out a random subset and then throwing away
> the
> > held-out items with low rating, then it's also the same idea, except
> you're
> > randomly throwing away some lower-rated data from both test and train. I
> > don't see what that helps either.
> >
> >
> > On Sat, Feb 16, 2013 at 9:41 PM, Tevfik Aytekin <
> tevfik.aytekin@gmail.com>wrote:
> >
> >> What I mean is you can choose ratings randomly and try to recommend
> >> the ones above  the threshold
> >>
> >>
>

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Posted by Sean Owen <sr...@gmail.com>.

I understand the idea, but this boils down to the current implementation,
plus going back and throwing out some additional training data that is
lower rated -- it's neither in test or training. Anything's possible, but I
do not imagine this is a helpful practice in general.


On Sat, Feb 16, 2013 at 10:29 PM, Tevfik Aytekin
<te...@gmail.com>wrote:

> I'm suggesting the second one. In that way the test user's ratings in
> the training set will compose of both low and high rated items, that
> prevents the problem pointed out by Ahmet.
>
> On Sat, Feb 16, 2013 at 11:19 PM, Sean Owen <sr...@gmail.com> wrote:
> > If you're suggesting that you hold out only high-rated items, and then
> > sample them, then that's what is done already in the code, except without
> > the sampling. The sampling doesn't buy anything that I can see.
> >
> > If you're suggesting holding out a random subset and then throwing away
> the
> > held-out items with low rating, then it's also the same idea, except
> you're
> > randomly throwing away some lower-rated data from both test and train. I
> > don't see what that helps either.
> >
> >
> > On Sat, Feb 16, 2013 at 9:41 PM, Tevfik Aytekin <
> tevfik.aytekin@gmail.com>wrote:
> >
> >> What I mean is you can choose ratings randomly and try to recommend
> >> the ones above  the threshold
> >>
> >>
>

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Posted by Tevfik Aytekin <te...@gmail.com>.

I'm suggesting the second one. In that way the test user's ratings in
the training set will compose of both low and high rated items, that
prevents the problem pointed out by Ahmet.

On Sat, Feb 16, 2013 at 11:19 PM, Sean Owen <sr...@gmail.com> wrote:
> If you're suggesting that you hold out only high-rated items, and then
> sample them, then that's what is done already in the code, except without
> the sampling. The sampling doesn't buy anything that I can see.
>
> If you're suggesting holding out a random subset and then throwing away the
> held-out items with low rating, then it's also the same idea, except you're
> randomly throwing away some lower-rated data from both test and train. I
> don't see what that helps either.
>
>
> On Sat, Feb 16, 2013 at 9:41 PM, Tevfik Aytekin <te...@gmail.com>wrote:
>
>> What I mean is you can choose ratings randomly and try to recommend
>> the ones above  the threshold
>>
>>

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Posted by Sean Owen <sr...@gmail.com>.

If you're suggesting that you hold out only high-rated items, and then
sample them, then that's what is done already in the code, except without
the sampling. The sampling doesn't buy anything that I can see.

If you're suggesting holding out a random subset and then throwing away the
held-out items with low rating, then it's also the same idea, except you're
randomly throwing away some lower-rated data from both test and train. I
don't see what that helps either.

On Sat, Feb 16, 2013 at 9:41 PM, Tevfik Aytekin <te...@gmail.com>wrote:

> What I mean is you can choose ratings randomly and try to recommend
> the ones above  the threshold
>
>

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Posted by Pat Ferrel <pa...@gmail.com>.

Time splits are fine but may contain anomalies that bias the data. If you are going to compare two recommenders based on time splits, make sure the data is exactly the same for each recommender. One time split we did to create a 90-10 training to test set had a split date of 12/24!  Some form of random hold-out will be less prone to time based systematic variation like seasonality, holidays, day of week, and the like. Stay with the same data when comparing and at least the tests will vary together. 

We still use time based splits, partly for the reasons Ted mentions but knowing the limitations is always good.

On Feb 16, 2013, at 3:12 PM, Ted Dunning <te...@gmail.com> wrote:

There are a variety of common time based effects which make time splits best in many practical cases.  Having the training data all be from the past emulates this better than random splits. 

For one thing, you can have the same user under different names in training and test.  For another thing, in real life you get data from the past of the user under consideration. As a third consideration, topical events can influence all users in common.  

These all mean that random training splits can have very large error in estimated performance. 

Sent from my iPhone

On Feb 16, 2013, at 1:41 PM, Tevfik Aytekin <te...@gmail.com> wrote:

> What I mean is you can choose ratings randomly and try to recommend
> the ones above  the threshold
> 
> On Sat, Feb 16, 2013 at 10:32 PM, Sean Owen <sr...@gmail.com> wrote:
>> Sure, if you were predicting ratings for one movie given a set of ratings
>> for that movie and the ratings for many other movies. That isn't what the
>> recommender problem is. Here, the problem is to list N movies most likely
>> to be top-rated. The precision-recall test is, in turn, a test of top N
>> results, not a test over prediction accuracy. We aren't talking about RMSE
>> here or even any particular means of generating top N recommendations. You
>> don't even have to predict ratings to make a top N list.
>> 
>> 
>> On Sat, Feb 16, 2013 at 9:28 PM, Tevfik Aytekin <te...@gmail.com>wrote:
>> 
>>> No, rating prediction is clearly a supervised ML problem
>>> 
>>> On Sat, Feb 16, 2013 at 10:15 PM, Sean Owen <sr...@gmail.com> wrote:
>>>> This is a good answer for evaluation of supervised ML, but, this is
>>>> unsupervised. Choosing randomly is choosing the 'right answers' randomly,
>>>> and that's plainly problematic.
>>>> 
>>>> 
>>>> On Sat, Feb 16, 2013 at 8:53 PM, Tevfik Aytekin <
>>> tevfik.aytekin@gmail.com>wrote:
>>>> 
>>>>> I think, it is better to choose ratings of the test user in a random
>>>>> fashion.
>>>>> 
>>>>> On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen <sr...@gmail.com> wrote:
>>>>>> Yes. But: the test sample is small. Using 40% of your data to test is
>>>>>> probably quite too much.
>>>>>> 
>>>>>> My point is that it may be the least-bad thing to do. What test are
>>> you
>>>>>> proposing instead, and why is it coherent with what you're testing?
>>>>>> 
>>>>> 
>>>

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Posted by Ted Dunning <te...@gmail.com>.

There are a variety of common time based effects which make time splits best in many practical cases.  Having the training data all be from the past emulates this better than random splits. 

For one thing, you can have the same user under different names in training and test.  For another thing, in real life you get data from the past of the user under consideration. As a third consideration, topical events can influence all users in common.  

These all mean that random training splits can have very large error in estimated performance. 

Sent from my iPhone

On Feb 16, 2013, at 1:41 PM, Tevfik Aytekin <te...@gmail.com> wrote:

> What I mean is you can choose ratings randomly and try to recommend
> the ones above  the threshold
> 
> On Sat, Feb 16, 2013 at 10:32 PM, Sean Owen <sr...@gmail.com> wrote:
>> Sure, if you were predicting ratings for one movie given a set of ratings
>> for that movie and the ratings for many other movies. That isn't what the
>> recommender problem is. Here, the problem is to list N movies most likely
>> to be top-rated. The precision-recall test is, in turn, a test of top N
>> results, not a test over prediction accuracy. We aren't talking about RMSE
>> here or even any particular means of generating top N recommendations. You
>> don't even have to predict ratings to make a top N list.
>> 
>> 
>> On Sat, Feb 16, 2013 at 9:28 PM, Tevfik Aytekin <te...@gmail.com>wrote:
>> 
>>> No, rating prediction is clearly a supervised ML problem
>>> 
>>> On Sat, Feb 16, 2013 at 10:15 PM, Sean Owen <sr...@gmail.com> wrote:
>>>> This is a good answer for evaluation of supervised ML, but, this is
>>>> unsupervised. Choosing randomly is choosing the 'right answers' randomly,
>>>> and that's plainly problematic.
>>>> 
>>>> 
>>>> On Sat, Feb 16, 2013 at 8:53 PM, Tevfik Aytekin <
>>> tevfik.aytekin@gmail.com>wrote:
>>>> 
>>>>> I think, it is better to choose ratings of the test user in a random
>>>>> fashion.
>>>>> 
>>>>> On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen <sr...@gmail.com> wrote:
>>>>>> Yes. But: the test sample is small. Using 40% of your data to test is
>>>>>> probably quite too much.
>>>>>> 
>>>>>> My point is that it may be the least-bad thing to do. What test are
>>> you
>>>>>> proposing instead, and why is it coherent with what you're testing?
>>>>>> 
>>>>> 
>>>

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Posted by Tevfik Aytekin <te...@gmail.com>.

What I mean is you can choose ratings randomly and try to recommend
the ones above  the threshold

On Sat, Feb 16, 2013 at 10:32 PM, Sean Owen <sr...@gmail.com> wrote:
> Sure, if you were predicting ratings for one movie given a set of ratings
> for that movie and the ratings for many other movies. That isn't what the
> recommender problem is. Here, the problem is to list N movies most likely
> to be top-rated. The precision-recall test is, in turn, a test of top N
> results, not a test over prediction accuracy. We aren't talking about RMSE
> here or even any particular means of generating top N recommendations. You
> don't even have to predict ratings to make a top N list.
>
>
> On Sat, Feb 16, 2013 at 9:28 PM, Tevfik Aytekin <te...@gmail.com>wrote:
>
>> No, rating prediction is clearly a supervised ML problem
>>
>> On Sat, Feb 16, 2013 at 10:15 PM, Sean Owen <sr...@gmail.com> wrote:
>> > This is a good answer for evaluation of supervised ML, but, this is
>> > unsupervised. Choosing randomly is choosing the 'right answers' randomly,
>> > and that's plainly problematic.
>> >
>> >
>> > On Sat, Feb 16, 2013 at 8:53 PM, Tevfik Aytekin <
>> tevfik.aytekin@gmail.com>wrote:
>> >
>> >> I think, it is better to choose ratings of the test user in a random
>> >> fashion.
>> >>
>> >> On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen <sr...@gmail.com> wrote:
>> >> > Yes. But: the test sample is small. Using 40% of your data to test is
>> >> > probably quite too much.
>> >> >
>> >> > My point is that it may be the least-bad thing to do. What test are
>> you
>> >> > proposing instead, and why is it coherent with what you're testing?
>> >> >
>> >>
>>

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Posted by Osman Başkaya <os...@computer.org>.

Correction:

- Are you saying that this job is unsupervised since no user can rate all
of the movies. For this reason, we won't be sure that our predicted top-N
list contains no relevant item because it can be possible that our top-N
recommendation list has relevant movie(s) which hasn't rated by the user *
yet* as relevant. By using this evaluation procedure we miss them.

+ Are you saying that this job is unsupervised since no user can rate all
of the movies. For this reason, we won't be sure that our predicted top-N
list contains no relevant item because it can be possible that our top-N
recommendation list has relevant movies which haven't rated by the user *yet
* as relevant. By using this evaluation procedure we may miss the evaluated
relevant item because top-N list is full of prospective relevant items
which haven't rated by the user yet, and relevant item in evaluation can be
outside of the list.

Sorry for inconvenience.

On Sun, Feb 17, 2013 at 1:56 PM, Osman Başkaya
<os...@computer.org>wrote:

> I am sorry to extend the unsupervised/supervised discussion which is not
> the main question here but I need to ask.
>
> Sean, I don't understand your last answer. Let's assume our rating scale
> is from 1 to 5. We can say that those movies which a particular user rates
> as 5 are relevant for him/her. 5 is just a number, we can use *relevance
> threshold *like you did and we can follow the method described in Cremonesi
> et al. Performance of Recommender Algorithms on Top-N Recommendation Tasks<http://goo.gl/pejO7>(
> *2. Testing Methodology - p.2*).
>
> Are you saying that this job is unsupervised since no user can rate all of
> the movies. For this reason, we won't be sure that our predicted top-N list
> contains no relevant item because it can be possible that our top-N
> recommendation list has relevant movie(s) which hasn't rated by the user *
> yet* as relevant. By using this evaluation procedure we miss them.
>
> In short, The following assumption can be problematic:
>
> We randomly select 1000 additional items unrated by
>> user u. We may assume that most of them will not be
>> of interest to user u.
>
>
> Although bigger N values overcomes this problem mostly, still it does not
> seem totally supervised.
>
>
> On Sun, Feb 17, 2013 at 1:49 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> The very question at hand is how to label the data as "relevant" and "not
>> relevant" results. The question exists because this is not given, which is
>> why I would not call this a supervised problem. That may just be
>> semantics,
>> but the point I wanted to make is that the reasons choosing a random
>> training set are correct for a supervised learning problem are not reasons
>> to determine the labels randomly from among the given data. It is a good
>> idea if you're doing, say, logistic regression. It's not the best way
>> here.
>> This also seems to reflect the difference between whatever you want to
>> call
>> this and your garden variety supervised learning problem.
>>
>> On Sat, Feb 16, 2013 at 11:15 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>> > Sean
>> >
>> > I think it is still a supervised learning problem in that there is a
>> > labelled training data set and an unlabeled test data set.
>> >
>> > Learning a ranking doesn't change the basic dichotomy between supervised
>> > and unsupervised.  It just changes the desired figure of merit.
>> >
>>
>
>
>
> --
> Osman Başkaya
> Koc University
> MS Student | Computer Science and Engineering
>



-- 
Osman Başkaya
Koc University
MS Student | Computer Science and Engineering

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Posted by Sean Owen <sr...@gmail.com>.

I agree with that explanation. Is it "why" it's unsupervised.. well I think
of recommendation in the context of things like dimension reduction, which
are just structure-finding exercises. Often the input has no positive or
negative label (a click); everything is 'positive'. If you're predicting
anything, it's not one target, but many targets, one per item, as if you
have many small supervised problems.

Whatever that is called -- I was just saying that it's not a simple
supervised problem, and so it's not necessarily true that the things you do
when testing that kind of thing apply here.

Viewed through the supervised lens, I suppose you could say that this
process only ever predicts the positive class, and that's different. In
fact it is not classifying given test examples at all... it's like it is
telling you which of many classifiers (items) would be most likely to
return the positive class

On Sun, Feb 17, 2013 at 11:56 AM, Osman Başkaya
<os...@computer.org>wrote:

> I am sorry to extend the unsupervised/supervised discussion which is not
> the main question here but I need to ask.
>
> Sean, I don't understand your last answer. Let's assume our rating scale is
> from 1 to 5. We can say that those movies which a particular user rates as
> 5 are relevant for him/her. 5 is just a number, we can use *relevance
> threshold *like you did and we can follow the method described in Cremonesi
> et al. Performance of Recommender Algorithms on Top-N Recommendation
> Tasks<http://goo.gl/pejO7>(
> *2. Testing Methodology - p.2*).
>
> Are you saying that this job is unsupervised since no user can rate all of
> the movies. For this reason, we won't be sure that our predicted top-N list
> contains no relevant item because it can be possible that our top-N
> recommendation list has relevant movie(s) which hasn't rated by the user *
> yet* as relevant. By using this evaluation procedure we miss them.
>
> In short, The following assumption can be problematic:
>
> We randomly select 1000 additional items unrated by
> > user u. We may assume that most of them will not be
> > of interest to user u.
>
>
> Although bigger N values overcomes this problem mostly, still it does not
> seem totally supervised.
>
>
> On Sun, Feb 17, 2013 at 1:49 AM, Sean Owen <sr...@gmail.com> wrote:
>
> > The very question at hand is how to label the data as "relevant" and "not
> > relevant" results. The question exists because this is not given, which
> is
> > why I would not call this a supervised problem. That may just be
> semantics,
> > but the point I wanted to make is that the reasons choosing a random
> > training set are correct for a supervised learning problem are not
> reasons
> > to determine the labels randomly from among the given data. It is a good
> > idea if you're doing, say, logistic regression. It's not the best way
> here.
> > This also seems to reflect the difference between whatever you want to
> call
> > this and your garden variety supervised learning problem.
> >
> > On Sat, Feb 16, 2013 at 11:15 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Sean
> > >
> > > I think it is still a supervised learning problem in that there is a
> > > labelled training data set and an unlabeled test data set.
> > >
> > > Learning a ranking doesn't change the basic dichotomy between
> supervised
> > > and unsupervised.  It just changes the desired figure of merit.
> > >
> >
>
>
>
> --
> Osman Başkaya
> Koc University
> MS Student | Computer Science and Engineering
>

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Posted by Osman Başkaya <os...@computer.org>.

I am sorry to extend the unsupervised/supervised discussion which is not
the main question here but I need to ask.

Sean, I don't understand your last answer. Let's assume our rating scale is
from 1 to 5. We can say that those movies which a particular user rates as
5 are relevant for him/her. 5 is just a number, we can use *relevance
threshold *like you did and we can follow the method described in Cremonesi
et al. Performance of Recommender Algorithms on Top-N Recommendation
Tasks<http://goo.gl/pejO7>(
*2. Testing Methodology - p.2*).

Are you saying that this job is unsupervised since no user can rate all of
the movies. For this reason, we won't be sure that our predicted top-N list
contains no relevant item because it can be possible that our top-N
recommendation list has relevant movie(s) which hasn't rated by the user *
yet* as relevant. By using this evaluation procedure we miss them.

In short, The following assumption can be problematic:

We randomly select 1000 additional items unrated by
> user u. We may assume that most of them will not be
> of interest to user u.

Although bigger N values overcomes this problem mostly, still it does not
seem totally supervised.

On Sun, Feb 17, 2013 at 1:49 AM, Sean Owen <sr...@gmail.com> wrote:

> The very question at hand is how to label the data as "relevant" and "not
> relevant" results. The question exists because this is not given, which is
> why I would not call this a supervised problem. That may just be semantics,
> but the point I wanted to make is that the reasons choosing a random
> training set are correct for a supervised learning problem are not reasons
> to determine the labels randomly from among the given data. It is a good
> idea if you're doing, say, logistic regression. It's not the best way here.
> This also seems to reflect the difference between whatever you want to call
> this and your garden variety supervised learning problem.
>
> On Sat, Feb 16, 2013 at 11:15 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Sean
> >
> > I think it is still a supervised learning problem in that there is a
> > labelled training data set and an unlabeled test data set.
> >
> > Learning a ranking doesn't change the basic dichotomy between supervised
> > and unsupervised.  It just changes the desired figure of merit.
> >
>

-- 
Osman Başkaya
Koc University
MS Student | Computer Science and Engineering

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Posted by Sean Owen <sr...@gmail.com>.

The very question at hand is how to label the data as "relevant" and "not
relevant" results. The question exists because this is not given, which is
why I would not call this a supervised problem. That may just be semantics,
but the point I wanted to make is that the reasons choosing a random
training set are correct for a supervised learning problem are not reasons
to determine the labels randomly from among the given data. It is a good
idea if you're doing, say, logistic regression. It's not the best way here.
This also seems to reflect the difference between whatever you want to call
this and your garden variety supervised learning problem.

On Sat, Feb 16, 2013 at 11:15 PM, Ted Dunning <te...@gmail.com> wrote:

> Sean
>
> I think it is still a supervised learning problem in that there is a
> labelled training data set and an unlabeled test data set.
>
> Learning a ranking doesn't change the basic dichotomy between supervised
> and unsupervised.  It just changes the desired figure of merit.
>

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Posted by Ted Dunning <te...@gmail.com>.

Sean

I think it is still a supervised learning problem in that there is a labelled training data set and an unlabeled test data set. 

Learning a ranking doesn't change the basic dichotomy between supervised and unsupervised.  It just changes the desired figure of merit. 

Sent from my iPhone

On Feb 16, 2013, at 1:32 PM, Sean Owen <sr...@gmail.com> wrote:

> Sure, if you were predicting ratings for one movie given a set of ratings
> for that movie and the ratings for many other movies. That isn't what the
> recommender problem is. Here, the problem is to list N movies most likely
> to be top-rated. The precision-recall test is, in turn, a test of top N
> results, not a test over prediction accuracy. We aren't talking about RMSE
> here or even any particular means of generating top N recommendations. You
> don't even have to predict ratings to make a top N list.
> 
> 
> On Sat, Feb 16, 2013 at 9:28 PM, Tevfik Aytekin <te...@gmail.com>wrote:
> 
>> No, rating prediction is clearly a supervised ML problem
>> 
>> On Sat, Feb 16, 2013 at 10:15 PM, Sean Owen <sr...@gmail.com> wrote:
>>> This is a good answer for evaluation of supervised ML, but, this is
>>> unsupervised. Choosing randomly is choosing the 'right answers' randomly,
>>> and that's plainly problematic.
>>> 
>>> 
>>> On Sat, Feb 16, 2013 at 8:53 PM, Tevfik Aytekin <
>> tevfik.aytekin@gmail.com>wrote:
>>> 
>>>> I think, it is better to choose ratings of the test user in a random
>>>> fashion.
>>>> 
>>>> On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen <sr...@gmail.com> wrote:
>>>>> Yes. But: the test sample is small. Using 40% of your data to test is
>>>>> probably quite too much.
>>>>> 
>>>>> My point is that it may be the least-bad thing to do. What test are
>> you
>>>>> proposing instead, and why is it coherent with what you're testing?
>>>>> 
>>>> 
>>

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Posted by Sean Owen <sr...@gmail.com>.

Sure, if you were predicting ratings for one movie given a set of ratings
for that movie and the ratings for many other movies. That isn't what the
recommender problem is. Here, the problem is to list N movies most likely
to be top-rated. The precision-recall test is, in turn, a test of top N
results, not a test over prediction accuracy. We aren't talking about RMSE
here or even any particular means of generating top N recommendations. You
don't even have to predict ratings to make a top N list.

On Sat, Feb 16, 2013 at 9:28 PM, Tevfik Aytekin <te...@gmail.com>wrote:

> No, rating prediction is clearly a supervised ML problem
>
> On Sat, Feb 16, 2013 at 10:15 PM, Sean Owen <sr...@gmail.com> wrote:
> > This is a good answer for evaluation of supervised ML, but, this is
> > unsupervised. Choosing randomly is choosing the 'right answers' randomly,
> > and that's plainly problematic.
> >
> >
> > On Sat, Feb 16, 2013 at 8:53 PM, Tevfik Aytekin <
> tevfik.aytekin@gmail.com>wrote:
> >
> >> I think, it is better to choose ratings of the test user in a random
> >> fashion.
> >>
> >> On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen <sr...@gmail.com> wrote:
> >> > Yes. But: the test sample is small. Using 40% of your data to test is
> >> > probably quite too much.
> >> >
> >> > My point is that it may be the least-bad thing to do. What test are
> you
> >> > proposing instead, and why is it coherent with what you're testing?
> >> >
> >>
>

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Posted by Tevfik Aytekin <te...@gmail.com>.

No, rating prediction is clearly a supervised ML problem

On Sat, Feb 16, 2013 at 10:15 PM, Sean Owen <sr...@gmail.com> wrote:
> This is a good answer for evaluation of supervised ML, but, this is
> unsupervised. Choosing randomly is choosing the 'right answers' randomly,
> and that's plainly problematic.
>
>
> On Sat, Feb 16, 2013 at 8:53 PM, Tevfik Aytekin <te...@gmail.com>wrote:
>
>> I think, it is better to choose ratings of the test user in a random
>> fashion.
>>
>> On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen <sr...@gmail.com> wrote:
>> > Yes. But: the test sample is small. Using 40% of your data to test is
>> > probably quite too much.
>> >
>> > My point is that it may be the least-bad thing to do. What test are you
>> > proposing instead, and why is it coherent with what you're testing?
>> >
>>

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Posted by Sean Owen <sr...@gmail.com>.

This is a good answer for evaluation of supervised ML, but, this is
unsupervised. Choosing randomly is choosing the 'right answers' randomly,
and that's plainly problematic.

On Sat, Feb 16, 2013 at 8:53 PM, Tevfik Aytekin <te...@gmail.com>wrote:

> I think, it is better to choose ratings of the test user in a random
> fashion.
>
> On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen <sr...@gmail.com> wrote:
> > Yes. But: the test sample is small. Using 40% of your data to test is
> > probably quite too much.
> >
> > My point is that it may be the least-bad thing to do. What test are you
> > proposing instead, and why is it coherent with what you're testing?
> >
>

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Posted by Tevfik Aytekin <te...@gmail.com>.

I think, it is better to choose ratings of the test user in a random fashion.

On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen <sr...@gmail.com> wrote:
> Yes. But: the test sample is small. Using 40% of your data to test is
> probably quite too much.
>
> My point is that it may be the least-bad thing to do. What test are you
> proposing instead, and why is it coherent with what you're testing?
>
>
>
>
> On Sat, Feb 16, 2013 at 8:26 PM, Ahmet Ylmaz <ah...@yahoo.com>wrote:
>
>> But modeling a user only by his/her low ratings can be problematic since
>> people generally are more precise (I believe) in their high ratings.
>> Another problem is that recommender algorithms in general first mean
>> normalize the ratings for each user. Suppose that we have the following
>> ratings of 3 people (A, B, and C) on 5 items.
>>
>> A's ratings: 1 2 3 4 5
>> B's ratings: 1 3 5 2 4
>> C's ratings: 1 2 3 4 5
>>
>>
>> Suppose that A is the test user. Now if we put only the low ratings of A
>> (1, 2, and 3) into the training set and mean normalize the ratings then A
>> will be
>> more similar to B than C, which is not true.
>>
>>
>>
>>
>> ________________________________
>>  From: Sean Owen <sr...@gmail.com>
>> To: Mahout User List <us...@mahout.apache.org>; Ahmet Ylmaz <
>> ahmetyilmazefendi@yahoo.com>
>> Sent: Saturday, February 16, 2013 8:41 PM
>> Subject: Re: Problems with Mahout's RecommenderIRStatsEvaluator
>>
>> No, this is not a problem.
>>
>> Yes it builds a model for each user, which takes a long time. It's
>> accurate, but time-consuming. It's meant for small data. You could rewrite
>> your own test to hold out data for all test users at once. That's what I
>> did when I rewrote a lot of this just because it was more useful to have
>> larger tests.
>>
>> There are several ways to choose the test data. One common way is by time,
>> but there is no time information here by default. The problem is that, for
>> example, recent ratings may be low -- or at least not high ratings. But the
>> evaluation is of course asking the recommender for items that are predicted
>> to be highly rated. Random selection has the same problem. Choosing by
>> rating at least makes the test coherent.
>>
>> It does bias the training set, but, the test set is supposed to be small.
>>
>> There is no way to actually know, a priori, what the top recommendations
>> are. You have no information to evaluate most recommendations. This makes a
>> precision/recall test fairly uninformative in practice. Still, it's better
>> than nothing and commonly understood.
>>
>> While precision/recall won't be high on tests like this, because of this, I
>> don't get these values for movielens data on any normal algo, but, you may
>> be, if choosing an algorithm or parameters that don't work well.
>>
>>
>>
>>
>> On Sat, Feb 16, 2013 at 7:30 PM, Ahmet Ylmaz <ahmetyilmazefendi@yahoo.com
>> >wrote:
>>
>> > Hi,
>> >
>> > I have looked at the internals of Mahout's RecommenderIRStatsEvaluator
>> > code. I think that there are two important problems here.
>> >
>> > According to my understanding the experimental protocol used in this code
>> > is something like this:
>> >
>> > It takes away a certain percentage of users as test users.
>> > For
>> >  each test user it builds a training set consisting of ratings given by
>> > all other users + the ratings of the test user which are below the
>> > relevanceThreshold.
>> > It then builds a model and makes a
>> > recommendation to the test user and finds the intersection between this
>> > recommendation list and the items which are rated above the
>> > relevanceThreshold by the test user.
>> > It then calculates the precision and recall in the usual way.
>> >
>> > Probems:
>> > 1. (mild) It builds a model for every test user which can take a lot of
>> > time.
>> >
>> > 2. (severe) Only the ratings (of the test user) which are below the
>> > relevanceThreshold are put into the training set. This means that the
>> > algorithm
>> > only knows the preferences of the test user about the items which s/he
>> > don't like. This is not a good representation of user ratings.
>> >
>> > Moreover when I run this evaluator on movielens 1m data, the precision
>> and
>> > recall turned out to be, respectively,
>> >
>> > 0.011534185658699288
>> > 0.007905982905982885
>> >
>> > and the run took about 13 minutes on my intel core i3. (I used user based
>> > recommendation with k=2)
>> >
>> >
>> > Altgough I know that it is not ok to judge the performance of a
>> > recommendation algorithm by looking at these absolute precision and
>> recall
>> > values, still these numbers seems to me too low which might be the result
>> > of the second problem I mentioned above.
>> >
>> > Am I missing something?
>> >
>> > Thanks
>> > Ahmet
>> >
>>

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Posted by Sean Owen <sr...@gmail.com>.

Yes. But: the test sample is small. Using 40% of your data to test is
probably quite too much.

My point is that it may be the least-bad thing to do. What test are you
proposing instead, and why is it coherent with what you're testing?




On Sat, Feb 16, 2013 at 8:26 PM, Ahmet Ylmaz <ah...@yahoo.com>wrote:

> But modeling a user only by his/her low ratings can be problematic since
> people generally are more precise (I believe) in their high ratings.
> Another problem is that recommender algorithms in general first mean
> normalize the ratings for each user. Suppose that we have the following
> ratings of 3 people (A, B, and C) on 5 items.
>
> A's ratings: 1 2 3 4 5
> B's ratings: 1 3 5 2 4
> C's ratings: 1 2 3 4 5
>
>
> Suppose that A is the test user. Now if we put only the low ratings of A
> (1, 2, and 3) into the training set and mean normalize the ratings then A
> will be
> more similar to B than C, which is not true.
>
>
>
>
> ________________________________
>  From: Sean Owen <sr...@gmail.com>
> To: Mahout User List <us...@mahout.apache.org>; Ahmet Ylmaz <
> ahmetyilmazefendi@yahoo.com>
> Sent: Saturday, February 16, 2013 8:41 PM
> Subject: Re: Problems with Mahout's RecommenderIRStatsEvaluator
>
> No, this is not a problem.
>
> Yes it builds a model for each user, which takes a long time. It's
> accurate, but time-consuming. It's meant for small data. You could rewrite
> your own test to hold out data for all test users at once. That's what I
> did when I rewrote a lot of this just because it was more useful to have
> larger tests.
>
> There are several ways to choose the test data. One common way is by time,
> but there is no time information here by default. The problem is that, for
> example, recent ratings may be low -- or at least not high ratings. But the
> evaluation is of course asking the recommender for items that are predicted
> to be highly rated. Random selection has the same problem. Choosing by
> rating at least makes the test coherent.
>
> It does bias the training set, but, the test set is supposed to be small.
>
> There is no way to actually know, a priori, what the top recommendations
> are. You have no information to evaluate most recommendations. This makes a
> precision/recall test fairly uninformative in practice. Still, it's better
> than nothing and commonly understood.
>
> While precision/recall won't be high on tests like this, because of this, I
> don't get these values for movielens data on any normal algo, but, you may
> be, if choosing an algorithm or parameters that don't work well.
>
>
>
>
> On Sat, Feb 16, 2013 at 7:30 PM, Ahmet Ylmaz <ahmetyilmazefendi@yahoo.com
> >wrote:
>
> > Hi,
> >
> > I have looked at the internals of Mahout's RecommenderIRStatsEvaluator
> > code. I think that there are two important problems here.
> >
> > According to my understanding the experimental protocol used in this code
> > is something like this:
> >
> > It takes away a certain percentage of users as test users.
> > For
> >  each test user it builds a training set consisting of ratings given by
> > all other users + the ratings of the test user which are below the
> > relevanceThreshold.
> > It then builds a model and makes a
> > recommendation to the test user and finds the intersection between this
> > recommendation list and the items which are rated above the
> > relevanceThreshold by the test user.
> > It then calculates the precision and recall in the usual way.
> >
> > Probems:
> > 1. (mild) It builds a model for every test user which can take a lot of
> > time.
> >
> > 2. (severe) Only the ratings (of the test user) which are below the
> > relevanceThreshold are put into the training set. This means that the
> > algorithm
> > only knows the preferences of the test user about the items which s/he
> > don't like. This is not a good representation of user ratings.
> >
> > Moreover when I run this evaluator on movielens 1m data, the precision
> and
> > recall turned out to be, respectively,
> >
> > 0.011534185658699288
> > 0.007905982905982885
> >
> > and the run took about 13 minutes on my intel core i3. (I used user based
> > recommendation with k=2)
> >
> >
> > Altgough I know that it is not ok to judge the performance of a
> > recommendation algorithm by looking at these absolute precision and
> recall
> > values, still these numbers seems to me too low which might be the result
> > of the second problem I mentioned above.
> >
> > Am I missing something?
> >
> > Thanks
> > Ahmet
> >
>

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Posted by Ahmet Ylmaz <ah...@yahoo.com>.

But modeling a user only by his/her low ratings can be problematic since people generally are more precise (I believe) in their high ratings.
Another problem is that recommender algorithms in general first mean normalize the ratings for each user. Suppose that we have the following ratings of 3 people (A, B, and C) on 5 items.

A's ratings: 1 2 3 4 5
B's ratings: 1 3 5 2 4
C's ratings: 1 2 3 4 5

Suppose that A is the test user. Now if we put only the low ratings of A (1, 2, and 3) into the training set and mean normalize the ratings then A will be
more similar to B than C, which is not true.

________________________________
 From: Sean Owen <sr...@gmail.com>
To: Mahout User List <us...@mahout.apache.org>; Ahmet Ylmaz <ah...@yahoo.com> 
Sent: Saturday, February 16, 2013 8:41 PM
Subject: Re: Problems with Mahout's RecommenderIRStatsEvaluator

No, this is not a problem.

Yes it builds a model for each user, which takes a long time. It's
accurate, but time-consuming. It's meant for small data. You could rewrite
your own test to hold out data for all test users at once. That's what I
did when I rewrote a lot of this just because it was more useful to have
larger tests.

There are several ways to choose the test data. One common way is by time,
but there is no time information here by default. The problem is that, for
example, recent ratings may be low -- or at least not high ratings. But the
evaluation is of course asking the recommender for items that are predicted
to be highly rated. Random selection has the same problem. Choosing by
rating at least makes the test coherent.

It does bias the training set, but, the test set is supposed to be small.

There is no way to actually know, a priori, what the top recommendations
are. You have no information to evaluate most recommendations. This makes a
precision/recall test fairly uninformative in practice. Still, it's better
than nothing and commonly understood.

While precision/recall won't be high on tests like this, because of this, I
don't get these values for movielens data on any normal algo, but, you may
be, if choosing an algorithm or parameters that don't work well.

On Sat, Feb 16, 2013 at 7:30 PM, Ahmet Ylmaz <ah...@yahoo.com>wrote:

> Hi,
>
> I have looked at the internals of Mahout's RecommenderIRStatsEvaluator
> code. I think that there are two important problems here.
>
> According to my understanding the experimental protocol used in this code
> is something like this:
>
> It takes away a certain percentage of users as test users.
> For
>  each test user it builds a training set consisting of ratings given by
> all other users + the ratings of the test user which are below the
> relevanceThreshold.
> It then builds a model and makes a
> recommendation to the test user and finds the intersection between this
> recommendation list and the items which are rated above the
> relevanceThreshold by the test user.
> It then calculates the precision and recall in the usual way.
>
> Probems:
> 1. (mild) It builds a model for every test user which can take a lot of
> time.
>
> 2. (severe) Only the ratings (of the test user) which are below the
> relevanceThreshold are put into the training set. This means that the
> algorithm
> only knows the preferences of the test user about the items which s/he
> don't like. This is not a good representation of user ratings.
>
> Moreover when I run this evaluator on movielens 1m data, the precision and
> recall turned out to be, respectively,
>
> 0.011534185658699288
> 0.007905982905982885
>
> and the run took about 13 minutes on my intel core i3. (I used user based
> recommendation with k=2)
>
>
> Altgough I know that it is not ok to judge the performance of a
> recommendation algorithm by looking at these absolute precision and recall
> values, still these numbers seems to me too low which might be the result
> of the second problem I mentioned above.
>
> Am I missing something?
>
> Thanks
> Ahmet
>

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Posted by Sean Owen <sr...@gmail.com>.

No, this is not a problem.

Yes it builds a model for each user, which takes a long time. It's
accurate, but time-consuming. It's meant for small data. You could rewrite
your own test to hold out data for all test users at once. That's what I
did when I rewrote a lot of this just because it was more useful to have
larger tests.

There are several ways to choose the test data. One common way is by time,
but there is no time information here by default. The problem is that, for
example, recent ratings may be low -- or at least not high ratings. But the
evaluation is of course asking the recommender for items that are predicted
to be highly rated. Random selection has the same problem. Choosing by
rating at least makes the test coherent.

It does bias the training set, but, the test set is supposed to be small.

There is no way to actually know, a priori, what the top recommendations
are. You have no information to evaluate most recommendations. This makes a
precision/recall test fairly uninformative in practice. Still, it's better
than nothing and commonly understood.

While precision/recall won't be high on tests like this, because of this, I
don't get these values for movielens data on any normal algo, but, you may
be, if choosing an algorithm or parameters that don't work well.

On Sat, Feb 16, 2013 at 7:30 PM, Ahmet Ylmaz <ah...@yahoo.com>wrote:

> Hi,
>
> I have looked at the internals of Mahout's RecommenderIRStatsEvaluator
> code. I think that there are two important problems here.
>
> According to my understanding the experimental protocol used in this code
> is something like this:
>
> It takes away a certain percentage of users as test users.
> For
>  each test user it builds a training set consisting of ratings given by
> all other users + the ratings of the test user which are below the
> relevanceThreshold.
> It then builds a model and makes a
> recommendation to the test user and finds the intersection between this
> recommendation list and the items which are rated above the
> relevanceThreshold by the test user.
> It then calculates the precision and recall in the usual way.
>
> Probems:
> 1. (mild) It builds a model for every test user which can take a lot of
> time.
>
> 2. (severe) Only the ratings (of the test user) which are below the
> relevanceThreshold are put into the training set. This means that the
> algorithm
> only knows the preferences of the test user about the items which s/he
> don't like. This is not a good representation of user ratings.
>
> Moreover when I run this evaluator on movielens 1m data, the precision and
> recall turned out to be, respectively,
>
> 0.011534185658699288
> 0.007905982905982885
>
> and the run took about 13 minutes on my intel core i3. (I used user based
> recommendation with k=2)
>
>
> Altgough I know that it is not ok to judge the performance of a
> recommendation algorithm by looking at these absolute precision and recall
> values, still these numbers seems to me too low which might be the result
> of the second problem I mentioned above.
>
> Am I missing something?
>
> Thanks
> Ahmet
>