You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Jonathan Hodges <ho...@gmail.com> on 2012/08/26 16:47:54 UTC

Can someone suggest an approach for calculating precision and recall for distributed recommendations?

Hi,

We have been tasked with producing video recommendations for our users. We
get about 100 million video views per month and track users and the videos
they watch, but currently we don’t collect rating value or preference.
Later we plan on using implicit data like percentage of video watched to
surmise preferences but for the first release we are stuck with Boolean
viewing data. To that end we started by using Mahout’s distributed
RecommenderJob with LoglikelihoodSimilarity algorithm to generate 50 video
recommendations for each user. We would like to gauge how well we are doing
by offline measuring precision and recall of these recommendations. We know
we should divide the viewing data into training and test data, but not real
sure what steps to take next. For the non-distributed approach we would
leverage IRStatistics to get the precision and recall values, but it seems
there isn’t as simple a solution within the Mahout framework for the Hadoop
based calculations.

Can someone please share/suggest their techniques for evaluating
recommendation accuracy with Mahout’s Hadoop-based distributed algorithms?

Thanks in advance,

Jonathan

Re: Can someone suggest an approach for calculating precision and recall for distributed recommendations?

Posted by Ted Dunning <te...@gmail.com>.

Obviously, you need to refer also to scores of other items as well.

One handy stat is AUC whcih you can compute by averaging to get the
probability that a relevant (viewed) item has a higher recommendation score
than a non-relevant (not viewed) item.

On Sun, Aug 26, 2012 at 5:55 PM, Sean Owen <sr...@gmail.com> wrote:

> There's another approach I've been playing with, which works when the
> recommender produces some score for each rec, not just a ranked list.
> You can train on data up to a certain point in time, then have the
> recommender score the observations that really happened after that
> point. Ideally it should produce a high score for things that really
> were observed next.
>
>

Re: Can someone suggest an approach for calculating precision and recall for distributed recommendations?

Posted by Sean Owen <sr...@gmail.com>.

Sure, this is more or less what you are doing with a matrix
factorization approach, finding underlying features to explain the
ratings.

If the problem you're getting at is the performance issues that can
come up with an approach that scales with the number of user ratings
in the data -- yes matrix factorization doesn't really have this issue
as the data are projected into the same low dimensional space anyway.

I don't know if this is a problem either way for eval though...

On Sun, Aug 26, 2012 at 10:30 PM, Lance Norskog <go...@gmail.com> wrote:
> About the 'user who watches too many movies' problem: is it worth
> recasting the item list by genre? That is, he watched one out of five
> available movies, but they were 90% Sci-fi and Westerns. (Definitely
> male :) Is it worth recasting the item counts as generic votes for
> Sci-Fi and Westerns?
>
> On Sun, Aug 26, 2012 at 5:17 PM, Jonathan Hodges <ho...@gmail.com> wrote:
>> Thanks for your thorough response.  It is really helpful as we are new to
>> Mahout and recommendations in general.  The approach you mention about
>> training on data up to a certain point a time and having the recommender
>> score the next actual observations is very interesting.  This would seem to
>> work well with our Boolean dataset.  We will give this a try.
>>
>>
>> Thanks again for the help.
>>
>>
>> -Jonathan
>>
>>
>> On Sun, Aug 26, 2012 at 3:55 PM, Sean Owen <sr...@gmail.com> wrote:
>>
>>> Most watched by that particular user.
>>>
>>> The issue is that the recommender is trying to answer, "of all items
>>> the user has not interacted with, which is the user most likely to
>>> interact with"? So the 'right answers' to the quiz it gets ought to be
>>> answers to this question. That is why the test data ought to be what
>>> appears to be the most interacted / preferred items.
>>>
>>> For example If you watched 10 Star Trek episodes, then 1 episode of
>>> the Simpsons, and then held out the Simpson episode -- the recommender
>>> is almost surely not going to predict it, not above more Star Trek.
>>> That seems like correct behavior, but would be scored badly by a
>>> simple precision test.
>>>
>>> There are two downsides to this approach. Firstly removing well liked
>>> items from the training set may meaningfully skew a user's
>>> recommendations. It's not such a big issue if the test set is small --
>>> and it should be.
>>>
>>> The second is that by taking out data this way you end up with a
>>> training set which never really existed at one point in time. That
>>> also could be a source of bias.
>>>
>>> Using recent data points tends to avoid both of these problem -- but
>>> then has the problem above.
>>>
>>>
>>> There's another approach I've been playing with, which works when the
>>> recommender produces some score for each rec, not just a ranked list.
>>> You can train on data up to a certain point in time, then have the
>>> recommender score the observations that really happened after that
>>> point. Ideally it should produce a high score for things that really
>>> were observed next.
>>>
>>> This isn't implemented in Mahout but you do get a score with recs even
>>> without ratings.
>>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com

Re: Can someone suggest an approach for calculating precision and recall for distributed recommendations?

Posted by Lance Norskog <go...@gmail.com>.

About the 'user who watches too many movies' problem: is it worth
recasting the item list by genre? That is, he watched one out of five
available movies, but they were 90% Sci-fi and Westerns. (Definitely
male :) Is it worth recasting the item counts as generic votes for
Sci-Fi and Westerns?

On Sun, Aug 26, 2012 at 5:17 PM, Jonathan Hodges <ho...@gmail.com> wrote:
> Thanks for your thorough response.  It is really helpful as we are new to
> Mahout and recommendations in general.  The approach you mention about
> training on data up to a certain point a time and having the recommender
> score the next actual observations is very interesting.  This would seem to
> work well with our Boolean dataset.  We will give this a try.
>
>
> Thanks again for the help.
>
>
> -Jonathan
>
>
> On Sun, Aug 26, 2012 at 3:55 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> Most watched by that particular user.
>>
>> The issue is that the recommender is trying to answer, "of all items
>> the user has not interacted with, which is the user most likely to
>> interact with"? So the 'right answers' to the quiz it gets ought to be
>> answers to this question. That is why the test data ought to be what
>> appears to be the most interacted / preferred items.
>>
>> For example If you watched 10 Star Trek episodes, then 1 episode of
>> the Simpsons, and then held out the Simpson episode -- the recommender
>> is almost surely not going to predict it, not above more Star Trek.
>> That seems like correct behavior, but would be scored badly by a
>> simple precision test.
>>
>> There are two downsides to this approach. Firstly removing well liked
>> items from the training set may meaningfully skew a user's
>> recommendations. It's not such a big issue if the test set is small --
>> and it should be.
>>
>> The second is that by taking out data this way you end up with a
>> training set which never really existed at one point in time. That
>> also could be a source of bias.
>>
>> Using recent data points tends to avoid both of these problem -- but
>> then has the problem above.
>>
>>
>> There's another approach I've been playing with, which works when the
>> recommender produces some score for each rec, not just a ranked list.
>> You can train on data up to a certain point in time, then have the
>> recommender score the observations that really happened after that
>> point. Ideally it should produce a high score for things that really
>> were observed next.
>>
>> This isn't implemented in Mahout but you do get a score with recs even
>> without ratings.
>>



-- 
Lance Norskog
goksron@gmail.com

Re: Can someone suggest an approach for calculating precision and recall for distributed recommendations?

Posted by Jonathan Hodges <ho...@gmail.com>.

Thanks for your thorough response.  It is really helpful as we are new to
Mahout and recommendations in general.  The approach you mention about
training on data up to a certain point a time and having the recommender
score the next actual observations is very interesting.  This would seem to
work well with our Boolean dataset.  We will give this a try.


Thanks again for the help.


-Jonathan


On Sun, Aug 26, 2012 at 3:55 PM, Sean Owen <sr...@gmail.com> wrote:

> Most watched by that particular user.
>
> The issue is that the recommender is trying to answer, "of all items
> the user has not interacted with, which is the user most likely to
> interact with"? So the 'right answers' to the quiz it gets ought to be
> answers to this question. That is why the test data ought to be what
> appears to be the most interacted / preferred items.
>
> For example If you watched 10 Star Trek episodes, then 1 episode of
> the Simpsons, and then held out the Simpson episode -- the recommender
> is almost surely not going to predict it, not above more Star Trek.
> That seems like correct behavior, but would be scored badly by a
> simple precision test.
>
> There are two downsides to this approach. Firstly removing well liked
> items from the training set may meaningfully skew a user's
> recommendations. It's not such a big issue if the test set is small --
> and it should be.
>
> The second is that by taking out data this way you end up with a
> training set which never really existed at one point in time. That
> also could be a source of bias.
>
> Using recent data points tends to avoid both of these problem -- but
> then has the problem above.
>
>
> There's another approach I've been playing with, which works when the
> recommender produces some score for each rec, not just a ranked list.
> You can train on data up to a certain point in time, then have the
> recommender score the observations that really happened after that
> point. Ideally it should produce a high score for things that really
> were observed next.
>
> This isn't implemented in Mahout but you do get a score with recs even
> without ratings.
>

Re: Can someone suggest an approach for calculating precision and recall for distributed recommendations?

Posted by Paulo Villegas <pa...@gmail.com>.

On 08/05/13 14:00, Vikas Kapur wrote:
> Hi,
> I calculated the precision using the below approach but getting weird and
> strange results.
>
> I tried to evaluate two algorithms with RMSE and PRECISION@5 metrics:
> I found that Algo1 has lower RMSE and Precision value when compared with Algo2.
> Isn't strange?
> If Algo1 has lower RMSE then it should have higher precision ?
>

Not at all, RMSE & precision should have a certain degree of 
correlation, yes (after all, they are all metrics of fitness). But they 
may differ substantially, and they will in many cases.

RMSE computes the error in preference prediction for unknown items 
('unknown' in the training set), so it measures the distance between 
what the engine thinks the user will rate and the actual user rating 
(depending on how you define 'rating' this may make more or less sense, 
for implicit datasets all you have is 0 and 1).

While precision@5 is concerned only on the relative order of the 
results. It doesn't matter if the distance in the preference estimation 
for 'good' items (items in the test set) is greater or smaller, as long 
as they reach the top 5 positions. If you want, it's a more 'global' 
measure, finding out if the final list delivered is right or not, while 
RMSE computes algorithm quality item by item.

None is worse of better in an universal sense, it depends on the 
context. For TopN problems, I tend to think that Precision@N works 
better as a figure of merit, but you could find situations in which it 
does not.

Re: Can someone suggest an approach for calculating precision and recall for distributed recommendations?

Posted by Vikas Kapur <vi...@gmail.com>.

Hi, 
I calculated the precision using the below approach but getting weird and
strange results.

I tried to evaluate two algorithms with RMSE and PRECISION@5 metrics:
I found that Algo1 has lower RMSE and Precision value when compared with Algo2.
Isn't strange?
If Algo1 has lower RMSE then it should have higher precision ?

Re: Can someone suggest an approach for calculating precision and recall for distributed recommendations?

Posted by Jonathan Hodges <ho...@gmail.com>.

Thanks to you again Ted. These are some great suggestions for helping out
us newbies.



On Mon, Aug 27, 2012 at 11:28 PM, Ted Dunning <te...@gmail.com> wrote:

> In another forum, I responded to this question this way:
>
> One short answer is that you only need enough test data to drive the
> > accuracy of your PR estimates to the point you need them. That isn't all
> > that much data so the sequential version should do rather well.
> > The gold standard, of course, is actual user behavior. Especially when
> you
> > are starting out, views are going to be entirely driven by your other
> > discovery mechanisms such as search. This means that maximizing recall
> > precision is going to drive your recommender to replicate your current
> > discovery patterns which isn't really what you want.
> > Regarding your use of raw views, you will have problems if your videos
> > have lots of misleading meta-data since users will click on things that
> > they don't really want to watch. This is a key user satisfaction issue,
> of
> > course.
> > You should also consider dithering in your system for lots of reasons.
> > Also, make sure you have alternative discovery mechanisms. A "recently
> > added" page is really helpful for this.
>
>
> And then added this about dithering:
>
> All clicks are implicit data and you can use boolean methods on any or all
> > of them. Nothing in these kinds of data prevents you from using LLR
> methods
> > or matrix factorization methods.
> > For dithering, what I do is set a synthetic score that looks like
> > exp(-rank). Then I add random noise to this that is exponentially
> > distributed (aka -log(random()) ). I scale the noise as small as I would
> > like. This method means that the top items generally mix with just the
> top
> > and deeper items mix with much deeper items.
> > You can experiment with this using the following R commands (with sample
> > output):
> >
>
>
> > *order(-exp(-(0:99)/4) + rexp(100, rate=10))** *
> > [1] 2 1 4 3 6 8 5 10 7 12 29 11 26 21 70 86 79 52
> > [19] 14 68 17 83 44 72 30 89 35 34 84 39 74 100 73 87 78 56
> > [37] 15 66 46 40 9 95 96 67 16 49 80 90 53 32 27 48 37 76
> > [55] 77 91 88 62 98 51 19 50 93 99 23 28 65 33 25 54 71 97
> > [73] 43 57 18 92 94 45 22 38 81 75 85 13 20 82 41 42 58 64
> > [91] 60 59 61 69 47 55 31 24 36 63
> > > *order(-exp(-(0:99)/4) + rexp(100, rate=10))** *
> > [1] 1 2 3 4 5 6 9 12 7 10 15 23 78 72 16 60 95 68
> > [19] 24 65 90 94 55 22 40 21 17 47 39 71 59 66 79 88 97 56
> > [37] 26 99 74 41 44 45 50 70 49 75 62 31 84 51 11 33 91 19
> > [55] 61 28 77 18 52 54 48 43 87 25 35 38 30 73 27 89 53 8
> > [73] 82 83 93 57 13 36 69 29 98 63 76 85 64 37 96 46 81 67
> > [91] 92 20 80 42 58 34 86 32 14 100
> > >
> >
>
>
> As you can see, the top items stay near the top, but mixing down deeper is
> > quite strong.
>
>
>
> You can use uniform noise to get kind of a different effect.
>

Re: Can someone suggest an approach for calculating precision and recall for distributed recommendations?

Posted by Ted Dunning <te...@gmail.com>.

In another forum, I responded to this question this way:

One short answer is that you only need enough test data to drive the
> accuracy of your PR estimates to the point you need them. That isn't all
> that much data so the sequential version should do rather well.
> The gold standard, of course, is actual user behavior. Especially when you
> are starting out, views are going to be entirely driven by your other
> discovery mechanisms such as search. This means that maximizing recall
> precision is going to drive your recommender to replicate your current
> discovery patterns which isn't really what you want.
> Regarding your use of raw views, you will have problems if your videos
> have lots of misleading meta-data since users will click on things that
> they don't really want to watch. This is a key user satisfaction issue, of
> course.
> You should also consider dithering in your system for lots of reasons.
> Also, make sure you have alternative discovery mechanisms. A "recently
> added" page is really helpful for this.


And then added this about dithering:

All clicks are implicit data and you can use boolean methods on any or all
> of them. Nothing in these kinds of data prevents you from using LLR methods
> or matrix factorization methods.
> For dithering, what I do is set a synthetic score that looks like
> exp(-rank). Then I add random noise to this that is exponentially
> distributed (aka -log(random()) ). I scale the noise as small as I would
> like. This method means that the top items generally mix with just the top
> and deeper items mix with much deeper items.
> You can experiment with this using the following R commands (with sample
> output):
>


> *order(-exp(-(0:99)/4) + rexp(100, rate=10))** *
> [1] 2 1 4 3 6 8 5 10 7 12 29 11 26 21 70 86 79 52
> [19] 14 68 17 83 44 72 30 89 35 34 84 39 74 100 73 87 78 56
> [37] 15 66 46 40 9 95 96 67 16 49 80 90 53 32 27 48 37 76
> [55] 77 91 88 62 98 51 19 50 93 99 23 28 65 33 25 54 71 97
> [73] 43 57 18 92 94 45 22 38 81 75 85 13 20 82 41 42 58 64
> [91] 60 59 61 69 47 55 31 24 36 63
> > *order(-exp(-(0:99)/4) + rexp(100, rate=10))** *
> [1] 1 2 3 4 5 6 9 12 7 10 15 23 78 72 16 60 95 68
> [19] 24 65 90 94 55 22 40 21 17 47 39 71 59 66 79 88 97 56
> [37] 26 99 74 41 44 45 50 70 49 75 62 31 84 51 11 33 91 19
> [55] 61 28 77 18 52 54 48 43 87 25 35 38 30 73 27 89 53 8
> [73] 82 83 93 57 13 36 69 29 98 63 76 85 64 37 96 46 81 67
> [91] 92 20 80 42 58 34 86 32 14 100
> >
>


As you can see, the top items stay near the top, but mixing down deeper is
> quite strong.



You can use uniform noise to get kind of a different effect.

Re: Can someone suggest an approach for calculating precision and recall for distributed recommendations?

Posted by Sean Owen <sr...@gmail.com>.

Most watched by that particular user.

The issue is that the recommender is trying to answer, "of all items
the user has not interacted with, which is the user most likely to
interact with"? So the 'right answers' to the quiz it gets ought to be
answers to this question. That is why the test data ought to be what
appears to be the most interacted / preferred items.

For example If you watched 10 Star Trek episodes, then 1 episode of
the Simpsons, and then held out the Simpson episode -- the recommender
is almost surely not going to predict it, not above more Star Trek.
That seems like correct behavior, but would be scored badly by a
simple precision test.

There are two downsides to this approach. Firstly removing well liked
items from the training set may meaningfully skew a user's
recommendations. It's not such a big issue if the test set is small --
and it should be.

The second is that by taking out data this way you end up with a
training set which never really existed at one point in time. That
also could be a source of bias.

Using recent data points tends to avoid both of these problem -- but
then has the problem above.


There's another approach I've been playing with, which works when the
recommender produces some score for each rec, not just a ranked list.
You can train on data up to a certain point in time, then have the
recommender score the observations that really happened after that
point. Ideally it should produce a high score for things that really
were observed next.

This isn't implemented in Mahout but you do get a score with recs even
without ratings.

Re: Can someone suggest an approach for calculating precision and recall for distributed recommendations?

Posted by Jonathan Hodges <ho...@gmail.com>.

Thanks for the responses.


@Sean we will follow your approach for calculating precision and recall.
When you suggest using videos watched most often for most-preferred items
for a user do you mean most watched by a particular user or overall? We
might also infer a user’s most-preferred items by grouping by show if there
are multiple video views from the same show.  We will also take a look at
the other measures you mention like F1 score, discounted cumulative gain,
and mean average precision.


@Sebastian thanks for your suggestion.  We do have some temporal data like
the time of day the video is viewed or relative to a new episode.  We will
inspect this relationship in more detail.


@Lance I hear you on the behaviorial data. Our first release on just
Boolean data will definitely leave plenty of room for improvement. For
future releases we plan on capturing more implicit viewing data like start
and stop time of the video or how many videos watched in a session so we
can incorporate more advanced techniques like matrix factorization via
Alternating Least Squares.




On Sun, Aug 26, 2012 at 1:23 PM, Lance Norskog <go...@gmail.com> wrote:

> Behavior is a much better basis for recommendation than user ratings.
> Raters are self-selected. If you can track whether someone actually
> watches the video, that is gold data.
>
> There is a meme in machine learning that signal drives out noise as
> data increases, so you can get away with simpler algorithms as you get
> more data.
>
> On Sun, Aug 26, 2012 at 8:38 AM, Sebastian Schelter <ss...@apache.org>
> wrote:
> > If you have temporal information, you should use these to split the data.
> > Try to predict later interactions from older ones.
> >
> > Am 26.08.2012 17:04 schrieb
> >>
> >> It's the same idea, but yes you'd have to re-implement it for Hadoop.
> >>
> >> Randomly select a subset of users. Identify a small number of
> >> most-preferred items for that user -- perhaps the video(s) watched
> >> most often. Hold these data points out as a test set. Run your process
> >> on all the rest.
> >>
> >> Make recommendations for the selected users. You then just see how
> >> many in the list were among the test data you held out. The percentage
> >> of recs that were in the test list is precision, and the percent of
> >> the test list in the recs is recall.
> >>
> >> Precision and recall are not good tests, but among the only ones you
> >> can carry out in the lab. Slightly better are variations on these two
> >> metrics, like F1 measure and normalized discounted cumulative gain.
> >> Also look up mean average precision.
> >>
> >> On Sun, Aug 26, 2012 at 10:47 AM, Jonathan Hodges <ho...@gmail.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > We have been tasked with producing video recommendations for our
> users.
> >> We
> >> > get about 100 million video views per month and track users and the
> >> videos
> >> > they watch, but currently we don’t collect rating value or preference.
> >> > Later we plan on using implicit data like percentage of video watched
> to
> >> > surmise preferences but for the first release we are stuck with
> Boolean
> >> > viewing data. To that end we started by using Mahout’s distributed
> >> > RecommenderJob with LoglikelihoodSimilarity algorithm to generate 50
> >> video
> >> > recommendations for each user. We would like to gauge how well we are
> >> doing
> >> > by offline measuring precision and recall of these recommendations. We
> >> know
> >> > we should divide the viewing data into training and test data, but not
> >> real
> >> > sure what steps to take next. For the non-distributed approach we
> would
> >> > leverage IRStatistics to get the precision and recall values, but it
> >> seems
> >> > there isn’t as simple a solution within the Mahout framework for the
> >> Hadoop
> >> > based calculations.
> >> >
> >> > Can someone please share/suggest their techniques for evaluating
> >> > recommendation accuracy with Mahout’s Hadoop-based distributed
> >> algorithms?
> >> >
> >> > Thanks in advance,
> >> >
> >> > Jonathan
> >>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Can someone suggest an approach for calculating precision and recall for distributed recommendations?

Posted by Lance Norskog <go...@gmail.com>.

Behavior is a much better basis for recommendation than user ratings.
Raters are self-selected. If you can track whether someone actually
watches the video, that is gold data.

There is a meme in machine learning that signal drives out noise as
data increases, so you can get away with simpler algorithms as you get
more data.

On Sun, Aug 26, 2012 at 8:38 AM, Sebastian Schelter <ss...@apache.org> wrote:
> If you have temporal information, you should use these to split the data.
> Try to predict later interactions from older ones.
>
> Am 26.08.2012 17:04 schrieb
>>
>> It's the same idea, but yes you'd have to re-implement it for Hadoop.
>>
>> Randomly select a subset of users. Identify a small number of
>> most-preferred items for that user -- perhaps the video(s) watched
>> most often. Hold these data points out as a test set. Run your process
>> on all the rest.
>>
>> Make recommendations for the selected users. You then just see how
>> many in the list were among the test data you held out. The percentage
>> of recs that were in the test list is precision, and the percent of
>> the test list in the recs is recall.
>>
>> Precision and recall are not good tests, but among the only ones you
>> can carry out in the lab. Slightly better are variations on these two
>> metrics, like F1 measure and normalized discounted cumulative gain.
>> Also look up mean average precision.
>>
>> On Sun, Aug 26, 2012 at 10:47 AM, Jonathan Hodges <ho...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > We have been tasked with producing video recommendations for our users.
>> We
>> > get about 100 million video views per month and track users and the
>> videos
>> > they watch, but currently we don’t collect rating value or preference.
>> > Later we plan on using implicit data like percentage of video watched to
>> > surmise preferences but for the first release we are stuck with Boolean
>> > viewing data. To that end we started by using Mahout’s distributed
>> > RecommenderJob with LoglikelihoodSimilarity algorithm to generate 50
>> video
>> > recommendations for each user. We would like to gauge how well we are
>> doing
>> > by offline measuring precision and recall of these recommendations. We
>> know
>> > we should divide the viewing data into training and test data, but not
>> real
>> > sure what steps to take next. For the non-distributed approach we would
>> > leverage IRStatistics to get the precision and recall values, but it
>> seems
>> > there isn’t as simple a solution within the Mahout framework for the
>> Hadoop
>> > based calculations.
>> >
>> > Can someone please share/suggest their techniques for evaluating
>> > recommendation accuracy with Mahout’s Hadoop-based distributed
>> algorithms?
>> >
>> > Thanks in advance,
>> >
>> > Jonathan
>>



-- 
Lance Norskog
goksron@gmail.com

Re: Can someone suggest an approach for calculating precision and recall for distributed recommendations?

Posted by Sebastian Schelter <ss...@apache.org>.

If you have temporal information, you should use these to split the data.
Try to predict later interactions from older ones.

Am 26.08.2012 17:04 schrieb
>
> It's the same idea, but yes you'd have to re-implement it for Hadoop.
>
> Randomly select a subset of users. Identify a small number of
> most-preferred items for that user -- perhaps the video(s) watched
> most often. Hold these data points out as a test set. Run your process
> on all the rest.
>
> Make recommendations for the selected users. You then just see how
> many in the list were among the test data you held out. The percentage
> of recs that were in the test list is precision, and the percent of
> the test list in the recs is recall.
>
> Precision and recall are not good tests, but among the only ones you
> can carry out in the lab. Slightly better are variations on these two
> metrics, like F1 measure and normalized discounted cumulative gain.
> Also look up mean average precision.
>
> On Sun, Aug 26, 2012 at 10:47 AM, Jonathan Hodges <ho...@gmail.com>
> wrote:
> > Hi,
> >
> > We have been tasked with producing video recommendations for our users.
> We
> > get about 100 million video views per month and track users and the
> videos
> > they watch, but currently we don’t collect rating value or preference.
> > Later we plan on using implicit data like percentage of video watched to
> > surmise preferences but for the first release we are stuck with Boolean
> > viewing data. To that end we started by using Mahout’s distributed
> > RecommenderJob with LoglikelihoodSimilarity algorithm to generate 50
> video
> > recommendations for each user. We would like to gauge how well we are
> doing
> > by offline measuring precision and recall of these recommendations. We
> know
> > we should divide the viewing data into training and test data, but not
> real
> > sure what steps to take next. For the non-distributed approach we would
> > leverage IRStatistics to get the precision and recall values, but it
> seems
> > there isn’t as simple a solution within the Mahout framework for the
> Hadoop
> > based calculations.
> >
> > Can someone please share/suggest their techniques for evaluating
> > recommendation accuracy with Mahout’s Hadoop-based distributed
> algorithms?
> >
> > Thanks in advance,
> >
> > Jonathan
>

Re: Can someone suggest an approach for calculating precision and recall for distributed recommendations?

Posted by Sean Owen <sr...@gmail.com>.

It's the same idea, but yes you'd have to re-implement it for Hadoop.

Randomly select a subset of users. Identify a small number of
most-preferred items for that user -- perhaps the video(s) watched
most often. Hold these data points out as a test set. Run your process
on all the rest.

Make recommendations for the selected users. You then just see how
many in the list were among the test data you held out. The percentage
of recs that were in the test list is precision, and the percent of
the test list in the recs is recall.

Precision and recall are not good tests, but among the only ones you
can carry out in the lab. Slightly better are variations on these two
metrics, like F1 measure and normalized discounted cumulative gain.
Also look up mean average precision.

On Sun, Aug 26, 2012 at 10:47 AM, Jonathan Hodges <ho...@gmail.com> wrote:
> Hi,
>
> We have been tasked with producing video recommendations for our users. We
> get about 100 million video views per month and track users and the videos
> they watch, but currently we don’t collect rating value or preference.
> Later we plan on using implicit data like percentage of video watched to
> surmise preferences but for the first release we are stuck with Boolean
> viewing data. To that end we started by using Mahout’s distributed
> RecommenderJob with LoglikelihoodSimilarity algorithm to generate 50 video
> recommendations for each user. We would like to gauge how well we are doing
> by offline measuring precision and recall of these recommendations. We know
> we should divide the viewing data into training and test data, but not real
> sure what steps to take next. For the non-distributed approach we would
> leverage IRStatistics to get the precision and recall values, but it seems
> there isn’t as simple a solution within the Mahout framework for the Hadoop
> based calculations.
>
> Can someone please share/suggest their techniques for evaluating
> recommendation accuracy with Mahout’s Hadoop-based distributed algorithms?
>
> Thanks in advance,
>
> Jonathan