You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Evgeny Karataev <ka...@gmail.com> on 2012/11/26 20:10:44 UTC

Recommender's formula

Hello,

I've read Mahout in Action book; then  this paper - "Case Study Evaluation
of Mahout as a Recommender Platform" (
http://ir.ii.uam.es/rue2012/papers/rue2012-seminario.pdf);  and then this
Sean Owen's comment (
http://mail-archives.apache.org/mod_mbox/mahout-user/201210.mbox/%3CCAEccTyzRzhRzUi9FGCPhPqa01bei=wYCtX2kewOcpfvU37PPGw@mail.gmail.com%3E)
and now I am confused what formula is used for user-based (and
item-based) recommendations. What paper is it based on?

Does it use mean centering as in the formula in Resnick's paper (
http://dl.acm.org/citation.cfm?id=192905) or formula 4.15 in "A
Comprehensive Survey of Neighborhood-based Recommendation Methods" (
http://www.springerlink.com/content/n3jq77686228781n/)? Or authors of "Case
Study Evaluation of Mahout as a Recommender Platform" are right and it
computed recommendation somehow similar to formula 4.12 in "A Comprehensive
Survey of Neighborhood-based Recommendation Methods"?


Following the algorithm in the Mahout in Action book, does not seem like i
uses mean centering. However, in the section about Cosine similarity,
authors states that the input it mean centered.


Thank you.

-- 
Best Regards,
Evgeny Karataev

Re: Recommender's formula

Posted by Sean Owen <sr...@gmail.com>.
Both are implemented in the project, and both are mentioned in a lot
of papers. Pearson correlation and cosine similarity aren't specific
to recommenders. Mean centering is generally a good idea and is what I
would recommend, but it's your choice. Both are options so there is
not somehow one version that is implemented in the project. You could
use either.

On Mon, Nov 26, 2012 at 7:10 PM, Evgeny Karataev
<ka...@gmail.com> wrote:
> Hello,
>
> I've read Mahout in Action book; then  this paper - "Case Study Evaluation
> of Mahout as a Recommender Platform" (
> http://ir.ii.uam.es/rue2012/papers/rue2012-seminario.pdf);  and then this
> Sean Owen's comment (
> http://mail-archives.apache.org/mod_mbox/mahout-user/201210.mbox/%3CCAEccTyzRzhRzUi9FGCPhPqa01bei=wYCtX2kewOcpfvU37PPGw@mail.gmail.com%3E)
> and now I am confused what formula is used for user-based (and
> item-based) recommendations. What paper is it based on?
>
> Does it use mean centering as in the formula in Resnick's paper (
> http://dl.acm.org/citation.cfm?id=192905) or formula 4.15 in "A
> Comprehensive Survey of Neighborhood-based Recommendation Methods" (
> http://www.springerlink.com/content/n3jq77686228781n/)? Or authors of "Case
> Study Evaluation of Mahout as a Recommender Platform" are right and it
> computed recommendation somehow similar to formula 4.12 in "A Comprehensive
> Survey of Neighborhood-based Recommendation Methods"?
>
>
> Following the algorithm in the Mahout in Action book, does not seem like i
> uses mean centering. However, in the section about Cosine similarity,
> authors states that the input it mean centered.
>
>
> Thank you.
>
> --
> Best Regards,
> Evgeny Karataev

Re: Recommender's formula

Posted by Sean Owen <sr...@gmail.com>.
Right, it doesn't do that. This isn't a part of the similarity metric.
It's a decent idea, the only drawback being that you have to keep the
means around. The effect is small at scale. But yes it would probably
be a nice additional feature.

On Mon, Nov 26, 2012 at 9:32 PM, Paulo Villegas <pa...@tid.es> wrote:
> But once you have similarities computed, then you go on and use them to
> predict the rating for unknown items. It's this rating prediction the
> place in which mean centering (or, to be more general, rating
> normalization) is not done and could be done.
>

Re: Recommender's formula

Posted by Paulo Villegas <pa...@tid.es>.
>
> and the formula looks almost exactly as formula 4.12 in "A Comprehensive
> Survey of Neighborhood-based Recommendation Methods" (
> http://www.springerlink.com/content/n3jq77686228781n/), however, the
> difference is that you divide weighted preference by totalSimilarity
>
>     ...
>   // Weights can be negative!
> preference += theSimilarity * preferencesFromUser.getValue(i);
> totalSimilarity += theSimilarity;
> ...
> float estimate = (float) (preference / totalSimilarity);
> ...
>
> Where in contrast, in other papers the denominator is sum of absolute
> values of similarities.*
> *
>
> If I am not mistaken and as the comment in the code states, weights
> (similarities) could be negative. And actually they might sum up to 0.
> Then you would divide preference by 0. What would be the estimate in
> that case?

They can be negative for certain similarity metrics, most notably
Pearson (which has sign, negative similarities express negative
correlations), other similarity metrics are strictly positive and
therefore do not present that problem.

The case you comment (total weight is zero) is theoretically possible
with Pearson, but would be very rare in practice (chances are slim, in
general).

Nonetheless, IMHO even if we disregard that case, it would still be
beneficial to take absolute value, because otherwise positive and
negative similarities cancel each other somehow in the denominator, and
end up yielding a normalization factor for the final rating that it is
too small. The consequence is an abnormal rating prediction (too high,
it ends up capped to the maximum), and degraded performance (but this
depends on the metric used).

Again, this would only happen with signed similarity metrics.

Paulo

>
>
>
>
> On Mon, Nov 26, 2012 at 4:32 PM, Paulo Villegas <pa...@tid.es> wrote:
>
>>> What do you mean here? You never need to actually subtract the mean
>>> from the data. The similarity metric's math is just adjusted to work
>>> as if it were. So no there is no idea of adding back a mean. I don't
>>> think there's something not implemented.
>>
>> No, not about the similarity metric, as I said, the computation of the
>> similarity metric *is* centred (or can be, the code has that option).
>>
>> But once you have similarities computed, then you go on and use them to
>> predict the rating for unknown items. It's this rating prediction the
>> place in which mean centering (or, to be more general, rating
>> normalization) is not done and could be done.
>>
>> The papers mentioned in the original post explain it, I just searched
>> around and found another one that also mentions it:
>>
>> "An Empirical Analysis of Design Choices in Neighborhood-Based
>> Collaborative Filtering Algorithms"
>>
>> (googling it will give you a PDF right away). The rating prediction is
>> Equation 1, and there you can see what I mean by mean centering in the
>> prediction.
>>
>> Basically, you use the similarities you have already computed as weights
>> for the averaging sum that creates the prediction, but those weights do
>> not multiply the bare ratings for the other items, but their deviation
>> from each users' average rating (equation 1 is for user-based).
>>
>> The rationale is that each user's scale is different, and tends to
>> cluster ratings around a different mean. By subtracting that mean, we
>> get into the equation only the user's perceived difference between that
>> item and her average opinion, and factor out the user's mean opinion
>> (which would introduce some bias). Then we add back to the result the
>> average rating of the target user, which restores the normal range for
>> the prediction, but this time using the target user's own bias. This
>> helps to achieve predictions more in line with the target user's own scale.
>>
>> The same paper explains it later on (more eloquently than me :-) in
>> section 7.1, in the more general context of rating normalization
>> (proposing also z-score as a more elaborate choice, and evaluating
>> results).
>>
>> Paulo
>>
>>
>> On 26/11/12 21:51, Sean Owen wrote:
>>
>>>
>>> On Mon, Nov 26, 2012 at 8:20 PM, Paulo Villegas <pa...@tid.es> wrote:
>>>
>>>> The thing is, in an Item- or User- based neighborhood recommender,
>>>> there's more than one thing that can be centered :-)
>>>>
>>>> What those papers talk about (from memory, it's been a while since I
>>>> last read them, and I don't have them at hand now) is about centering of
>>>> the preference around the user's (or item's) average before entering it
>>>> in the neighborhood formula. And then moving them back to its usual
>>>> range by adding back the average preference (this time for the target
>>>> item or user).
>>>>
>>>> This is something that the code in Mahout does not currently do. You can
>>>> check for yourself, the formula is pretty straightforward:
>>>>
>>>
>>
>> ______________________________**__
>>
>> Este mensaje se dirige exclusivamente a su destinatario. Puede consultar
>> nuestra política de envío y recepción de correo electrónico en el enlace
>> situado más abajo.
>> This message is intended exclusively for its addressee. We only send and
>> receive email on the basis of the terms set out at:
>> http://www.tid.es/ES/PAGINAS/**disclaimer.aspx<http://www.tid.es/ES/PAGINAS/disclaimer.aspx>
>>
>
>
>


________________________________

Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra política de envío y recepción de correo electrónico en el enlace situado más abajo.
This message is intended exclusively for its addressee. We only send and receive email on the basis of the terms set out at:
http://www.tid.es/ES/PAGINAS/disclaimer.aspx

Re: Recommender's formula

Posted by Paulo Villegas <pa...@tid.es>.
Note that if you do implement mean centering, then it solves that
interpretation issue. Then a prediction of -3 means "a prediction 3
below the user's mean", so it's still valid in the 1-5 scale (when you
add the user's mean, it gets back into the scale, though it's possible
that needs capping).

But you're right in that implementing it requires to carry around rating
means. I did that, augmenting the DataModel with the needed data
(basically a bunch of RunningAverage objects), but the result wasn't
pretty :-), so I did not submit it as a patch.



> This is a good discussion of the issue.
>
> https://issues.apache.org/jira/browse/MAHOUT-898
>
> Negative weights are problematic. I think taking the absolute value
> gives slightly less explainable results, but that's up to taste. For
> example a rating of 3, weighted by -4, results in a prediction of -3.
> It's not clear -3 represents "the opposite of 3", and it doesn't in a
> 1-5 rating scale for example. Really negative weights are votes to be
> infinitely far from a value, and that is weird. Don't do it.
>
> On Mon, Nov 26, 2012 at 9:51 PM, Evgeny Karataev
> <ka...@gmail.com> wrote:
>> Thank you Sean and Paulo.
>>
>> Paulo, I guess in my original email I meant what you said in your last
>> email (about rating normalization). So that part is not done.
>>
>> I've looked at the code https://github.com/apache/**
>> mahout/blob/trunk/core/src/**main/java/org/apache/mahout/**
>> cf/taste/impl/recommender/**GenericItemBasedRecommender.**java#L230<https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java#L230>
>>
>> and the formula looks almost exactly as formula 4.12 in "A Comprehensive
>> Survey of Neighborhood-based Recommendation Methods" (
>> http://www.springerlink.com/content/n3jq77686228781n/), however, the
>> difference is that you divide weighted preference by totalSimilarity
>>
>>     ...
>>
>>
>>   // Weights can be negative!
>> preference += theSimilarity * preferencesFromUser.getValue(i);
>> totalSimilarity += theSimilarity;
>> ...
>> float estimate = (float) (preference / totalSimilarity);
>> ...
>>
>> Where in contrast, in other papers the denominator is sum of absolute
>> values of similarities.*
>> *
>>
>> If I am not mistaken and as the comment in the code states, weights
>> (similarities) could be negative. And actually they might sum up to 0.
>> Then you would divide preference by 0. What would be the estimate in
>> that case?
>>
>>
>>
>>
>> On Mon, Nov 26, 2012 at 4:32 PM, Paulo Villegas <pa...@tid.es> wrote:
>>
>>>> What do you mean here? You never need to actually subtract the mean
>>>> from the data. The similarity metric's math is just adjusted to work
>>>> as if it were. So no there is no idea of adding back a mean. I don't
>>>> think there's something not implemented.
>>>
>>> No, not about the similarity metric, as I said, the computation of the
>>> similarity metric *is* centred (or can be, the code has that option).
>>>
>>> But once you have similarities computed, then you go on and use them to
>>> predict the rating for unknown items. It's this rating prediction the
>>> place in which mean centering (or, to be more general, rating
>>> normalization) is not done and could be done.
>>>
>>> The papers mentioned in the original post explain it, I just searched
>>> around and found another one that also mentions it:
>>>
>>> "An Empirical Analysis of Design Choices in Neighborhood-Based
>>> Collaborative Filtering Algorithms"
>>>
>>> (googling it will give you a PDF right away). The rating prediction is
>>> Equation 1, and there you can see what I mean by mean centering in the
>>> prediction.
>>>
>>> Basically, you use the similarities you have already computed as weights
>>> for the averaging sum that creates the prediction, but those weights do
>>> not multiply the bare ratings for the other items, but their deviation
>>> from each users' average rating (equation 1 is for user-based).
>>>
>>> The rationale is that each user's scale is different, and tends to
>>> cluster ratings around a different mean. By subtracting that mean, we
>>> get into the equation only the user's perceived difference between that
>>> item and her average opinion, and factor out the user's mean opinion
>>> (which would introduce some bias). Then we add back to the result the
>>> average rating of the target user, which restores the normal range for
>>> the prediction, but this time using the target user's own bias. This
>>> helps to achieve predictions more in line with the target user's own scale.
>>>
>>> The same paper explains it later on (more eloquently than me :-) in
>>> section 7.1, in the more general context of rating normalization
>>> (proposing also z-score as a more elaborate choice, and evaluating
>>> results).
>>>
>>> Paulo
>>>
>>>
>>> On 26/11/12 21:51, Sean Owen wrote:
>>>
>>>>
>>>> On Mon, Nov 26, 2012 at 8:20 PM, Paulo Villegas <pa...@tid.es> wrote:
>>>>
>>>>> The thing is, in an Item- or User- based neighborhood recommender,
>>>>> there's more than one thing that can be centered :-)
>>>>>
>>>>> What those papers talk about (from memory, it's been a while since I
>>>>> last read them, and I don't have them at hand now) is about centering of
>>>>> the preference around the user's (or item's) average before entering it
>>>>> in the neighborhood formula. And then moving them back to its usual
>>>>> range by adding back the average preference (this time for the target
>>>>> item or user).
>>>>>
>>>>> This is something that the code in Mahout does not currently do. You can
>>>>> check for yourself, the formula is pretty straightforward:
>>>>>
>>>>
>>>
>>> ______________________________**__
>>>
>>> Este mensaje se dirige exclusivamente a su destinatario. Puede consultar
>>> nuestra política de envío y recepción de correo electrónico en el enlace
>>> situado más abajo.
>>> This message is intended exclusively for its addressee. We only send and
>>> receive email on the basis of the terms set out at:
>>> http://www.tid.es/ES/PAGINAS/**disclaimer.aspx<http://www.tid.es/ES/PAGINAS/disclaimer.aspx>
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Evgeny Karataev


________________________________

Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra política de envío y recepción de correo electrónico en el enlace situado más abajo.
This message is intended exclusively for its addressee. We only send and receive email on the basis of the terms set out at:
http://www.tid.es/ES/PAGINAS/disclaimer.aspx

Re: Recommender's formula

Posted by Sean Owen <sr...@gmail.com>.
This is a good discussion of the issue.

https://issues.apache.org/jira/browse/MAHOUT-898

Negative weights are problematic. I think taking the absolute value
gives slightly less explainable results, but that's up to taste. For
example a rating of 3, weighted by -4, results in a prediction of -3.
It's not clear -3 represents "the opposite of 3", and it doesn't in a
1-5 rating scale for example. Really negative weights are votes to be
infinitely far from a value, and that is weird. Don't do it.

On Mon, Nov 26, 2012 at 9:51 PM, Evgeny Karataev
<ka...@gmail.com> wrote:
> Thank you Sean and Paulo.
>
> Paulo, I guess in my original email I meant what you said in your last
> email (about rating normalization). So that part is not done.
>
> I've looked at the code https://github.com/apache/**
> mahout/blob/trunk/core/src/**main/java/org/apache/mahout/**
> cf/taste/impl/recommender/**GenericItemBasedRecommender.**java#L230<https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java#L230>
>
> and the formula looks almost exactly as formula 4.12 in "A Comprehensive
> Survey of Neighborhood-based Recommendation Methods" (
> http://www.springerlink.com/content/n3jq77686228781n/), however, the
> difference is that you divide weighted preference by totalSimilarity
>
>    ...
>
>
>  // Weights can be negative!
> preference += theSimilarity * preferencesFromUser.getValue(i);
> totalSimilarity += theSimilarity;
> ...
> float estimate = (float) (preference / totalSimilarity);
> ...
>
> Where in contrast, in other papers the denominator is sum of absolute
> values of similarities.*
> *
>
> If I am not mistaken and as the comment in the code states, weights
> (similarities) could be negative. And actually they might sum up to 0.
> Then you would divide preference by 0. What would be the estimate in
> that case?
>
>
>
>
> On Mon, Nov 26, 2012 at 4:32 PM, Paulo Villegas <pa...@tid.es> wrote:
>
>> > What do you mean here? You never need to actually subtract the mean
>> > from the data. The similarity metric's math is just adjusted to work
>> > as if it were. So no there is no idea of adding back a mean. I don't
>> > think there's something not implemented.
>>
>> No, not about the similarity metric, as I said, the computation of the
>> similarity metric *is* centred (or can be, the code has that option).
>>
>> But once you have similarities computed, then you go on and use them to
>> predict the rating for unknown items. It's this rating prediction the
>> place in which mean centering (or, to be more general, rating
>> normalization) is not done and could be done.
>>
>> The papers mentioned in the original post explain it, I just searched
>> around and found another one that also mentions it:
>>
>> "An Empirical Analysis of Design Choices in Neighborhood-Based
>> Collaborative Filtering Algorithms"
>>
>> (googling it will give you a PDF right away). The rating prediction is
>> Equation 1, and there you can see what I mean by mean centering in the
>> prediction.
>>
>> Basically, you use the similarities you have already computed as weights
>> for the averaging sum that creates the prediction, but those weights do
>> not multiply the bare ratings for the other items, but their deviation
>> from each users' average rating (equation 1 is for user-based).
>>
>> The rationale is that each user's scale is different, and tends to
>> cluster ratings around a different mean. By subtracting that mean, we
>> get into the equation only the user's perceived difference between that
>> item and her average opinion, and factor out the user's mean opinion
>> (which would introduce some bias). Then we add back to the result the
>> average rating of the target user, which restores the normal range for
>> the prediction, but this time using the target user's own bias. This
>> helps to achieve predictions more in line with the target user's own scale.
>>
>> The same paper explains it later on (more eloquently than me :-) in
>> section 7.1, in the more general context of rating normalization
>> (proposing also z-score as a more elaborate choice, and evaluating
>> results).
>>
>> Paulo
>>
>>
>> On 26/11/12 21:51, Sean Owen wrote:
>>
>>>
>>> On Mon, Nov 26, 2012 at 8:20 PM, Paulo Villegas <pa...@tid.es> wrote:
>>>
>>>> The thing is, in an Item- or User- based neighborhood recommender,
>>>> there's more than one thing that can be centered :-)
>>>>
>>>> What those papers talk about (from memory, it's been a while since I
>>>> last read them, and I don't have them at hand now) is about centering of
>>>> the preference around the user's (or item's) average before entering it
>>>> in the neighborhood formula. And then moving them back to its usual
>>>> range by adding back the average preference (this time for the target
>>>> item or user).
>>>>
>>>> This is something that the code in Mahout does not currently do. You can
>>>> check for yourself, the formula is pretty straightforward:
>>>>
>>>
>>
>> ______________________________**__
>>
>> Este mensaje se dirige exclusivamente a su destinatario. Puede consultar
>> nuestra política de envío y recepción de correo electrónico en el enlace
>> situado más abajo.
>> This message is intended exclusively for its addressee. We only send and
>> receive email on the basis of the terms set out at:
>> http://www.tid.es/ES/PAGINAS/**disclaimer.aspx<http://www.tid.es/ES/PAGINAS/disclaimer.aspx>
>>
>
>
>
> --
> Best Regards,
> Evgeny Karataev

Re: Recommender's formula

Posted by Evgeny Karataev <ka...@gmail.com>.
Thank you Sean and Paulo.

Paulo, I guess in my original email I meant what you said in your last
email (about rating normalization). So that part is not done.

I've looked at the code https://github.com/apache/**
mahout/blob/trunk/core/src/**main/java/org/apache/mahout/**
cf/taste/impl/recommender/**GenericItemBasedRecommender.**java#L230<https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java#L230>

and the formula looks almost exactly as formula 4.12 in "A Comprehensive
Survey of Neighborhood-based Recommendation Methods" (
http://www.springerlink.com/content/n3jq77686228781n/), however, the
difference is that you divide weighted preference by totalSimilarity

   ...


 // Weights can be negative!
preference += theSimilarity * preferencesFromUser.getValue(i);
totalSimilarity += theSimilarity;
...
float estimate = (float) (preference / totalSimilarity);
...

Where in contrast, in other papers the denominator is sum of absolute
values of similarities.*
*

If I am not mistaken and as the comment in the code states, weights
(similarities) could be negative. And actually they might sum up to 0.
Then you would divide preference by 0. What would be the estimate in
that case?




On Mon, Nov 26, 2012 at 4:32 PM, Paulo Villegas <pa...@tid.es> wrote:

> > What do you mean here? You never need to actually subtract the mean
> > from the data. The similarity metric's math is just adjusted to work
> > as if it were. So no there is no idea of adding back a mean. I don't
> > think there's something not implemented.
>
> No, not about the similarity metric, as I said, the computation of the
> similarity metric *is* centred (or can be, the code has that option).
>
> But once you have similarities computed, then you go on and use them to
> predict the rating for unknown items. It's this rating prediction the
> place in which mean centering (or, to be more general, rating
> normalization) is not done and could be done.
>
> The papers mentioned in the original post explain it, I just searched
> around and found another one that also mentions it:
>
> "An Empirical Analysis of Design Choices in Neighborhood-Based
> Collaborative Filtering Algorithms"
>
> (googling it will give you a PDF right away). The rating prediction is
> Equation 1, and there you can see what I mean by mean centering in the
> prediction.
>
> Basically, you use the similarities you have already computed as weights
> for the averaging sum that creates the prediction, but those weights do
> not multiply the bare ratings for the other items, but their deviation
> from each users' average rating (equation 1 is for user-based).
>
> The rationale is that each user's scale is different, and tends to
> cluster ratings around a different mean. By subtracting that mean, we
> get into the equation only the user's perceived difference between that
> item and her average opinion, and factor out the user's mean opinion
> (which would introduce some bias). Then we add back to the result the
> average rating of the target user, which restores the normal range for
> the prediction, but this time using the target user's own bias. This
> helps to achieve predictions more in line with the target user's own scale.
>
> The same paper explains it later on (more eloquently than me :-) in
> section 7.1, in the more general context of rating normalization
> (proposing also z-score as a more elaborate choice, and evaluating
> results).
>
> Paulo
>
>
> On 26/11/12 21:51, Sean Owen wrote:
>
>>
>> On Mon, Nov 26, 2012 at 8:20 PM, Paulo Villegas <pa...@tid.es> wrote:
>>
>>> The thing is, in an Item- or User- based neighborhood recommender,
>>> there's more than one thing that can be centered :-)
>>>
>>> What those papers talk about (from memory, it's been a while since I
>>> last read them, and I don't have them at hand now) is about centering of
>>> the preference around the user's (or item's) average before entering it
>>> in the neighborhood formula. And then moving them back to its usual
>>> range by adding back the average preference (this time for the target
>>> item or user).
>>>
>>> This is something that the code in Mahout does not currently do. You can
>>> check for yourself, the formula is pretty straightforward:
>>>
>>
>
> ______________________________**__
>
> Este mensaje se dirige exclusivamente a su destinatario. Puede consultar
> nuestra política de envío y recepción de correo electrónico en el enlace
> situado más abajo.
> This message is intended exclusively for its addressee. We only send and
> receive email on the basis of the terms set out at:
> http://www.tid.es/ES/PAGINAS/**disclaimer.aspx<http://www.tid.es/ES/PAGINAS/disclaimer.aspx>
>



-- 
Best Regards,
Evgeny Karataev

Re: Recommender's formula

Posted by Paulo Villegas <pa...@tid.es>.
 > What do you mean here? You never need to actually subtract the mean
 > from the data. The similarity metric's math is just adjusted to work
 > as if it were. So no there is no idea of adding back a mean. I don't
 > think there's something not implemented.

No, not about the similarity metric, as I said, the computation of the
similarity metric *is* centred (or can be, the code has that option).

But once you have similarities computed, then you go on and use them to
predict the rating for unknown items. It's this rating prediction the
place in which mean centering (or, to be more general, rating
normalization) is not done and could be done.

The papers mentioned in the original post explain it, I just searched
around and found another one that also mentions it:

"An Empirical Analysis of Design Choices in Neighborhood-Based
Collaborative Filtering Algorithms"

(googling it will give you a PDF right away). The rating prediction is
Equation 1, and there you can see what I mean by mean centering in the
prediction.

Basically, you use the similarities you have already computed as weights
for the averaging sum that creates the prediction, but those weights do
not multiply the bare ratings for the other items, but their deviation
from each users' average rating (equation 1 is for user-based).

The rationale is that each user's scale is different, and tends to
cluster ratings around a different mean. By subtracting that mean, we
get into the equation only the user's perceived difference between that
item and her average opinion, and factor out the user's mean opinion
(which would introduce some bias). Then we add back to the result the
average rating of the target user, which restores the normal range for
the prediction, but this time using the target user's own bias. This
helps to achieve predictions more in line with the target user's own scale.

The same paper explains it later on (more eloquently than me :-) in
section 7.1, in the more general context of rating normalization
(proposing also z-score as a more elaborate choice, and evaluating results).

Paulo

On 26/11/12 21:51, Sean Owen wrote:
>
> On Mon, Nov 26, 2012 at 8:20 PM, Paulo Villegas <pa...@tid.es> wrote:
>> The thing is, in an Item- or User- based neighborhood recommender,
>> there's more than one thing that can be centered :-)
>>
>> What those papers talk about (from memory, it's been a while since I
>> last read them, and I don't have them at hand now) is about centering of
>> the preference around the user's (or item's) average before entering it
>> in the neighborhood formula. And then moving them back to its usual
>> range by adding back the average preference (this time for the target
>> item or user).
>>
>> This is something that the code in Mahout does not currently do. You can
>> check for yourself, the formula is pretty straightforward:


________________________________

Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra política de envío y recepción de correo electrónico en el enlace situado más abajo.
This message is intended exclusively for its addressee. We only send and receive email on the basis of the terms set out at:
http://www.tid.es/ES/PAGINAS/disclaimer.aspx

Re: Recommender's formula

Posted by Sean Owen <sr...@gmail.com>.
What do you mean here? You never need to actually subtract the mean
from the data. The similarity metric's math is just adjusted to work
as if it were. So no there is no idea of adding back a mean. I don't
think there's something not implemented.

On Mon, Nov 26, 2012 at 8:20 PM, Paulo Villegas <pa...@tid.es> wrote:
> The thing is, in an Item- or User- based neighborhood recommender,
> there's more than one thing that can be centered :-)
>
> What those papers talk about (from memory, it's been a while since I
> last read them, and I don't have them at hand now) is about centering of
> the preference around the user's (or item's) average before entering it
> in the neighborhood formula. And then moving them back to its usual
> range by adding back the average preference (this time for the target
> item or user).
>
> This is something that the code in Mahout does not currently do. You can
> check for yourself, the formula is pretty straightforward:

Re: Recommender's formula

Posted by Paulo Villegas <pa...@tid.es>.
The thing is, in an Item- or User- based neighborhood recommender,
there's more than one thing that can be centered :-)

What those papers talk about (from memory, it's been a while since I
last read them, and I don't have them at hand now) is about centering of
the preference around the user's (or item's) average before entering it
in the neighborhood formula. And then moving them back to its usual
range by adding back the average preference (this time for the target
item or user).

This is something that the code in Mahout does not currently do. You can
check for yourself, the formula is pretty straightforward:

https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java#L230

Now, what the Mahout code does is to center preference data when
computing user & item similarities (the ones that will later go into the
final recommender equation mentioned above). Or *can* center, since it's
an optional feature of the similarity metric. You can configure it to
apply or not, for instance it's activated for PearsonCorrelation (the
most "typical" similarity), but in general terms any similarity metric
inheriting from AbstractSimilarity can use centering. Again, check the code:

https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/similarity/AbstractSimilarity.java#L134


So, in summary, Mahouts does one of the centerings, but not the other.
What it's best depends somehow on the use case and the dataset features;
if I were to give a global opinion, I'd say when in doubt do both:
centering mostly helps, and rarely hurts. As do other kinds of
regularizations, such as Bayesian-like estimation, etc. But of course YMMV

Regards

Paulo


On 26/11/12 20:10, Evgeny Karataev wrote:
> Hello,
>
> I've read Mahout in Action book; then  this paper - "Case Study Evaluation
> of Mahout as a Recommender Platform" (
> http://ir.ii.uam.es/rue2012/papers/rue2012-seminario.pdf);  and then this
> Sean Owen's comment (
> http://mail-archives.apache.org/mod_mbox/mahout-user/201210.mbox/%3CCAEccTyzRzhRzUi9FGCPhPqa01bei=wYCtX2kewOcpfvU37PPGw@mail.gmail.com%3E)
> and now I am confused what formula is used for user-based (and
> item-based) recommendations. What paper is it based on?
>
> Does it use mean centering as in the formula in Resnick's paper (
> http://dl.acm.org/citation.cfm?id=192905) or formula 4.15 in "A
> Comprehensive Survey of Neighborhood-based Recommendation Methods" (
> http://www.springerlink.com/content/n3jq77686228781n/)? Or authors of "Case
> Study Evaluation of Mahout as a Recommender Platform" are right and it
> computed recommendation somehow similar to formula 4.12 in "A Comprehensive
> Survey of Neighborhood-based Recommendation Methods"?
>
>
> Following the algorithm in the Mahout in Action book, does not seem like i
> uses mean centering. However, in the section about Cosine similarity,
> authors states that the input it mean centered.
>
>
> Thank you.
>


________________________________

Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra política de envío y recepción de correo electrónico en el enlace situado más abajo.
This message is intended exclusively for its addressee. We only send and receive email on the basis of the terms set out at:
http://www.tid.es/ES/PAGINAS/disclaimer.aspx