You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Sebastian Schelter (JIRA)" <ji...@apache.org> on 2010/04/27 18:34:32 UTC

[jira] Created: (MAHOUT-387) Cosine item similarity implementation

Cosine item similarity implementation
-------------------------------------

                 Key: MAHOUT-387
                 URL: https://issues.apache.org/jira/browse/MAHOUT-387
             Project: Mahout
          Issue Type: New Feature
          Components: Collaborative Filtering
            Reporter: Sebastian Schelter
         Attachments: MAHOUT-387.patch

I needed to compute the cosine similarity between two items when running org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find an implementation (did I overlook it maybe?) so I created my own. I want to share it here, in case you find it useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Created: (MAHOUT-387) Cosine item similarity implementation

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

 From Mahout In Action:

You may be searching for something like “CosineMeasureSimilarity” in 
Mahout. You’ve    actually    already    found    it    but    under    
an    unexpected    name: PearsonCorrelationSimilarity. The cosine 
measure similarity and Pearson correlation aren’t the same thing, but, 
if you bother to work out the math, they actually reduce to the same 
computation when the two series of input values each have a mean of 0 
(“centered”).

Jeff

On 4/27/10 9:34 AM, Sebastian Schelter (JIRA) wrote:
> Cosine item similarity implementation
> -------------------------------------
>
>                   Key: MAHOUT-387
>                   URL: https://issues.apache.org/jira/browse/MAHOUT-387
>               Project: Mahout
>            Issue Type: New Feature
>            Components: Collaborative Filtering
>              Reporter: Sebastian Schelter
>           Attachments: MAHOUT-387.patch
>
> I needed to compute the cosine similarity between two items when running org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find an implementation (did I overlook it maybe?) so I created my own. I want to share it here, in case you find it useful.
>
>

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

Posted by Sebastian Schelter <se...@zalando.de>.

Hi Sean,

then I'll spare benchmarking the patch you sent, I'm glad you like the
original version of the code.

Sebastian

Btw: our discussion here also helps me fill the pages of my diploma
thesis with interesting stuff ;)



Sean Owen schrieb:
> Nah, scratch that too. The simple version of this idea doesn't scale,
> and I was unable to get the current version to run at all
> significantly differently in speed. It's just good as-is.
>
> Now there is a non-distributed similarity implementation that matches
> what this does, which was the original question.
>
> Sean
>
> On Wed, Apr 28, 2010 at 7:14 PM, Sean Owen <sr...@gmail.com> wrote:
>   
>> Actually scratch that patch I sent over. I see the trick now that
>> makes the existing approach quite good. I think I can make a version
>> that preserves that trick and still streamlines the processing. I will
>> benchmark and report back if successful.
>>
>> On Wed, Apr 28, 2010 at 3:20 PM, Sean Owen <sr...@gmail.com> wrote:
>>     
>>> Sorry, typo, that's what I meant. yes the difference isn't *that* large!
>>> It may be worse in practice since you have a few users with very many prefs.
>>> It may also be beneficial to simply have one fewer phase and throw
>>> around less data. I will also try to benchmark since really that's the
>>> only way to know.
>>>
>>>

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

Posted by Sean Owen <sr...@gmail.com>.

Nah, scratch that too. The simple version of this idea doesn't scale,
and I was unable to get the current version to run at all
significantly differently in speed. It's just good as-is.

Now there is a non-distributed similarity implementation that matches
what this does, which was the original question.

Sean

On Wed, Apr 28, 2010 at 7:14 PM, Sean Owen <sr...@gmail.com> wrote:
> Actually scratch that patch I sent over. I see the trick now that
> makes the existing approach quite good. I think I can make a version
> that preserves that trick and still streamlines the processing. I will
> benchmark and report back if successful.
>
> On Wed, Apr 28, 2010 at 3:20 PM, Sean Owen <sr...@gmail.com> wrote:
>> Sorry, typo, that's what I meant. yes the difference isn't *that* large!
>> It may be worse in practice since you have a few users with very many prefs.
>> It may also be beneficial to simply have one fewer phase and throw
>> around less data. I will also try to benchmark since really that's the
>> only way to know.
>>
>

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

Posted by Sean Owen <sr...@gmail.com>.

Actually scratch that patch I sent over. I see the trick now that
makes the existing approach quite good. I think I can make a version
that preserves that trick and still streamlines the processing. I will
benchmark and report back if successful.

On Wed, Apr 28, 2010 at 3:20 PM, Sean Owen <sr...@gmail.com> wrote:
> Sorry, typo, that's what I meant. yes the difference isn't *that* large!
> It may be worse in practice since you have a few users with very many prefs.
> It may also be beneficial to simply have one fewer phase and throw
> around less data. I will also try to benchmark since really that's the
> only way to know.
>

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

Posted by Sean Owen <sr...@gmail.com>.

Sorry, typo, that's what I meant. yes the difference isn't *that* large!
It may be worse in practice since you have a few users with very many prefs.
It may also be beneficial to simply have one fewer phase and throw
around less data. I will also try to benchmark since really that's the
only way to know.

On Wed, Apr 28, 2010 at 3:01 PM, Sebastian Schelter
<se...@zalando.de> wrote:
> Hi Sean,
>
> "Say I have P prefs, U users and I items.
> Assume P/I prefs per item. The bottleneck of this stage is outputting
> I2 vectors of size P/I, or about P*I output.
>
> What you're doing now bottlenecks in the last bit where you output for
> each user some data for every pair of items. That's coarsely U *
> (P/U)2 = U*P2 output. I think that P2 is killer."
>
> There's an error in the above math as U * (P/U)2 = P2 / U (not UP2)
> which makes me guess that there won't be any improvements by mapping out
> all item-item pairs in the beginning.
> I showed the calculations to a mathematician and he was also of the
> opinion that there wouldn't be any improvements. You'll find our
> thoughts below, please check that we did not oversee something. I will
> run the job with your patch in the next days to get us a real-life
> comparison.
>
> Regards,
> Sebastian
>
> -------------------------------------
> Calculated with some example numbers
>
> U = 1,000,000 users
> I = 50,000 items
> P = 50,000,000 preferences
>
> P*I = 250 billion
> P2/U = 2,5 billion
>
> -------------------------------------
> Or stated as inequality:
>
> P2/U > PI
> P/U > I
> P > UI <-- obviously not true
>
>
> Sean Owen schrieb:
>> Well it's not hard to add something like UncenteredCosineSimilarity
>> for sure, I don't mind. It's actually a matter of configuring the
>> superclass to center or not.
>>
>> But it's also easy to center the data in the M/R. I agree it makes
>> little difference in your case, and the effect is subtle. I can add
>> centering, to at least have it in the code for consistency -- in
>> addition to adding the implementation above for completeness.
>>
>> While I'm at it I think we might be able to simplify this item-item
>> computation. A straightforward alternative is something like this:
>>
>> 1. Compute item vectors (using Vector output, ideally) in one M/R
>> 2M. For each item-item pair output both vectors
>> 2R. With both Vectors in hand easily compute the cosine measure
>>
>> It's quite simple, and lends itself to dropping in a different reducer
>> stage to implement different similarities, which is great.
>>
>> The question is performance. Say I have P prefs, U users and I items.
>> Assume P/I prefs per item. The bottleneck of this stage is outputting
>> I^2 vectors of size P/I, or about P*I output.
>>
>> What you're doing now bottlenecks in the last bit where you output for
>> each user some data for every pair of items. That's coarsely U *
>> (P/U)^2 = U*P^2 output. I think that P^2 is killer.
>>
>> Double-check my thinking but if you like it I think I can
>> significantly simplify this job. I can maybe make a patch for you to
>> try.
>>
>> On Wed, Apr 28, 2010 at 9:05 AM, Sebastian Schelter
>> <se...@zalando.de> wrote:
>>
>>> However, I'm a little bit confused now, let me explain why I thought
>>> that this additional similarity implementation would be necessary: Three
>>> weeks ago, I submitted a patch computing the cosine item similarities
>>> via map-reduce (MAHOUT-362), which is how I currently fill my database
>>> table of precomputed item-item-similarities. Some days ago I needed to
>>> precompute lots of recommendations offline via
>>> org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob and didn't want
>>> to connect my map-reduce code to the database. So I was in need of a
>>> similarity implementation that would give me the same results on the fly
>>> from a FileDataModel, which is how I came to create the
>>> CosineItemSimilarity implementation. So my point here would be that if
>>> the code in org.apache.mahout.cf.taste.hadoop.similarity.item does not
>>> center the data, a non-distributed implementation of that way of
>>> computing the similarity should be available too, or the code in
>>> org.apache.mahout.cf.taste.hadoop.similarity.item should be changed to
>>> center the data.
>>>
>>> You stated that centering the data "adjusts for a user's tendency to
>>> rate high or low on average" which is certainly necessary when you
>>> collect explicit ratings from your users. However in my usecase (a big
>>> online-shopping platform) I unfortunately do not have explicit ratings
>>> from the users, so I chose to interpret certain actions as ratings (I
>>> recall this is called implicit ratings in the literature), e.g. a user
>>> putting an item into their shopping basket as a rating of 3 or a user
>>> purchasing an item as a rating of 5, like suggested in the "Making
>>> Recommendations" chapter" of "Programming Collective Intelligence" by T.
>>> Searagan. As far as I can judge (to be honest my mathematical knowledge
>>> is kinda limited) there are no different interpretations of the rating
>>> scala here as the values are fixed, so I thought that a centering of the
>>> data would not be necessary.
>>>
>>> Regards,
>>> Sebastian
>>>
>>> Sean Owen (JIRA) schrieb:
>>>
>>>>      [ https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>>>
>>>> Sean Owen updated MAHOUT-387:
>>>> -----------------------------
>>>>
>>>>            Status: Resolved  (was: Patch Available)
>>>>          Assignee: Sean Owen
>>>>     Fix Version/s: 0.3
>>>>        Resolution: Won't Fix
>>>>
>>>> Yes like Jeff said, this actually exists as PearsonCorrelationSimilarity. In the case where the mean of each series is 0, the result is the same. The fastest way I know to see this is to just look at this form of the sample correlation: http://upload.wikimedia.org/math/c/a/6/ca68fbe94060a2591924b380c9bc4e27.png ... and note that sum(xi) = sum (yi) = 0 when the mean of xi and yi are 0. You're left with sum(xi*yi) in the numerator, which is the dot product, and sqrt(sum(xi^2)) * sqrt(sum(yi^2)) in the denominator, which are the vector sizes. This is just the cosine of the angle between x and y.
>>>>
>>>> One can argue whether forcing the data to be centered is right. I think it's a good thing in all cases. It adjusts for a user's tendency to rate high or low on average. It also makes the computation simpler, and more consistent with Pearson (well, it makes it identical!). This has a good treatment:
>>>> http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Geometric_interpretation
>>>>
>>>> Only for this reason I'd mark this as won't-fix for the moment; the patch is otherwise nice. I'd personally like to hear more about why to not center if there's an argument for it.
>>>>
>>>>
>>>>
>>>>> Cosine item similarity implementation
>>>>> -------------------------------------
>>>>>
>>>>>                 Key: MAHOUT-387
>>>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-387
>>>>>             Project: Mahout
>>>>>          Issue Type: New Feature
>>>>>          Components: Collaborative Filtering
>>>>>            Reporter: Sebastian Schelter
>>>>>            Assignee: Sean Owen
>>>>>             Fix For: 0.3
>>>>>
>>>>>         Attachments: MAHOUT-387.patch
>>>>>
>>>>>
>>>>> I needed to compute the cosine similarity between two items when running org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find an implementation (did I overlook it maybe?) so I created my own. I want to share it here, in case you find it useful.
>>>>>
>>>>>
>>>>
>>>
>
>

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

Posted by Sebastian Schelter <se...@zalando.de>.

Hi Sean,

"Say I have P prefs, U users and I items.
Assume P/I prefs per item. The bottleneck of this stage is outputting
I2 vectors of size P/I, or about P*I output.

What you're doing now bottlenecks in the last bit where you output for
each user some data for every pair of items. That's coarsely U *
(P/U)2 = U*P2 output. I think that P2 is killer."

There's an error in the above math as U * (P/U)2 = P2 / U (not UP2)
which makes me guess that there won't be any improvements by mapping out
all item-item pairs in the beginning.
I showed the calculations to a mathematician and he was also of the
opinion that there wouldn't be any improvements. You'll find our
thoughts below, please check that we did not oversee something. I will
run the job with your patch in the next days to get us a real-life
comparison.

Regards,
Sebastian

-------------------------------------
Calculated with some example numbers

U = 1,000,000 users
I = 50,000 items
P = 50,000,000 preferences

P*I = 250 billion
P2/U = 2,5 billion

-------------------------------------
Or stated as inequality:

P2/U > PI
P/U > I
P > UI <-- obviously not true


Sean Owen schrieb:
> Well it's not hard to add something like UncenteredCosineSimilarity
> for sure, I don't mind. It's actually a matter of configuring the
> superclass to center or not.
>
> But it's also easy to center the data in the M/R. I agree it makes
> little difference in your case, and the effect is subtle. I can add
> centering, to at least have it in the code for consistency -- in
> addition to adding the implementation above for completeness.
>
> While I'm at it I think we might be able to simplify this item-item
> computation. A straightforward alternative is something like this:
>
> 1. Compute item vectors (using Vector output, ideally) in one M/R
> 2M. For each item-item pair output both vectors
> 2R. With both Vectors in hand easily compute the cosine measure
>
> It's quite simple, and lends itself to dropping in a different reducer
> stage to implement different similarities, which is great.
>
> The question is performance. Say I have P prefs, U users and I items.
> Assume P/I prefs per item. The bottleneck of this stage is outputting
> I^2 vectors of size P/I, or about P*I output.
>
> What you're doing now bottlenecks in the last bit where you output for
> each user some data for every pair of items. That's coarsely U *
> (P/U)^2 = U*P^2 output. I think that P^2 is killer.
>
> Double-check my thinking but if you like it I think I can
> significantly simplify this job. I can maybe make a patch for you to
> try.
>
> On Wed, Apr 28, 2010 at 9:05 AM, Sebastian Schelter
> <se...@zalando.de> wrote:
>   
>> However, I'm a little bit confused now, let me explain why I thought
>> that this additional similarity implementation would be necessary: Three
>> weeks ago, I submitted a patch computing the cosine item similarities
>> via map-reduce (MAHOUT-362), which is how I currently fill my database
>> table of precomputed item-item-similarities. Some days ago I needed to
>> precompute lots of recommendations offline via
>> org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob and didn't want
>> to connect my map-reduce code to the database. So I was in need of a
>> similarity implementation that would give me the same results on the fly
>> from a FileDataModel, which is how I came to create the
>> CosineItemSimilarity implementation. So my point here would be that if
>> the code in org.apache.mahout.cf.taste.hadoop.similarity.item does not
>> center the data, a non-distributed implementation of that way of
>> computing the similarity should be available too, or the code in
>> org.apache.mahout.cf.taste.hadoop.similarity.item should be changed to
>> center the data.
>>
>> You stated that centering the data "adjusts for a user's tendency to
>> rate high or low on average" which is certainly necessary when you
>> collect explicit ratings from your users. However in my usecase (a big
>> online-shopping platform) I unfortunately do not have explicit ratings
>> from the users, so I chose to interpret certain actions as ratings (I
>> recall this is called implicit ratings in the literature), e.g. a user
>> putting an item into their shopping basket as a rating of 3 or a user
>> purchasing an item as a rating of 5, like suggested in the "Making
>> Recommendations" chapter" of "Programming Collective Intelligence" by T.
>> Searagan. As far as I can judge (to be honest my mathematical knowledge
>> is kinda limited) there are no different interpretations of the rating
>> scala here as the values are fixed, so I thought that a centering of the
>> data would not be necessary.
>>
>> Regards,
>> Sebastian
>>
>> Sean Owen (JIRA) schrieb:
>>     
>>>      [ https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>>
>>> Sean Owen updated MAHOUT-387:
>>> -----------------------------
>>>
>>>            Status: Resolved  (was: Patch Available)
>>>          Assignee: Sean Owen
>>>     Fix Version/s: 0.3
>>>        Resolution: Won't Fix
>>>
>>> Yes like Jeff said, this actually exists as PearsonCorrelationSimilarity. In the case where the mean of each series is 0, the result is the same. The fastest way I know to see this is to just look at this form of the sample correlation: http://upload.wikimedia.org/math/c/a/6/ca68fbe94060a2591924b380c9bc4e27.png ... and note that sum(xi) = sum (yi) = 0 when the mean of xi and yi are 0. You're left with sum(xi*yi) in the numerator, which is the dot product, and sqrt(sum(xi^2)) * sqrt(sum(yi^2)) in the denominator, which are the vector sizes. This is just the cosine of the angle between x and y.
>>>
>>> One can argue whether forcing the data to be centered is right. I think it's a good thing in all cases. It adjusts for a user's tendency to rate high or low on average. It also makes the computation simpler, and more consistent with Pearson (well, it makes it identical!). This has a good treatment:
>>> http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Geometric_interpretation
>>>
>>> Only for this reason I'd mark this as won't-fix for the moment; the patch is otherwise nice. I'd personally like to hear more about why to not center if there's an argument for it.
>>>
>>>
>>>       
>>>> Cosine item similarity implementation
>>>> -------------------------------------
>>>>
>>>>                 Key: MAHOUT-387
>>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-387
>>>>             Project: Mahout
>>>>          Issue Type: New Feature
>>>>          Components: Collaborative Filtering
>>>>            Reporter: Sebastian Schelter
>>>>            Assignee: Sean Owen
>>>>             Fix For: 0.3
>>>>
>>>>         Attachments: MAHOUT-387.patch
>>>>
>>>>
>>>> I needed to compute the cosine similarity between two items when running org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find an implementation (did I overlook it maybe?) so I created my own. I want to share it here, in case you find it useful.
>>>>
>>>>         
>>>       
>>

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

Posted by Sean Owen <sr...@gmail.com>.

Well it's not hard to add something like UncenteredCosineSimilarity
for sure, I don't mind. It's actually a matter of configuring the
superclass to center or not.

But it's also easy to center the data in the M/R. I agree it makes
little difference in your case, and the effect is subtle. I can add
centering, to at least have it in the code for consistency -- in
addition to adding the implementation above for completeness.

While I'm at it I think we might be able to simplify this item-item
computation. A straightforward alternative is something like this:

1. Compute item vectors (using Vector output, ideally) in one M/R
2M. For each item-item pair output both vectors
2R. With both Vectors in hand easily compute the cosine measure

It's quite simple, and lends itself to dropping in a different reducer
stage to implement different similarities, which is great.

The question is performance. Say I have P prefs, U users and I items.
Assume P/I prefs per item. The bottleneck of this stage is outputting
I^2 vectors of size P/I, or about P*I output.

What you're doing now bottlenecks in the last bit where you output for
each user some data for every pair of items. That's coarsely U *
(P/U)^2 = U*P^2 output. I think that P^2 is killer.

Double-check my thinking but if you like it I think I can
significantly simplify this job. I can maybe make a patch for you to
try.

On Wed, Apr 28, 2010 at 9:05 AM, Sebastian Schelter
<se...@zalando.de> wrote:
> However, I'm a little bit confused now, let me explain why I thought
> that this additional similarity implementation would be necessary: Three
> weeks ago, I submitted a patch computing the cosine item similarities
> via map-reduce (MAHOUT-362), which is how I currently fill my database
> table of precomputed item-item-similarities. Some days ago I needed to
> precompute lots of recommendations offline via
> org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob and didn't want
> to connect my map-reduce code to the database. So I was in need of a
> similarity implementation that would give me the same results on the fly
> from a FileDataModel, which is how I came to create the
> CosineItemSimilarity implementation. So my point here would be that if
> the code in org.apache.mahout.cf.taste.hadoop.similarity.item does not
> center the data, a non-distributed implementation of that way of
> computing the similarity should be available too, or the code in
> org.apache.mahout.cf.taste.hadoop.similarity.item should be changed to
> center the data.
>
> You stated that centering the data "adjusts for a user's tendency to
> rate high or low on average" which is certainly necessary when you
> collect explicit ratings from your users. However in my usecase (a big
> online-shopping platform) I unfortunately do not have explicit ratings
> from the users, so I chose to interpret certain actions as ratings (I
> recall this is called implicit ratings in the literature), e.g. a user
> putting an item into their shopping basket as a rating of 3 or a user
> purchasing an item as a rating of 5, like suggested in the "Making
> Recommendations" chapter" of "Programming Collective Intelligence" by T.
> Searagan. As far as I can judge (to be honest my mathematical knowledge
> is kinda limited) there are no different interpretations of the rating
> scala here as the values are fixed, so I thought that a centering of the
> data would not be necessary.
>
> Regards,
> Sebastian
>
> Sean Owen (JIRA) schrieb:
>>      [ https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>
>> Sean Owen updated MAHOUT-387:
>> -----------------------------
>>
>>            Status: Resolved  (was: Patch Available)
>>          Assignee: Sean Owen
>>     Fix Version/s: 0.3
>>        Resolution: Won't Fix
>>
>> Yes like Jeff said, this actually exists as PearsonCorrelationSimilarity. In the case where the mean of each series is 0, the result is the same. The fastest way I know to see this is to just look at this form of the sample correlation: http://upload.wikimedia.org/math/c/a/6/ca68fbe94060a2591924b380c9bc4e27.png ... and note that sum(xi) = sum (yi) = 0 when the mean of xi and yi are 0. You're left with sum(xi*yi) in the numerator, which is the dot product, and sqrt(sum(xi^2)) * sqrt(sum(yi^2)) in the denominator, which are the vector sizes. This is just the cosine of the angle between x and y.
>>
>> One can argue whether forcing the data to be centered is right. I think it's a good thing in all cases. It adjusts for a user's tendency to rate high or low on average. It also makes the computation simpler, and more consistent with Pearson (well, it makes it identical!). This has a good treatment:
>> http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Geometric_interpretation
>>
>> Only for this reason I'd mark this as won't-fix for the moment; the patch is otherwise nice. I'd personally like to hear more about why to not center if there's an argument for it.
>>
>>
>>> Cosine item similarity implementation
>>> -------------------------------------
>>>
>>>                 Key: MAHOUT-387
>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-387
>>>             Project: Mahout
>>>          Issue Type: New Feature
>>>          Components: Collaborative Filtering
>>>            Reporter: Sebastian Schelter
>>>            Assignee: Sean Owen
>>>             Fix For: 0.3
>>>
>>>         Attachments: MAHOUT-387.patch
>>>
>>>
>>> I needed to compute the cosine similarity between two items when running org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find an implementation (did I overlook it maybe?) so I created my own. I want to share it here, in case you find it useful.
>>>
>>
>>
>
>

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

Posted by Sebastian Schelter <se...@zalando.de>.

Hi Sean and Jeff,

I looked at the formulas and I see your point that the computation is
the same for input series with a mean of zero, thank you for the
detailed feedback on this.

However, I'm a little bit confused now, let me explain why I thought
that this additional similarity implementation would be necessary: Three
weeks ago, I submitted a patch computing the cosine item similarities
via map-reduce (MAHOUT-362), which is how I currently fill my database
table of precomputed item-item-similarities. Some days ago I needed to
precompute lots of recommendations offline via
org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob and didn't want
to connect my map-reduce code to the database. So I was in need of a
similarity implementation that would give me the same results on the fly
from a FileDataModel, which is how I came to create the
CosineItemSimilarity implementation. So my point here would be that if
the code in org.apache.mahout.cf.taste.hadoop.similarity.item does not
center the data, a non-distributed implementation of that way of
computing the similarity should be available too, or the code in
org.apache.mahout.cf.taste.hadoop.similarity.item should be changed to
center the data.

You stated that centering the data "adjusts for a user's tendency to
rate high or low on average" which is certainly necessary when you
collect explicit ratings from your users. However in my usecase (a big
online-shopping platform) I unfortunately do not have explicit ratings
from the users, so I chose to interpret certain actions as ratings (I
recall this is called implicit ratings in the literature), e.g. a user
putting an item into their shopping basket as a rating of 3 or a user
purchasing an item as a rating of 5, like suggested in the "Making
Recommendations" chapter" of "Programming Collective Intelligence" by T.
Searagan. As far as I can judge (to be honest my mathematical knowledge
is kinda limited) there are no different interpretations of the rating
scala here as the values are fixed, so I thought that a centering of the
data would not be necessary.

Regards,
Sebastian

Sean Owen (JIRA) schrieb:
>      [ https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Sean Owen updated MAHOUT-387:
> -----------------------------
>
>            Status: Resolved  (was: Patch Available)
>          Assignee: Sean Owen
>     Fix Version/s: 0.3
>        Resolution: Won't Fix
>
> Yes like Jeff said, this actually exists as PearsonCorrelationSimilarity. In the case where the mean of each series is 0, the result is the same. The fastest way I know to see this is to just look at this form of the sample correlation: http://upload.wikimedia.org/math/c/a/6/ca68fbe94060a2591924b380c9bc4e27.png ... and note that sum(xi) = sum (yi) = 0 when the mean of xi and yi are 0. You're left with sum(xi*yi) in the numerator, which is the dot product, and sqrt(sum(xi^2)) * sqrt(sum(yi^2)) in the denominator, which are the vector sizes. This is just the cosine of the angle between x and y.
>
> One can argue whether forcing the data to be centered is right. I think it's a good thing in all cases. It adjusts for a user's tendency to rate high or low on average. It also makes the computation simpler, and more consistent with Pearson (well, it makes it identical!). This has a good treatment:
> http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Geometric_interpretation
>
> Only for this reason I'd mark this as won't-fix for the moment; the patch is otherwise nice. I'd personally like to hear more about why to not center if there's an argument for it.
>
>   
>> Cosine item similarity implementation
>> -------------------------------------
>>
>>                 Key: MAHOUT-387
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-387
>>             Project: Mahout
>>          Issue Type: New Feature
>>          Components: Collaborative Filtering
>>            Reporter: Sebastian Schelter
>>            Assignee: Sean Owen
>>             Fix For: 0.3
>>
>>         Attachments: MAHOUT-387.patch
>>
>>
>> I needed to compute the cosine similarity between two items when running org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find an implementation (did I overlook it maybe?) so I created my own. I want to share it here, in case you find it useful.
>>     
>
>

[jira] Updated: (MAHOUT-387) Cosine item similarity implementation

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-387:
-----------------------------

           Status: Resolved  (was: Patch Available)
         Assignee: Sean Owen
    Fix Version/s: 0.3
       Resolution: Won't Fix

Yes like Jeff said, this actually exists as PearsonCorrelationSimilarity. In the case where the mean of each series is 0, the result is the same. The fastest way I know to see this is to just look at this form of the sample correlation: http://upload.wikimedia.org/math/c/a/6/ca68fbe94060a2591924b380c9bc4e27.png ... and note that sum(xi) = sum (yi) = 0 when the mean of xi and yi are 0. You're left with sum(xi*yi) in the numerator, which is the dot product, and sqrt(sum(xi^2)) * sqrt(sum(yi^2)) in the denominator, which are the vector sizes. This is just the cosine of the angle between x and y.

One can argue whether forcing the data to be centered is right. I think it's a good thing in all cases. It adjusts for a user's tendency to rate high or low on average. It also makes the computation simpler, and more consistent with Pearson (well, it makes it identical!). This has a good treatment:
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Geometric_interpretation

Only for this reason I'd mark this as won't-fix for the moment; the patch is otherwise nice. I'd personally like to hear more about why to not center if there's an argument for it.

> Cosine item similarity implementation
> -------------------------------------
>
>                 Key: MAHOUT-387
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-387
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Assignee: Sean Owen
>             Fix For: 0.3
>
>         Attachments: MAHOUT-387.patch
>
>
> I needed to compute the cosine similarity between two items when running org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find an implementation (did I overlook it maybe?) so I created my own. I want to share it here, in case you find it useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-387) Cosine item similarity implementation

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-387:
--------------------------------------

    Status: Patch Available  (was: Open)

> Cosine item similarity implementation
> -------------------------------------
>
>                 Key: MAHOUT-387
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-387
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-387.patch
>
>
> I needed to compute the cosine similarity between two items when running org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find an implementation (did I overlook it maybe?) so I created my own. I want to share it here, in case you find it useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (MAHOUT-387) Cosine item similarity implementation

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen reopened MAHOUT-387:
------------------------------


> Cosine item similarity implementation
> -------------------------------------
>
>                 Key: MAHOUT-387
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-387
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Assignee: Sean Owen
>             Fix For: 0.4
>
>         Attachments: MAHOUT-387.patch
>
>
> I needed to compute the cosine similarity between two items when running org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find an implementation (did I overlook it maybe?) so I created my own. I want to share it here, in case you find it useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-387) Cosine item similarity implementation

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-387:
--------------------------------------

    Attachment: MAHOUT-387.patch

> Cosine item similarity implementation
> -------------------------------------
>
>                 Key: MAHOUT-387
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-387
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-387.patch
>
>
> I needed to compute the cosine similarity between two items when running org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find an implementation (did I overlook it maybe?) so I created my own. I want to share it here, in case you find it useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (MAHOUT-387) Cosine item similarity implementation

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-387.
------------------------------

    Fix Version/s: 0.4
                       (was: 0.3)
       Resolution: Fixed

> Cosine item similarity implementation
> -------------------------------------
>
>                 Key: MAHOUT-387
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-387
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.4
>
>         Attachments: MAHOUT-387.patch
>
>
> I needed to compute the cosine similarity between two items when running org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find an implementation (did I overlook it maybe?) so I created my own. I want to share it here, in case you find it useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-387) Cosine item similarity implementation

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-387:
-----------------------------

    Priority: Minor  (was: Major)

Nevermind I'll add UncenteredCosineSimilarity, can't hurt

> Cosine item similarity implementation
> -------------------------------------
>
>                 Key: MAHOUT-387
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-387
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.4
>
>         Attachments: MAHOUT-387.patch
>
>
> I needed to compute the cosine similarity between two items when running org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find an implementation (did I overlook it maybe?) so I created my own. I want to share it here, in case you find it useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.