You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sebastian Schelter (JIRA)" <ji...@apache.org> on 2010/05/04 17:17:09 UTC

[jira] Created: (MAHOUT-389) UncenteredCosineSimilarity

UncenteredCosineSimilarity 
---------------------------

                 Key: MAHOUT-389
                 URL: https://issues.apache.org/jira/browse/MAHOUT-389
             Project: Mahout
          Issue Type: Bug
          Components: Collaborative Filtering
            Reporter: Sebastian Schelter


org.apache.mahout.cf.taste.impl.similarity.UncenteredCosineSimilarity only computes the cosine distance between those components of the vectors where both vectors have a value greater zero.

This is inconsistent with the definition of the cosine (correct me if I'm wrong) and is inconsistent with the distributed cosine similarity computation.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-389) UncenteredCosineSimilarity

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-389:
-----------------------------

    Issue Type: Improvement  (was: Bug)
      Priority: Minor  (was: Major)

I should add you could emulate the behavior you may desire in the non-distributed implementation with a combination of the UncenteredCosineSimilarity, and a PreferenceInferrer which always infers 0.

I don't see a simple way to make the distributed version emulate the non-distributed version.

While I think the non-distributed version has preferable semantics, neither is really "wrong". I suggesting leaving this as more of an improvement idea to figure out a way to unify them, by modifying the distributed version.

> UncenteredCosineSimilarity 
> ---------------------------
>
>                 Key: MAHOUT-389
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-389
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Priority: Minor
>         Attachments: MAHOUT-389.patch
>
>
> org.apache.mahout.cf.taste.impl.similarity.UncenteredCosineSimilarity only computes the cosine distance between those components of the vectors where both vectors have a value greater zero.
> This is inconsistent with the definition of the cosine (correct me if I'm wrong) and is inconsistent with the distributed cosine similarity computation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-389) UncenteredCosineSimilarity

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864835#action_12864835 ] 

Sebastian Schelter commented on MAHOUT-389:
-------------------------------------------

I'm glad you like it, refactor it as you wish! 

This patch should make it possible for mahout users to use their own customized similarity implementation for the distributed computation

> UncenteredCosineSimilarity 
> ---------------------------
>
>                 Key: MAHOUT-389
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-389
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Priority: Minor
>         Attachments: MAHOUT-389-2.patch, MAHOUT-389-3.patch, MAHOUT-389.patch
>
>
> org.apache.mahout.cf.taste.impl.similarity.UncenteredCosineSimilarity only computes the cosine distance between those components of the vectors where both vectors have a value greater zero.
> This is inconsistent with the definition of the cosine (correct me if I'm wrong) and is inconsistent with the distributed cosine similarity computation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-389) UncenteredCosineSimilarity

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864334#action_12864334 ] 

Sean Owen commented on MAHOUT-389:
----------------------------------

De-emphasizing common items is often desirable, though achieving it this way may be a case of mixing two things that can be separable. That is you can more directly de-emphasize with, for example, the provided InverseUserFrequency transformation.

But doubtless there are times when this version of the computation is better, even best. For example when your rating is really the number of times a user has viewed an item, then it's absolutely correct to assume missing items are rated 0, because they really are 0, and so adding that information is even better. You could also get this effect with a PreferenceInferrer. (See I've thought of everything.)

But I don't really mind adding more variations on the similarity computation, to give people options. As long as it fits in nicely, doesn't really complicate the code or confuse readers, and doesn't have any material performance impact, why not? I will look at the patch a bit later and see what you're up to.

> UncenteredCosineSimilarity 
> ---------------------------
>
>                 Key: MAHOUT-389
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-389
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Priority: Minor
>         Attachments: MAHOUT-389-2.patch, MAHOUT-389-3.patch, MAHOUT-389.patch
>
>
> org.apache.mahout.cf.taste.impl.similarity.UncenteredCosineSimilarity only computes the cosine distance between those components of the vectors where both vectors have a value greater zero.
> This is inconsistent with the definition of the cosine (correct me if I'm wrong) and is inconsistent with the distributed cosine similarity computation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-389) UncenteredCosineSimilarity

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-389:
--------------------------------------

    Attachment: MAHOUT-389.patch

Unit test attached.

> UncenteredCosineSimilarity 
> ---------------------------
>
>                 Key: MAHOUT-389
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-389
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-389.patch
>
>
> org.apache.mahout.cf.taste.impl.similarity.UncenteredCosineSimilarity only computes the cosine distance between those components of the vectors where both vectors have a value greater zero.
> This is inconsistent with the definition of the cosine (correct me if I'm wrong) and is inconsistent with the distributed cosine similarity computation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-389) UncenteredCosineSimilarity

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-389:
--------------------------------------

    Attachment: MAHOUT-389-2.patch

There might be cases where it makes sense to look not only at co-ratings, e.g. imagine you have 3 products: A, B and C

Let's say the pairs A,B and A,C have the same co-ratings (the same users bought them), but B is a topseller, which is bought by lots of people and C is a niche product, which only sells rarely.
A cosine which includes the zero assumption would decrease the value for the topseller and prefer the niche product, which might be a good thing depending on your use case.

But I definitely see your point here that the assumption is generally not holding and I also think that the distributed version should be modified. 

I attached a patch with a first proposal how this could be managed.

I tried to refactor the similarity computation out of the map-reduce code and make it possible to implement different similarity functions that have to follow this scheme:

 * in a early stage of the process, the similarity implementation can compute a weight (a single double) for each item-vector
 * in the end, it is given all co-ratings and the previously computed weights for each item-pair that has at least one co-rating

That should be sufficient to compute centered pearson-correlation as well as cosine or tanimoto coefficients.

I hope it's understandable what I'm trying to propose here, taking a look at org.apache.mahout.cf.taste.hadoop.similarity.DistributedSimilarity together with DistributedPearsonCorrelationSimilarity and DistributedUncenteredZeroAssumingCosineSimilarity will hopefully help to get a clearer picture. These implementations are merely for demonstration purposes, they could be merged with the already existing non-distributed implementations in case you like the approach described here.

> UncenteredCosineSimilarity 
> ---------------------------
>
>                 Key: MAHOUT-389
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-389
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Priority: Minor
>         Attachments: MAHOUT-389-2.patch, MAHOUT-389.patch
>
>
> org.apache.mahout.cf.taste.impl.similarity.UncenteredCosineSimilarity only computes the cosine distance between those components of the vectors where both vectors have a value greater zero.
> This is inconsistent with the definition of the cosine (correct me if I'm wrong) and is inconsistent with the distributed cosine similarity computation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-389) UncenteredCosineSimilarity

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-389:
--------------------------------------

    Attachment: MAHOUT-389-3.patch

MAHOUT-389-2.patch had some debug code not removed, sorry, this patch should be clean


> UncenteredCosineSimilarity 
> ---------------------------
>
>                 Key: MAHOUT-389
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-389
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Priority: Minor
>         Attachments: MAHOUT-389-2.patch, MAHOUT-389-3.patch, MAHOUT-389.patch
>
>
> org.apache.mahout.cf.taste.impl.similarity.UncenteredCosineSimilarity only computes the cosine distance between those components of the vectors where both vectors have a value greater zero.
> This is inconsistent with the definition of the cosine (correct me if I'm wrong) and is inconsistent with the distributed cosine similarity computation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-389) UncenteredCosineSimilarity

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864812#action_12864812 ] 

Sean Owen commented on MAHOUT-389:
----------------------------------

I'm happy to commit this since it looks fine, suits your purposes as an active user, and remains reusable. I might make minor changes like not making CoRating an inner class of an interface, and just making it Writable rather than separate the Writable wrapper.

> UncenteredCosineSimilarity 
> ---------------------------
>
>                 Key: MAHOUT-389
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-389
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Priority: Minor
>         Attachments: MAHOUT-389-2.patch, MAHOUT-389-3.patch, MAHOUT-389.patch
>
>
> org.apache.mahout.cf.taste.impl.similarity.UncenteredCosineSimilarity only computes the cosine distance between those components of the vectors where both vectors have a value greater zero.
> This is inconsistent with the definition of the cosine (correct me if I'm wrong) and is inconsistent with the distributed cosine similarity computation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (MAHOUT-389) UncenteredCosineSimilarity

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-389.
------------------------------

         Assignee: Sean Owen
    Fix Version/s: 0.4
       Resolution: Fixed

Committed patch #3 with some largely cosmetic style changes

> UncenteredCosineSimilarity 
> ---------------------------
>
>                 Key: MAHOUT-389
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-389
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.4
>
>         Attachments: MAHOUT-389-2.patch, MAHOUT-389-3.patch, MAHOUT-389.patch
>
>
> org.apache.mahout.cf.taste.impl.similarity.UncenteredCosineSimilarity only computes the cosine distance between those components of the vectors where both vectors have a value greater zero.
> This is inconsistent with the definition of the cosine (correct me if I'm wrong) and is inconsistent with the distributed cosine similarity computation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-389) UncenteredCosineSimilarity

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863861#action_12863861 ] 

Sean Owen commented on MAHOUT-389:
----------------------------------

The issue with that is you are implicitly assuming that missing ratings are 0, and that's not a good assumption. For example, if your ratings are on a scale of 1 to 5, you are assuming that all unknown items are utterly hated.

It is not ignoring zeroes; it is ignoring missing ratings.

I think the distributed computation should act the same way for similar reasons.

> UncenteredCosineSimilarity 
> ---------------------------
>
>                 Key: MAHOUT-389
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-389
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-389.patch
>
>
> org.apache.mahout.cf.taste.impl.similarity.UncenteredCosineSimilarity only computes the cosine distance between those components of the vectors where both vectors have a value greater zero.
> This is inconsistent with the definition of the cosine (correct me if I'm wrong) and is inconsistent with the distributed cosine similarity computation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.