You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Sebastian Schelter (JIRA)" <ji...@apache.org> on 2010/05/09 09:40:48 UTC

[jira] Created: (MAHOUT-393) Distributed item similarity functions

Distributed item similarity functions
-------------------------------------

                 Key: MAHOUT-393
                 URL: https://issues.apache.org/jira/browse/MAHOUT-393
             Project: Mahout
          Issue Type: Improvement
          Components: Collaborative Filtering
            Reporter: Sebastian Schelter


To complete the work started in MAHOUT-389, I've created a distributed version of any item similarity function that is currently already available in a non-distributed manner. An additional M/R job was necessary to compute the number of all users which is needed by some similarity functions (like LogLikelihoodSimilarity for example).

There is still some optimization potential in the code as not every similarity function needs all information that is currently extracted (like the number of users e.g.), but the optimization would have made the code much less readable so I did not do any work on that.

I hope you consider this a useful addition.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-393) Distributed item similarity functions

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865600#action_12865600 ] 

Sebastian Schelter commented on MAHOUT-393:
-------------------------------------------

org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityTest.testCompleteJob() explicitly tested for the number of users and it worked with your changes, so it's good as it is it seems

> Distributed item similarity functions
> -------------------------------------
>
>                 Key: MAHOUT-393
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-393
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Assignee: Sean Owen
>             Fix For: 0.4
>
>         Attachments: MAHOUT-393.patch
>
>
> To complete the work started in MAHOUT-389, I've created a distributed version of any item similarity function that is currently already available in a non-distributed manner. An additional M/R job was necessary to compute the number of all users which is needed by some similarity functions (like LogLikelihoodSimilarity for example).
> There is still some optimization potential in the code as not every similarity function needs all information that is currently extracted (like the number of users e.g.), but the optimization would have made the code much less readable so I did not do any work on that.
> I hope you consider this a useful addition.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (MAHOUT-393) Distributed item similarity functions

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-393.
------------------------------

         Assignee: Sean Owen
    Fix Version/s: 0.4
       Resolution: Fixed

Done, I committed with only two substantive tweaks:

- I had switched over to VLongWritable from LongWritable. Most IDs used don't really need nearly 8 bytes, so variable-length coding saves a lot.
- CountUsersKeyWritable didn't define equals() and hashCode() non-trivially, and was inconsistent with compareTo(). Do I miss something about this?

> Distributed item similarity functions
> -------------------------------------
>
>                 Key: MAHOUT-393
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-393
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Assignee: Sean Owen
>             Fix For: 0.4
>
>         Attachments: MAHOUT-393.patch
>
>
> To complete the work started in MAHOUT-389, I've created a distributed version of any item similarity function that is currently already available in a non-distributed manner. An additional M/R job was necessary to compute the number of all users which is needed by some similarity functions (like LogLikelihoodSimilarity for example).
> There is still some optimization potential in the code as not every similarity function needs all information that is currently extracted (like the number of users e.g.), but the optimization would have made the code much less readable so I did not do any work on that.
> I hope you consider this a useful addition.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-393) Distributed item similarity functions

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865599#action_12865599 ] 

Sean Owen commented on MAHOUT-393:
----------------------------------

Unless, I missed something, and the unit tests don't manage to catch it, yeah I think this is important. The way it was defined, the objects had an ordering but all were equal. So compareTo() would return nonzero for objects that are equals() and that could cause problems if not now then some day. (It may happen that the value of equals() is never used). Anyway, all set here it seems.

> Distributed item similarity functions
> -------------------------------------
>
>                 Key: MAHOUT-393
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-393
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Assignee: Sean Owen
>             Fix For: 0.4
>
>         Attachments: MAHOUT-393.patch
>
>
> To complete the work started in MAHOUT-389, I've created a distributed version of any item similarity function that is currently already available in a non-distributed manner. An additional M/R job was necessary to compute the number of all users which is needed by some similarity functions (like LogLikelihoodSimilarity for example).
> There is still some optimization potential in the code as not every similarity function needs all information that is currently extracted (like the number of users e.g.), but the optimization would have made the code much less readable so I did not do any work on that.
> I hope you consider this a useful addition.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-393) Distributed item similarity functions

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865598#action_12865598 ] 

Sebastian Schelter commented on MAHOUT-393:
-------------------------------------------

I thought those definitions of equals() and hashCode() were necessary for the Secondary Sort to work, but obviously they aren't :)

> Distributed item similarity functions
> -------------------------------------
>
>                 Key: MAHOUT-393
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-393
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Assignee: Sean Owen
>             Fix For: 0.4
>
>         Attachments: MAHOUT-393.patch
>
>
> To complete the work started in MAHOUT-389, I've created a distributed version of any item similarity function that is currently already available in a non-distributed manner. An additional M/R job was necessary to compute the number of all users which is needed by some similarity functions (like LogLikelihoodSimilarity for example).
> There is still some optimization potential in the code as not every similarity function needs all information that is currently extracted (like the number of users e.g.), but the optimization would have made the code much less readable so I did not do any work on that.
> I hope you consider this a useful addition.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-393) Distributed item similarity functions

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-393:
--------------------------------------

    Attachment: MAHOUT-393.patch

> Distributed item similarity functions
> -------------------------------------
>
>                 Key: MAHOUT-393
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-393
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-393.patch
>
>
> To complete the work started in MAHOUT-389, I've created a distributed version of any item similarity function that is currently already available in a non-distributed manner. An additional M/R job was necessary to compute the number of all users which is needed by some similarity functions (like LogLikelihoodSimilarity for example).
> There is still some optimization potential in the code as not every similarity function needs all information that is currently extracted (like the number of users e.g.), but the optimization would have made the code much less readable so I did not do any work on that.
> I hope you consider this a useful addition.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.