You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sebastian Schelter (JIRA)" <ji...@apache.org> on 2010/08/08 11:40:16 UTC

[jira] Created: (MAHOUT-460) Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob

Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob
-------------------------------------------------------------------------------------------------------

                 Key: MAHOUT-460
                 URL: https://issues.apache.org/jira/browse/MAHOUT-460
             Project: Mahout
          Issue Type: Improvement
          Components: Collaborative Filtering
            Reporter: Sebastian Schelter


Because "coocurrence algorithms ... scale in the square of the number of occurrences most popular item" (Ted wrote that in a recent mail) we should offer a parameter to the ItemSimilarity job that makes it limit the number of considered preferences per item. RecommenderJob already has such an option.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-460) Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899799#action_12899799 ] 

Hudson commented on MAHOUT-460:
-------------------------------

Integrated in Mahout-Quality #200 (See [https://hudson.apache.org/hudson/job/Mahout-Quality/200/])
    MAHOUT-460 Add maxPreferencesPerItemConsidered option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob


> Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> -------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-460
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-460
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>             Fix For: 0.4
>
>         Attachments: MAHOUT-460-2.patch, MAHOUT-460.patch
>
>
> Because "coocurrence algorithms ... scale in the square of the number of occurrences most popular item" (Ted wrote that in a recent mail) we should offer a parameter to the ItemSimilarity job that makes it limit the number of considered preferences per item. RecommenderJob already has such an option.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-460) Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-460:
--------------------------------------

    Attachment: MAHOUT-460.patch

> Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> -------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-460
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-460
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-460.patch
>
>
> Because "coocurrence algorithms ... scale in the square of the number of occurrences most popular item" (Ted wrote that in a recent mail) we should offer a parameter to the ItemSimilarity job that makes it limit the number of considered preferences per item. RecommenderJob already has such an option.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-460) Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899745#action_12899745 ] 

Sebastian Schelter commented on MAHOUT-460:
-------------------------------------------

Cleaned up the patch. 

> Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> -------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-460
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-460
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-460-2.patch, MAHOUT-460.patch
>
>
> Because "coocurrence algorithms ... scale in the square of the number of occurrences most popular item" (Ted wrote that in a recent mail) we should offer a parameter to the ItemSimilarity job that makes it limit the number of considered preferences per item. RecommenderJob already has such an option.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-460) Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898727#action_12898727 ] 

Sebastian Schelter commented on MAHOUT-460:
-------------------------------------------

Patch attached, which fixes a big misunderstanding in the existing code. I had created MaybePruneRowsMapper from Sean's old UserVectorToCooccurrenceMapper. It's main use should have been to limit the number of cooccurrences per item in the RecommenderJob. Unfortunately it was applied to the item-user-matrix (the itemvectors) instead of the user-item-matrix (the uservectors), which is now corrected.

Please note that the approach taken here is only a heuristic as each mapper instance tries to limit the number of cooccurrences on its own, if I understand the code correctly.

I introduced a new job argument "maxCooccurrencesPerItem" with a default of 100.

> Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> -------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-460
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-460
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-460.patch
>
>
> Because "coocurrence algorithms ... scale in the square of the number of occurrences most popular item" (Ted wrote that in a recent mail) we should offer a parameter to the ItemSimilarity job that makes it limit the number of considered preferences per item. RecommenderJob already has such an option.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-460) Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-460:
--------------------------------------

           Status: Resolved  (was: Patch Available)
    Fix Version/s: 0.4
       Resolution: Fixed

> Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> -------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-460
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-460
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>             Fix For: 0.4
>
>         Attachments: MAHOUT-460-2.patch, MAHOUT-460.patch
>
>
> Because "coocurrence algorithms ... scale in the square of the number of occurrences most popular item" (Ted wrote that in a recent mail) we should offer a parameter to the ItemSimilarity job that makes it limit the number of considered preferences per item. RecommenderJob already has such an option.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-460) Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-460:
--------------------------------------

    Attachment: MAHOUT-460-2.patch

> Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> -------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-460
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-460
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-460-2.patch, MAHOUT-460.patch
>
>
> Because "coocurrence algorithms ... scale in the square of the number of occurrences most popular item" (Ted wrote that in a recent mail) we should offer a parameter to the ItemSimilarity job that makes it limit the number of considered preferences per item. RecommenderJob already has such an option.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (MAHOUT-460) Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898726#action_12898726 ] 

Sebastian Schelter edited comment on MAHOUT-460 at 8/15/10 2:30 PM:
--------------------------------------------------------------------

The goal of this issue is to introduce a limititation onto the number of cooccurrences per item to make the runtime of ItemSimilarityJob and RecommenderJob linear to the size of the input and not dependent on the maximum number of preferences per item.

This should be OK to do because you don't really learn anything new about an item after seeing a certain number of preferences and thus it should be sufficient to look at a fixed number of them at maximum per item

      was (Author: ssc):
    The goal of this issue is to introduce a limititation onto the number of cooccurrences per item to make the runtime of ItemSimilarityJob and RecommenderJob runtime linear to the size of the input and not dependent on the maximum number of preferences per item.

This should be OK to do because you don't really learn anything new about an item after seeing a certain number of preferences and thus it should be sufficient to look at a fixed number of them at maximum per item
  
> Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> -------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-460
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-460
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-460.patch
>
>
> Because "coocurrence algorithms ... scale in the square of the number of occurrences most popular item" (Ted wrote that in a recent mail) we should offer a parameter to the ItemSimilarity job that makes it limit the number of considered preferences per item. RecommenderJob already has such an option.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-460) Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898751#action_12898751 ] 

Sean Owen commented on MAHOUT-460:
----------------------------------

I have some general comments and then believe you are welcome to commit.

- Your IDE seems to be reordering imports. I'd leave them as they are as they're reasonably standard in ordering across the code.
- Some of the changes also seem to be changes in whitespace indentation -- should be 2 spaces per unit of indentation everywhere. For instance see MaybePruneRowsMapper.countSeen()
- MathHelper: I wouldn't concatenate a string together with '+' and then append to StringBuffer. Append each piece to take advantage of it.
- Also we should all use StringBuilder, not StringBuffer
- ToItemVectorsReducer: attach the Apache copyright header?

> Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> -------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-460
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-460
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-460.patch
>
>
> Because "coocurrence algorithms ... scale in the square of the number of occurrences most popular item" (Ted wrote that in a recent mail) we should offer a parameter to the ItemSimilarity job that makes it limit the number of considered preferences per item. RecommenderJob already has such an option.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-460) Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898726#action_12898726 ] 

Sebastian Schelter commented on MAHOUT-460:
-------------------------------------------

The goal of this issue is to introduce a limititation onto the number of cooccurrences per item to make the runtime of ItemSimilarityJob and RecommenderJob runtime linear to the size of the input and not dependent on the maximum number of preferences per item.

This should be OK to do because you don't really learn anything new about an item after seeing a certain number of preferences and thus it should be sufficient to look at a fixed number of them at maximum per item

> Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> -------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-460
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-460
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-460.patch
>
>
> Because "coocurrence algorithms ... scale in the square of the number of occurrences most popular item" (Ted wrote that in a recent mail) we should offer a parameter to the ItemSimilarity job that makes it limit the number of considered preferences per item. RecommenderJob already has such an option.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-460) Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-460:
--------------------------------------

    Status: Patch Available  (was: Open)

> Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> -------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-460
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-460
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-460.patch
>
>
> Because "coocurrence algorithms ... scale in the square of the number of occurrences most popular item" (Ted wrote that in a recent mail) we should offer a parameter to the ItemSimilarity job that makes it limit the number of considered preferences per item. RecommenderJob already has such an option.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.