You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Sebastian Schelter (Created) (JIRA)" <ji...@apache.org> on 2011/12/04 09:57:39 UTC

[jira] [Created] (MAHOUT-914) Provide a non-distributed counterpart of the sampling which is applied in the distributed item similarity computation

Provide a non-distributed counterpart of the sampling which is applied in the distributed item similarity computation
---------------------------------------------------------------------------------------------------------------------

                 Key: MAHOUT-914
                 URL: https://issues.apache.org/jira/browse/MAHOUT-914
             Project: Mahout
          Issue Type: New Feature
          Components: Collaborative Filtering
    Affects Versions: 0.6
            Reporter: Sebastian Schelter
            Assignee: Sebastian Schelter
         Attachments: downsampling.png

The distributed item similarity computation applies a so-called 'interaction-cut': it selectively down samples 'power users' in org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper. This is done because the users with the most interactions usually dominate the runtime without providing much benefit to the quality, as users with an enormous amount of interactions are very often crawlers or people sharing an account.

Mahout should have an exact counterpart of this strategy for the non-distributed code.

I also attach a figure that shows experiments with this strategy for the movielens 1M dataset. The dataset was split into 90% training and 10% test set. An interaction cut of size k was applied and the prediction quality (using mean average error) was measured. The prediction in the unsampled dataset corresponds to using k = 1000 as this is the maximum number of interactions per user. We see that with k > 300 the error seems to converge and we get a quality that sufficiently replicates the unsampled quality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-914) Provide a non-distributed counterpart of the sampling which is applied in the distributed item similarity computation

Posted by "Sebastian Schelter (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-914:
--------------------------------------

    Attachment: downsampling.png
    
> Provide a non-distributed counterpart of the sampling which is applied in the distributed item similarity computation
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-914
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-914
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.6
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: downsampling.png
>
>
> The distributed item similarity computation applies a so-called 'interaction-cut': it selectively down samples 'power users' in org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper. This is done because the users with the most interactions usually dominate the runtime without providing much benefit to the quality, as users with an enormous amount of interactions are very often crawlers or people sharing an account.
> Mahout should have an exact counterpart of this strategy for the non-distributed code.
> I also attach a figure that shows experiments with this strategy for the movielens 1M dataset. The dataset was split into 90% training and 10% test set. An interaction cut of size k was applied and the prediction quality (using mean average error) was measured. The prediction in the unsampled dataset corresponds to using k = 1000 as this is the maximum number of interactions per user. We see that with k > 300 the error seems to converge and we get a quality that sufficiently replicates the unsampled quality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-914) Provide a non-distributed counterpart of the sampling which is applied in the distributed item similarity computation

Posted by "Sebastian Schelter (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-914:
--------------------------------------

    Attachment: MAHOUT-914.patch
    
> Provide a non-distributed counterpart of the sampling which is applied in the distributed item similarity computation
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-914
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-914
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.6
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: MAHOUT-914.patch, downsampling.png
>
>
> The distributed item similarity computation applies a so-called 'interaction-cut': it selectively down samples 'power users' in org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper. This is done because the users with the most interactions usually dominate the runtime without providing much benefit to the quality, as users with an enormous amount of interactions are very often crawlers or people sharing an account.
> Mahout should have an exact counterpart of this strategy for the non-distributed code.
> I also attach a figure that shows experiments with this strategy for the movielens 1M dataset. The dataset was split into 90% training and 10% test set. An interaction cut of size k was applied and the prediction quality (using mean average error) was measured. The prediction in the unsampled dataset corresponds to using k = 1000 as this is the maximum number of interactions per user. We see that with k > 300 the error seems to converge and we get a quality that sufficiently replicates the unsampled quality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-914) Provide a non-distributed counterpart of the sampling which is applied in the distributed item similarity computation

Posted by "Sebastian Schelter (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-914:
--------------------------------------

    Status: Patch Available  (was: Open)
    
> Provide a non-distributed counterpart of the sampling which is applied in the distributed item similarity computation
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-914
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-914
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.6
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: MAHOUT-914.patch, downsampling.png
>
>
> The distributed item similarity computation applies a so-called 'interaction-cut': it selectively down samples 'power users' in org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper. This is done because the users with the most interactions usually dominate the runtime without providing much benefit to the quality, as users with an enormous amount of interactions are very often crawlers or people sharing an account.
> Mahout should have an exact counterpart of this strategy for the non-distributed code.
> I also attach a figure that shows experiments with this strategy for the movielens 1M dataset. The dataset was split into 90% training and 10% test set. An interaction cut of size k was applied and the prediction quality (using mean average error) was measured. The prediction in the unsampled dataset corresponds to using k = 1000 as this is the maximum number of interactions per user. We see that with k > 300 the error seems to converge and we get a quality that sufficiently replicates the unsampled quality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-914) Provide a non-distributed counterpart of the sampling which is applied in the distributed item similarity computation

Posted by "Sebastian Schelter (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-914:
--------------------------------------

    Resolution: Duplicate
        Status: Resolved  (was: Patch Available)

will be included in MAHOUT-910
                
> Provide a non-distributed counterpart of the sampling which is applied in the distributed item similarity computation
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-914
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-914
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.6
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: MAHOUT-914.patch, downsampling.png
>
>
> The distributed item similarity computation applies a so-called 'interaction-cut': it selectively down samples 'power users' in org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper. This is done because the users with the most interactions usually dominate the runtime without providing much benefit to the quality, as users with an enormous amount of interactions are very often crawlers or people sharing an account.
> Mahout should have an exact counterpart of this strategy for the non-distributed code.
> I also attach a figure that shows experiments with this strategy for the movielens 1M dataset. The dataset was split into 90% training and 10% test set. An interaction cut of size k was applied and the prediction quality (using mean average error) was measured. The prediction in the unsampled dataset corresponds to using k = 1000 as this is the maximum number of interactions per user. We see that with k > 300 the error seems to converge and we get a quality that sufficiently replicates the unsampled quality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-914) Provide a non-distributed counterpart of the sampling which is applied in the distributed item similarity computation

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164484#comment-13164484 ] 

Hudson commented on MAHOUT-914:
-------------------------------

Integrated in Mahout-Quality #1234 (See [https://builds.apache.org/job/Mahout-Quality/1234/])
    MAHOUT-910 merge ideas from MAHOUT-914, better docs, new no-limit arg, different defaults from Sebastian

srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1211439
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/SamplingCandidateItemsStrategy.java

                
> Provide a non-distributed counterpart of the sampling which is applied in the distributed item similarity computation
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-914
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-914
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.6
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: MAHOUT-914.patch, downsampling.png
>
>
> The distributed item similarity computation applies a so-called 'interaction-cut': it selectively down samples 'power users' in org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper. This is done because the users with the most interactions usually dominate the runtime without providing much benefit to the quality, as users with an enormous amount of interactions are very often crawlers or people sharing an account.
> Mahout should have an exact counterpart of this strategy for the non-distributed code.
> I also attach a figure that shows experiments with this strategy for the movielens 1M dataset. The dataset was split into 90% training and 10% test set. An interaction cut of size k was applied and the prediction quality (using mean average error) was measured. The prediction in the unsampled dataset corresponds to using k = 1000 as this is the maximum number of interactions per user. We see that with k > 300 the error seems to converge and we get a quality that sufficiently replicates the unsampled quality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira