You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sebastian Schelter (JIRA)" <ji...@apache.org> on 2011/09/08 14:34:08 UTC

[jira] [Commented] (MAHOUT-767) Improve RowSimilarityJob performance

    [ https://issues.apache.org/jira/browse/MAHOUT-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100265#comment-13100265 ] 

Sebastian Schelter commented on MAHOUT-767:
-------------------------------------------

A summary of my current work so far, a new patch is coming:


We should only support algebraic similarity measures which allows us to use a combiner in the most crucial phase. Furthermore we will use the stripes-pattern for in-mapper combination of cooccurrences to avoid emitting lots of cooccurrence pair objects.

This issue also touches ItemSimilarityJob and RecommenderJob as they use RowSimilarityJob internally. We will introduce a new job responsible for preparing the input data for these jobs.

As the distribution of ratings per user and ratings per item follow power-law distributions usually, appropriate down-sampling is crucial for the performance of these jobs as their runtime is dominated by the user with the largest number of interactions. We should remove the old "maxCooccurrencesPerItem" heuristic as it depends on the number of mappers that are run and the ordering of the input data. A simple random downsampling of users having a number of ratings above a threshold should work better.

> Improve RowSimilarityJob performance
> ------------------------------------
>
>                 Key: MAHOUT-767
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-767
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Grant Ingersoll
>            Assignee: Sebastian Schelter
>             Fix For: 0.6
>
>         Attachments: MAHOUT-767.patch
>
>
> (See http://www.lucidimagination.com/search/document/40c4f124795c6b5/rowsimilarity_s#42ab816c27c6a9e7 for background)
> Currently, the RowSimilarityJob defers the calculation of the similarity metric until the reduce phase, while emitting many Cooccurrence objects.  For similarity metrics that are algebraic (http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions) we should be able to do much of the computation during the Mapper part of this phase and also take advantage of a Combiner.  
> We should use a marker interface to know whether a similarity metric is algebraic and then make use of an appropriate Mapper implementation, otherwise we can fall back on our existing implementation.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira