You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2010/09/22 09:14:34 UTC

[jira] Resolved: (MAHOUT-467) Change Iterable in org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer to list or array to improve the performance

     [ https://issues.apache.org/jira/browse/MAHOUT-467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-467.
------------------------------

         Assignee: Sean Owen
    Fix Version/s:     (was: 0.4)
       Resolution: Not A Problem

> Change Iterable<Cooccurrence> in  org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer  to list or array to improve the performance
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-467
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-467
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Han Hui Wen 
>            Assignee: Sean Owen
>
> In Class AbstractDistributedVectorSimilarity
>       protected int countElements(Iterator<?> iterator)
>       { int count = 0;
>           while (iterator.hasNext()) 
>           {
>                   count++; 
>                   iterator.next(); 
>            } return count; 
>     }
> The method countElements is used continually and is called continually ,but it has bad performance.
> If the iterator has million elements ,we have to iterate million  times to just get the count of the iterator.
> this methods used in many pacles:
> 1) DistributedCooccurrenceVectorSimilarity 
> public class DistributedCooccurrenceVectorSimilarity extends AbstractDistributedVectorSimilarity {
>   @Override
>   protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence> cooccurrences, double weightOfVectorA,
>       double weightOfVectorB, int numberOfColumns) {
>     return countElements(cooccurrences);
>   }
> }
> one items may be liked by many people, we has system ,one items may be liked by  hundred thousand persons,
> Here doComputeResult just returned the count of elements in  cooccurrences,but It has to iterate for hundred thousand times.
> If we use List or Array type,we can get the result in one call. because it already sets the size of the Array or list when system constructs the List or Array.
> 2)  DistributedLoglikelihoodVectorSimilarity
> 3)  DistributedTanimotoCoefficientVectorSimilarity
> I have doing a test using DistributedCooccurrenceVectorSimilarity 
> it used 4.5 hours to run RowSimilarityJob-CooccurrencesMapper-SimilarityReducer

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.