You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Sebastian Schelter (JIRA)" <ji...@apache.org> on 2010/08/12 16:35:17 UTC

[jira] Commented: (MAHOUT-467) Change Iterable in org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer to list or array to improve the performance

    [ https://issues.apache.org/jira/browse/MAHOUT-467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897771#action_12897771 ] 

Sebastian Schelter commented on MAHOUT-467:
-------------------------------------------

For the millions of cooccurrences to be modeled as a list or and array, we would have to simultaneously load them all into memory.
We can't do this because then the scalability of the whole job would be limited by the amount of RAM available on the worker machines. 
IIRC Mahout's goal is that its distributed jobs should run in O(n) concerning the input data and O(1) concerning the amount of memory needed.


> Change Iterable<Cooccurrence> in  org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer  to list or array to improve the performance
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-467
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-467
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Hui Wen Han
>             Fix For: 0.4
>
>
> In Class AbstractDistributedVectorSimilarity
>       protected int countElements(Iterator<?> iterator)
>       { int count = 0;
>           while (iterator.hasNext()) 
>           {
>                   count++; 
>                   iterator.next(); 
>            } return count; 
>     }
> The method countElements is used continually and is called continually ,but it has bad performance.
> If the iterator has million elements ,we have to iterate million  times to just get the count of the iterator.
> this methods used in many pacles:
> 1) DistributedCooccurrenceVectorSimilarity 
> public class DistributedCooccurrenceVectorSimilarity extends AbstractDistributedVectorSimilarity {
>   @Override
>   protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence> cooccurrences, double weightOfVectorA,
>       double weightOfVectorB, int numberOfColumns) {
>     return countElements(cooccurrences);
>   }
> }
> one items may be liked by many people, we has system ,one items may be liked by  hundred thousand persons,
> Here doComputeResult just returned the count of elements in  cooccurrences,but It has to iterate for hundred thousand times.
> If we use List or Array type,we can get the result in one call. because it already sets the size of the Array or list when system constructs the List or Array.
> 2)  DistributedLoglikelihoodVectorSimilarity
> 3)  DistributedTanimotoCoefficientVectorSimilarity
> I have doing a test using DistributedCooccurrenceVectorSimilarity 
> it used 4.5 hours to run RowSimilarityJob-CooccurrencesMapper-SimilarityReducer

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.