You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2010/09/22 09:14:34 UTC
[jira] Resolved: (MAHOUT-467) Change Iterable in
org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer
to list or array to improve the performance
[ https://issues.apache.org/jira/browse/MAHOUT-467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved MAHOUT-467.
------------------------------
Assignee: Sean Owen
Fix Version/s: (was: 0.4)
Resolution: Not A Problem
> Change Iterable<Cooccurrence> in org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer to list or array to improve the performance
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-467
> URL: https://issues.apache.org/jira/browse/MAHOUT-467
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Han Hui Wen
> Assignee: Sean Owen
>
> In Class AbstractDistributedVectorSimilarity
> protected int countElements(Iterator<?> iterator)
> { int count = 0;
> while (iterator.hasNext())
> {
> count++;
> iterator.next();
> } return count;
> }
> The method countElements is used continually and is called continually ,but it has bad performance.
> If the iterator has million elements ,we have to iterate million times to just get the count of the iterator.
> this methods used in many pacles:
> 1) DistributedCooccurrenceVectorSimilarity
> public class DistributedCooccurrenceVectorSimilarity extends AbstractDistributedVectorSimilarity {
> @Override
> protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence> cooccurrences, double weightOfVectorA,
> double weightOfVectorB, int numberOfColumns) {
> return countElements(cooccurrences);
> }
> }
> one items may be liked by many people, we has system ,one items may be liked by hundred thousand persons,
> Here doComputeResult just returned the count of elements in cooccurrences,but It has to iterate for hundred thousand times.
> If we use List or Array type,we can get the result in one call. because it already sets the size of the Array or list when system constructs the List or Array.
> 2) DistributedLoglikelihoodVectorSimilarity
> 3) DistributedTanimotoCoefficientVectorSimilarity
> I have doing a test using DistributedCooccurrenceVectorSimilarity
> it used 4.5 hours to run RowSimilarityJob-CooccurrencesMapper-SimilarityReducer
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.