You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sean Owen (Updated) (JIRA)" <ji...@apache.org> on 2011/09/30 14:25:45 UTC

[jira] [Updated] (MAHOUT-823) RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance

     [ https://issues.apache.org/jira/browse/MAHOUT-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-823:
-----------------------------

    Attachment: MAHOUT-823.patch
    
> RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-823
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-823
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Eugene Kirpichov
>              Labels: dot, dot-product, vector
>             Fix For: 0.6
>
>         Attachments: MAHOUT-823.patch
>
>
> http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172
> The complexity of the algorithm is O(num nondefault elements in this), while it could clearly be O(min(num nondefault in this, num nondefault in x)).
> This can be fixed by adding this code before line 189.
> {code}
> if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
>   return x.dot(this);
> }
> {code}
> An easy case where this asymmetry is very apparent and makes a huge difference in performance is K-Means clustering.
> In K-Means for high-dimensional points (e.g. those that arise in text retrieval problems), the centroids often have a huge number of non-zero components, whereas points have a small number of them.
> So, if you make a mistake and use centroid.dot(point) in your code for computing the distance, instead of point.dot(centroid), you end up with orders of magnitude worse performance (which is what we actually observed - the clustering time was a couple of minutes with this fix and over an hour without it).
> So, perhaps, if you make this fix, quite a few people who had a similar case but didn't notice it will suddenly have a dramatic performance increase :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira