You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Eugene Kirpichov (Created) (JIRA)" <ji...@apache.org> on 2011/09/30 13:21:45 UTC

[jira] [Created] (MAHOUT-823) RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance

RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance
-----------------------------------------------------------------------------------------------------------------

                 Key: MAHOUT-823
                 URL: https://issues.apache.org/jira/browse/MAHOUT-823
             Project: Mahout
          Issue Type: Improvement
          Components: Math
            Reporter: Eugene Kirpichov


http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172

The complexity of the algorithm is O(num nondefault elements in this), while it could clearly be O(min(num nondefault in this, num nondefault in x)).

This can be fixed by adding this code before line 189.

{code}
if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
  return x.dot(this);
}
{code}

An easy case where this asymmetry is very apparent and makes a huge difference in performance is K-Means clustering.

In K-Means for high-dimensional points (e.g. those that arise in text retrieval problems), the centroids often have a huge number of non-zero components, whereas points have a small number of them.

So, if you make a mistake and use centroid.dot(point) in your code for computing the distance, instead of point.dot(centroid), you end up with orders of magnitude worse performance (which is what we actually observed - the clustering time was a couple of minutes with this fix and over an hour without it).

So, perhaps, if you make this fix, quite a few people who had a similar case but didn't notice it will suddenly have a dramatic performance increase :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-823) RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118015#comment-13118015 ] 

Sean Owen commented on MAHOUT-823:
----------------------------------

I think it's a good idea, and can be expanded. Am I right that the goal is to compute the dot in the way that will avoid more lookups (in the other vector), and/or avoid looking up in the vector with the slower get() operation?

I imagine that avoiding lookups is higher priority, so the rule you suggest is something we always want to do. So, for example, let's say you're dotting a SequentialAccessSparseVector with 100,000 non-default elements, with a RandomAccessSparseVector of 1,000 non-default elements. You'd want random-dot-sequential, right? Even though it'll take longer to look up in the SequentialAccessSparseVector with binary search, it's saving 100x lookups.

But we now have a rule in the code that prioritizes dotting sequential-dot-random. I think we'd just remove that then right?

And then in SequentialAccessSparseVector... I can imagine keeping the special-case of sequential-dot-sequential... but somehow it also seems overwhelmed by avoiding lookups. And, this special-case ignores DenseVectors, which should be similarly treated. So, can it go too?

And then DenseVector's special case for DenseVector seems not that useful.

And then we get to the point where all dot() implementations are the same, and include the helpful rule above. This seems happy and tidy -- see attached counter-proposal patch, just for discussion. Am I missing something?
                
> RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-823
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-823
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Eugene Kirpichov
>              Labels: dot, dot-product, vector
>             Fix For: 0.6
>
>         Attachments: MAHOUT-823.patch
>
>
> http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172
> The complexity of the algorithm is O(num nondefault elements in this), while it could clearly be O(min(num nondefault in this, num nondefault in x)).
> This can be fixed by adding this code before line 189.
> {code}
> if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
>   return x.dot(this);
> }
> {code}
> An easy case where this asymmetry is very apparent and makes a huge difference in performance is K-Means clustering.
> In K-Means for high-dimensional points (e.g. those that arise in text retrieval problems), the centroids often have a huge number of non-zero components, whereas points have a small number of them.
> So, if you make a mistake and use centroid.dot(point) in your code for computing the distance, instead of point.dot(centroid), you end up with orders of magnitude worse performance (which is what we actually observed - the clustering time was a couple of minutes with this fix and over an hour without it).
> So, perhaps, if you make this fix, quite a few people who had a similar case but didn't notice it will suddenly have a dramatic performance increase :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-823) RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118721#comment-13118721 ] 

Sean Owen commented on MAHOUT-823:
----------------------------------

I did the homework, and came up with the following result. The substance of the patch is now this (as well as removing other dot() implementations):

{code}
    // Crude rule of thumb: when a sequential-access vector, with O(log n) lookups, has about
    // 2^n elements, its lookups take longer than a dense / random access vector (with O(1) lookups) by
    // about a factor of (0.71n - 12.3). This holds pretty well from n=19 up to at least n=23 according to my tests;
    // below that lookups are so fast that this difference is near zero.

    int thisNumNonDefault = getNumNondefaultElements();
    int thatNumNonDefault = x.getNumNondefaultElements();
    // Default: dot from smaller vector to larger vector
    boolean reverseDot = thatNumNonDefault < thisNumNonDefault;

    // But, see if we should override that -- is exactly one of them sequential access and so slower to lookup in?
    if (isSequentialAccess() != x.isSequentialAccess()) {
      double log2ThisSize = Math.log(thisNumNonDefault) / LOG2;
      double log2ThatSize = Math.log(thatNumNonDefault) / LOG2;
      // Only override when the O(log n) factor seems big enough to care about:
      if (log2ThisSize >= 19.0 && log2ThatSize >= 19.0) {
        double dotCost = thisNumNonDefault;
        if (x.isSequentialAccess()) {
          dotCost *= 0.71 * log2ThatSize - 12.3;
        }
        double reverseDotCost = thatNumNonDefault;
        if (isSequentialAccess()) {
          reverseDotCost *= 0.71 * log2ThisSize - 12.3;
        }
        reverseDot = reverseDotCost < dotCost;
      }
    }

    if (reverseDot) {
      return x.dot(this);
    }
{code}

Thoughts?
                
> RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-823
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-823
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Eugene Kirpichov
>            Assignee: Sean Owen
>              Labels: dot, dot-product, vector
>             Fix For: 0.6
>
>         Attachments: MAHOUT-823.patch
>
>
> http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172
> The complexity of the algorithm is O(num nondefault elements in this), while it could clearly be O(min(num nondefault in this, num nondefault in x)).
> This can be fixed by adding this code before line 189.
> {code}
> if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
>   return x.dot(this);
> }
> {code}
> An easy case where this asymmetry is very apparent and makes a huge difference in performance is K-Means clustering.
> In K-Means for high-dimensional points (e.g. those that arise in text retrieval problems), the centroids often have a huge number of non-zero components, whereas points have a small number of them.
> So, if you make a mistake and use centroid.dot(point) in your code for computing the distance, instead of point.dot(centroid), you end up with orders of magnitude worse performance (which is what we actually observed - the clustering time was a couple of minutes with this fix and over an hour without it).
> So, perhaps, if you make this fix, quite a few people who had a similar case but didn't notice it will suddenly have a dramatic performance increase :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-823) RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance

Posted by "Eugene Kirpichov (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118984#comment-13118984 ] 

Eugene Kirpichov commented on MAHOUT-823:
-----------------------------------------

Yay, great! Thanks!
                
> RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-823
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-823
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Eugene Kirpichov
>            Assignee: Sean Owen
>              Labels: dot, dot-product, vector
>             Fix For: 0.6
>
>         Attachments: MAHOUT-823.patch
>
>
> http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172
> The complexity of the algorithm is O(num nondefault elements in this), while it could clearly be O(min(num nondefault in this, num nondefault in x)).
> This can be fixed by adding this code before line 189.
> {code}
> if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
>   return x.dot(this);
> }
> {code}
> An easy case where this asymmetry is very apparent and makes a huge difference in performance is K-Means clustering.
> In K-Means for high-dimensional points (e.g. those that arise in text retrieval problems), the centroids often have a huge number of non-zero components, whereas points have a small number of them.
> So, if you make a mistake and use centroid.dot(point) in your code for computing the distance, instead of point.dot(centroid), you end up with orders of magnitude worse performance (which is what we actually observed - the clustering time was a couple of minutes with this fix and over an hour without it).
> So, perhaps, if you make this fix, quite a few people who had a similar case but didn't notice it will suddenly have a dramatic performance increase :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-823) RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance

Posted by "Eugene Kirpichov (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118049#comment-13118049 ] 

Eugene Kirpichov commented on MAHOUT-823:
-----------------------------------------

Ted, since the code change required is so small, I simply included it into the task description. Is there any particular reason to convert that into a patch (e.g. the necessity to follow a formal procedure)?

Sean, right, the goal is to avoid doing many lookups; I actually only had in mind the trivial case of dotting two RandomAccess vectors, but your comment got me thinking on what to do in the general case.

Random-random: Smallest leads (what's proposed)
Seq-seq: Merge (what's implemented)
Random-seq:
If random leads, number of operations is O((num-nonzero-in-r)^2) because there's this many lookups into the sequential vector, each taking linear time.
If sequential leads (what's implemented), number of operations is O(num nonzero in sequential).

It's difficult to decide which to use, given that we don't know the constant factors, but at least having the sequential lead will never have quadratic behavior, so I suggest to leave it as is.

---
Bottom line:
I would suggest to implement just the proposed change.
                
> RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-823
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-823
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Eugene Kirpichov
>            Assignee: Sean Owen
>              Labels: dot, dot-product, vector
>             Fix For: 0.6
>
>         Attachments: MAHOUT-823.patch
>
>
> http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172
> The complexity of the algorithm is O(num nondefault elements in this), while it could clearly be O(min(num nondefault in this, num nondefault in x)).
> This can be fixed by adding this code before line 189.
> {code}
> if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
>   return x.dot(this);
> }
> {code}
> An easy case where this asymmetry is very apparent and makes a huge difference in performance is K-Means clustering.
> In K-Means for high-dimensional points (e.g. those that arise in text retrieval problems), the centroids often have a huge number of non-zero components, whereas points have a small number of them.
> So, if you make a mistake and use centroid.dot(point) in your code for computing the distance, instead of point.dot(centroid), you end up with orders of magnitude worse performance (which is what we actually observed - the clustering time was a couple of minutes with this fix and over an hour without it).
> So, perhaps, if you make this fix, quite a few people who had a similar case but didn't notice it will suddenly have a dramatic performance increase :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-823) RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance

Posted by "Sean Owen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-823:
-----------------------------

        Fix Version/s: 0.6
             Assignee: Sean Owen
               Labels: dot dot-product vector  (was: )
    Affects Version/s: 0.5
               Status: Patch Available  (was: Open)
    
> RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-823
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-823
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Eugene Kirpichov
>            Assignee: Sean Owen
>              Labels: vector, dot, dot-product
>             Fix For: 0.6
>
>         Attachments: MAHOUT-823.patch
>
>
> http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172
> The complexity of the algorithm is O(num nondefault elements in this), while it could clearly be O(min(num nondefault in this, num nondefault in x)).
> This can be fixed by adding this code before line 189.
> {code}
> if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
>   return x.dot(this);
> }
> {code}
> An easy case where this asymmetry is very apparent and makes a huge difference in performance is K-Means clustering.
> In K-Means for high-dimensional points (e.g. those that arise in text retrieval problems), the centroids often have a huge number of non-zero components, whereas points have a small number of them.
> So, if you make a mistake and use centroid.dot(point) in your code for computing the distance, instead of point.dot(centroid), you end up with orders of magnitude worse performance (which is what we actually observed - the clustering time was a couple of minutes with this fix and over an hour without it).
> So, perhaps, if you make this fix, quite a few people who had a similar case but didn't notice it will suddenly have a dramatic performance increase :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-823) RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118857#comment-13118857 ] 

Hudson commented on MAHOUT-823:
-------------------------------

Integrated in Mahout-Quality #1076 (See [https://builds.apache.org/job/Mahout-Quality/1076/])
    MAHOUT-823 standardize, optimize vector dot product

srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1177975
Files : 
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/AbstractVector.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/DenseVector.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/VectorView.java

                
> RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-823
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-823
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Eugene Kirpichov
>            Assignee: Sean Owen
>              Labels: dot, dot-product, vector
>             Fix For: 0.6
>
>         Attachments: MAHOUT-823.patch
>
>
> http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172
> The complexity of the algorithm is O(num nondefault elements in this), while it could clearly be O(min(num nondefault in this, num nondefault in x)).
> This can be fixed by adding this code before line 189.
> {code}
> if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
>   return x.dot(this);
> }
> {code}
> An easy case where this asymmetry is very apparent and makes a huge difference in performance is K-Means clustering.
> In K-Means for high-dimensional points (e.g. those that arise in text retrieval problems), the centroids often have a huge number of non-zero components, whereas points have a small number of them.
> So, if you make a mistake and use centroid.dot(point) in your code for computing the distance, instead of point.dot(centroid), you end up with orders of magnitude worse performance (which is what we actually observed - the clustering time was a couple of minutes with this fix and over an hour without it).
> So, perhaps, if you make this fix, quite a few people who had a similar case but didn't notice it will suddenly have a dramatic performance increase :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-823) RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance

Posted by "Robin Anil (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118626#comment-13118626 ] 

Robin Anil commented on MAHOUT-823:
-----------------------------------

There is  a vector benchmarks class in utils or examples? Use that, it compares various combinations of operations of various types of vectors
                
> RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-823
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-823
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Eugene Kirpichov
>            Assignee: Sean Owen
>              Labels: dot, dot-product, vector
>             Fix For: 0.6
>
>         Attachments: MAHOUT-823.patch
>
>
> http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172
> The complexity of the algorithm is O(num nondefault elements in this), while it could clearly be O(min(num nondefault in this, num nondefault in x)).
> This can be fixed by adding this code before line 189.
> {code}
> if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
>   return x.dot(this);
> }
> {code}
> An easy case where this asymmetry is very apparent and makes a huge difference in performance is K-Means clustering.
> In K-Means for high-dimensional points (e.g. those that arise in text retrieval problems), the centroids often have a huge number of non-zero components, whereas points have a small number of them.
> So, if you make a mistake and use centroid.dot(point) in your code for computing the distance, instead of point.dot(centroid), you end up with orders of magnitude worse performance (which is what we actually observed - the clustering time was a couple of minutes with this fix and over an hour without it).
> So, perhaps, if you make this fix, quite a few people who had a similar case but didn't notice it will suddenly have a dramatic performance increase :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-823) RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance

Posted by "Sean Owen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-823:
-----------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Hearing essential consensus on this approach, and as tests pass, and it simplifies and speeds things, I committed. We can tweak it later.
                
> RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-823
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-823
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Eugene Kirpichov
>            Assignee: Sean Owen
>              Labels: dot, dot-product, vector
>             Fix For: 0.6
>
>         Attachments: MAHOUT-823.patch
>
>
> http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172
> The complexity of the algorithm is O(num nondefault elements in this), while it could clearly be O(min(num nondefault in this, num nondefault in x)).
> This can be fixed by adding this code before line 189.
> {code}
> if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
>   return x.dot(this);
> }
> {code}
> An easy case where this asymmetry is very apparent and makes a huge difference in performance is K-Means clustering.
> In K-Means for high-dimensional points (e.g. those that arise in text retrieval problems), the centroids often have a huge number of non-zero components, whereas points have a small number of them.
> So, if you make a mistake and use centroid.dot(point) in your code for computing the distance, instead of point.dot(centroid), you end up with orders of magnitude worse performance (which is what we actually observed - the clustering time was a couple of minutes with this fix and over an hour without it).
> So, perhaps, if you make this fix, quite a few people who had a similar case but didn't notice it will suddenly have a dramatic performance increase :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-823) RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance

Posted by "Sean Owen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-823:
-----------------------------

    Attachment: MAHOUT-823.patch
    
> RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-823
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-823
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Eugene Kirpichov
>              Labels: dot, dot-product, vector
>             Fix For: 0.6
>
>         Attachments: MAHOUT-823.patch
>
>
> http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172
> The complexity of the algorithm is O(num nondefault elements in this), while it could clearly be O(min(num nondefault in this, num nondefault in x)).
> This can be fixed by adding this code before line 189.
> {code}
> if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
>   return x.dot(this);
> }
> {code}
> An easy case where this asymmetry is very apparent and makes a huge difference in performance is K-Means clustering.
> In K-Means for high-dimensional points (e.g. those that arise in text retrieval problems), the centroids often have a huge number of non-zero components, whereas points have a small number of them.
> So, if you make a mistake and use centroid.dot(point) in your code for computing the distance, instead of point.dot(centroid), you end up with orders of magnitude worse performance (which is what we actually observed - the clustering time was a couple of minutes with this fix and over an hour without it).
> So, perhaps, if you make this fix, quite a few people who had a similar case but didn't notice it will suddenly have a dramatic performance increase :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-823) RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance

Posted by "Eugene Kirpichov (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118474#comment-13118474 ] 

Eugene Kirpichov commented on MAHOUT-823:
-----------------------------------------

Sean, you're right - I was somehow blindly assuming that the sequential vector is O( n ). Indeed benchmarks need to be done. If there'd be a way to make the code simpler, sometimes much faster than the current version, and always not much slower (probably getting "always not slower" would be harder, as the constant factors measured in benchmarks would differ across particular deployments) - that would be great.
                
> RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-823
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-823
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Eugene Kirpichov
>            Assignee: Sean Owen
>              Labels: dot, dot-product, vector
>             Fix For: 0.6
>
>         Attachments: MAHOUT-823.patch
>
>
> http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172
> The complexity of the algorithm is O(num nondefault elements in this), while it could clearly be O(min(num nondefault in this, num nondefault in x)).
> This can be fixed by adding this code before line 189.
> {code}
> if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
>   return x.dot(this);
> }
> {code}
> An easy case where this asymmetry is very apparent and makes a huge difference in performance is K-Means clustering.
> In K-Means for high-dimensional points (e.g. those that arise in text retrieval problems), the centroids often have a huge number of non-zero components, whereas points have a small number of them.
> So, if you make a mistake and use centroid.dot(point) in your code for computing the distance, instead of point.dot(centroid), you end up with orders of magnitude worse performance (which is what we actually observed - the clustering time was a couple of minutes with this fix and over an hour without it).
> So, perhaps, if you make this fix, quite a few people who had a similar case but didn't notice it will suddenly have a dramatic performance increase :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-823) RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118296#comment-13118296 ] 

Sean Owen commented on MAHOUT-823:
----------------------------------

What's the quadratic case? SequentialAccessSparseVector is O(log n) for lookups, not O(n). That's still worse than O(1), for a hash-based RandomAccessSparseVector or array-backed DenseVector, but the real-world difference, I assume, is a small-ish constant factor. Dunno, realistically looking at 20-ish comparisons in a big vector versus 4-5? It's still probably a 'win' to lead with the smaller vector if it has, say, 5x fewer entries.

I must say I'm in love with simplifying this and getting rid of 'instanceof' code here, which is already incomplete and not optimal in most cases. Why don't I run some benchmarks to get some concept of the appropriate constant factors, then build that in to my patch? Am I still missing something?
                
> RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-823
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-823
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Eugene Kirpichov
>            Assignee: Sean Owen
>              Labels: dot, dot-product, vector
>             Fix For: 0.6
>
>         Attachments: MAHOUT-823.patch
>
>
> http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172
> The complexity of the algorithm is O(num nondefault elements in this), while it could clearly be O(min(num nondefault in this, num nondefault in x)).
> This can be fixed by adding this code before line 189.
> {code}
> if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
>   return x.dot(this);
> }
> {code}
> An easy case where this asymmetry is very apparent and makes a huge difference in performance is K-Means clustering.
> In K-Means for high-dimensional points (e.g. those that arise in text retrieval problems), the centroids often have a huge number of non-zero components, whereas points have a small number of them.
> So, if you make a mistake and use centroid.dot(point) in your code for computing the distance, instead of point.dot(centroid), you end up with orders of magnitude worse performance (which is what we actually observed - the clustering time was a couple of minutes with this fix and over an hour without it).
> So, perhaps, if you make this fix, quite a few people who had a similar case but didn't notice it will suddenly have a dramatic performance increase :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-823) RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118011#comment-13118011 ] 

Ted Dunning commented on MAHOUT-823:
------------------------------------

Great idea, do you have a patch?
                
> RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-823
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-823
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>            Reporter: Eugene Kirpichov
>
> http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172
> The complexity of the algorithm is O(num nondefault elements in this), while it could clearly be O(min(num nondefault in this, num nondefault in x)).
> This can be fixed by adding this code before line 189.
> {code}
> if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
>   return x.dot(this);
> }
> {code}
> An easy case where this asymmetry is very apparent and makes a huge difference in performance is K-Means clustering.
> In K-Means for high-dimensional points (e.g. those that arise in text retrieval problems), the centroids often have a huge number of non-zero components, whereas points have a small number of them.
> So, if you make a mistake and use centroid.dot(point) in your code for computing the distance, instead of point.dot(centroid), you end up with orders of magnitude worse performance (which is what we actually observed - the clustering time was a couple of minutes with this fix and over an hour without it).
> So, perhaps, if you make this fix, quite a few people who had a similar case but didn't notice it will suddenly have a dramatic performance increase :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-823) RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118065#comment-13118065 ] 

Ted Dunning commented on MAHOUT-823:
------------------------------------

Eugene, 

I was just suggesting the patch to avoid errors.  I would normally have created the patch and had you review it to be sure it was what you meant, but I am packing for my flight tomorrow and am too lazy.

Your suggested policies look fine.  Sequential-random could be handled as random leading, but I would sort the indexes from the random to avoid the quadratic behavior and just require a big ratio before considering the possibility.  Even then, it seems quite possible that the cost will be the same as the sequential leading since we have to skip through the list to the right spots anyway.  Thus, the cost is liable to be O(size of sequential up to last element of random) in any case.

                
> RandomAccessSparseVector.dot with another non-sequential vector can be extremely non-symmetric in its performance
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-823
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-823
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Eugene Kirpichov
>            Assignee: Sean Owen
>              Labels: dot, dot-product, vector
>             Fix For: 0.6
>
>         Attachments: MAHOUT-823.patch
>
>
> http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172
> The complexity of the algorithm is O(num nondefault elements in this), while it could clearly be O(min(num nondefault in this, num nondefault in x)).
> This can be fixed by adding this code before line 189.
> {code}
> if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
>   return x.dot(this);
> }
> {code}
> An easy case where this asymmetry is very apparent and makes a huge difference in performance is K-Means clustering.
> In K-Means for high-dimensional points (e.g. those that arise in text retrieval problems), the centroids often have a huge number of non-zero components, whereas points have a small number of them.
> So, if you make a mistake and use centroid.dot(point) in your code for computing the distance, instead of point.dot(centroid), you end up with orders of magnitude worse performance (which is what we actually observed - the clustering time was a couple of minutes with this fix and over an hour without it).
> So, perhaps, if you make this fix, quite a few people who had a similar case but didn't notice it will suddenly have a dramatic performance increase :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira