You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Shashikant Kore (JIRA)" <ji...@apache.org> on 2009/05/22 13:29:45 UTC

[jira] Created: (MAHOUT-121) Speed up distance calculations for sparse vectors

Speed up distance calculations for sparse vectors
-------------------------------------------------

Key: MAHOUT-121
URL: https://issues.apache.org/jira/browse/MAHOUT-121
Project: Mahout
Issue Type: Improvement
Components: Matrix
Reporter: Shashikant Kore

>From my mail to the Mahout mailing list.

I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.

http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/

I tried out this optimization. The test setup had 2000 document vectors with few hundred items. I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.

Current Canopy Generation: 28 min 15 sec.
Canopy Generation with distance optimization: 1 min 38 sec.

I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector implementation to used primitive collections by Trove [
http://trove4j.sourceforge.net/ ].

Distance optimization with Trove: 59 sec
Current canopy generation with Trove: 21 min 55 sec

To sum, these two optimizations reduced cluster generation time by a 97%.

Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Grant Ingersoll <gs...@apache.org>.

On Jun 24, 2009, at 7:20 PM, Grant Ingersoll wrote:

>
> On Jun 24, 2009, at 6:53 PM, Ted Dunning wrote:
>
>> Grant,
>>
>> This optimization should have made a large difference.  Did it?
>
> Yes.  Still quantifying, but very promising.  Still having a hard  
> time finding good t1, t2 values for the simple tests I am running  
> clustering Wikipedia data, so that is clouding things.  It seems no  
> matter what I pick, I get one vector per canopy.  Obviously,  
> something is wrong, but I don't know what. Sigh.  Of course, it  
> could be the fact the docs I'm clustering aren't related, I guess.   
> I'm only doing the first 1000 from a dump.  I'll try a bigger  
> version now.

Switching to the CosineDistanceMeasure has improved things.   Sorry  
for the thread highjack.

>
> All the tests pass with the changes, though, and I had the same  
> problem before.
>
>>
>> The triangle inequality trick should help by a factor of two or  
>> more as
>> well.
>
>

Re: [jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Ted Dunning <te...@gmail.com>.

If the distance being used is a real metric, it would satisfy the triangle
inequality (that is basically the definition of metric).  For a given point
x relative to two centroids c1 and c2, this means that

         r(c1, c2) <= r(x, c1) + r(x, c2)

or

         r(c1, c2) - r(x, c1) <= r(x, c2)

This means that if r(c1, c2) is > 2 r(x, c1), you don't have to consider
r(x, c2) since r(x, c1) < r(x, c2)

Moreover, since many points stay with the same centroid from one iteration
to the next, especially in later iterations, you can compute the distance to
the previous centroid first and avoid computing distances to several other
centroids.

A related optimization is to not recompute distances to clusters that
haven't moved.  Typically, this would only apply to the closest cluster to
save memory, but that is a late stage decision.

In high dimensional spaces with completely symmetrically distributions of
points, it is pretty easy to see that the triangle optimization will not
help that much.  BUT, one essentially never has completely symmetrical
distributions and with text you have radically non-symmetric distributions
due to sparsity of the document vectors.  So I am very curious how these
optimizations will work for the text clustering you are doing.

On Thu, Jun 25, 2009 at 1:04 PM, Grant Ingersoll <gs...@apache.org>wrote:

> 3) use triangle inequality to limit number of times we have to compute
>> distance (probably 2x win, but I am curious)
>>
>
> Reference?  I don't believe this is implemented, but I might not be
> understanding.

-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Re: [jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Grant Ingersoll <gs...@apache.org>.

On Jun 24, 2009, at 10:11 PM, Ted Dunning wrote:

> To clarify, the optimizations I know about that are pending are:
>
> 1) better hash table that doesn't box/unbox (clear win)

This is committed already.

Seems like we should also remove the square root calculation for  
Euclidean, right, we don't actually care about exact distances, since  
the computation is relative.

>
> 2) better centroid distance computation that uses sparsity of document
> vector to minimize the L2 norm computation time (this is what I am  
> curious
> about)

Yes, this is in the patch and is notably faster, in my subjective  
testing, which is backed by Shashi.  I'm going to commit soon.  Just  
wanted a final discussion on whether we should remove the sqrt calc.  
from EuclideanDistance.  Seems like we should since it's just used for  
comparison purposes.

>
> 3) use triangle inequality to limit number of times we have to compute
> distance (probably 2x win, but I am curious)

Reference?  I don't believe this is implemented, but I might not be  
understanding.

Re: [jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Ted Dunning <te...@gmail.com>.

To clarify, the optimizations I know about that are pending are:

1) better hash table that doesn't box/unbox (clear win)

2) better centroid distance computation that uses sparsity of document
vector to minimize the L2 norm computation time (this is what I am curious
about)

3) use triangle inequality to limit number of times we have to compute
distance (probably 2x win, but I am curious)

I realized that my question was ambiguous after I asked.

On Wed, Jun 24, 2009 at 4:20 PM, Grant Ingersoll <gs...@apache.org>wrote:

> Still quantifying, but very promising




-- 
Ted Dunning, CTO
DeepDyve

Re: [jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Grant Ingersoll <gs...@apache.org>.

On Jun 24, 2009, at 6:53 PM, Ted Dunning wrote:

> Grant,
>
> This optimization should have made a large difference.  Did it?

Yes.  Still quantifying, but very promising.  Still having a hard time  
finding good t1, t2 values for the simple tests I am running  
clustering Wikipedia data, so that is clouding things.  It seems no  
matter what I pick, I get one vector per canopy.  Obviously, something  
is wrong, but I don't know what. Sigh.  Of course, it could be the  
fact the docs I'm clustering aren't related, I guess.  I'm only doing  
the first 1000 from a dump.  I'll try a bigger version now.

All the tests pass with the changes, though, and I had the same  
problem before.

>
> The triangle inequality trick should help by a factor of two or more  
> as
> well.

Re: [jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Ted Dunning <te...@gmail.com>.

Grant,

This optimization should have made a large difference.  Did it?

The triangle inequality trick should help by a factor of two or more as
well.

On Wed, Jun 24, 2009 at 3:05 PM, Grant Ingersoll (JIRA) <ji...@apache.org>wrote:

> Updated to trunk and incorporates Shashi's distance work.

Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Grant Ingersoll <gs...@apache.org>.

On Jun 25, 2009, at 4:10 PM, Grant Ingersoll wrote:

>
> On Jun 25, 2009, at 3:41 PM, Ted Dunning wrote:
>
>> THat is one option.
>>
>> Another is to have a separate distance measure that doesn't have  
>> the square
>> root.
>>
>
> Yeah, but that implies that we have one interface for all the other  
> DistanceMeasures and one for Euclidean, or, I suppose we could have  
> a PseudoEuclideanDM class that doesn't do the square root from which  
> the Euclidean merely wraps and gives back the sqrt.  Actually, I  
> like that solution, except the name part of it...
>


I'm going to go with the base class approach.   Seems like that should  
work.  Committing shortly.

Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

+1

Ted Dunning wrote:
> That is what I meant.
>
> As for name, how about SquaredEuclideanDistance?
>
> On Thu, Jun 25, 2009 at 1:10 PM, Grant Ingersoll <gs...@apache.org>wrote:
>
>   
>> I suppose we could have a PseudoEuclideanDM class that doesn't do the
>> square root from which the Euclidean merely wraps and gives back the sqrt.
>>  Actually, I like that solution, except the name part of it
>>     
>
>

Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Ted Dunning <te...@gmail.com>.

That is what I meant.

As for name, how about SquaredEuclideanDistance?

On Thu, Jun 25, 2009 at 1:10 PM, Grant Ingersoll <gs...@apache.org>wrote:

> I suppose we could have a PseudoEuclideanDM class that doesn't do the
> square root from which the Euclidean merely wraps and gives back the sqrt.
>  Actually, I like that solution, except the name part of it

Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Grant Ingersoll <gs...@apache.org>.

On Jun 25, 2009, at 3:41 PM, Ted Dunning wrote:

> THat is one option.
>
> Another is to have a separate distance measure that doesn't have the  
> square
> root.
>

Yeah, but that implies that we have one interface for all the other  
DistanceMeasures and one for Euclidean, or, I suppose we could have a  
PseudoEuclideanDM class that doesn't do the square root from which the  
Euclidean merely wraps and gives back the sqrt.  Actually, I like that  
solution, except the name part of it...

> On Thu, Jun 25, 2009 at 12:40 PM, Grant Ingersoll (JIRA) <jira@apache.org 
> >wrote:
>
>>
>>   [
>> https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724228 
>> #action_12724228]
>>
>> Grant Ingersoll commented on MAHOUT-121:
>> ----------------------------------------
>>
>> BTW, any objection to me removing the sqrt calculation from
>> EuclideanDistanceMeasure?
>>
>>

Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Ted Dunning <te...@gmail.com>.

THat is one option.

Another is to have a separate distance measure that doesn't have the square
root.

On Thu, Jun 25, 2009 at 12:40 PM, Grant Ingersoll (JIRA) <ji...@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724228#action_12724228]
>
> Grant Ingersoll commented on MAHOUT-121:
> ----------------------------------------
>
> BTW, any objection to me removing the sqrt calculation from
> EuclideanDistanceMeasure?
>
>

Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Ted Dunning <te...@gmail.com>.

Shashikant,

How does k-means treat your data?

On Tue, Jun 16, 2009 at 10:12 PM, Shashikant Kore (JIRA) <ji...@apache.org>wrote:

> This is  voodoo for me. For the dataset I am working with has a window of
> 0.05 in which the result changes from 0 canopies to 3,000 canopies.
>

Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Ted Dunning <te...@gmail.com>.

No.  You didn't miss anything.

The point of EDM is to be strictly correct relative to the definition, to
not confuse people and yet still provide a pointer to SEDM.  The point of
SEDM is for speed.  This is all just as you said.

On Fri, Jun 26, 2009 at 6:29 AM, Shashikant Kore (JIRA) <ji...@apache.org>wrote:

> Didn't get your question. I see, EuclideanDistanceMeasure still returns the
> sqrt value and other calculations are moved to
> SquaredEuclideanDistanceMeasure. If someone is interested in only relative
> measure, using SquaredEuclideanDistanceMeasure is a faster solution.  The
> distance value returned by EuclideanDistanceMeasure still continues to be
> the same. Am I missing something here?
>

Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Ted Dunning <te...@gmail.com>.

Not irrelevant at all.  It is a very good point.

It means that the strict triangle optimization cannot be applied with the
no-square-root optimization.

There is a stricter bound that you can use with squared distances (r_12^2 -
3 r_x1^2 as opposed to r_12 - r_x1) but using that spoils the generality of
the k-means algorithm.  We could also push a method into some distances
using an additional interface something like HasTriangleBound and allow the
distance to compute the bound, but this is getting pretty far out in the
weeds.

On Fri, Jun 26, 2009 at 6:35 AM, Sean Owen (JIRA) <ji...@apache.org> wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724520#action_12724520]
>
> Sean Owen commented on MAHOUT-121:
> ----------------------------------
>
> This may be irrelevant -- haven't thought it through -- since someone
> mentioned using the triangle inequality to optimize some stuff earlier, I
> wonder if it is a problem that a squared-distance measure no longer
> satisfies this inequality? That is, it is not true that the square of one
> side is less than the sum of squares of other two sides.
>
>

Re: [jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Sean Owen <sr...@gmail.com>.

Oops somehow lost my last sentence:

Meant to say that the unit test pass. OK to submit?

On Jun 17, 2009 2:58 PM, "Sean Owen (JIRA)" <ji...@apache.org> wrote:

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issue.
..
Sean Owen updated MAHOUT-121:
-----------------------------

   Attachment: MAHOUT-121.patch

Here's my latest patch, which incorporates a unit test, further refinements,
as well as related changes we discussed on a separate thread. In fact, most
of the diff is due to those other changes. The core of the change involves
SparseVector and OrderedIntDoubleMapping.

Is this enough of a good start to commit, and move forward with? seems like
we have evidence it gives a significant performance boost. There was a
question of correctness

> Speed up distance calculations for sparse vectors >
---------------------------------------------...
>         Attachments: MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch,
Mahout1211.patch

> > > From my mail to the Mahout mailing list. > I am working on clustering
a dataset which has thou...

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724067#action_12724067 ] 

Shashikant Kore commented on MAHOUT-121:
----------------------------------------

I'm getting inconsistent results. 

In one case, the current code without distance optimization generated canopies in 7m 23s and the distance-optimized code finished in 3m 13sec. 

I modified the distance vectors to include terms which meet certain criterai (minimum # of docs, max % of docs). Now, optimized code run in 1m 23 sec and the non-optimized  code runs in 1m45sec.

Grant, have you seen this kind of behaviour?


> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740382#action_12740382 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

Nicolas,

I don't get the getStd() method now.  You moved the functionality from computeCentroid, but now getStd is never called, except in DisplayKMeans, which isn't part of the main execution path.  I also am not following the commenting out of the check to see if the two vectors are the same size, but then again, my eyes are glazing over in need of sleep.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740589#action_12740589 ] 

Sean Owen edited comment on MAHOUT-121 at 8/7/09 8:38 AM:
----------------------------------------------------------

Are we talking about the same patch? sections like this?

{noformat}
+    double distance = 0.0;
+    Vector clusterCenter = null;
     for (Cluster cluster : clusters) {
-      double distance = measure.distance(point, cluster.getCenter());
-      if (nearestCluster == null || distance < nearestDistance) {
+      clusterCenter = cluster.getCenter();
+      distance = measure.distance(clusterCenter.getLengthSquared(), clusterCenter, point);
+      if (distance < nearestDistance) {
         nearestCluster = cluster;
         nearestDistance = distance;
       }
{noformat}

The question here is indeed whether to declare distance and clusterCenter inside or outside the loop.

While I completely agree with your statements, they do not seem to pertain to this.

I think we're not talking about the same thing, since you pick up on the point of creating Strings in a loop -- which both my examples did. That is not a difference and not germane to the point I have in mind.

I am further making the point that there is not a tradeoff between readability and speed here -- putting the declaration outside the loop actually makes both worse. I am quite sure this should be reversed.

      was (Author: srowen):
    Are we talking about the same patch? sections like this?

+    double distance = 0.0;
+    Vector clusterCenter = null;
     for (Cluster cluster : clusters) {
-      double distance = measure.distance(point, cluster.getCenter());
-      if (nearestCluster == null || distance < nearestDistance) {
+      clusterCenter = cluster.getCenter();
+      distance = measure.distance(clusterCenter.getLengthSquared(), clusterCenter, point);
+      if (distance < nearestDistance) {
         nearestCluster = cluster;
         nearestDistance = distance;
       }

The question here is indeed whether to declare distance and clusterCenter inside or outside the loop.

While I completely agree with your statements, they do not seem to pertain to this.

I think we're not talking about the same thing, since you pick up on the point of creating Strings in a loop -- which both my examples did. That is not a difference and not germane to the point I have in mind.

I am further making the point that there is not a tradeoff between readability and speed here -- putting the declaration outside the loop actually makes both worse. I am quite sure this should be reversed.
  
> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724075#action_12724075 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

I haven't verified this, but intuitively, it doesn't seem all that unreasonable.  Presumably your vectors are a lot more sparse now.

BTW, I'll add the min #docs, % as options to the Driver.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Ted Dunning <te...@gmail.com>.

I just asked Luc M.   Should have an answer shortly.

On Wed, Jun 10, 2009 at 5:30 AM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On Jun 9, 2009, at 5:46 PM, Benson Margulies (JIRA) wrote:
>
>
>> -----------------------------------------
>>
>> I could make you a fast sparse vector, but I thought you wanted to wait
>> for MTJ?
>>
>
> I guess the question becomes "when is that going to happen?"  I don't
> follow Commons Math, so does anyone else know?
>

Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Grant Ingersoll <gs...@apache.org>.

On Jun 9, 2009, at 5:46 PM, Benson Margulies (JIRA) wrote:

>
> -----------------------------------------
>
> I could make you a fast sparse vector, but I thought you wanted to  
> wait for MTJ?

I guess the question becomes "when is that going to happen?"  I don't  
follow Commons Math, so does anyone else know?

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717832#action_12717832 ] 

Benson Margulies commented on MAHOUT-121:
-----------------------------------------

I could make you a fast sparse vector, but I thought you wanted to wait for MTJ?

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720469#action_12720469 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

I've created some vectors from Wikipedia and can upload them.  I just started profiling using Canopy clustering 1000 Wikipedia docs and so far 87% of the time (after 5% map phase) is spent in getQuick() and 80% of which is Integer.valueOf().  I can run it longer, but I think it's safe to say it's the bottleneck.

I'll try the patch and I'll upload the file and the params for running.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744992#action_12744992 ] 

Sean Owen commented on MAHOUT-121:
----------------------------------

Exactly, sounds like this is ultimately the wrong choice given the usage profile. Delete the class and switch to Colt then?

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Nicolás Fantone (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740694#action_12740694 ] 

Nicolás Fantone commented on MAHOUT-121:
----------------------------------------

{quote}
Where in the patch do you move an object allocation from inside the loop to outside? I do not see this. 
{quote}

Here:
{code:title=getDistanceSquared() in SparseVector.java - Original |borderStyle=solid}
    double result = 0.0;
    Iterator<Vector.Element> iter = iterateNonZero();
    while (iter.hasNext()) {
      Vector.Element elt = iter.next();
      double centroidValue = v.getQuick(elt.index());
      double delta = elt.get() - centroidValue;
      result += (delta * delta) - (centroidValue * centroidValue);
    }
    return result;
{code}

{code:title=getDistanceSquared() in SparseVector.java - Patched |borderStyle=solid}
  ... 
    double result = 0.0;
    Iterator<Vector.Element> iter = iterateNonZero();
    Vector.Element elt = null;
    double centroidValue = 0.0;
    double delta = 0.0;
    while (iter.hasNext()) {
      elt = iter.next();
      centroidValue = v.getQuick(elt.index());
      delta = elt.get() - centroidValue;
      result += (delta * delta) - (centroidValue * centroidValue);
    }
...
{code}

And here:

{code:title=emitPointToNearestCluster() in Cluster.java - Patched |borderStyle=solid}
 ...
    double distance = 0.0;
    Vector clusterCenter = null;
     for (Cluster cluster : clusters) {
      clusterCenter = cluster.getCenter();
      distance = measure.distance(clusterCenter.getLengthSquared(), clusterCenter, point);
      if (distance < nearestDistance) {
         nearestCluster = cluster;
         nearestDistance = distance;
       }
...
{code}

{code:title=emitPointToNearestCluster() in Cluster.java - From Grant patch |borderStyle=solid}
     Cluster nearestCluster = null;
     double nearestDistance = Double.MAX_VALUE;
     for (Cluster cluster : clusters) {
      Vector clusterCenter = cluster.getCenter();
      double distance = measure.distance(clusterCenter.getLengthSquared(), clusterCenter, point);
       if (nearestCluster == null || distance < nearestDistance) {
         nearestCluster = cluster;
         nearestDistance = distance;
{code}

And in outputPointWithClusterInfo(), which is actually the same bit of code as this last one.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719118#action_12719118 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

Also, seems like we could split out the original two issues that Shashikant brought up, right?

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Rob Eden (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745082#action_12745082 ] 

Rob Eden commented on MAHOUT-121:
---------------------------------

Hi guys. I'm the lead developer for the Trove project. Shashi mentioned the problem you're having with Trove's license. I appreciate the interest and would like to accommodate your usage. I'm going to speak to the original developer about dual-licensing.

My specific question is, would the MPL be an acceptable license for usage or does it have to be APL? (Not implying that I necessarily have a problem with APL... just checking options.)

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719462#action_12719462 ] 

Shashikant Kore commented on MAHOUT-121:
----------------------------------------

+1 Grant's suggestion that we split two issues.

Sean,  Doubling the vector size looks aggressive. In a recent experiment, I ran into trouble due to such aggressive expansion.  We could optimize memory usage and Sparse Vector initialization for input document vectors by exposing some low-level APIs. When document vector is read, we already know the number of elements in it. It need not go through the iterative set() method which avoids the movement of the array elements. FastIntDouble can be initialized by two parallel arrays created by reading the sparse vector formatted string.

I will try out your patch and report performance. 

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720123#action_12720123 ] 

Shashikant Kore commented on MAHOUT-121:
----------------------------------------

Apologies for posting incorrect results in previous comments.  I applied Sean's patch to the code which had optimization fo distance calculation.

When I applied this patch to the trunk, it took 2 hr 20mins to execute. I couldn't complete the run for the trunk code as there was trouble with my machine. But it wasn't complete after 4 hours. 

I will re-run and post the correct results. 


> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Nicolás Fantone (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740646#action_12740646 ] 

Nicolás Fantone commented on MAHOUT-121:
----------------------------------------

Jira is hardly the place to discuss this (almost moral) issues, as this tends to be a matter of personal choice most of the time. But I feel the urge to reply: some of you have really helped me out here at work with my project. 

{quote}
Are we talking about the same patch? sections like this? 
{quote}

Yes we are. And while we are taking a look at that quoted piece of code, I shall mention that the declaration of distance and clusterCenter outside the loop isn't the only thing being changed in the patch. There's also an elimination of a an unnecessary boolean evaluation in the if statement. 

{quote}
I think we're not talking about the same thing, since you pick up on the point of creating Strings in a loop - which both my examples did. 
{quote}

We may not. Perhaps, my String example wasn't the happiest of them all - but it certainly IS showing how costly it is in JAVA to just instantiate and initialize an object. Furthermore, I thought I gave a counterargument to what you were implying with those examples.

{quote}
I am further making the point that there is not a tradeoff between readability and speed here - putting the declaration outside the loop actually makes both worse.
{quote}

I can tell none of my propositions were convincing enough. Maybe someone else's? 
http://weblogs.java.net/blog/ddevore/archive/2006/08/declare_variabl_1.html

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shashikant Kore updated MAHOUT-121:
-----------------------------------

    Attachment: doc-vector-4k

3889 Document vectors to test performance

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721646#action_12721646 ] 

Sean Owen commented on MAHOUT-121:
----------------------------------

Since I am not hearing objections, and cognizant that people are waiting on this, going to commit. If there are issues we can roll back or tweak from there.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744973#action_12744973 ] 

Shashikant Kore commented on MAHOUT-121:
----------------------------------------

OrderedIntDoubleMapping, the primitive hash map used in sparse vector, is not fast enough.  Compared to Trove, it is an order of magnitude slower. 

How can we improve the performance? 

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745118#action_12745118 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

Right, we wouldn't be able to bundle it (either as a GPL or MPL), which is a bit of a downer for Mahout.  What we would likely need to do is provide an abstraction layer that can use IntDoubleMapping if Trove is not present on the classpath.  

Either that, or we do the work to improve IntDoubleMapping.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723732#action_12723732 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

Note, also on that Snapshot that I uploaded that I was running w/ MAHOUT-139

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744995#action_12744995 ] 

Shashikant Kore commented on MAHOUT-121:
----------------------------------------

Trying it out...


> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741068#action_12741068 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

bq. The floating point comparision (lengthSquared != -1) is not correct. It should be (lengthSquared < 0), if at all we want to remove the Double object.

Actually, this should be >= 0, since we are returning lengthSquared in that case (before we we're returning it if it was not null).

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740589#action_12740589 ] 

Sean Owen commented on MAHOUT-121:
----------------------------------

Are we talking about the same patch? sections like this?

+    double distance = 0.0;
+    Vector clusterCenter = null;
     for (Cluster cluster : clusters) {
-      double distance = measure.distance(point, cluster.getCenter());
-      if (nearestCluster == null || distance < nearestDistance) {
+      clusterCenter = cluster.getCenter();
+      distance = measure.distance(clusterCenter.getLengthSquared(), clusterCenter, point);
+      if (distance < nearestDistance) {
         nearestCluster = cluster;
         nearestDistance = distance;
       }

The question here is indeed whether to declare distance and clusterCenter inside or outside the loop.

While I completely agree with your statements, they do not seem to pertain to this.

I think we're not talking about the same thing, since you pick up on the point of creating Strings in a loop -- which both my examples did. That is not a difference and not germane to the point I have in mind.

I am further making the point that there is not a tradeoff between readability and speed here -- putting the declaration outside the loop actually makes both worse. I am quite sure this should be reversed.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719546#action_12719546 ] 

Shashikant Kore commented on MAHOUT-121:
----------------------------------------

Sean, 

Your patch has definitely improved the performance. For my test data set, it took 8m 34s to generate canopies. The trunk code has been running for 40 minutes and still only 85% of Mapper is complete. I increased the default size to 1024 and it took 5m 21s. 

But, this is higher than Trove's 3m 34s. The array copy operation on addition of new element to vector  seems to be the culprit. Think centroid calculation with has thousands of feature items.  This could be solved only with a hash implementation like Trove, which has (amortized) insertion and lookup time of O(1).



> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740527#action_12740527 ] 

Sean Owen commented on MAHOUT-121:
----------------------------------

On the point of variable declaration -- the location of the declaration has no impact on performance. It is not somehow 'declared' multiple times if within a loop. Pulling it out of the loop could even be very slightly slower, if it means you are forced to assign it a dummy value at initialization which is never used. There's one more benefit when referencing objects: an Object reference that exists, say, in a loop -- the reference goes out of scope when the loop finishes and its referent is immediately GC-able. If it's declared outside the loop, the reference exists until the larger containing block finishes. Unless you set it to null, but, that's more maintenance. It's a waste if the referent is large and not actually used anymore. This has actually bitten me in the past.

So in general, in Java. declare variables as deep and late as possible. 

Agree with Shashikant that while using double instead of Double is nice, using -1 (really, use -1.0 -- double literals belong with double values) as a signal value probably isn't right here. Just say < 0.0.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740666#action_12740666 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

The only thing I'm missing at this point is the question from earlier about the removal of the size comparison, then I am ready to commit.
{quote}
What about the commenting out of:
{code}
if (centroid.size() != v.size()) {
{code}
It seems like if we are going to do this in this distance measure, we should do it in the other distance measure. We could push this out a layer via documentation and require the caller to do the check, if required. Otherwise, do we have some guarantee that the centroid and the vector are the same cardinality in all cases of calls to this method? I think we do explicitly for the Cluster calls, but that isn't the same.
{quote}



> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740854#action_12740854 ] 

Sean Owen commented on MAHOUT-121:
----------------------------------

Nope, you have not avoided an object allocation in either case. You are simply moving around declarations of object references. (Does it help to note that objects are never allocated on the stack as in C++?) Therefore, I think the prior arguments apply. This very slightly hurts performance and readability and should be reversed.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Nicolás Fantone (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740568#action_12740568 ] 

Nicolás Fantone commented on MAHOUT-121:
----------------------------------------

Thanks for the feedback, people. There are some things I'd like to point out:

{quote}
??I don't get the getStd() method now. You moved the functionality from computeCentroid, but now getStd is never called.??
{quote}

Please, feel free to contradict me here - That was the whole point: getStd() is NEVER called. I search through the entire mahout-core project and no call was found. So, why do we need to add functionality where it does not belong, and moreover, it is not needed? If there's a computeCentroid() method, then let it compute a centroid... not estimate some deviation (which isn't actually used,  anyway) hundreds of times iteratively.  If there's a real need for that particular value, then create a method that calculates it. As there was no such method, I thought it was appropriate to use getStd().

{quote}
??I also am not following the commenting out of the check to see if the two vectors are the same size??
{quote}

Feel free to uncomment those lines and leave them as they once were. I realize that a size check is a needed and healthy thing, but then again, if the purpose of the "optimized" version of distance() is performance-gain perhaps we should consider dropping it out (there are a couple of methods in the project that already do that - check getDistanceSquared() is SparseVector).

{quote}
??Declaring variables out of the loop looks like an optimization, but I suppose, compiler will do that automatically. Even if it doesn't, the performance penalty is not so significant to sacrifice the code readability (which is arguable.)??
{quote}

I have had this discussion dozens of times with my colleagues at the University and partners at work. Though I can accept the root of the issue at hand is kind of a myth (whether declaration inside of a loop impacts performance negatively or not), instantiation of thousands of unnecessary objects IS costly in JAVA - not to mention garbage-accumulative. At the end of the day, avoiding this kind of code-writing is, more often than not, a healthy thing to do - even if it means a decrease in readability.

{quote}
??I ran a simple test with million loops over float multiplication & runs in 10ms. Moving the declarations outside the loop didn't change the results.??
{quote}

floats are a primitive type. Want a simple test? Use String, which also happen to be immutable. Something as straightforward as,

{{String s = new String("sample-string");}}
{{for (int i = 0; i < 250000; i++){s += "anotherString";}}}

will take several minutes to complete thanks to the cost of instantiating a new String each iteration with a new value. Now, replace String with StringBuilder, += with append, and voilà: no more instantiations, same result, few seconds.

{quote}
??The floating point comparision (lengthSquared != -1) is not correct. It should be (lengthSquared < 0), if at all we want to remove the Double object.??
{quote}

Agree. You are completely right.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744991#action_12744991 ] 

Shashikant Kore commented on MAHOUT-121:
----------------------------------------


We need mix of iteration and map get/set. but the map operations are dominant. Hence I tested with get/set operations.  Getting iterator in case of Colt is less than 2x slower. 

1. Vector addition: 
Get iterator on one vector.  
Iterate on the indices to get values from that vector.   
For the same indices, get values from current vector.
Set the result in the current vector. 

In the current implementation, "get" is O(log n) operation. And "set" involves copying the array copy operation, which is quadratic (I think).

2. SparseVector getDistanceSquared()
Operational complexity is similar to that of vector addition, except there is no "set" operation.



> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-121:
-----------------------------------

    Attachment: MAHOUT-121.patch

Updated to trunk.  Did some profiling and the bottleneck is gone, or at least moved to other places.

Haven't actually evaluated correctness, but...

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-121.patch, mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720810#action_12720810 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

bq. I see, i did not even know this utils/ directory was added. I'll get it into my client and update it accordingly. 

Just this morning...  Can you tell I've got some time for Mahout lately?  ;-)

bq. To avoid confusion... I suppose let me keep the 'official' version of the patch for now since I need to make some fixes. It is still substantially identical to the last couple that have been posted, so, reviewing those is valid for anyone who cares.

All yours, but just so you know, this is a really hot topic for me and I've got a deadline associated with it, so I can help if need be.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740580#action_12740580 ] 

Sean Owen commented on MAHOUT-121:
----------------------------------

And finally to Grant's point about compilers and loop unrolling, and I may be stating things people already know:

- This isn't an example of unrolling is it?
- javac will never perform optimizations like this, or really of any kind, given it is bound to output certain bytecode given a certain input program. So I tend to code efficiently from the get-go in Java
- (But ProGuard can, I love ProGuard)
- But the JIT compiler might at runtime. The theory, and I like it, is that this half of the compilation is better done by a process on the platform where the target code will run. For instance it can inline some method calls from even non-final classes, even though Java method invocation is always polymorphic, if it could detect that there doesn't happen to be any subclasses loaded in the JVM at runtime. That would not be possible at compile time.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720693#action_12720693 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

FYI, Sean, can you generate the patch from within the trunk directory, this will make applying it easier for everyone else.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736460#action_12736460 ] 

Shashikant Kore commented on MAHOUT-121:
----------------------------------------

Grant,

The latest patch brings back the changes from my first patch. This can go in.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719511#action_12719511 ] 

Sean Owen commented on MAHOUT-121:
----------------------------------

Agree with you on doubling the vector size, we can do something more conservative like growing 20%. Yes, also easy to add constructors that take arrays. You could also simply set the size in the constructor. The inserts then don't involve any array shuffling.

If this proves a good win for performance I will work more on refining the change, and expanding it to SparseMatrix.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Nicolás Fantone (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740587#action_12740587 ] 

Nicolás Fantone commented on MAHOUT-121:
----------------------------------------

{quote}
The point illustrated by the String loop example has nothing to do with how variables declared, and everything to do with the difference between String and StringBuilder. It doesn't seem to address the point previously raised.
{quote}

Not quite right. The difference between String and StringBuilder IS EXACTLY the difference between instantiating thousands of objects and re-using just one, which is, I believe, the matter at hand here.

{quote}
In fact the first is ever so slightly worse since it sets s to null, but the value is unused. But it is worse for another reason: s continues to point to "anotherString: 249999" after the loop terminates, which is also pointless.
{quote}

If you create new Strings in a loop, then you'll have as many objects as iterations pointing to "anotherString: 0", "anotherString: 1", ..., "anotherString: 121410", and so on, waiting to be gcollected - which may not even happen in the short term. Even more pointless, following your logic.

{quote}
Hence I would undo that part of the patch unless there is another purpose to it I missed.
{quote}

Perhaps someone could run a profiler with and without the latest patch? I tend to think the gain in execution speed would not be significant if any at all, as some of you have stated. However, unless code readability is a priority, I see no harm in changing something that can only help performance.

{quote}
This isn't an example of unrolling is it?
{quote}

That's right. It is not. It is about the cost of instatiation vs. reusability of short-lived objects.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719094#action_12719094 ] 

Sean Owen commented on MAHOUT-121:
----------------------------------

Good point on the overflow, even if it would only happen after the  vector had over a billion non-zero elements, taking 12GB+ of memory! Arrays.binarySearch() has this issue, so perhaps good to just keep a fixed version of binary search.

Yes there needs to be a way to iterate over the elements directly, quickly. Thinking it best to just return the arrays, if we're going for speed.

Would it be useful to take a shot at rewriting SparseVector to use this?

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720798#action_12720798 ] 

Sean Owen commented on MAHOUT-121:
----------------------------------

I see, i did not even know this utils/ directory was added. I'll get it into my client and update it accordingly.

I didn't see a failure in that test, but, do see the problem -- it uses a random number generator without a seed. I will add a seed, and see if I can find one that induces a failure.

To avoid confusion... I suppose let me keep the 'official' version of the patch for now since I need to make some fixes. It is still substantially identical to the last couple that have been posted, so, reviewing those is valid for anyone who cares.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724520#action_12724520 ] 

Sean Owen commented on MAHOUT-121:
----------------------------------

This may be irrelevant -- haven't thought it through -- since someone mentioned using the triangle inequality to optimize some stuff earlier, I wonder if it is a problem that a squared-distance measure no longer satisfies this inequality? That is, it is not true that the square of one side is less than the sum of squares of other two sides.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Nicolás Fantone (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739835#action_12739835 ] 

Nicolás Fantone commented on MAHOUT-121:
----------------------------------------

I'm terribly sorry. It was really busy these days at work, I didn't have a second to take a look at this. Tomorrow, I promise, I'll post a new patch.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-121:
-----------------------------

    Attachment: MAHOUT-121.patch

One more go-round. This fixes a bug, indeed, that was revealed by the vector test in certain randomly-generated situations. I found a seed that triggered it, so, was able to make the test deterministic and also to fix it. utils/ is OK now. Will wait a beat to see if anyone's got comments -- aware this is holding up other changes though so will try to get this sorted tonight.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740814#action_12740814 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

I committed an in-between version of MAHOUT-121-new-distance-optimization and MAHOUT-121-cluster-distance.patch. The former didn't pass tests due to not checking to see if the nearestCluster was null.  I did, however, move that check for null to the second part of the || clause, as it is the less frequent case and is likely to be short-circuited by the distance check.

Committed revision 802283.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-121:
-----------------------------

    Attachment: Mahout1211.patch

Here is a next patch that switches to use this new class in SparseVector. Thoughts?

Maybe the two classes should just be merged. Then SparseMatrix should be changed too.

I haven't measured the performance change but imagine it will be significant.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723724#action_12723724 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

Hi Shashi,

Any chance you can bring the optimization (distance calculation stuff) part of this up to date with trunk (the primitive stuff is all in)?

Thanks,
Grant

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723980#action_12723980 ] 

Shashikant Kore commented on MAHOUT-121:
----------------------------------------

Grant, 

I was trying to verify the patch, but for some reason,  my hadoop setup refuses to run with "Error reading task output" message. Couldn't get it working after lots of effort. Will try it again. If you could post the performance numbers before and after the distance calculation improvement, it would be helpful to verify if those are in line with expectations.  


> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724228#action_12724228 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

BTW, any objection to me removing the sqrt calculation from EuclideanDistanceMeasure?

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740519#action_12740519 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

bq. Declaring variables out of the loop looks like an optimization, but I suppose, compiler will do that automatically

This appears to be moving it right out of the computation all together, though, as getStd (we need a better name for that!) isn't even in the computation anymore, AFAICT.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720839#action_12720839 ] 

Jeff Eastman commented on MAHOUT-121:
-------------------------------------

The patch fails to install. More hand-editing?

jeff-eastmans-mac-pro:Mahout jeff$ patch -p0 < MAHOUT-121\(2\).patch 
patching file core/src/test/java/org/apache/mahout/matrix/TestOrderedIntDoubleMapping.java
patching file core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopy.java
Hunk #1 FAILED at 242.
Hunk #2 FAILED at 259.
Hunk #3 FAILED at 274.
Hunk #4 FAILED at 294.
Hunk #5 FAILED at 416.
5 out of 5 hunks FAILED -- saving rejects to file core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopy.java.rej
patching file core/src/test/java/org/apache/mahout/matrix/VectorTest.java
patching file core/src/test/java/org/apache/mahout/matrix/TestDenseVector.java
patching file core/src/main/java/org/apache/mahout/clustering/dirichlet/models/NormalModel.java
patching file core/src/main/java/org/apache/mahout/matrix/VectorView.java
patching file core/src/main/java/org/apache/mahout/clustering/dirichlet/models/AsymmetricSampledNormalModel.java
patching file core/src/main/java/org/apache/mahout/utils/ManhattanDistanceMeasure.java
patching file core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/SoftCluster.java
can't find file to patch at input line 598
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|Index: trunk/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/dirichlet/NormalScModel.java
|===================================================================
|--- trunk/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/dirichlet/NormalScModel.java	(revision 780473)
|+++ trunk/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/dirichlet/NormalScModel.java	Tue Jun 16 20:58:30 BST 2009
--------------------------
File to patch: ^C


> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717812#action_12717812 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

I've been doing some profiling and we do indeed need a solution to the autoboxing stuff.  getQuick() and setQuick are (ironically, given their names) the big bottlenecks due to constant boxing.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719117#action_12719117 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

bq. Would it be useful to take a shot at rewriting SparseVector to use this?

You could do that, or an alternate implementation.  Is there any case where one wouldn't want this?  Also, I wouldn't mind a little better name than FastIntDouble.  ;-)

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745097#action_12745097 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

Hi Rob,

Thanks!  Here's the ASF's stance on licenses:  http://apache.org/legal/resolved.html.   I think for our use case, MPL would be fine, since we are just using it as a library and not extending/deriving from it, even if ASL would be better.

-Grant

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-121:
-----------------------------

    Attachment: MAHOUT-121.patch

Here's my latest patch, which incorporates a unit test, further refinements, as well as related changes we discussed on a separate thread. In fact, most of the diff is due to those other changes. The core of the change involves SparseVector and OrderedIntDoubleMapping.

Is this enough of a good start to commit, and move forward with? seems like we have evidence it gives a significant performance boost. There was a question of correctness

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720473#action_12720473 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

Patch, of course, also needs a unit test for the OrderedIntDoubleMapping.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-121.patch, mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724077#action_12724077 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

Also, how many docs are you clustering on for these tests?  And have you checked out some of the utilities I've been putting in?

Can you share more about your tests so I can reproduce them?  I will put up some Wikipedia Vectors soon.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740420#action_12740420 ] 

Shashikant Kore commented on MAHOUT-121:
----------------------------------------

Declaring variables out of the loop  looks like an optimization, but I suppose, compiler will do that automatically. Even if it doesn't, the performance penalty is not so significant to sacrifice the code readability (which is arguable.) I ran a simple test with million loops over float multiplication & runs in 10ms. Moving the declarations outside the loop didn't change the results.

The floating point comparision (lengthSquared != -1) is not correct. It should be (lengthSquared < 0), if at all we want to remove the Double object. 


> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Daniel Kulp <dk...@apache.org>.

It depends on the purpose of the confluence space.   If the space is used at 
all for "official" project related material, an ICLA must be on file.   
"official" project related material includes (but is not limitted to):

1) The project web site
2) Anything going into a "release"
3) Any "official" documentation

Documentation, on a wiki or in SVN, is a project owned resource and MUST have 
the same legal protection as the source code.   Thus, the ICLA must be on file 
for it.

If the space is just for discussion things and similar, then no, an ICLA would 
not be required.

Dan

On Wed June 17 2009 10:38:41 am Benson Margulies wrote:
> Grant,
>
> That's not the policy as we've had it explained to us at CXF. If it's
> content on www.apache.org, any contributor has to have a CLA.
>
> Dan, do I have this right? Is there some difference between Confluence
> and cwiki?
>
> --benson
>
> On Wed, Jun 17, 2009 at 9:35 AM, Grant Ingersoll<gs...@apache.org> wrote:
> > On Jun 17, 2009, at 9:29 AM, Benson Margulies wrote:
> >> At CXF, permission to modify the confluence wiki is only granted to
> >> people with a CLA on file? Obviously, I have one, but do you need to
> >> grant me karma
> >> here before I can edit?
> >
> > We aren't distributing the wiki as a part of the release, so anyone with
> > an account should be able to modify it.

-- 
Daniel Kulp
dkulp@apache.org
http://www.dankulp.com/blog

Re: Confluence as a Wiki was Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Benson Margulies <bi...@gmail.com>.

As per Dan's email, all is well. Sorry for the faux alarm.

On Wed, Jun 17, 2009 at 10:49 AM, Grant Ingersoll <gs...@apache.org>wrote:

> I'm basing my understanding on reading http://cwiki.apache.org/CWIKI/.  We
> are not using the Wiki as our main site nor are we packaging it with our
> releases and it is clearly labeled as a Wiki.  It is effectively the same as
> any other Moin Moin Wiki in Apache land, AIUI.
>
> On Jun 17, 2009, at 10:38 AM, Benson Margulies wrote:
>
>  Grant,
>>
>> That's not the policy as we've had it explained to us at CXF. If it's
>> content on www.apache.org, any contributor has to have a CLA.
>>
>> Dan, do I have this right? Is there some difference between Confluence
>> and cwiki?
>>
>> --benson
>>
>>
>> On Wed, Jun 17, 2009 at 9:35 AM, Grant Ingersoll<gs...@apache.org>
>> wrote:
>>
>>>
>>> On Jun 17, 2009, at 9:29 AM, Benson Margulies wrote:
>>>
>>>  At CXF, permission to modify the confluence wiki is only granted to
>>>> people
>>>> with a CLA on file? Obviously, I have one, but do you need to grant me
>>>> karma
>>>> here before I can edit?
>>>>
>>>
>>> We aren't distributing the wiki as a part of the release, so anyone with
>>> an
>>> account should be able to modify it.
>>>
>>>
>>>
>
>

Confluence as a Wiki was Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Grant Ingersoll <gs...@apache.org>.

I'm basing my understanding on reading http://cwiki.apache.org/ 
CWIKI/.  We are not using the Wiki as our main site nor are we  
packaging it with our releases and it is clearly labeled as a Wiki.   
It is effectively the same as any other Moin Moin Wiki in Apache land,  
AIUI.

On Jun 17, 2009, at 10:38 AM, Benson Margulies wrote:

> Grant,
>
> That's not the policy as we've had it explained to us at CXF. If it's
> content on www.apache.org, any contributor has to have a CLA.
>
> Dan, do I have this right? Is there some difference between Confluence
> and cwiki?
>
> --benson
>
>
> On Wed, Jun 17, 2009 at 9:35 AM, Grant  
> Ingersoll<gs...@apache.org> wrote:
>>
>> On Jun 17, 2009, at 9:29 AM, Benson Margulies wrote:
>>
>>> At CXF, permission to modify the confluence wiki is only granted  
>>> to people
>>> with a CLA on file? Obviously, I have one, but do you need to  
>>> grant me
>>> karma
>>> here before I can edit?
>>
>> We aren't distributing the wiki as a part of the release, so anyone  
>> with an
>> account should be able to modify it.
>>
>>

Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Benson Margulies <bi...@gmail.com>.

Grant,

That's not the policy as we've had it explained to us at CXF. If it's
content on www.apache.org, any contributor has to have a CLA.

Dan, do I have this right? Is there some difference between Confluence
and cwiki?

--benson

On Wed, Jun 17, 2009 at 9:35 AM, Grant Ingersoll<gs...@apache.org> wrote:
>
> On Jun 17, 2009, at 9:29 AM, Benson Margulies wrote:
>
>> At CXF, permission to modify the confluence wiki is only granted to people
>> with a CLA on file? Obviously, I have one, but do you need to grant me
>> karma
>> here before I can edit?
>
> We aren't distributing the wiki as a part of the release, so anyone with an
> account should be able to modify it.
>
>

Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Grant Ingersoll <gs...@apache.org>.

On Jun 17, 2009, at 9:29 AM, Benson Margulies wrote:

> At CXF, permission to modify the confluence wiki is only granted to  
> people
> with a CLA on file? Obviously, I have one, but do you need to grant  
> me karma
> here before I can edit?

We aren't distributing the wiki as a part of the release, so anyone  
with an account should be able to modify it.

Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Benson Margulies <bi...@gmail.com>.

At CXF, permission to modify the confluence wiki is only granted to people
with a CLA on file? Obviously, I have one, but do you need to grant me karma
here before I can edit?


On Wed, Jun 17, 2009 at 9:22 AM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On Jun 17, 2009, at 9:05 AM, Benson Margulies wrote:
>
>  All I know is what I learned from reading the paper. However, I continue
>> to
>> think, from reading the paper, that you may be trying to make Canopy do
>> something it was not intended to do.
>>
>> As I read the paper, the idea here is to get a rough partitioning that is
>> used to optimize various downstream algorithms, not to tune for a precise
>> partitioning. The number of canopies doesn't need, as I read it, to be
>> particularly close to the number of eventual partitions to be useful.
>>
>> Thus the extended discussion of how to start up and run various other
>> algorithms, (e.g. k-means).
>>
>
> Makes sense.
>
>
>> Now, still, you need to get some useful number of partitions. The paper
>> has
>> a classic toss-off line, 'we used cross-validation,' without any details
>> about exactly what the authors did. Presumably, that means that the author
>> ran many possible values and hand-examined the results. The paper reports
>> no
>> general results about how sensitive the T values are to particular input
>> data sets. A pessimist would fear that, for any new input, you're going to
>> need to go through a lengthy process to find good values for T1 and T2.
>>
>> This leads me to wonder, ignorantly, why this project is so focused on
>> Canopy. The paper describes it as a tool for speeding up various other
>> things. Since you're hadooping all those other things, how much does it
>> help?
>>
>
> I don't think anyone is solely focused on it, but it is something that we
> have available in our arsenal of clustering tools, therefore it warrants
> documentation and understanding of when and how to use.  Personally, it's
> just something I could easily run to work on MAHOUT-121.
>
> At any rate, this kind of write up is exactly the advice that we need to be
> able to give people.  Care to add to
> http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData ?
>
>
>
>> Anyway, I expect that my ignorance is on comprehensive display here.
>>
>>
> Funny, I feel like my ignorance is the one on display, but that is
> something I got over a long time ago in open source.  Which is why I just
> come out and ask the questions!  One of my goals for Mahout is to make it a
> place where people can come and learn about Machine Learning and get
> practical advice and not be afraid to ask basic questions.  Machine learning
> is so shrouded in mystery it almost seems like a Dark Art.  I'm thankful
> every day on this project that smarter people than me show up and answer
> questions.  So, please, keep 'em coming!
>
> -Grant
>

Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Grant Ingersoll <gs...@apache.org>.

On Jun 17, 2009, at 9:05 AM, Benson Margulies wrote:

> All I know is what I learned from reading the paper. However, I  
> continue to
> think, from reading the paper, that you may be trying to make Canopy  
> do
> something it was not intended to do.
>
> As I read the paper, the idea here is to get a rough partitioning  
> that is
> used to optimize various downstream algorithms, not to tune for a  
> precise
> partitioning. The number of canopies doesn't need, as I read it, to be
> particularly close to the number of eventual partitions to be useful.
>
> Thus the extended discussion of how to start up and run various other
> algorithms, (e.g. k-means).

Makes sense.

>
> Now, still, you need to get some useful number of partitions. The  
> paper has
> a classic toss-off line, 'we used cross-validation,' without any  
> details
> about exactly what the authors did. Presumably, that means that the  
> author
> ran many possible values and hand-examined the results. The paper  
> reports no
> general results about how sensitive the T values are to particular  
> input
> data sets. A pessimist would fear that, for any new input, you're  
> going to
> need to go through a lengthy process to find good values for T1 and  
> T2.
>
> This leads me to wonder, ignorantly, why this project is so focused on
> Canopy. The paper describes it as a tool for speeding up various other
> things. Since you're hadooping all those other things, how much does  
> it
> help?

I don't think anyone is solely focused on it, but it is something that  
we have available in our arsenal of clustering tools, therefore it  
warrants documentation and understanding of when and how to use.   
Personally, it's just something I could easily run to work on  
MAHOUT-121.

At any rate, this kind of write up is exactly the advice that we need  
to be able to give people.  Care to add to http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData 
  ?

>
> Anyway, I expect that my ignorance is on comprehensive display here.
>

Funny, I feel like my ignorance is the one on display, but that is  
something I got over a long time ago in open source.  Which is why I  
just come out and ask the questions!  One of my goals for Mahout is to  
make it a place where people can come and learn about Machine Learning  
and get practical advice and not be afraid to ask basic questions.   
Machine learning is so shrouded in mystery it almost seems like a Dark  
Art.  I'm thankful every day on this project that smarter people than  
me show up and answer questions.  So, please, keep 'em coming!

-Grant

Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Shashikant Kore <sh...@gmail.com>.

Grant,

Well, it's the other way round. Bandhan.com is a personal project.
It's search engine for matrimony (dating) profile aggregator focused
on Indian market.

I am consulting to a company and the Mahout work is for the same. I am
not sure if I am at liberty to disclose the identity. Please accept my
apologies for that.

But I can say I am working with data which we all work with - PDF/Word
documents, emails, text files, etc. Just that it's at scale.

--shashi

On Wed, Jun 17, 2009 at 8:08 PM, Grant Ingersoll<gs...@apache.org> wrote:
>
> On Jun 17, 2009, at 10:34 AM, Shashikant Kore wrote:
>>
>> OK. Now, let's not focus on my ignorance. I have got my hands dirty
>> with  Machine Learning, Mahout and Hadoop barely few days back.
>>
>> --shashi
>>
>> -- http://www.bandhan.com/
>
> Shashi, based on your link here, I'm curious if you could share what your
> use case is a bit more?  Or is this just a personal thing?  Feel free to say
> you can't, but I think it would be helpful to others.
>

--
http://www.bandhan.com/

Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Grant Ingersoll <gs...@apache.org>.

On Jun 17, 2009, at 10:34 AM, Shashikant Kore wrote:
> OK. Now, let's not focus on my ignorance. I have got my hands dirty
> with  Machine Learning, Mahout and Hadoop barely few days back.
>
> --shashi
>
> --  
> http://www.bandhan.com/

Shashi, based on your link here, I'm curious if you could share what  
your use case is a bit more?  Or is this just a personal thing?  Feel  
free to say you can't, but I think it would be helpful to others.

Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Shashikant Kore <sh...@gmail.com>.

I have  verifed the results only by "laugh-test" method.  Many of the
clusters were excellent. There were some false-positives though, which
were farther from the cetroid. It might be because I used 4
iterations. Higher number of iterations probably will give better
results.

Right now, I don't have any visualization tools to make a confident
statement about quality of clusters.  I will report back when I have
something concrete.

--shashi

On Thu, Jun 18, 2009 at 12:16 AM, Ted Dunning<te...@gmail.com> wrote:
> Shashi,
>
> What were the results for k-means?
>
> (I have zero experience with canopy, but have generally had mildly useful
> results using k-means clustering.
>
> On Wed, Jun 17, 2009 at 7:34 AM, Shashikant Kore <sh...@gmail.com>wrote:
>
>> I ran Canopy and then K-Means on 50k doc vectors
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Ted Dunning <te...@gmail.com>.

Shashi,

What were the results for k-means?

(I have zero experience with canopy, but have generally had mildly useful
results using k-means clustering.

On Wed, Jun 17, 2009 at 7:34 AM, Shashikant Kore <sh...@gmail.com>wrote:

> I ran Canopy and then K-Means on 50k doc vectors
>

-- 
Ted Dunning, CTO
DeepDyve

Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Shashikant Kore <sh...@gmail.com>.

On Wed, Jun 17, 2009 at 6:35 PM, Benson Margulies<bi...@gmail.com> wrote:
>
> As I read the paper, the idea here is to get a rough partitioning that is
> used to optimize various downstream algorithms, not to tune for a precise
> partitioning. The number of canopies doesn't need, as I read it, to be
> particularly close to the number of eventual partitions to be useful.
>
> Thus the extended discussion of how to start up and run various other
> algorithms, (e.g. k-means).
>

That's right. But here is my experience.  I ran Canopy and then
K-Means on 50k doc vectors. (That, by the way, is fraction of the
actual dataset.)  I used the code in the patch of 121, which uses
primitives for Sparse Vectors.

After some experimentation, for the t2 value of 0.9, I got only 1
cluster. When I changed it to 0.85, it generated 3000+ clusters(or
canopies). With increasing number of canopies the code starts
crawling. And after some time, even 2G memory is not sufficient for
it.

Canopies is one of the simplest clustering algorithm and I had trouble
getting it work. May be it's my data set. I simply didn't had the
patience to find out all the values of t1 and t2 which are anyway
going to change when the input changes. So, for now, I have just put a
cap on the number of canopies generated.  Not elegant, but results
don't seem bad at all.

OK. Now, let's not focus on my ignorance. I have got my hands dirty
with  Machine Learning, Mahout and Hadoop barely few days back.

--shashi

-- 
http://www.bandhan.com/

Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Benson Margulies <bi...@gmail.com>.

All I know is what I learned from reading the paper. However, I continue to
think, from reading the paper, that you may be trying to make Canopy do
something it was not intended to do.

As I read the paper, the idea here is to get a rough partitioning that is
used to optimize various downstream algorithms, not to tune for a precise
partitioning. The number of canopies doesn't need, as I read it, to be
particularly close to the number of eventual partitions to be useful.

Thus the extended discussion of how to start up and run various other
algorithms, (e.g. k-means).

Now, still, you need to get some useful number of partitions. The paper has
a classic toss-off line, 'we used cross-validation,' without any details
about exactly what the authors did. Presumably, that means that the author
ran many possible values and hand-examined the results. The paper reports no
general results about how sensitive the T values are to particular input
data sets. A pessimist would fear that, for any new input, you're going to
need to go through a lengthy process to find good values for T1 and T2.

This leads me to wonder, ignorantly, why this project is so focused on
Canopy. The paper describes it as a tool for speeding up various other
things. Since you're hadooping all those other things, how much does it
help?

Anyway, I expect that my ignorance is on comprehensive display here.

On Wed, Jun 17, 2009 at 7:16 AM, Grant Ingersoll <gs...@apache.org>wrote:

> Shashikant asked this over on mahout-dev, but I thought I would move it to
> user so that others can benefit from the discussion.
>
>
> On Jun 17, 2009, at 1:12 AM, Shashikant Kore (JIRA) wrote:
>
>>
>> Shashikant Kore commented on MAHOUT-121:
>> ----------------------------------------
>>
>>
>> [OT] Also, was wondering how you came up with the values of t1 and t2 as
>> 1.3 & 1.0. This is  voodoo for me. For the dataset I am working with has a
>> window of 0.05 in which the result changes from 0 canopies to 3,000
>> canopies.
>>
>
> I just picked some numbers based on what you did!  It is voodoo to me too.
>  I have not done much clustering, so I'm learning a lot here.  As for
> MAHOUT-121, I just wanted something to run.

[Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Grant Ingersoll <gs...@apache.org>.

Shashikant asked this over on mahout-dev, but I thought I would move  
it to user so that others can benefit from the discussion.

On Jun 17, 2009, at 1:12 AM, Shashikant Kore (JIRA) wrote:
>
> Shashikant Kore commented on MAHOUT-121:
> ----------------------------------------
>
>
> [OT] Also, was wondering how you came up with the values of t1 and  
> t2 as 1.3 & 1.0. This is  voodoo for me. For the dataset I am  
> working with has a window of 0.05 in which the result changes from 0  
> canopies to 3,000 canopies.

I just picked some numbers based on what you did!  It is voodoo to me  
too.  I have not done much clustering, so I'm learning a lot here.  As  
for MAHOUT-121, I just wanted something to run.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720498#action_12720498 ] 

Shashikant Kore commented on MAHOUT-121:
----------------------------------------

Grant, 

I am trying out the wikipedia vectors with the parameters you gave. It is running slow. Only 8% map completion after 18 minutes. Can you please post the performance before and after applying the OrderedIntDoubleMapping patch?

[OT] Also, was wondering how you came up with the values of t1 and t2 as 1.3 & 1.0. This is  voodoo for me. For the dataset I am working with has a window of 0.05 in which the result changes from 0 canopies to 3,000 canopies.



> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-121.patch, mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724278#action_12724278 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

Committed revision 788504.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745015#action_12745015 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

Let's open a new issue for this one, as this issue is really long and overloaded at this point. 

Solr uses Colt via a download mechanism, so we probably could use it via the POM, but not sure if we can distribute it.  

FWIW, in my profiling, I think find() was the major operation.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718043#action_12718043 ] 

Benson Margulies commented on MAHOUT-121:
-----------------------------------------

Two arrays would indeed model all the high-speed C++ sparse vectors I have met. One optimization is to allow insertion without maintaining order and then resorting for algorithms that need to do many inserts in a row without reading.


> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720246#action_12720246 ] 

Shashikant Kore commented on MAHOUT-121:
----------------------------------------

I have document vectors created from some internal text.  I create canopies from those vectors.  Centroid calculation is a suspect area. 

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-121:
-----------------------------

    Attachment: MAHOUT-121.patch

Not sure if my very truly last version of the patch got posted. Here it is. It is relative to the root rather than trunk/ -- seems my hand editing doesn't work.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720764#action_12720764 ] 

Jeff Eastman commented on MAHOUT-121:
-------------------------------------

A bit premature perhaps. There is still an unresolved cardinality() reference in /Mahout/utils/src/test/java/org/apache/mahout/utils/vectors/lucene/LuceneIterableTest.java and the VectorTest testSparseVectorTimesX fails intermittently. AFAICT, none of the patches include /utils and that compile fails after the unit tests and packaging have succeeded.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720694#action_12720694 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

I'm getting a lot of failures when applying this patch.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745101#action_12745101 ] 

Benson Margulies commented on MAHOUT-121:
-----------------------------------------

Trove is LGPL. If it were published to maven, which is currently is not in any reliable way, I think mahout can have an optional dependency. As per the referenced, LGPL is precluded in a full dependency.


> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724548#action_12724548 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

bq. I see, EuclideanDistanceMeasure still returns the sqrt value and other calculations are moved to SquaredEuclideanDistanceMeasure. If someone is interested in only relative measure, using SquaredEuclideanDistanceMeasure is a faster solution. The distance value returned by EuclideanDistanceMeasure still continues to be the same. Am I missing something here?

Nope, your not missing anything.  I added the SquaredEDM class yesterday per this discussion.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719632#action_12719632 ] 

Sean Owen commented on MAHOUT-121:
----------------------------------

That's good. I expect that, normally, vector updates are relatively infrequent. It should not take so many such edits to construct a vector, during construction for instance. What code does the work of creating the vectors, so I can add and integrate a faster mechanism there?

We could switch to a hashtable implementation later if that is necessary.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Nicolás Fantone (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737956#action_12737956 ] 

Nicolás Fantone commented on MAHOUT-121:
----------------------------------------

Um... weird. That's the way I created the patch, too. I'm going to revise it on Monday at work and upload a new one. (Maybe it's just an empty or a new line that I erased from the original .java and it's not translating correctly into the patch?) 

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Grant Ingersoll (JIRA) wrote:
>     [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740573#action_12740573 ] 
>
> Grant Ingersoll commented on MAHOUT-121:
> ----------------------------------------
>
> bq. Please, feel free to contradict me here - That was the whole point: getStd() is NEVER called.
>
> Ah, I see now.  We were in deed calculating it in computeCentroid lots of times, but it is only ever used for DisplayKMeans.  You are correct.
>
>
> As for loop unrolling, etc. it is usually best to let the compiler take care of that.  As for strings, you should never, ever use String for concatenation.  It is a horrible performance drain.  StringBuilder is definitely the way to go.  Especially be on the look out for String concats in logging statements that aren't guarded by if (log.isDebugEnabled())
>
> I will fix the lengthSquared comparison and commit.
>
> Thanks!
>
>   
>> Speed up distance calculations for sparse vectors
>> -------------------------------------------------
>>
>>                 Key: MAHOUT-121
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>>             Project: Mahout
>>          Issue Type: Improvement
>>          Components: Matrix
>>            Reporter: Shashikant Kore
>>            Assignee: Grant Ingersoll
>>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>>
>>
>> From my mail to the Mahout mailing list.
>> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
>> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
>> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>>  
>> Current Canopy Generation: 28 min 15 sec.
>> Canopy Generation with distance optimization: 1 min 38 sec.
>> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
>> http://trove4j.sourceforge.net/ ].
>> Distance optimization with Trove: 59 sec
>> Current canopy generation with Trove: 21 min 55 sec
>> To sum, these two optimizations reduced cluster generation time by a 97%.
>> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
>> Licensing of Trove seems to be an issue which needs to be addressed.
>>     
>
>   
I added the getStd() method for use in the display examples to show a 
size metric for the clusters consistent with the Dirichlet models. The 
std calculation should, at least, be refactored out of computeCentroid 
so it is only done if needed. If the overhead of calculating the 
pointSquaredTotal is a source of unacceptable performance overhead then 
by all means remove it and I will figure out how to display something 
reasonable about cluster size.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740573#action_12740573 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

bq. Please, feel free to contradict me here - That was the whole point: getStd() is NEVER called.

Ah, I see now.  We were in deed calculating it in computeCentroid lots of times, but it is only ever used for DisplayKMeans.  You are correct.


As for loop unrolling, etc. it is usually best to let the compiler take care of that.  As for strings, you should never, ever use String for concatenation.  It is a horrible performance drain.  StringBuilder is definitely the way to go.  Especially be on the look out for String concats in logging statements that aren't guarded by if (log.isDebugEnabled())

I will fix the lengthSquared comparison and commit.

Thanks!

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Stephen Green (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719016#action_12719016 ] 

Stephen Green commented on MAHOUT-121:
--------------------------------------

Don't forget that there is java.util.Arrays.binarySearch (and it looks like they've improved it since the last time that I looked.)

Also, don't forget Josh Bloch's point about binary search.  For sufficiently large arrays, low+high might overflow an int, which will make your binary search unhappy.

Also, also, if you're going to do a lot of finds in a row, you might want to provide an iterator-returning method that returns the positions in the array one-by-one as iterating through the array once will be faster than many binary searches.


> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720684#action_12720684 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

Presumably the tests of still pass, so there is some evidence of correctness.  

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Dawid Weiss (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712388#action_12712388 ] 

Dawid Weiss commented on MAHOUT-121:
------------------------------------

There is even a hash map implementation in Lucene, as far as I remember (part of string interning/ caching routine recently committed to the trunk). This could be a started. 

I've been pestering Soren Bak to donate his excellent PCJ library to Apache -- if so, we would have the problem of primitive collections of all kinds solved.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Nicolás Fantone (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736627#action_12736627 ] 

Nicolás Fantone commented on MAHOUT-121:
----------------------------------------

Hi all - I've just sign up to Jira!

Before committing this patch, could any of you take a look to my latest mail in the Mahout mailing list (it's a bit extensive to copy/paste in here)? With some enlightenment, I could create a patch from my work and we may be able to make bigger improvements for KMeans.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736036#action_12736036 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

Looks like we missed some opportunities for using our new distance capabilities.  Will try to get a patch out soon.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724517#action_12724517 ] 

Shashikant Kore commented on MAHOUT-121:
----------------------------------------

Grant,

Didn't get your question. I see, EuclideanDistanceMeasure still returns the sqrt value and other calculations are moved to SquaredEuclideanDistanceMeasure. If someone is interested in only relative measure, using SquaredEuclideanDistanceMeasure is a faster solution.  The distance value returned by EuclideanDistanceMeasure still continues to be the same. Am I missing something here?




> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Nicolás Fantone (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740681#action_12740681 ] 

Nicolás Fantone commented on MAHOUT-121:
----------------------------------------

Sean, we are definitely not following each other. Probably due to my lack of communication skills.

{quote}
Your point about Strings is undeniable, but is irrelevant to this question. See my counter-example for an example which would be relevant to the point I am trying to make.
{quote}

How could it be irrelevant when it's the exact same point I tried to make in my patch?  There's a Vector being instantiated and allocated in every for iteration, in every task, in every reduce job, in every node of the cluster. And it is not necessary. Every time. The very same thing goes for my String example... except with Strings.

{quote}
Declaring outside the loop only incurs an extra initialization.
{quote}

Extra? There's only ONE initialization. If the declaration is done inside the loop, thousands of initializations are going to be done. That's thousands minus one "extra" initializations.

Grant, maybe you should leave the size comparison for now. It won't impact speed noticeably and, as of now, KMeans is only using the optimized distance calculation for both computing convergence and emitting points. Is there anywhere else a size check is done between input vectors? I believe there isn't.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-121:
-----------------------------------

    Attachment: Canopy_Wiki_1000-2009-06-24.snapshot

Here's a YourKit snapshot of me clustering 1000 wikipedia docs.  There are a few key bottlenecks:
1. Distance calculation
2. SparseVector.setQuick
3. SparseVector.clone

All three of these are heavily called during both Map and Reduce phases and are part of the distance calculation loops

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720627#action_12720627 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

After the patch, it was pretty fast, but like I said, I didn't evaluate any correctness just yet.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-121.patch, mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740577#action_12740577 ] 

Sean Owen commented on MAHOUT-121:
----------------------------------

Still picking on this loops / declaration issue:

Instantiating unneeded objects is indeed, obviously, a waste. But the changes in the last patch (which I think we're discussing?) does not save instantiations -- nor does it really add any either. I don't follow this point then.

The point illustrated by the String loop example has nothing to do with how variables declared, and everything to do with the difference between String and StringBuilder. It doesn't seem to address the point previously raised.

Here is a difference that doesn't matter:

String s = null;
for (int i = 0; i < 250000; i++){s = "anotherString: " + i;}

for (int i = 0; i < 250000; i++){String s = "anotherString: " + i;}

In fact the first is ever so slightly worse since it sets s to null, but the value is unused. But it is worse for another reason: s continues to point to "anotherString: 249999" after the loop terminates, which is also pointless.

Hence I would undo that part of the patch unless there is another purpose to it I missed.


> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shashikant Kore updated MAHOUT-121:
-----------------------------------

    Attachment: mahout-121.patch

This patch is intended to communicate the optimization idea. Not a final patch. 

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712193#action_12712193 ] 

Ted Dunning commented on MAHOUT-121:
------------------------------------

> Licensing of Trove seems to be an issue which needs to be addressed.

The licensing of trove is a non-starter.

Typically, however, you can limit the use to a few cases like Map<Integer, Double> and then simply re-implement or find a Apache compatible implementation.

There is no magic involved in an open hash map in this kind of case.  Even just using two arrays with binary search could be acceptable.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720470#action_12720470 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

Sparse Vector available at http://people.apache.org/~gsingers/mahout/out.txt.gz

Ran as: 
{quote}
org.apache.mahout.clustering.canopy.CanopyClusteringJob /Users/grantingersoll/projects/lucene/solr/wikipedia/part-out.txt /Users/grantingersoll/projects/lucene/solr/wikipedia/output org.apache.mahout.utils.EuclideanDistanceMeasure 1.3 1.0
{quote}
Not sure if the values actually make sense, but that's what I'm trying...

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-121:
-----------------------------------

    Attachment: MAHOUT-121-cluster-distance.patch

brings back Shashi's distance change for KMeans

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-121:
-----------------------------------

    Fix Version/s: 0.2

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shashikant Kore updated MAHOUT-121:
-----------------------------------

    Status: Patch Available  (was: Open)

This is initial patch just show the idea behind optimization. Not a final patch.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740654#action_12740654 ] 

Sean Owen commented on MAHOUT-121:
----------------------------------

My comments -- which narrowly concern where to declare variables -- are relevant to the patch. It is not a matter of choice or style, or morality or whatever.

Your point about Strings is undeniable, but is irrelevant to this question. See my counter-example for an example which would be relevant to the point I am trying to make.

This blog posts corroborates exactly what I am saying. Declaring outside the loop only incurs an extra initialization. It also makes the point that it is not the same thing to *allocate an Object* outside the loop, versus inside. Of course -- it's the difference between allocating one object and many. But, that is not what is happening in your patch.

I think, then, all evidence suggests the variable declaration changes should be reverted. I am making no other claims. For instance, removing the unnecessary boolean condition remains a positive change I am sure.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739712#action_12739712 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

Nicolas, any word on your update?  Otherwise I will just apply my patch

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740576#action_12740576 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

What about the commenting out of:
{code}
if (centroid.size() != v.size()) {
{code}

It seems like if we are going to do this in this distance measure, we should do it in the other distance measure.  We could push this out a layer via documentation and require the caller to do the check, if required.  Otherwise, do we have some guarantee that the centroid and the vector are the same cardinality in all cases of calls to this method?  I think we do explicitly for the Cluster calls, but that isn't the same.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Nicolás Fantone (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicolás Fantone updated MAHOUT-121:
-----------------------------------

    Attachment: MAHOUT-121-new-distance-optimization.patch

Here it is. You'll notice some lines of code are commented after applying the patch. Just ignore them.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737875#action_12737875 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

Hi Nicolas,

I'm getting an error applying your patch:
{quote}
 patch -p 0 -i ../patches/MAHOUT-121-distance-optimization.patch --dry-run
patching file core/src/main/java/org/apache/mahout/clustering/kmeans/Cluster.java
patch: **** malformed patch at line 46: @@ -323,6 +327,10 @@
{quote}

I create my patches by going to the top directory of mahout and doing:
svn diff > ../mypatch.patch

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-121:
-----------------------------

    Attachment: MAHOUT-121.patch

Ah yes my fault. I could not get IntelliJ to change the base directory, so, just edited the patch by hand. I think that works?

yes this is a diff against head, and, the unit tests pass.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720159#action_12720159 ] 

Sean Owen commented on MAHOUT-121:
----------------------------------

While you are at it, what are you running to load test? that way I can focus on optimizing construction of the vectors in that path to achieve a hopefully better result.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718966#action_12718966 ] 

Sean Owen commented on MAHOUT-121:
----------------------------------

Here's a start at that last suggestion. Anyone care to see it finished? I can hack it out.

public final class FastIntDouble {

  private static final double DEFAULT_VALUE = 0.0;

  private int[] indices;
  private double[] values;
  private int size;

  public FastIntDouble(int capacity) {
    indices = new int[capacity];
    values = new double[capacity];
    size = 0;
  }

  private void growTo(int newCapacity) {
    if (newCapacity > indices.length) {
      int[] newIndices = new int[newCapacity];
      System.arraycopy(indices, 0, newIndices, 0, size);
      indices = newIndices;
      double[] newValues = new double[newCapacity];
      System.arraycopy(values, 0, newValues, 0, size);
      values = newValues;
    }
  }

  private int find(int index) {
    int low = 0;
    int high = size - 1;
    while (low <= high) {
      int mid = (low + high) >>> 1;
      int midVal = indices[mid];
      if (midVal < index) {
        low = mid + 1;
      } else if (midVal > index) {
        high = mid - 1;
      } else {
        return mid;
      }
    }
    return -(low + 1);
  }

  public double get(int index) {
    int offset = find(index);
    return offset >= 0 ? values[offset] : DEFAULT_VALUE;
  }

  public void set(int index, double value) {
    int offset = find(index);
    if (offset >= 0) {
      if (value == DEFAULT_VALUE) {
        System.arraycopy(indices, offset + 1, indices, offset, size - offset);
        System.arraycopy(values, offset + 1, values, offset, size - offset);
        size--;
      } else {
        values[offset] = value;
      }
    } else {
      if (value != DEFAULT_VALUE) {
        if (size >= indices.length) {
          growTo(size << 1);
        }
        int at = -offset - 1;
        if (size > at) {
          System.arraycopy(indices, at, indices, at + 1, size - at);
          System.arraycopy(values, at, values, at + 1, size - at);
        }
        indices[at] = index;
        values[at] = value;
        size++;
      }
    }
  }

}

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740684#action_12740684 ] 

Sean Owen commented on MAHOUT-121:
----------------------------------

Where in the patch do you move an object allocation from inside the loop to outside? I do not see this. If I did, indeed, you would have a point. That's why we may not be talking about the same thing. All I see is moving declarations.

Yes, the one extra initialization is trivial. The fact that a reference lives on outside the loop unnecessarily is also tiny, but not as tiny. But they are both negatives. I think the consensus was it also was very slightly less readable? so we have a couple small negatives -- how can that add up to support for a change?

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724225#action_12724225 ] 

Grant Ingersoll commented on MAHOUT-121:
----------------------------------------

I'm seeing definite performance gain on this latest patch.  I'm going to commit, then we can iterate.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744980#action_12744980 ] 

Sean Owen commented on MAHOUT-121:
----------------------------------

These two implementations use a different algorithm. This implemented is the analog of TreeSet in Java Collections, whereas you are talking about an analog of HashMap.

A hash is likely going to be faster for gets. I imagine it's slower when required to produce the keys/values in order. Which did you benchmark? let's make sure we are asking the right question first.

Of course, if we find that in fact we far more often need fast gets than fast iteration, we do want a hash-based implementation instead. Is that the conclusion? (And could we make a new issue to continue that!)

We can't use Trove of course. The parts of colt we need appear to be Apache-compatible. That seems like a fine thing to try next.

If it's not suitable, we can easily roll our own primitive-based hash. We already have a long->Object map implementation.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-121:
-----------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I think we have this one now, resolving.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718026#action_12718026 ] 

Shashikant Kore commented on MAHOUT-121:
----------------------------------------

That's right, Grant. Some simple tests showed that autoboxing can take 20x cpu and 5x memory compared to operations on primitives.  

For sparse vectors, even Trove leaves some room for improvement. The input document vectors are read-only for the clustering code. In which case, Sparse vector could simply be two arrays. The sparse vectors used to generate centroid requires expandable vector. 



> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: mahout-121.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744977#action_12744977 ] 

Shashikant Kore commented on MAHOUT-121:
----------------------------------------

Colt by CERN is as good as Trove.  Is it compatible with Apache License? 

http://acs.lbl.gov/~hoschek/colt/index.html

Colt's License:  http://acs.lbl.gov/~hoschek/colt/license.html

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-121:
--------------------------------

    Attachment: MAHOUT-121jfe.patch

I still had problems applying the previous patch so here's another from trunk that runs all tests. I'm going to stop work on MAHOUT-65 until this gets stabilized.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned MAHOUT-121:
--------------------------------------

    Assignee: Grant Ingersoll

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-121:
-----------------------------------

    Attachment: mahout-121.patch

Updated to trunk and incorporates Shashi's distance work.

> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
>                 Key: MAHOUT-121
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-121
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector  implementation to used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.