You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by nfantone <nf...@gmail.com> on 2009/07/28 16:16:13 UTC

Distance calculation performance issue

Ok, this post is going to be a long one, and so it deserves its own
thread. My apologies beforehand.

Here's what I have tried to ease the distance calculation problem. I
know it's quite nasty, but I wanted to ensure our bottleneck was
there, in the euclidean distance computation.

But first, a little excursus:

/* EXCURSUS */
The "optimized" version for SparseVectors of distance() in
SquaredEuclideanDistanceMeasure.java, currently, does the following:

    if (centroid.size() != v.size()) {
      throw new CardinalityException();
    }

    double result = centroidLengthSquare;
    result += v.getDistanceSquared(centroid);
    return centroidLengthSquare + v.getDistanceSquared(centroid);

It expects a vector squared length, which is finally added to what
getDistanceSquared() returns. However, that method computes this
inside a while loop (for the comments, assume two 2D vectors [X0, X1]
and [Y0, Y1]):

      elt = iter.next();
      centroidValue = v.getQuick(elt.index());
      delta = elt.get() - centroidValue;
      result += (delta * delta) - (centroidValue * centroidValue); //
This is  (X0 - Y0)*(X0 - Y0) - Y0*Y0 + (X1 - Y1)*(X1 - Y1)  - Y1*Y1; I
don't have the slightest idea what to call this value.

I certainly couldn't understand the reason behind that (centroidValue
* centroidValue) subtraction. Being called getDistanceSquared, the
method should return just that... and that is the value one gets by
just ignoring that substraction, that is:

...
result += (delta * delta); //This is (X0 - Y0)*(X0 - Y0) + (X1 -
Y1)*(X1 - Y1); the ACTUAL squared distance.
...

Moreover, the sum of every (centroidValue * centroidValue) in each
iteration (Y0*Y0 + Y1*Y1, in the examples) corresponds to
centroidLengthSquare, which was previously calculated before calling
distance(centroidLengthSquare, centroid, v) and then added to the
result. So, to sum up: first, a certain length is calculated (the
squared norm of what the signature from distance() calls "centroid");
second, that SAME value gets calculated again in an another method and
subtracted from a certain total; last, those values cancel each other
by summing both totals, leaving just the wanted squared distance,
here:

return centroidLengthSquare + v.getDistanceSquared(centroid);

Which could be re-written as:

return centroidLengthSquare + squaredDistanceBetweenCentroidAndV -
centroidLengthSquare;

Maybe this has been done on purpose, though that purpose eludes me for now.

/* END EXCURSUS */

And now, the fun part: code disrupting. Here are the changes I applied
(remember: just for testing performance's sake). I left the changed
bits commented.

EuclideanDistanceMeasure.java
Renamed distance(double centroidLengthSquare, Vector centroid, Vector
v) to sparseDistance and eliminate the first param. We don't need the
sqrt to compare real distances for emitting points to clusters (but we
do need them to compute a cluster's convergence), so I took that out,
as well (I know the whole point of this class is to sqrt the results
from SquaredEuclideanDistance, but I needed the distance function
that's compromising performance to do a minimal set of calculations) .

  @Override
  public double sparseDistance(/*double centroidLengthSquare,*/ Vector
centroid, Vector v) {
    return /*Math.sqrt(*/super.sparseDistance(/*centroidLengthSquare,*/
centroid, v);//*);
  }


SquaredEuclideanDistanceMeasure.java
Rewrote and renamed the implementation of distance(double
centroidLengthSquare, Vector centroid, Vector v). Ignored size check
(again: just for the benefit of performance). Just return the result
of getDistanceSquared().

@Override
  public double sparseDistance(/*double centroidLengthSquare,*/ Vector
centroid, Vector v) {
/*    if (centroid.size() != v.size()) {
      throw new CardinalityException();
    }

    double result = centroidLengthSquare;
   result += v.getDistanceSquared(centroid);
*/ return /*centroidLengthSquare +*/ v.getDistanceSquared(centroid);
  }

SparseVector.java
Refactor of getDistanceSquared(). Variables should not be created in a
loop, if there's no need to do so. Eliminate the centroidValue^2
calculation in each iteration.

@Override
  public double getDistanceSquared(Vector v) {
    //TODO: Check sizes?

    double result = 0.0;
    Iterator<Vector.Element> iter = iterateNonZero();
    Vector.Element elt;
    double delta;

    while (iter.hasNext()) {
      elt = iter.next();
      delta = elt.get() - v.getQuick(elt.index());
      result += (delta * delta); //- (centroidValue * centroidValue);
    }
    return result;
  }

Cluster.java
Refactor of emitPointToNearestCluster(). null comparison eliminated
(not necessary anymore). sparseDistance() is called now, instead.

...
   Cluster nearestCluster = null;
    double nearestDistance = Double.MAX_VALUE;
    double distance = 0;
    for (Cluster cluster : clusters) {
      distance = measure.sparseDistance(cluster.getCenter(), point);
      if (distance < nearestDistance) {
        nearestCluster = cluster;
        nearestDistance = distance;
      }
    }
...

Add a Math.sqrt call to computeConvergence(), which now uses sparseDistance().

  public boolean computeConvergence() {
    Vector centroid = computeCentroid();
    converged =
Math.sqrt(measure.sparseDistance(/*centroid.getLengthSquared(),*/
centroid, center)) <= convergenceDelta;
    return converged;
  }


After all those modifications, kMeans now runs noticeably faster.
Running the whole thing locally -in a pseudo-distributed cluster-,
every iteration is taking me about ~7 minutes to complete using the
very same dataset I uploaded and posted some days ago, which used to
last 3 hours. Again, this is not meant to be a final, closed solution
at all. But, I think it could very well serve as a first step towards
that.

Re: Distance calculation performance issue

Posted by nfantone <nf...@gmail.com>.
Well, as I suspected there was a misunderstanding on my end about the
euclidean distance "optimized" calculation and the changed I suggested
weren't computing it accurately. Thanks, Shashikant for clearing that
up.

I uploaded my patch to MAHOUT-121, although now it does not differ
much from Grant's.

Re: Distance calculation performance issue

Posted by nfantone <nf...@gmail.com>.
> Agreed, can you put your changes up as a patch on MAHOUT-121?  That way we can do file diffs, etc.

I'm willing to do so. But, then again, I didn't know for sure if what
I changed was correct or if I just didn't interpret the distance
calculation method. After reading what Shashinkant suggested, I'll try
to make a patch.

> --input ../nfantone/user.data --clusters ../nfantone/output/clusters --k 10 --output ../content/nfantone/output/ --convergence 0.01 --overwrite

I see you choose k=10, as oppose to 200. That could explain our differences.

> We took the idea of optimized distance calculation from LingPipe. I
> suggest you to read this as I won't be able to communicate the idea as
> crisply.  After reading this post, you will be able to relate the
> code.
>
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/

I'll read this. I have a feeling I misunderstood the distance process.
I'll comment soon.

> Yes, some of the method names are misnomers (or downright misleading)
> as there is no good term to describe the intermediate values.

Surely we can think of another name for that particular method. A name
may be misleading, but it shouldn't be something that refers to some
OTHER thing it doesn't do.

>> SquaredEuclideanDistanceMeasure.java, currently, does the following:
>>
>>    if (centroid.size() != v.size()) {
>>      throw new CardinalityException();
>>    }
>>
>>    double result = centroidLengthSquare;
>>    result += v.getDistanceSquared(centroid);
>>    return centroidLengthSquare + v.getDistanceSquared(centroid);
>
> Here, we don't call v.getDistanceSquared(centroid) again (which is
> redundant). It simply returns result calculated in the prvious step.

You are completely right. My mistake. While refactoring, it seems I
modified those lines and then left them unchanged when writing my
post. That redundant call is not in the original code. However, I fail
to see the need in declaring a double, += something to it, etc... this
should simply be:

if (centroid.size() != v.size()) {
     throw new CardinalityException();
}
return centroidLengthSquare +  v.getDistanceSquared(centroid);


> With the optimization of LingPipe, you will incur the calculations
> equal to the non-zero features in only one vector. You don't need to
> iterate on cetroid vector for every distance calculation.
>
> Let me know if I am off the mark here.

You aren't. Again, I think I got the algebra wrong. I'll let you know.

Re: Distance calculation performance issue

Posted by Shashikant Kore <sh...@gmail.com>.
Hi Nicolas,

We took the idea of optimized distance calculation from LingPipe. I
suggest you to read this as I won't be able to communicate the idea as
crisply.  After reading this post, you will be able to relate the
code.

http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/

Yes, some of the method names are misnomers (or downright misleading)
as there is no good term to describe the intermediate values.

A minor correrction in the code you quoted.

> SquaredEuclideanDistanceMeasure.java, currently, does the following:
>
>    if (centroid.size() != v.size()) {
>      throw new CardinalityException();
>    }
>
>    double result = centroidLengthSquare;
>    result += v.getDistanceSquared(centroid);
>    return centroidLengthSquare + v.getDistanceSquared(centroid);

Here, we don't call v.getDistanceSquared(centroid) again (which is
redundant). It simply returns result calculated in the prvious step.

Now coming to your implementation.

> SparseVector.java
>
> @Override
>  public double getDistanceSquared(Vector v) {
>    double result = 0.0;
>    Iterator<Vector.Element> iter = iterateNonZero();
>    Vector.Element elt;
>    double delta;
>
>    while (iter.hasNext()) {
>      elt = iter.next();
>      delta = elt.get() - v.getQuick(elt.index());
>      result += (delta * delta); //- (centroidValue * centroidValue);
>    }
>    return result;
>  }

The distance calculation here is not correct. In this code, the unique
features of Vector v are ignored.  To put differently, only the
features of "this" vector are considered for calculation. For example,
let's say, current vector v1 is [0:1, 3:1, 4:1, 6:1, 8:1] and the
incoming parameter "v" is   [0:1, 3:1, 5:1, 7:1, 8:1]. The code given
above ignores the unique features of "v" namely 5 & 7. To get this
correct, you have to consider the features of "v" as well.  That,
essentially, is  SquaredEuclideanDistanceMeasure.distance(Vector v1,
Vector v2).

With the optimization of LingPipe, you will incur the calculations
equal to the non-zero features in only one vector. You don't need to
iterate on cetroid vector for every distance calculation.

Let me know if I am off the mark here.

--shashi

On Tue, Jul 28, 2009 at 7:46 PM, nfantone<nf...@gmail.com> wrote:
> Ok, this post is going to be a long one, and so it deserves its own
> thread. My apologies beforehand.
>
> Here's what I have tried to ease the distance calculation problem. I
> know it's quite nasty, but I wanted to ensure our bottleneck was
> there, in the euclidean distance computation.
>
> But first, a little excursus:
>
> /* EXCURSUS */
> The "optimized" version for SparseVectors of distance() in
> SquaredEuclideanDistanceMeasure.java, currently, does the following:
>
>    if (centroid.size() != v.size()) {
>      throw new CardinalityException();
>    }
>
>    double result = centroidLengthSquare;
>    result += v.getDistanceSquared(centroid);
>    return centroidLengthSquare + v.getDistanceSquared(centroid);
>
> It expects a vector squared length, which is finally added to what
> getDistanceSquared() returns. However, that method computes this
> inside a while loop (for the comments, assume two 2D vectors [X0, X1]
> and [Y0, Y1]):
>
>      elt = iter.next();
>      centroidValue = v.getQuick(elt.index());
>      delta = elt.get() - centroidValue;
>      result += (delta * delta) - (centroidValue * centroidValue); //
> This is  (X0 - Y0)*(X0 - Y0) - Y0*Y0 + (X1 - Y1)*(X1 - Y1)  - Y1*Y1; I
> don't have the slightest idea what to call this value.
>
> I certainly couldn't understand the reason behind that (centroidValue
> * centroidValue) subtraction. Being called getDistanceSquared, the
> method should return just that... and that is the value one gets by
> just ignoring that substraction, that is:
>
> ...
> result += (delta * delta); //This is (X0 - Y0)*(X0 - Y0) + (X1 -
> Y1)*(X1 - Y1); the ACTUAL squared distance.
> ...
>
> Moreover, the sum of every (centroidValue * centroidValue) in each
> iteration (Y0*Y0 + Y1*Y1, in the examples) corresponds to
> centroidLengthSquare, which was previously calculated before calling
> distance(centroidLengthSquare, centroid, v) and then added to the
> result. So, to sum up: first, a certain length is calculated (the
> squared norm of what the signature from distance() calls "centroid");
> second, that SAME value gets calculated again in an another method and
> subtracted from a certain total; last, those values cancel each other
> by summing both totals, leaving just the wanted squared distance,
> here:
>
> return centroidLengthSquare + v.getDistanceSquared(centroid);
>
> Which could be re-written as:
>
> return centroidLengthSquare + squaredDistanceBetweenCentroidAndV -
> centroidLengthSquare;
>
> Maybe this has been done on purpose, though that purpose eludes me for now.
>
> /* END EXCURSUS */
>
> And now, the fun part: code disrupting. Here are the changes I applied
> (remember: just for testing performance's sake). I left the changed
> bits commented.
>
> EuclideanDistanceMeasure.java
> Renamed distance(double centroidLengthSquare, Vector centroid, Vector
> v) to sparseDistance and eliminate the first param. We don't need the
> sqrt to compare real distances for emitting points to clusters (but we
> do need them to compute a cluster's convergence), so I took that out,
> as well (I know the whole point of this class is to sqrt the results
> from SquaredEuclideanDistance, but I needed the distance function
> that's compromising performance to do a minimal set of calculations) .
>
>  @Override
>  public double sparseDistance(/*double centroidLengthSquare,*/ Vector
> centroid, Vector v) {
>    return /*Math.sqrt(*/super.sparseDistance(/*centroidLengthSquare,*/
> centroid, v);//*);
>  }
>
>
> SquaredEuclideanDistanceMeasure.java
> Rewrote and renamed the implementation of distance(double
> centroidLengthSquare, Vector centroid, Vector v). Ignored size check
> (again: just for the benefit of performance). Just return the result
> of getDistanceSquared().
>
> @Override
>  public double sparseDistance(/*double centroidLengthSquare,*/ Vector
> centroid, Vector v) {
> /*    if (centroid.size() != v.size()) {
>      throw new CardinalityException();
>    }
>
>    double result = centroidLengthSquare;
>   result += v.getDistanceSquared(centroid);
> */ return /*centroidLengthSquare +*/ v.getDistanceSquared(centroid);
>  }
>
> SparseVector.java
> Refactor of getDistanceSquared(). Variables should not be created in a
> loop, if there's no need to do so. Eliminate the centroidValue^2
> calculation in each iteration.
>
> @Override
>  public double getDistanceSquared(Vector v) {
>    //TODO: Check sizes?
>
>    double result = 0.0;
>    Iterator<Vector.Element> iter = iterateNonZero();
>    Vector.Element elt;
>    double delta;
>
>    while (iter.hasNext()) {
>      elt = iter.next();
>      delta = elt.get() - v.getQuick(elt.index());
>      result += (delta * delta); //- (centroidValue * centroidValue);
>    }
>    return result;
>  }
>
> Cluster.java
> Refactor of emitPointToNearestCluster(). null comparison eliminated
> (not necessary anymore). sparseDistance() is called now, instead.
>
> ...
>   Cluster nearestCluster = null;
>    double nearestDistance = Double.MAX_VALUE;
>    double distance = 0;
>    for (Cluster cluster : clusters) {
>      distance = measure.sparseDistance(cluster.getCenter(), point);
>      if (distance < nearestDistance) {
>        nearestCluster = cluster;
>        nearestDistance = distance;
>      }
>    }
> ...
>
> Add a Math.sqrt call to computeConvergence(), which now uses sparseDistance().
>
>  public boolean computeConvergence() {
>    Vector centroid = computeCentroid();
>    converged =
> Math.sqrt(measure.sparseDistance(/*centroid.getLengthSquared(),*/
> centroid, center)) <= convergenceDelta;
>    return converged;
>  }
>
>
> After all those modifications, kMeans now runs noticeably faster.
> Running the whole thing locally -in a pseudo-distributed cluster-,
> every iteration is taking me about ~7 minutes to complete using the
> very same dataset I uploaded and posted some days ago, which used to
> last 3 hours. Again, this is not meant to be a final, closed solution
> at all. But, I think it could very well serve as a first step towards
> that.
>


-- 
http://www.bandhan.com/

Re: Distance calculation performance issue

Posted by Ted Dunning <te...@gmail.com>.
Nice work.

This should make big improvements possible.

On Tue, Jul 28, 2009 at 7:16 AM, nfantone <nf...@gmail.com> wrote:

> After all those modifications, kMeans now runs noticeably faster.
> Running the whole thing locally -in a pseudo-distributed cluster-,
> every iteration is taking me about ~7 minutes to complete using the
> very same dataset I uploaded and posted some days ago, which used to
> last 3 hours. Again, this is not meant to be a final, closed solution
> at all. But, I think it could very well serve as a first step towards
> that.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Distance calculation performance issue

Posted by Grant Ingersoll <gs...@apache.org>.
On Jul 29, 2009, at 9:07 AM, nfantone wrote:

> Grant, I took a look at your patch. It seems as though you did
> something similar to what I did. However, I believe that there's still
> room for improvement as there are things being calculated
> unnecessarily for no apparent reason. Could you please read my
> previous post? At least the "excursus" bit. I may be totally wrong,
> though: some particular parts were a bit obscure to me. Perhaps you
> (or Shashikant) can throw some light in there? We might be able to
> release a bigger/better patch.

Agreed, can you put your changes up as a patch on MAHOUT-121?  That  
way we can do file diffs, etc.

>
>>>  I think your data set ran, for 10 iterations, in just over 2  
>>> minutes
>>> and that was with the profiler hooked up, too.
>
> Um... I also did that and, while it was considerably faster than
> before, it took about ~2hs to complete (it used to take days, mind
> you), using a 4 node hadoop cluster. The actual vector clustering
> only, that is the final step, took just over an hour:
>
> Started at: Tue Jul 28 17:44:20 ART 2009
> Finished at: Tue Jul 28 18:46:24 ART 2009
> Finished in: 1hrs, 2mins, 4sec
>
> How exactly did you launch the job? What convergence delta did you
> choose? Hoy many clusters did you set up initially?



--input ../nfantone/user.data --clusters ../nfantone/output/clusters -- 
k 10 --output ../content/nfantone/output/ --convergence 0.01 --overwrite

So, it wasn't exactly what you were running.  I will try to run your's  
at some point.

-Grant

Re: Distance calculation performance issue

Posted by nfantone <nf...@gmail.com>.
Grant, I took a look at your patch. It seems as though you did
something similar to what I did. However, I believe that there's still
room for improvement as there are things being calculated
unnecessarily for no apparent reason. Could you please read my
previous post? At least the "excursus" bit. I may be totally wrong,
though: some particular parts were a bit obscure to me. Perhaps you
(or Shashikant) can throw some light in there? We might be able to
release a bigger/better patch.

>>  I think your data set ran, for 10 iterations, in just over 2 minutes
>> and that was with the profiler hooked up, too.

Um... I also did that and, while it was considerably faster than
before, it took about ~2hs to complete (it used to take days, mind
you), using a 4 node hadoop cluster. The actual vector clustering
only, that is the final step, took just over an hour:

Started at: Tue Jul 28 17:44:20 ART 2009
Finished at: Tue Jul 28 18:46:24 ART 2009
Finished in: 1hrs, 2mins, 4sec

How exactly did you launch the job? What convergence delta did you
choose? Hoy many clusters did you set up initially?

Re: Distance calculation performance issue

Posted by nfantone <nf...@gmail.com>.
I'll check things up tomorrow, at work, make the patch, and I'll let you know.
Thanks, Grant.

On Tue, Jul 28, 2009 at 5:29 PM, Grant Ingersoll<gs...@apache.org> wrote:
> Funny, I did this same thing on the plane by revisiting Shashi's patch on
> MAHOUT-121 and properly applying it to K-Means (I missed one critical line
> that uses the centroid based distance method instead of the (Vector, Vector)
> one).  I think your data set ran, for 10 iterations, in just over 2 minutes
> and that was with the profiler hooked up, too.
>
> I will post a patch on MAHOUT-121, can you take a look?  Also, if you can
> post your changes as a patch, then we can likely compare/merge, etc. and
> come up with the best way of doing all of this.
>
> Thanks for the insight/analysis.
>
> -Grant
>
>
> On Jul 28, 2009, at 10:16 AM, nfantone wrote:
>
>> Ok, this post is going to be a long one, and so it deserves its own
>> thread. My apologies beforehand.
>>
>> Here's what I have tried to ease the distance calculation problem. I
>> know it's quite nasty, but I wanted to ensure our bottleneck was
>> there, in the euclidean distance computation.
>>
>> But first, a little excursus:
>>
>> /* EXCURSUS */
>> The "optimized" version for SparseVectors of distance() in
>> SquaredEuclideanDistanceMeasure.java, currently, does the following:
>>
>>   if (centroid.size() != v.size()) {
>>     throw new CardinalityException();
>>   }
>>
>>   double result = centroidLengthSquare;
>>   result += v.getDistanceSquared(centroid);
>>   return centroidLengthSquare + v.getDistanceSquared(centroid);
>>
>> It expects a vector squared length, which is finally added to what
>> getDistanceSquared() returns. However, that method computes this
>> inside a while loop (for the comments, assume two 2D vectors [X0, X1]
>> and [Y0, Y1]):
>>
>>     elt = iter.next();
>>     centroidValue = v.getQuick(elt.index());
>>     delta = elt.get() - centroidValue;
>>     result += (delta * delta) - (centroidValue * centroidValue); //
>> This is  (X0 - Y0)*(X0 - Y0) - Y0*Y0 + (X1 - Y1)*(X1 - Y1)  - Y1*Y1; I
>> don't have the slightest idea what to call this value.
>>
>> I certainly couldn't understand the reason behind that (centroidValue
>> * centroidValue) subtraction. Being called getDistanceSquared, the
>> method should return just that... and that is the value one gets by
>> just ignoring that substraction, that is:
>>
>> ...
>> result += (delta * delta); //This is (X0 - Y0)*(X0 - Y0) + (X1 -
>> Y1)*(X1 - Y1); the ACTUAL squared distance.
>> ...
>>
>> Moreover, the sum of every (centroidValue * centroidValue) in each
>> iteration (Y0*Y0 + Y1*Y1, in the examples) corresponds to
>> centroidLengthSquare, which was previously calculated before calling
>> distance(centroidLengthSquare, centroid, v) and then added to the
>> result. So, to sum up: first, a certain length is calculated (the
>> squared norm of what the signature from distance() calls "centroid");
>> second, that SAME value gets calculated again in an another method and
>> subtracted from a certain total; last, those values cancel each other
>> by summing both totals, leaving just the wanted squared distance,
>> here:
>>
>> return centroidLengthSquare + v.getDistanceSquared(centroid);
>>
>> Which could be re-written as:
>>
>> return centroidLengthSquare + squaredDistanceBetweenCentroidAndV -
>> centroidLengthSquare;
>>
>> Maybe this has been done on purpose, though that purpose eludes me for
>> now.
>>
>> /* END EXCURSUS */
>>
>> And now, the fun part: code disrupting. Here are the changes I applied
>> (remember: just for testing performance's sake). I left the changed
>> bits commented.
>>
>> EuclideanDistanceMeasure.java
>> Renamed distance(double centroidLengthSquare, Vector centroid, Vector
>> v) to sparseDistance and eliminate the first param. We don't need the
>> sqrt to compare real distances for emitting points to clusters (but we
>> do need them to compute a cluster's convergence), so I took that out,
>> as well (I know the whole point of this class is to sqrt the results
>> from SquaredEuclideanDistance, but I needed the distance function
>> that's compromising performance to do a minimal set of calculations) .
>>
>>  @Override
>>  public double sparseDistance(/*double centroidLengthSquare,*/ Vector
>> centroid, Vector v) {
>>   return /*Math.sqrt(*/super.sparseDistance(/*centroidLengthSquare,*/
>> centroid, v);//*);
>>  }
>>
>>
>> SquaredEuclideanDistanceMeasure.java
>> Rewrote and renamed the implementation of distance(double
>> centroidLengthSquare, Vector centroid, Vector v). Ignored size check
>> (again: just for the benefit of performance). Just return the result
>> of getDistanceSquared().
>>
>> @Override
>>  public double sparseDistance(/*double centroidLengthSquare,*/ Vector
>> centroid, Vector v) {
>> /*    if (centroid.size() != v.size()) {
>>     throw new CardinalityException();
>>   }
>>
>>   double result = centroidLengthSquare;
>>  result += v.getDistanceSquared(centroid);
>> */ return /*centroidLengthSquare +*/ v.getDistanceSquared(centroid);
>>  }
>>
>> SparseVector.java
>> Refactor of getDistanceSquared(). Variables should not be created in a
>> loop, if there's no need to do so. Eliminate the centroidValue^2
>> calculation in each iteration.
>>
>> @Override
>>  public double getDistanceSquared(Vector v) {
>>   //TODO: Check sizes?
>>
>>   double result = 0.0;
>>   Iterator<Vector.Element> iter = iterateNonZero();
>>   Vector.Element elt;
>>   double delta;
>>
>>   while (iter.hasNext()) {
>>     elt = iter.next();
>>     delta = elt.get() - v.getQuick(elt.index());
>>     result += (delta * delta); //- (centroidValue * centroidValue);
>>   }
>>   return result;
>>  }
>>
>> Cluster.java
>> Refactor of emitPointToNearestCluster(). null comparison eliminated
>> (not necessary anymore). sparseDistance() is called now, instead.
>>
>> ...
>>  Cluster nearestCluster = null;
>>   double nearestDistance = Double.MAX_VALUE;
>>   double distance = 0;
>>   for (Cluster cluster : clusters) {
>>     distance = measure.sparseDistance(cluster.getCenter(), point);
>>     if (distance < nearestDistance) {
>>       nearestCluster = cluster;
>>       nearestDistance = distance;
>>     }
>>   }
>> ...
>>
>> Add a Math.sqrt call to computeConvergence(), which now uses
>> sparseDistance().
>>
>>  public boolean computeConvergence() {
>>   Vector centroid = computeCentroid();
>>   converged =
>> Math.sqrt(measure.sparseDistance(/*centroid.getLengthSquared(),*/
>> centroid, center)) <= convergenceDelta;
>>   return converged;
>>  }
>>
>>
>> After all those modifications, kMeans now runs noticeably faster.
>> Running the whole thing locally -in a pseudo-distributed cluster-,
>> every iteration is taking me about ~7 minutes to complete using the
>> very same dataset I uploaded and posted some days ago, which used to
>> last 3 hours. Again, this is not meant to be a final, closed solution
>> at all. But, I think it could very well serve as a first step towards
>> that.
>
>

Re: Distance calculation performance issue

Posted by Grant Ingersoll <gs...@apache.org>.
Funny, I did this same thing on the plane by revisiting Shashi's patch  
on MAHOUT-121 and properly applying it to K-Means (I missed one  
critical line that uses the centroid based distance method instead of  
the (Vector, Vector) one).  I think your data set ran, for 10  
iterations, in just over 2 minutes and that was with the profiler  
hooked up, too.

I will post a patch on MAHOUT-121, can you take a look?  Also, if you  
can post your changes as a patch, then we can likely compare/merge,  
etc. and come up with the best way of doing all of this.

Thanks for the insight/analysis.

-Grant


On Jul 28, 2009, at 10:16 AM, nfantone wrote:

> Ok, this post is going to be a long one, and so it deserves its own
> thread. My apologies beforehand.
>
> Here's what I have tried to ease the distance calculation problem. I
> know it's quite nasty, but I wanted to ensure our bottleneck was
> there, in the euclidean distance computation.
>
> But first, a little excursus:
>
> /* EXCURSUS */
> The "optimized" version for SparseVectors of distance() in
> SquaredEuclideanDistanceMeasure.java, currently, does the following:
>
>    if (centroid.size() != v.size()) {
>      throw new CardinalityException();
>    }
>
>    double result = centroidLengthSquare;
>    result += v.getDistanceSquared(centroid);
>    return centroidLengthSquare + v.getDistanceSquared(centroid);
>
> It expects a vector squared length, which is finally added to what
> getDistanceSquared() returns. However, that method computes this
> inside a while loop (for the comments, assume two 2D vectors [X0, X1]
> and [Y0, Y1]):
>
>      elt = iter.next();
>      centroidValue = v.getQuick(elt.index());
>      delta = elt.get() - centroidValue;
>      result += (delta * delta) - (centroidValue * centroidValue); //
> This is  (X0 - Y0)*(X0 - Y0) - Y0*Y0 + (X1 - Y1)*(X1 - Y1)  - Y1*Y1; I
> don't have the slightest idea what to call this value.
>
> I certainly couldn't understand the reason behind that (centroidValue
> * centroidValue) subtraction. Being called getDistanceSquared, the
> method should return just that... and that is the value one gets by
> just ignoring that substraction, that is:
>
> ...
> result += (delta * delta); //This is (X0 - Y0)*(X0 - Y0) + (X1 -
> Y1)*(X1 - Y1); the ACTUAL squared distance.
> ...
>
> Moreover, the sum of every (centroidValue * centroidValue) in each
> iteration (Y0*Y0 + Y1*Y1, in the examples) corresponds to
> centroidLengthSquare, which was previously calculated before calling
> distance(centroidLengthSquare, centroid, v) and then added to the
> result. So, to sum up: first, a certain length is calculated (the
> squared norm of what the signature from distance() calls "centroid");
> second, that SAME value gets calculated again in an another method and
> subtracted from a certain total; last, those values cancel each other
> by summing both totals, leaving just the wanted squared distance,
> here:
>
> return centroidLengthSquare + v.getDistanceSquared(centroid);
>
> Which could be re-written as:
>
> return centroidLengthSquare + squaredDistanceBetweenCentroidAndV -
> centroidLengthSquare;
>
> Maybe this has been done on purpose, though that purpose eludes me  
> for now.
>
> /* END EXCURSUS */
>
> And now, the fun part: code disrupting. Here are the changes I applied
> (remember: just for testing performance's sake). I left the changed
> bits commented.
>
> EuclideanDistanceMeasure.java
> Renamed distance(double centroidLengthSquare, Vector centroid, Vector
> v) to sparseDistance and eliminate the first param. We don't need the
> sqrt to compare real distances for emitting points to clusters (but we
> do need them to compute a cluster's convergence), so I took that out,
> as well (I know the whole point of this class is to sqrt the results
> from SquaredEuclideanDistance, but I needed the distance function
> that's compromising performance to do a minimal set of calculations) .
>
>  @Override
>  public double sparseDistance(/*double centroidLengthSquare,*/ Vector
> centroid, Vector v) {
>    return /*Math.sqrt(*/super.sparseDistance(/*centroidLengthSquare,*/
> centroid, v);//*);
>  }
>
>
> SquaredEuclideanDistanceMeasure.java
> Rewrote and renamed the implementation of distance(double
> centroidLengthSquare, Vector centroid, Vector v). Ignored size check
> (again: just for the benefit of performance). Just return the result
> of getDistanceSquared().
>
> @Override
>  public double sparseDistance(/*double centroidLengthSquare,*/ Vector
> centroid, Vector v) {
> /*    if (centroid.size() != v.size()) {
>      throw new CardinalityException();
>    }
>
>    double result = centroidLengthSquare;
>   result += v.getDistanceSquared(centroid);
> */ return /*centroidLengthSquare +*/ v.getDistanceSquared(centroid);
>  }
>
> SparseVector.java
> Refactor of getDistanceSquared(). Variables should not be created in a
> loop, if there's no need to do so. Eliminate the centroidValue^2
> calculation in each iteration.
>
> @Override
>  public double getDistanceSquared(Vector v) {
>    //TODO: Check sizes?
>
>    double result = 0.0;
>    Iterator<Vector.Element> iter = iterateNonZero();
>    Vector.Element elt;
>    double delta;
>
>    while (iter.hasNext()) {
>      elt = iter.next();
>      delta = elt.get() - v.getQuick(elt.index());
>      result += (delta * delta); //- (centroidValue * centroidValue);
>    }
>    return result;
>  }
>
> Cluster.java
> Refactor of emitPointToNearestCluster(). null comparison eliminated
> (not necessary anymore). sparseDistance() is called now, instead.
>
> ...
>   Cluster nearestCluster = null;
>    double nearestDistance = Double.MAX_VALUE;
>    double distance = 0;
>    for (Cluster cluster : clusters) {
>      distance = measure.sparseDistance(cluster.getCenter(), point);
>      if (distance < nearestDistance) {
>        nearestCluster = cluster;
>        nearestDistance = distance;
>      }
>    }
> ...
>
> Add a Math.sqrt call to computeConvergence(), which now uses  
> sparseDistance().
>
>  public boolean computeConvergence() {
>    Vector centroid = computeCentroid();
>    converged =
> Math.sqrt(measure.sparseDistance(/*centroid.getLengthSquared(),*/
> centroid, center)) <= convergenceDelta;
>    return converged;
>  }
>
>
> After all those modifications, kMeans now runs noticeably faster.
> Running the whole thing locally -in a pseudo-distributed cluster-,
> every iteration is taking me about ~7 minutes to complete using the
> very same dataset I uploaded and posted some days ago, which used to
> last 3 hours. Again, this is not meant to be a final, closed solution
> at all. But, I think it could very well serve as a first step towards
> that.