You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2009/06/25 01:20:44 UTC

Re: [jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

On Jun 24, 2009, at 6:53 PM, Ted Dunning wrote:

> Grant,
>
> This optimization should have made a large difference.  Did it?

Yes.  Still quantifying, but very promising.  Still having a hard time  
finding good t1, t2 values for the simple tests I am running  
clustering Wikipedia data, so that is clouding things.  It seems no  
matter what I pick, I get one vector per canopy.  Obviously, something  
is wrong, but I don't know what. Sigh.  Of course, it could be the  
fact the docs I'm clustering aren't related, I guess.  I'm only doing  
the first 1000 from a dump.  I'll try a bigger version now.

All the tests pass with the changes, though, and I had the same  
problem before.

>
> The triangle inequality trick should help by a factor of two or more  
> as
> well.

Re: [jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Grant Ingersoll <gs...@apache.org>.

On Jun 24, 2009, at 7:20 PM, Grant Ingersoll wrote:

>
> On Jun 24, 2009, at 6:53 PM, Ted Dunning wrote:
>
>> Grant,
>>
>> This optimization should have made a large difference.  Did it?
>
> Yes.  Still quantifying, but very promising.  Still having a hard  
> time finding good t1, t2 values for the simple tests I am running  
> clustering Wikipedia data, so that is clouding things.  It seems no  
> matter what I pick, I get one vector per canopy.  Obviously,  
> something is wrong, but I don't know what. Sigh.  Of course, it  
> could be the fact the docs I'm clustering aren't related, I guess.   
> I'm only doing the first 1000 from a dump.  I'll try a bigger  
> version now.

Switching to the CosineDistanceMeasure has improved things.   Sorry  
for the thread highjack.

>
> All the tests pass with the changes, though, and I had the same  
> problem before.
>
>>
>> The triangle inequality trick should help by a factor of two or  
>> more as
>> well.
>
>

Re: [jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Ted Dunning <te...@gmail.com>.

If the distance being used is a real metric, it would satisfy the triangle
inequality (that is basically the definition of metric).  For a given point
x relative to two centroids c1 and c2, this means that

         r(c1, c2) <= r(x, c1) + r(x, c2)

or

         r(c1, c2) - r(x, c1) <= r(x, c2)

This means that if r(c1, c2) is > 2 r(x, c1), you don't have to consider
r(x, c2) since r(x, c1) < r(x, c2)

Moreover, since many points stay with the same centroid from one iteration
to the next, especially in later iterations, you can compute the distance to
the previous centroid first and avoid computing distances to several other
centroids.

A related optimization is to not recompute distances to clusters that
haven't moved.  Typically, this would only apply to the closest cluster to
save memory, but that is a late stage decision.

In high dimensional spaces with completely symmetrically distributions of
points, it is pretty easy to see that the triangle optimization will not
help that much.  BUT, one essentially never has completely symmetrical
distributions and with text you have radically non-symmetric distributions
due to sparsity of the document vectors.  So I am very curious how these
optimizations will work for the text clustering you are doing.

On Thu, Jun 25, 2009 at 1:04 PM, Grant Ingersoll <gs...@apache.org>wrote:

> 3) use triangle inequality to limit number of times we have to compute
>> distance (probably 2x win, but I am curious)
>>
>
> Reference?  I don't believe this is implemented, but I might not be
> understanding.

-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Re: [jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Grant Ingersoll <gs...@apache.org>.

On Jun 24, 2009, at 10:11 PM, Ted Dunning wrote:

> To clarify, the optimizations I know about that are pending are:
>
> 1) better hash table that doesn't box/unbox (clear win)

This is committed already.

Seems like we should also remove the square root calculation for  
Euclidean, right, we don't actually care about exact distances, since  
the computation is relative.

>
> 2) better centroid distance computation that uses sparsity of document
> vector to minimize the L2 norm computation time (this is what I am  
> curious
> about)

Yes, this is in the patch and is notably faster, in my subjective  
testing, which is backed by Shashi.  I'm going to commit soon.  Just  
wanted a final discussion on whether we should remove the sqrt calc.  
from EuclideanDistance.  Seems like we should since it's just used for  
comparison purposes.

>
> 3) use triangle inequality to limit number of times we have to compute
> distance (probably 2x win, but I am curious)

Reference?  I don't believe this is implemented, but I might not be  
understanding.

Re: [jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

Posted by Ted Dunning <te...@gmail.com>.

To clarify, the optimizations I know about that are pending are:

1) better hash table that doesn't box/unbox (clear win)

2) better centroid distance computation that uses sparsity of document
vector to minimize the L2 norm computation time (this is what I am curious
about)

3) use triangle inequality to limit number of times we have to compute
distance (probably 2x win, but I am curious)

I realized that my question was ambiguous after I asked.

On Wed, Jun 24, 2009 at 4:20 PM, Grant Ingersoll <gs...@apache.org>wrote:

> Still quantifying, but very promising




-- 
Ted Dunning, CTO
DeepDyve