You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Christoph Hermann <he...@informatik.uni-freiburg.de> on 2010/01/25 17:26:10 UTC

MeanShift Clustering duplicating vectors in canopies?

Hello,

i'm running some clustering with the Mean Shift and in my final canopy i 
get 5x the same vector.

In the original input list i only had it once and i'm wondering why 
duplicates are allowed within the same canopy?

Attached is a file with the method i'm using to run mean shift as well 
as the ouput (i'm iterating over the getBoundPoints() list of the 
canopy).

I'd be happy if someone could explain this.

regards
Christoph Hermann

-- 
Christoph Hermann
Institut für Informatik
Tel: +49 761-203-8171 Fax: +49 761-203-8162
e-mail: hermann@informatik.uni-freiburg.de

Re: MeanShift Clustering duplicating vectors in canopies?

Posted by Christoph Hermann <he...@informatik.uni-freiburg.de>.

Am Mittwoch, 27. Januar 2010 schrieben Sie:

Hello,

> > The mean shift canopy should never contain duplicates. If you can
> >  figure  out why this is occurring in your runs I'd be very
> >  interested in fixing it. If you can send me a copy of your dataset
> > I will look into it too.
> 
> I'll have a look at this and i will try to build an example.
> I'll also have a look if this still occurs with the latest trunk.
> 
> Give me a few days, i'll keep you informed.

ok, i created a Junit Test case trying to reproduce this, and i found 
out that i was wrong (stupid typo error).

My vector list dump method contained these three lines:

for (Vector aV : vList) {
  _LOG.debug("Vector: " + v.asFormatString());
}

where v is the the vector i was searching for.
It should be aV of course.

Apologies for wasting your time. I'm really sorry.

regards
Christoph

-- 
Christoph Hermann
Institut für Informatik
Tel: +49 761-203-8171 Fax: +49 761-203-8162
e-mail: hermann@informatik.uni-freiburg.de

Re: MeanShift Clustering duplicating vectors in canopies?

Posted by Christoph Hermann <he...@informatik.uni-freiburg.de>.

Am Dienstag, 26. Januar 2010 schrieb Jeff Eastman:

Hello,

> The mean shift canopy should never contain duplicates. If you can
>  figure  out why this is occurring in your runs I'd be very
>  interested in fixing it. If you can send me a copy of your dataset I
>  will look into it too.

I'll have a look at this and i will try to build an example.
I'll also have a look if this still occurs with the latest trunk.

Give me a few days, i'll keep you informed.

regards
Christoph

-- 
Christoph Hermann
Institut für Informatik
Tel: +49 761-203-8171 Fax: +49 761-203-8162
e-mail: hermann@informatik.uni-freiburg.de

Re: MeanShift Clustering duplicating vectors in canopies?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Hi again Christoph,

Further debugging of my test code indicates I was using List.contains() 
to test for duplicates and this uses o.equals() and not == so it was 
erroneously claiming the boundPoints had duplicates because the *values* 
of the vectors were the same and not the *identities* thereof. So, 
unless you can provide a dataset which exhibits the problem in the 
reference implementation you are running I cannot help much more.

PS: if your dataset happens to contain duplicate-valued points then it 
would be reasonable to see them all in an output cluster. Perhaps 
assigning names to your vectors - which are encoded in their writable 
state - will help to resolve this for you.

PPS: assigning unique names to the data points will be a pre-requisite 
to either of the optimization strategies I mentioned earlier in this thread.

Jeff

Jeff Eastman wrote:
> Hi Christoph,
>
> The only unit test which exhibits this problem is the one which runs 
> the full MR job (testCanopyEuclideanMRJob()). This is darn hard to 
> debug and is doubly baffling since all the vectors should be read from 
> Writable format into new, distinct instances. If you have a small 
> dataset which exhibits the problem while running the reference 
> implementation it would be very nice if you could share it.
>
> Jeff
>
> Jeff Eastman wrote:
>> I added some test code to detect duplicate boundPoint entries and can 
>> duplicate the issue in a unit test. I will see what is happening and 
>> let you know.
>> Jeff
>

Re: MeanShift Clustering duplicating vectors in canopies?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Hi Christoph,

The only unit test which exhibits this problem is the one which runs the 
full MR job (testCanopyEuclideanMRJob()). This is darn hard to debug and 
is doubly baffling since all the vectors should be read from Writable 
format into new, distinct instances. If you have a small dataset which 
exhibits the problem while running the reference implementation it would 
be very nice if you could share it.

Jeff

Jeff Eastman wrote:
> I added some test code to detect duplicate boundPoint entries and can 
> duplicate the issue in a unit test. I will see what is happening and 
> let you know.
> Jeff

Re: MeanShift Clustering duplicating vectors in canopies?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

I added some test code to detect duplicate boundPoint entries and can 
duplicate the issue in a unit test. I will see what is happening and let 
you know.
Jeff


Jeff Eastman wrote:
> Christoph Hermann wrote:
>> Am Montag, 25. Januar 2010 schrieben Sie:
>>
>> Hello,
>>
>>  
>>> The Meanshift canopy keeps copies of all the input points it has
>>> accreted. It does this for bookkeeping purposes, so that points can
>>>  be associated with each canopy when it is done, but this clearly
>>>  does not scale and is currently a showstopper for its utility in
>>>  large problems (despite the M/R implementation, a large number of
>>>  points will converge to a smaller number of very large cluster
>>>  descriptions). I've considered two ways to improve this situation:
>>>  1) associate identifiers with each point and just store the ids
>>>  instead of the whole point; 2) write out the accreted/merged
>>>  canopies to a separate log file so that final cluster membership can
>>>  be calculated after the fact. Option 1 would be the easiest to
>>>  implement but would only give an order-constant improvement in
>>>  space. Option 2 would solve the cluster space problem but would
>>>  introduce another post-processing step to track the cluster merges.
>>>     
>>
>> ok, thats good to know, thanks for the explanation.
>>
>> I'm currently making some small experiments with the different 
>> clustering implementations of mahout and to be honest although i 
>> think i  understood how the algorithms work, i have some problems 
>> with the code.
>>
>> I choose MeanShift first, because it allows me to start without 
>> specifying the number of clusters and because i easily found out how 
>> to check which vector belongs to which canopy after running the 
>> algorithm.
>>
>>   
> Yes, MeanShift is unique in that it does that bookkeeping as it goes. 
> It is; however, not scalable and needs another accounting mechanism.
>>> Unlike the other clustering algorithms, which define symmetrical
>>>  regions of n-space for each cluster, Meanshift clusters are
>>>  asymmetric and so points cannot be clustered after the fact using
>>>  just the cluster centers and distance measure.
>>>     
>>
>> Ok. Since a 'Cluster' (using k-means) does not seem to have 
>> references to the vectors it contains, how do i get all the vectors 
>> belonging to one cluster?
>>
>> Do i have to iterate over all points and calculate the nearest 
>> cluster again? That looks very inefficient to me, maybe you can point 
>> me in the right direction.
>>   
> Often it is the parameters of the clusters which are of primary 
> interest, and not so much the actual point assignments. Making another 
> pass over the data to assign points to clusters - when needed - does 
> seem inefficient but is really not that onerous since it is highly 
> parallel.
>> What i need to do is to locate a vector in the list of clusters and 
>> list all the other vectors belonging to the same cluster.
>>   
> Canopy and k-means both embody the notions that a point 'belongs' to 
> only one cluster; the one with the minimal distance measure. Each has 
> an optional, separate post-processing step to assign the points using 
> that metric. Fuzzy k-means and Dirichlet both allow points to 'belong' 
> to more than one cluster, making this determination more 
> probabilistic. The former does it by adding a fraction of each point's 
> value to each cluster center calculation depending upon the distance 
> measure. Dirichlet uses sampling methods to assign each point to a 
> likely cluster (model) during each iteration but usually not to the 
> same cluster in subsequent iterations.
>>> I'm not sure why you are getting duplicate copies of the same point
>>>  in your canopy. Your code looks like it was derived from the
>>> testReferenceImplementation unit test but has some minor differences.
>>> Why, since the code adds all the points to a new set of canopies
>>>  before iterating, are you passing in 'canopies' as an argument? Can
>>>  you say more about your input data set and the T1 & T2 values you
>>>  used? How many iterations occurred? What was your convergence test
>>>  value?
>>>     
>>
>> Actually i took it from "DisplayMeanShift" class.
>> I'm putting in "canopies" (which is currently just an empty list) bc 
>> i wanted to extend the code lateron. For now its useless.
>>
>> My input data is a distribution of downloads of files, so for each 
>> day i have the day, the id of a file, and the number of downloads.
>> I'm selecting for a certain period of time (i.e. 7 days, starting at 
>> day x) the download values and try to cluster similar download patterns.
>> T1 and T2 are varying, i'm still trying to find the best values.
>>
>>  
>>> Finally, our Vector library has improved its asFormatString in a
>>>  number of areas but at the cost of readibility. This makes debugging
>>>  terribly difficult and some sort of debuggable formatter is needed.
>>>     
>>
>> At least it shows me the ids and values of the vector, thats how i 
>> found out the canopy contains duplicates; i'll improve that lateron.
>>   
> The mean shift canopy should never contain duplicates. If you can 
> figure out why this is occurring in your runs I'd be very interested 
> in fixing it. If you can send me a copy of your dataset I will look 
> into it too.
>
> Jeff
>
>

Re: MeanShift Clustering duplicating vectors in canopies?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Christoph Hermann wrote:
> Am Montag, 25. Januar 2010 schrieben Sie:
>
> Hello,
>
>   
>> The Meanshift canopy keeps copies of all the input points it has
>> accreted. It does this for bookkeeping purposes, so that points can
>>  be associated with each canopy when it is done, but this clearly
>>  does not scale and is currently a showstopper for its utility in
>>  large problems (despite the M/R implementation, a large number of
>>  points will converge to a smaller number of very large cluster
>>  descriptions). I've considered two ways to improve this situation:
>>  1) associate identifiers with each point and just store the ids
>>  instead of the whole point; 2) write out the accreted/merged
>>  canopies to a separate log file so that final cluster membership can
>>  be calculated after the fact. Option 1 would be the easiest to
>>  implement but would only give an order-constant improvement in
>>  space. Option 2 would solve the cluster space problem but would
>>  introduce another post-processing step to track the cluster merges.
>>     
>
> ok, thats good to know, thanks for the explanation.
>
> I'm currently making some small experiments with the different 
> clustering implementations of mahout and to be honest although i think i  
> understood how the algorithms work, i have some problems with the code.
>
> I choose MeanShift first, because it allows me to start without 
> specifying the number of clusters and because i easily found out how to 
> check which vector belongs to which canopy after running the algorithm.
>
>   
Yes, MeanShift is unique in that it does that bookkeeping as it goes. It 
is; however, not scalable and needs another accounting mechanism.
>> Unlike the other clustering algorithms, which define symmetrical
>>  regions of n-space for each cluster, Meanshift clusters are
>>  asymmetric and so points cannot be clustered after the fact using
>>  just the cluster centers and distance measure.
>>     
>
> Ok. Since a 'Cluster' (using k-means) does not seem to have references 
> to the vectors it contains, how do i get all the vectors belonging to 
> one cluster?
>
> Do i have to iterate over all points and calculate the nearest cluster 
> again? That looks very inefficient to me, maybe you can point me in the 
> right direction.
>   
Often it is the parameters of the clusters which are of primary 
interest, and not so much the actual point assignments. Making another 
pass over the data to assign points to clusters - when needed - does 
seem inefficient but is really not that onerous since it is highly parallel.
> What i need to do is to locate a vector in the list of clusters and list 
> all the other vectors belonging to the same cluster.
>   
Canopy and k-means both embody the notions that a point 'belongs' to 
only one cluster; the one with the minimal distance measure. Each has an 
optional, separate post-processing step to assign the points using that 
metric. Fuzzy k-means and Dirichlet both allow points to 'belong' to 
more than one cluster, making this determination more probabilistic. The 
former does it by adding a fraction of each point's value to each 
cluster center calculation depending upon the distance measure. 
Dirichlet uses sampling methods to assign each point to a likely cluster 
(model) during each iteration but usually not to the same cluster in 
subsequent iterations.
>> I'm not sure why you are getting duplicate copies of the same point
>>  in your canopy. Your code looks like it was derived from the
>> testReferenceImplementation unit test but has some minor differences.
>> Why, since the code adds all the points to a new set of canopies
>>  before iterating, are you passing in 'canopies' as an argument? Can
>>  you say more about your input data set and the T1 & T2 values you
>>  used? How many iterations occurred? What was your convergence test
>>  value?
>>     
>
> Actually i took it from "DisplayMeanShift" class.
> I'm putting in "canopies" (which is currently just an empty list) bc i 
> wanted to extend the code lateron. For now its useless.
>
> My input data is a distribution of downloads of files, so for each day i 
> have the day, the id of a file, and the number of downloads.
> I'm selecting for a certain period of time (i.e. 7 days, starting at day 
> x) the download values and try to cluster similar download patterns.
> T1 and T2 are varying, i'm still trying to find the best values.
>
>   
>> Finally, our Vector library has improved its asFormatString in a
>>  number of areas but at the cost of readibility. This makes debugging
>>  terribly difficult and some sort of debuggable formatter is needed.
>>     
>
> At least it shows me the ids and values of the vector, thats how i found 
> out the canopy contains duplicates; i'll improve that lateron.
>   
The mean shift canopy should never contain duplicates. If you can figure 
out why this is occurring in your runs I'd be very interested in fixing 
it. If you can send me a copy of your dataset I will look into it too.
 
Jeff

Re: MeanShift Clustering duplicating vectors in canopies?

Posted by Christoph Hermann <he...@informatik.uni-freiburg.de>.

Am Montag, 25. Januar 2010 schrieben Sie:

Hello,

> The Meanshift canopy keeps copies of all the input points it has
> accreted. It does this for bookkeeping purposes, so that points can
>  be associated with each canopy when it is done, but this clearly
>  does not scale and is currently a showstopper for its utility in
>  large problems (despite the M/R implementation, a large number of
>  points will converge to a smaller number of very large cluster
>  descriptions). I've considered two ways to improve this situation:
>  1) associate identifiers with each point and just store the ids
>  instead of the whole point; 2) write out the accreted/merged
>  canopies to a separate log file so that final cluster membership can
>  be calculated after the fact. Option 1 would be the easiest to
>  implement but would only give an order-constant improvement in
>  space. Option 2 would solve the cluster space problem but would
>  introduce another post-processing step to track the cluster merges.

ok, thats good to know, thanks for the explanation.

I'm currently making some small experiments with the different 
clustering implementations of mahout and to be honest although i think i  
understood how the algorithms work, i have some problems with the code.

I choose MeanShift first, because it allows me to start without 
specifying the number of clusters and because i easily found out how to 
check which vector belongs to which canopy after running the algorithm.

> Unlike the other clustering algorithms, which define symmetrical
>  regions of n-space for each cluster, Meanshift clusters are
>  asymmetric and so points cannot be clustered after the fact using
>  just the cluster centers and distance measure.

Ok. Since a 'Cluster' (using k-means) does not seem to have references 
to the vectors it contains, how do i get all the vectors belonging to 
one cluster?

Do i have to iterate over all points and calculate the nearest cluster 
again? That looks very inefficient to me, maybe you can point me in the 
right direction.

What i need to do is to locate a vector in the list of clusters and list 
all the other vectors belonging to the same cluster.

> I'm not sure why you are getting duplicate copies of the same point
>  in your canopy. Your code looks like it was derived from the
> testReferenceImplementation unit test but has some minor differences.
> Why, since the code adds all the points to a new set of canopies
>  before iterating, are you passing in 'canopies' as an argument? Can
>  you say more about your input data set and the T1 & T2 values you
>  used? How many iterations occurred? What was your convergence test
>  value?

Actually i took it from "DisplayMeanShift" class.
I'm putting in "canopies" (which is currently just an empty list) bc i 
wanted to extend the code lateron. For now its useless.

My input data is a distribution of downloads of files, so for each day i 
have the day, the id of a file, and the number of downloads.
I'm selecting for a certain period of time (i.e. 7 days, starting at day 
x) the download values and try to cluster similar download patterns.
T1 and T2 are varying, i'm still trying to find the best values.

> Finally, our Vector library has improved its asFormatString in a
>  number of areas but at the cost of readibility. This makes debugging
>  terribly difficult and some sort of debuggable formatter is needed.

At least it shows me the ids and values of the vector, thats how i found 
out the canopy contains duplicates; i'll improve that lateron.

regards
Christoph

-- 
Christoph Hermann
Institut für Informatik
Tel: +49 761-203-8171 Fax: +49 761-203-8162
e-mail: hermann@informatik.uni-freiburg.de

Re: MeanShift Clustering duplicating vectors in canopies?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Hi Christoph,

The Meanshift canopy keeps copies of all the input points it has 
accreted. It does this for bookkeeping purposes, so that points can be 
associated with each canopy when it is done, but this clearly does not 
scale and is currently a showstopper for its utility in large problems 
(despite the M/R implementation, a large number of points will converge 
to a smaller number of very large cluster descriptions). I've considered 
two ways to improve this situation: 1) associate identifiers with each 
point and just store the ids instead of the whole point; 2) write out 
the accreted/merged canopies to a separate log file so that final 
cluster membership can be calculated after the fact. Option 1 would be 
the easiest to implement but would only give an order-constant 
improvement in space. Option 2 would solve the cluster space problem but 
would introduce another post-processing step to track the cluster merges.

Unlike the other clustering algorithms, which define symmetrical regions 
of n-space for each cluster, Meanshift clusters are asymmetric and so 
points cannot be clustered after the fact using just the cluster centers 
and distance measure.

I'm not sure why you are getting duplicate copies of the same point in 
your canopy. Your code looks like it was derived from the 
testReferenceImplementation unit test but has some minor differences. 
Why, since the code adds all the points to a new set of canopies before 
iterating, are you passing in 'canopies' as an argument? Can you say 
more about your input data set and the T1 & T2 values you used? How many 
iterations occurred? What was your convergence test value?

Finally, our Vector library has improved its asFormatString in a number 
of areas but at the cost of readibility. This makes debugging terribly 
difficult and some sort of debuggable formatter is needed.

Jeff

Christoph Hermann wrote:
> Hello,
>
> i'm running some clustering with the Mean Shift and in my final canopy i 
> get 5x the same vector.
>
> In the original input list i only had it once and i'm wondering why 
> duplicates are allowed within the same canopy?
>
> Attached is a file with the method i'm using to run mean shift as well 
> as the ouput (i'm iterating over the getBoundPoints() list of the 
> canopy).
>
> I'd be happy if someone could explain this.
>
> regards
> Christoph Hermann
>
>