You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Benson Margulies <bi...@gmail.com> on 2009/05/31 19:48:27 UTC

The 'k' in k-means

As I read the K-means algorithm and the Canopy paper, I'd expect a
Canopy+KMeans job to still accept an input parameter of 'k' even if Canopy
is on the front. To quote McCallum,

One can also use the canopies idea to speed up prototype based
clustering methods like K-means and Expectation-
Maximization (EM). In general, neither K-means nor EM
specify how many clusters to use. The canopies technique
does not help this choice.

This is, if an issue with anything, an issue with the syntheticcontrol
sample. The sample has me stumped. I don't see where the code is turns the
canopies to a set of prototypes that satisfy the requirement that each point
is in only one prototype.

Re: The 'k' in k-means

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Hi Benson,

As I read the papers, the K in K-means is the single input parameter 
(other than DistanceMeasure) that must be specified. For an arbitrary K, 
one can initialize the clusters by sampling K random points from the 
input dataset. The iterations themselves then assign each point to its 
closest cluster and the cluster centroid is then adjusted for the next 
iteration. For KMeans, it would certainly be possible to have a version 
in which the initial clusters were constructed by K random samples from 
the input data set.

In the SyntheticControl example, I see that somebody has commented out 
the OutputDriver.runJob() line which mapped each point to its closest 
cluster in a manner similar to the second pass in Canopy.

In a Canopy+KMeans job, specifying the number of clusters is done 
indirectly by the Canopy step using the given T1 and T2 values. Then, 
the KMeansDriver iterates over them until a convergence value is 
reached. It is possible in this process for one or more of the initial 
clusters to become empty but the current algorithm does not attempt to 
fix this.

I don't know how one could specify the number of Canopies as an input 
parameter and get it to work given the algorithm. I've always tweaked 
the T1 and T2 values until something reasonable came out the other end. 
There are, apparently, cross-validation techniques for estimating these 
values but I'm not familiar with them. Another new trick I guess...

Jeff

PS: Looking at the Canopy reference implementation again, since every 
iteration over the points in the while loop removes one point from 
points to locate a new canopy and then all other points that are within 
T2, the worst that can happen would be one canopy for every input point 
for any value of T2.


Benson Margulies wrote:
> As I read the K-means algorithm and the Canopy paper, I'd expect a
> Canopy+KMeans job to still accept an input parameter of 'k' even if Canopy
> is on the front. To quote McCallum,
>
> One can also use the canopies idea to speed up prototype based
> clustering methods like K-means and Expectation-
> Maximization (EM). In general, neither K-means nor EM
> specify how many clusters to use. The canopies technique
> does not help this choice.
>
> This is, if an issue with anything, an issue with the syntheticcontrol
> sample. The sample has me stumped. I don't see where the code is turns the
> canopies to a set of prototypes that satisfy the requirement that each point
> is in only one prototype.
>
>