You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Benson Margulies <bi...@gmail.com> on 2009/05/31 19:48:27 UTC
The 'k' in k-means
As I read the K-means algorithm and the Canopy paper, I'd expect a
Canopy+KMeans job to still accept an input parameter of 'k' even if Canopy
is on the front. To quote McCallum,
One can also use the canopies idea to speed up prototype based
clustering methods like K-means and Expectation-
Maximization (EM). In general, neither K-means nor EM
specify how many clusters to use. The canopies technique
does not help this choice.
This is, if an issue with anything, an issue with the syntheticcontrol
sample. The sample has me stumped. I don't see where the code is turns the
canopies to a set of prototypes that satisfy the requirement that each point
is in only one prototype.
Re: The 'k' in k-means
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Hi Benson,
As I read the papers, the K in K-means is the single input parameter
(other than DistanceMeasure) that must be specified. For an arbitrary K,
one can initialize the clusters by sampling K random points from the
input dataset. The iterations themselves then assign each point to its
closest cluster and the cluster centroid is then adjusted for the next
iteration. For KMeans, it would certainly be possible to have a version
in which the initial clusters were constructed by K random samples from
the input data set.
In the SyntheticControl example, I see that somebody has commented out
the OutputDriver.runJob() line which mapped each point to its closest
cluster in a manner similar to the second pass in Canopy.
In a Canopy+KMeans job, specifying the number of clusters is done
indirectly by the Canopy step using the given T1 and T2 values. Then,
the KMeansDriver iterates over them until a convergence value is
reached. It is possible in this process for one or more of the initial
clusters to become empty but the current algorithm does not attempt to
fix this.
I don't know how one could specify the number of Canopies as an input
parameter and get it to work given the algorithm. I've always tweaked
the T1 and T2 values until something reasonable came out the other end.
There are, apparently, cross-validation techniques for estimating these
values but I'm not familiar with them. Another new trick I guess...
Jeff
PS: Looking at the Canopy reference implementation again, since every
iteration over the points in the while loop removes one point from
points to locate a new canopy and then all other points that are within
T2, the worst that can happen would be one canopy for every input point
for any value of T2.
Benson Margulies wrote:
> As I read the K-means algorithm and the Canopy paper, I'd expect a
> Canopy+KMeans job to still accept an input parameter of 'k' even if Canopy
> is on the front. To quote McCallum,
>
> One can also use the canopies idea to speed up prototype based
> clustering methods like K-means and Expectation-
> Maximization (EM). In general, neither K-means nor EM
> specify how many clusters to use. The canopies technique
> does not help this choice.
>
> This is, if an issue with anything, an issue with the syntheticcontrol
> sample. The sample has me stumped. I don't see where the code is turns the
> canopies to a set of prototypes that satisfy the requirement that each point
> is in only one prototype.
>
>