You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Aaron Kaplan <ma...@aaronkaplan.info> on 2011/07/19 01:21:01 UTC

Canopy clustering

The mahout wiki page for canopy clustering [1] links to the KDD 2000 
paper by McCallum et al. [2], but I read the paper and I read the mahout 
code and they don't seem to be doing the same thing at all. In the 
paper, multiple seed prototypes are placed in each canopy, and each 
point is compared only to the prototypes within its own canopies. The 
fact that they avoid comparing every point with every prototype is the 
whole point of the method. The two uses of canopy that I've looked at in 
the mahout code are in 
org.apache.mahout.clustering.syntheticcontrol.canopy.Job and 
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job, and as I 
understand it they both just use the canopy centroids as seed prototypes 
for a classical clustering approach, where every point gets compared 
with every prototype in each iteration. Is that accurate or have I 
misread the code?

Thanks
-Aaron

[1] https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
[2] http://www.kamalnigam.com/papers/canopy-kdd00.pdf