You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Aaron Kaplan <ma...@aaronkaplan.info> on 2011/07/19 01:21:01 UTC
Canopy clustering
The mahout wiki page for canopy clustering [1] links to the KDD 2000
paper by McCallum et al. [2], but I read the paper and I read the mahout
code and they don't seem to be doing the same thing at all. In the
paper, multiple seed prototypes are placed in each canopy, and each
point is compared only to the prototypes within its own canopies. The
fact that they avoid comparing every point with every prototype is the
whole point of the method. The two uses of canopy that I've looked at in
the mahout code are in
org.apache.mahout.clustering.syntheticcontrol.canopy.Job and
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job, and as I
understand it they both just use the canopy centroids as seed prototypes
for a classical clustering approach, where every point gets compared
with every prototype in each iteration. Is that accurate or have I
misread the code?
Thanks
-Aaron
[1] https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
[2] http://www.kamalnigam.com/papers/canopy-kdd00.pdf