You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by gabeweb <ga...@htc.com> on 2010/10/15 07:03:58 UTC

Canopy clustering question

I've been using canopy clustering to generate initial centroids for k-means
clustering.  But in this case, is t1 actually doing anything?  Because
canopy clustering keeps going until all points are within t2 of some canopy
center.  If a point is within t1 of a canopy center, it's placed in that
canopy, sure, but it's also kept in the list of "not yet within t2 of some
canopy center" points, so this step seems vacuous.

It seems to me that the case in which t1 is not vacuous is when performing
the second step of canopy clustering as described by McCallum et al., which
is the hard clustering using a better distance metric.  In this case, the
canopies generated by t1 are used:  when points are assigned to single
clusters, only the clusters whose centers are within t1 of a point are
considered.

However, when using canopy clustering to generate centroids for k-means, we
don't perform the second step of canopy clustering because we don't need the
hard decisions for each point; we just need the canopy centers.  K-means
clustering determines the actual cluster membership (along with changing the
cluster centers, obviously).

So am I right about t1 not being relevant in the case of canopy clustering
for centroids only?  Or if that's not right, where am I going wrong? 
Thanks.
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Canopy-clustering-question-tp1705754p1705754.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Canopy clustering question

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  You are correct that T2 is the prime driver in Canopy, but T1 is 
important in determining how many nearby points will be considered in 
the final centroid computation. if you look at the CanopyReducer, you 
will note that it is calling canopy.computeParameters() before writing 
out the canopy. This has the effect of recomputing the posterior 
statistics of all the points that were within T1 of its original center, 
and setting the center to the mean thereof. Note also that the 
CanopyClusterer.addPointToCanopies() is observing the point in the 
posterior statistics based upon the T1 distance.

The hard clustering step you describe uses the same distance measure and 
canopy centers to assign points to clusters. It is only possible to use 
a different/better distance measure when calling the driver methods from 
Java. The CLI interface uses the same distance measure for both phases.

You are correct that the clustered points from canopy are not needed if 
you are feeding the canopies to k-means as initial cluster centers. But 
again these centers are influenced by T1. And T2 of course so they're 
both important for different reasons. I think you are mostly on the 
right track so hope this helps.

On 10/14/10 10:03 PM, gabeweb wrote:
> I've been using canopy clustering to generate initial centroids for k-means
> clustering.  But in this case, is t1 actually doing anything?  Because
> canopy clustering keeps going until all points are within t2 of some canopy
> center.  If a point is within t1 of a canopy center, it's placed in that
> canopy, sure, but it's also kept in the list of "not yet within t2 of some
> canopy center" points, so this step seems vacuous.
>
> It seems to me that the case in which t1 is not vacuous is when performing
> the second step of canopy clustering as described by McCallum et al., which
> is the hard clustering using a better distance metric.  In this case, the
> canopies generated by t1 are used:  when points are assigned to single
> clusters, only the clusters whose centers are within t1 of a point are
> considered.
>
> However, when using canopy clustering to generate centroids for k-means, we
> don't perform the second step of canopy clustering because we don't need the
> hard decisions for each point; we just need the canopy centers.  K-means
> clustering determines the actual cluster membership (along with changing the
> cluster centers, obviously).
>
> So am I right about t1 not being relevant in the case of canopy clustering
> for centroids only?  Or if that's not right, where am I going wrong?
> Thanks.