You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Madhusudan Joshi <ma...@gmail.com> on 2011/04/13 08:22:27 UTC

Choosing appropriate values for T1 and T2 for canopy clustering

I am been using kmeans to cluster some data. For initial cluster I used
random seed but that resulted in most of the documents (more than 70%) being
clustered in a single cluster. So I want to use canopy clustering to create
initial clusters but I am having problems selecting suitable values for T1
and T2. Is there any method to calculate the appropriate values depending
upon the number of documents used for clustering? Any help will be
appreciated.

-- 
Everything we hear is an opinion, not a fact.
Everything we see is perspective, not the truth.

RE: Choosing appropriate values for T1 and T2 for canopy clustering

Posted by Jeff Eastman <je...@Narus.com>.
The T2 value you select will determine the number of clusters you get. The T1 value determines how much points which are near to each cluster will influence it in the final centroid calculation. Your choice of distance measure will also have a big impact upon the outcome. If T2 is too small you will get too many clusters; in the limit equal to the number of input vectors. If it is too large you will get a single cluster. I find a few iterations where I double/half the T2 value will pretty quickly converge on a reasonable set of clusters. If this does not happen then pick a different distance measure.

-----Original Message-----
From: Madhusudan Joshi [mailto:madhusudanrjoshi@gmail.com] 
Sent: Tuesday, April 12, 2011 11:22 PM
To: user@mahout.apache.org
Subject: Choosing appropriate values for T1 and T2 for canopy clustering

I am been using kmeans to cluster some data. For initial cluster I used
random seed but that resulted in most of the documents (more than 70%) being
clustered in a single cluster. So I want to use canopy clustering to create
initial clusters but I am having problems selecting suitable values for T1
and T2. Is there any method to calculate the appropriate values depending
upon the number of documents used for clustering? Any help will be
appreciated.

-- 
Everything we hear is an opinion, not a fact.
Everything we see is perspective, not the truth.