You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Robert Stewart <bs...@gmail.com> on 2012/05/15 16:45:30 UTC

choosing appropriate t1,t2 for canopy clustering

I am trying to run canopy clustering on vectors extracted from lucene index.  I want to use CosineDistanceMeasure.  How do I know what appropriate values to use for t1 and t2 distance threshold?  I would assume that Cosine distance measure would return "distances" as a range from 0.0 to 1.0 but that seems not the case, so how do I know what the potential distance ranges are to pick t1 and t2 (other than many trial and errors)?

Thanks
Bob

Re: choosing appropriate t1,t2 for canopy clustering

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

You can use the RepresentativePointsDriver to pick a set of n 
representative points from each cluster to speed these calculations, but 
it requires the clusters and clustered points so it may not help with 
what you are doing.

On 5/16/12 4:16 AM, Paritosh Ranjan wrote:
> "calculated the mean distance between all the pairs of vectors"
>
>
> This can be a very costly operation if the dataset is reasonably large.
>
> On 16-05-2012 13:34, ivan obeso wrote:
>> In my project of text clustering I used the Euclidean distance as
>> measurement method. I wrote a method which calculated the mean distance
>> between all the pairs of vectors (documents) and used this mean as 
>> T2, and
>> for T1 I used mean*2. This approach worked really good for me, giving
>> a reasonably
>> number of clusters in various corpus.
>>
>> On Tue, May 15, 2012 at 10:45 AM, Robert 
>> Stewart<bs...@gmail.com>wrote:
>>
>>> I am trying to run canopy clustering on vectors extracted from lucene
>>> index.  I want to use CosineDistanceMeasure.  How do I know what
>>> appropriate values to use for t1 and t2 distance threshold?  I would 
>>> assume
>>> that Cosine distance measure would return "distances" as a range 
>>> from 0.0
>>> to 1.0 but that seems not the case, so how do I know what the potential
>>> distance ranges are to pick t1 and t2 (other than many trial and 
>>> errors)?
>>>
>>> Thanks
>>> Bob
>
>
>

Re: choosing appropriate t1,t2 for canopy clustering

Posted by Paritosh Ranjan <pr...@xebia.com>.

"calculated the mean distance between all the pairs of vectors"


This can be a very costly operation if the dataset is reasonably large.

On 16-05-2012 13:34, ivan obeso wrote:
> In my project of text clustering I used the Euclidean distance as
> measurement method. I wrote a method which calculated the mean distance
> between all the pairs of vectors (documents) and used this mean as T2, and
> for T1 I used mean*2. This approach worked really good for me, giving
> a reasonably
> number of clusters in various corpus.
>
> On Tue, May 15, 2012 at 10:45 AM, Robert Stewart<bs...@gmail.com>wrote:
>
>> I am trying to run canopy clustering on vectors extracted from lucene
>> index.  I want to use CosineDistanceMeasure.  How do I know what
>> appropriate values to use for t1 and t2 distance threshold?  I would assume
>> that Cosine distance measure would return "distances" as a range from 0.0
>> to 1.0 but that seems not the case, so how do I know what the potential
>> distance ranges are to pick t1 and t2 (other than many trial and errors)?
>>
>> Thanks
>> Bob

Re: choosing appropriate t1,t2 for canopy clustering

Posted by ivan obeso <se...@gmail.com>.

In my project of text clustering I used the Euclidean distance as
measurement method. I wrote a method which calculated the mean distance
between all the pairs of vectors (documents) and used this mean as T2, and
for T1 I used mean*2. This approach worked really good for me, giving
a reasonably
number of clusters in various corpus.

On Tue, May 15, 2012 at 10:45 AM, Robert Stewart <bs...@gmail.com>wrote:

> I am trying to run canopy clustering on vectors extracted from lucene
> index.  I want to use CosineDistanceMeasure.  How do I know what
> appropriate values to use for t1 and t2 distance threshold?  I would assume
> that Cosine distance measure would return "distances" as a range from 0.0
> to 1.0 but that seems not the case, so how do I know what the potential
> distance ranges are to pick t1 and t2 (other than many trial and errors)?
>
> Thanks
> Bob

Re: choosing appropriate t1,t2 for canopy clustering

Posted by Pat Ferrel <pa...@occamsmachete.com>.

I've seen the same thing. I don't think there is a way to specify ahead 
of time acceptable distances because the algorithm finds centroids and 
can only calculate distances after all centroids have converged. Which 
might be your answer because these distances are stored at the end of 
the job. In my case I only allow the closest docs to the centroid into 
my UI. You could set your own threshold and discard docs too far away. 
In some cases you may then get empty clusters.

On 5/15/12 8:36 AM, Robert Stewart wrote:
> Thanks Jeff.  I do see that cosine distance does return 0.0-1.0 now as expected.  Something else was wrong in my initial run I guess.
>
> A different question about k-means:  I can successfully cluster using k-means but what happens is some clusters are very unrelated, so it seems like there needs to be some distance threshold to cluster documents using k-means (so clusters with very dis-similar items just dont get put into any cluster).  Is that possible with mahout?  I dont see any type of threshold parameters for k-means.
>
>
> On May 15, 2012, at 11:16 AM, Jeff Eastman wrote:
>
>> Hi Bob,
>>
>> Cosine distance will return distances on 0.0...1.0 as you suggest. While there is no absolutely foolproof technique for priming canopy T1&  T2 values I recommend you begin by setting T1==T2 and doing a binary search from some initial distance, perhaps 0.1. If you get too few clusters, decrease T1==T2 by half and try again. If too many, double etc.
>>
>> If you want to be more analytical, use the RandomSeedGenerator to sample from your input vectors and compute a starting point using their inter-cluster distances. You can also skip Canopy and use k-means with -k specified to sample from your input data and produce k clusters. That works pretty well with text and Cosine distance
>>
>> Once you arrive at a "reasonable" number of clusters, you can mess with T1 to include more points in the centroid calculations but that will not change the number of clusters.
>>
>>
>> On 5/15/12 10:45 AM, Robert Stewart wrote:
>>> I am trying to run canopy clustering on vectors extracted from lucene index.  I want to use CosineDistanceMeasure.  How do I know what appropriate values to use for t1 and t2 distance threshold?  I would assume that Cosine distance measure would return "distances" as a range from 0.0 to 1.0 but that seems not the case, so how do I know what the potential distance ranges are to pick t1 and t2 (other than many trial and errors)?
>>>
>>> Thanks
>>> Bob
>>>
>
>

Re: choosing appropriate t1,t2 for canopy clustering

Posted by Paritosh Ranjan <pr...@xebia.com>.

A threshold parameter for K-Means has been added in Mahout 0.7-snapshot. 
You will not find it in Mahout 0.6. See clusterClassificationThreshold 
parameter in KMeansDriver.

  public static void run(

Configuration conf, Path input, Path clustersIn, Path output, DistanceMeasure measure,

       double convergenceDelta, int maxIterations, boolean runClustering, double clusterClassificationThreshold,

       boolean runSequential)


On 15-05-2012 21:06, Robert Stewart wrote:
> Thanks Jeff.  I do see that cosine distance does return 0.0-1.0 now as expected.  Something else was wrong in my initial run I guess.
>
> A different question about k-means:  I can successfully cluster using k-means but what happens is some clusters are very unrelated, so it seems like there needs to be some distance threshold to cluster documents using k-means (so clusters with very dis-similar items just dont get put into any cluster).  Is that possible with mahout?  I dont see any type of threshold parameters for k-means.
>
>
> On May 15, 2012, at 11:16 AM, Jeff Eastman wrote:
>
>> Hi Bob,
>>
>> Cosine distance will return distances on 0.0...1.0 as you suggest. While there is no absolutely foolproof technique for priming canopy T1&  T2 values I recommend you begin by setting T1==T2 and doing a binary search from some initial distance, perhaps 0.1. If you get too few clusters, decrease T1==T2 by half and try again. If too many, double etc.
>>
>> If you want to be more analytical, use the RandomSeedGenerator to sample from your input vectors and compute a starting point using their inter-cluster distances. You can also skip Canopy and use k-means with -k specified to sample from your input data and produce k clusters. That works pretty well with text and Cosine distance
>>
>> Once you arrive at a "reasonable" number of clusters, you can mess with T1 to include more points in the centroid calculations but that will not change the number of clusters.
>>
>>
>> On 5/15/12 10:45 AM, Robert Stewart wrote:
>>> I am trying to run canopy clustering on vectors extracted from lucene index.  I want to use CosineDistanceMeasure.  How do I know what appropriate values to use for t1 and t2 distance threshold?  I would assume that Cosine distance measure would return "distances" as a range from 0.0 to 1.0 but that seems not the case, so how do I know what the potential distance ranges are to pick t1 and t2 (other than many trial and errors)?
>>>
>>> Thanks
>>> Bob
>>>

Re: choosing appropriate t1,t2 for canopy clustering

Posted by Robert Stewart <bs...@gmail.com>.

Thanks Jeff.  I do see that cosine distance does return 0.0-1.0 now as expected.  Something else was wrong in my initial run I guess.  

A different question about k-means:  I can successfully cluster using k-means but what happens is some clusters are very unrelated, so it seems like there needs to be some distance threshold to cluster documents using k-means (so clusters with very dis-similar items just dont get put into any cluster).  Is that possible with mahout?  I dont see any type of threshold parameters for k-means.


On May 15, 2012, at 11:16 AM, Jeff Eastman wrote:

> Hi Bob,
> 
> Cosine distance will return distances on 0.0...1.0 as you suggest. While there is no absolutely foolproof technique for priming canopy T1 & T2 values I recommend you begin by setting T1==T2 and doing a binary search from some initial distance, perhaps 0.1. If you get too few clusters, decrease T1==T2 by half and try again. If too many, double etc.
> 
> If you want to be more analytical, use the RandomSeedGenerator to sample from your input vectors and compute a starting point using their inter-cluster distances. You can also skip Canopy and use k-means with -k specified to sample from your input data and produce k clusters. That works pretty well with text and Cosine distance
> 
> Once you arrive at a "reasonable" number of clusters, you can mess with T1 to include more points in the centroid calculations but that will not change the number of clusters.
> 
> 
> On 5/15/12 10:45 AM, Robert Stewart wrote:
>> I am trying to run canopy clustering on vectors extracted from lucene index.  I want to use CosineDistanceMeasure.  How do I know what appropriate values to use for t1 and t2 distance threshold?  I would assume that Cosine distance measure would return "distances" as a range from 0.0 to 1.0 but that seems not the case, so how do I know what the potential distance ranges are to pick t1 and t2 (other than many trial and errors)?
>> 
>> Thanks
>> Bob
>> 
>

Re: choosing appropriate t1,t2 for canopy clustering

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Hi Bob,

Cosine distance will return distances on 0.0...1.0 as you suggest. While 
there is no absolutely foolproof technique for priming canopy T1 & T2 
values I recommend you begin by setting T1==T2 and doing a binary search 
from some initial distance, perhaps 0.1. If you get too few clusters, 
decrease T1==T2 by half and try again. If too many, double etc.

If you want to be more analytical, use the RandomSeedGenerator to sample 
from your input vectors and compute a starting point using their 
inter-cluster distances. You can also skip Canopy and use k-means with 
-k specified to sample from your input data and produce k clusters. That 
works pretty well with text and Cosine distance

Once you arrive at a "reasonable" number of clusters, you can mess with 
T1 to include more points in the centroid calculations but that will not 
change the number of clusters.

On 5/15/12 10:45 AM, Robert Stewart wrote:
> I am trying to run canopy clustering on vectors extracted from lucene index.  I want to use CosineDistanceMeasure.  How do I know what appropriate values to use for t1 and t2 distance threshold?  I would assume that Cosine distance measure would return "distances" as a range from 0.0 to 1.0 but that seems not the case, so how do I know what the potential distance ranges are to pick t1 and t2 (other than many trial and errors)?
>
> Thanks
> Bob
>