You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Camilo Lopez <ca...@camilolopez.com> on 2011/04/27 21:39:02 UTC

Finding thresholds for canopy

I'm using Canopy as first step for K-means clustering, is there any algorithmic, or even a good heuristic to estimate good T1 and T2 from the vectorized data?

Re: Finding thresholds for canopy

Posted by Camilo Lopez <ca...@camilolopez.com>.

Thanks Jeff, I guess the "art" part of it is the initial reasonable number.


On 2011-04-27, at 4:23 PM, Jeff Eastman wrote:

> No good answers here. The T2 value is the one which will control the number of clusters that Canopy finds. Try an initial value that seems reasonable then do a binary search, halving or doubling the value etc., until you get a reasonable number of clusters. Increasing T2 will give you fewer clusters, decreasing will give you more. If your initial value is off a lot you will get either 1 or numPoints clusters. T1 will affect which points that are near to a cluster but farther than T2 will contribute to its ultimate centroid. You can make T1=T2 in your binary search then increase T1 incrementally to see how the centroids move. 
> 
> -----Original Message-----
> From: Camilo Lopez [mailto:admin@camilolopez.com] On Behalf Of Camilo Lopez
> Sent: Wednesday, April 27, 2011 12:39 PM
> To: user@mahout.apache.org
> Subject: Finding thresholds for canopy
> 
> I'm using Canopy as first step for K-means clustering, is there any algorithmic, or even a good heuristic to estimate good T1 and T2 from the vectorized data?

RE: Finding thresholds for canopy

Posted by Jeff Eastman <je...@Narus.com>.

No good answers here. The T2 value is the one which will control the number of clusters that Canopy finds. Try an initial value that seems reasonable then do a binary search, halving or doubling the value etc., until you get a reasonable number of clusters. Increasing T2 will give you fewer clusters, decreasing will give you more. If your initial value is off a lot you will get either 1 or numPoints clusters. T1 will affect which points that are near to a cluster but farther than T2 will contribute to its ultimate centroid. You can make T1=T2 in your binary search then increase T1 incrementally to see how the centroids move. 

-----Original Message-----
From: Camilo Lopez [mailto:admin@camilolopez.com] On Behalf Of Camilo Lopez
Sent: Wednesday, April 27, 2011 12:39 PM
To: user@mahout.apache.org
Subject: Finding thresholds for canopy

I'm using Canopy as first step for K-means clustering, is there any algorithmic, or even a good heuristic to estimate good T1 and T2 from the vectorized data?

Re: Finding thresholds for canopy

Posted by Paul Mahon <pm...@decarta.com>.

It certainly is 3d thinking. The idea generalized to N dimensions, but 
I agree, it's unlikely to be effective since most of the space at 
higher dimensions is empty in any application I've seen.

On 04/27/2011 02:04 PM, Ted Dunning wrote:
> That sounds like 3-dimensional thinking.
>
> High dimensional problems abound and have very different properties.
>
> On Wed, Apr 27, 2011 at 1:55 PM, Paul Mahon<pm...@decarta.com>  wrote:
>
>> No, I mean the area. If all the vectors fit in a AxBxC sized box, and you
>> expect about 10 clusters, you could make an initial guess that the clusters
>> will be (A/10)xBxC in size and you could try T1=(A/10)*B*C. I've no idea how
>> well this would work in practice... probably not very well.
>>
>> On 04/27/2011 01:50 PM, Camilo Lopez wrote:
>>
>>> By area of the space you mean just the total number of vectors I'm using?
>>> On 2011-04-27, at 4:46 PM, Paul Mahon wrote:
>>>
>>>   If you have a guess at how many clusters you want you could take the
>>>> total area of the space and divide by the number of clusters to get an
>>>> initial guess of T2 or T1. That might work to get you started, depending on
>>>> the distribution.
>>>>
>>>> On 04/27/2011 12:39 PM, Camilo Lopez wrote:
>>>>
>>>>> I'm using Canopy as first step for K-means clustering, is there any
>>>>> algorithmic, or even a good heuristic to estimate good T1 and T2 from the
>>>>> vectorized data?
>>>>>

Re: Finding thresholds for canopy

Posted by Ted Dunning <te...@gmail.com>.

That sounds like 3-dimensional thinking.

High dimensional problems abound and have very different properties.

On Wed, Apr 27, 2011 at 1:55 PM, Paul Mahon <pm...@decarta.com> wrote:

> No, I mean the area. If all the vectors fit in a AxBxC sized box, and you
> expect about 10 clusters, you could make an initial guess that the clusters
> will be (A/10)xBxC in size and you could try T1=(A/10)*B*C. I've no idea how
> well this would work in practice... probably not very well.
>
> On 04/27/2011 01:50 PM, Camilo Lopez wrote:
>
>> By area of the space you mean just the total number of vectors I'm using?
>> On 2011-04-27, at 4:46 PM, Paul Mahon wrote:
>>
>>  If you have a guess at how many clusters you want you could take the
>>> total area of the space and divide by the number of clusters to get an
>>> initial guess of T2 or T1. That might work to get you started, depending on
>>> the distribution.
>>>
>>> On 04/27/2011 12:39 PM, Camilo Lopez wrote:
>>>
>>>> I'm using Canopy as first step for K-means clustering, is there any
>>>> algorithmic, or even a good heuristic to estimate good T1 and T2 from the
>>>> vectorized data?
>>>>
>>>

Re: Finding thresholds for canopy

Posted by Paul Mahon <pm...@decarta.com>.

No, I mean the area. If all the vectors fit in a AxBxC sized box, and 
you expect about 10 clusters, you could make an initial guess that the 
clusters will be (A/10)xBxC in size and you could try T1=(A/10)*B*C. 
I've no idea how well this would work in practice... probably not very 
well.

On 04/27/2011 01:50 PM, Camilo Lopez wrote:
> By area of the space you mean just the total number of vectors I'm using?
> On 2011-04-27, at 4:46 PM, Paul Mahon wrote:
>
>> If you have a guess at how many clusters you want you could take the total area of the space and divide by the number of clusters to get an initial guess of T2 or T1. That might work to get you started, depending on the distribution.
>>
>> On 04/27/2011 12:39 PM, Camilo Lopez wrote:
>>> I'm using Canopy as first step for K-means clustering, is there any algorithmic, or even a good heuristic to estimate good T1 and T2 from the vectorized data?

Re: Finding thresholds for canopy

Posted by Camilo Lopez <ca...@camilolopez.com>.

By area of the space you mean just the total number of vectors I'm using?
On 2011-04-27, at 4:46 PM, Paul Mahon wrote:

> If you have a guess at how many clusters you want you could take the total area of the space and divide by the number of clusters to get an initial guess of T2 or T1. That might work to get you started, depending on the distribution.
> 
> On 04/27/2011 12:39 PM, Camilo Lopez wrote:
>> I'm using Canopy as first step for K-means clustering, is there any algorithmic, or even a good heuristic to estimate good T1 and T2 from the vectorized data?

RE: Finding thresholds for canopy

Posted by Jeff Eastman <je...@Narus.com>.

Sure, seems like a reasonable place to start. I will be fascinated to hear how it works.

-----Original Message-----
From: Frank Scholten [mailto:frank@frankscholten.nl] 
Sent: Tuesday, May 17, 2011 8:04 AM
To: user@mahout.apache.org
Subject: Re: Finding thresholds for canopy

Hi Jeff,

After building this distance matrix, what would then be a good value
for T2? The average distance in the matrix?

Frank

On Wed, Apr 27, 2011 at 10:57 PM, Jeff Eastman <je...@narus.com> wrote:
> Worth a try, but it ultimately boils down to the distance measure you've chosen, the distributions of input vectors and T2. As a pre-run experiment, you could sample some points from your data set (e.g. using RandomSeedGenerator as you would to prime k-means), then build a distance matrix using your chosen distance measure. That would give you a T2 starting point in a more systematic manner than grabbing it completely out of thin air.
>
> -----Original Message-----
> From: Paul Mahon [mailto:pmahon@decarta.com]
> Sent: Wednesday, April 27, 2011 1:46 PM
> To: user@mahout.apache.org
> Subject: Re: Finding thresholds for canopy
>
> If you have a guess at how many clusters you want you could take the
> total area of the space and divide by the number of clusters to get an
> initial guess of T2 or T1. That might work to get you started,
> depending on the distribution.
>
> On 04/27/2011 12:39 PM, Camilo Lopez wrote:
>> I'm using Canopy as first step for K-means clustering, is there any algorithmic, or even a good heuristic to estimate good T1 and T2 from the vectorized data?
>

Re: Finding thresholds for canopy

Posted by Frank Scholten <fr...@frankscholten.nl>.

Hi Jeff,

After building this distance matrix, what would then be a good value
for T2? The average distance in the matrix?

Frank

On Wed, Apr 27, 2011 at 10:57 PM, Jeff Eastman <je...@narus.com> wrote:
> Worth a try, but it ultimately boils down to the distance measure you've chosen, the distributions of input vectors and T2. As a pre-run experiment, you could sample some points from your data set (e.g. using RandomSeedGenerator as you would to prime k-means), then build a distance matrix using your chosen distance measure. That would give you a T2 starting point in a more systematic manner than grabbing it completely out of thin air.
>
> -----Original Message-----
> From: Paul Mahon [mailto:pmahon@decarta.com]
> Sent: Wednesday, April 27, 2011 1:46 PM
> To: user@mahout.apache.org
> Subject: Re: Finding thresholds for canopy
>
> If you have a guess at how many clusters you want you could take the
> total area of the space and divide by the number of clusters to get an
> initial guess of T2 or T1. That might work to get you started,
> depending on the distribution.
>
> On 04/27/2011 12:39 PM, Camilo Lopez wrote:
>> I'm using Canopy as first step for K-means clustering, is there any algorithmic, or even a good heuristic to estimate good T1 and T2 from the vectorized data?
>

RE: Finding thresholds for canopy

Posted by Jeff Eastman <je...@Narus.com>.

Worth a try, but it ultimately boils down to the distance measure you've chosen, the distributions of input vectors and T2. As a pre-run experiment, you could sample some points from your data set (e.g. using RandomSeedGenerator as you would to prime k-means), then build a distance matrix using your chosen distance measure. That would give you a T2 starting point in a more systematic manner than grabbing it completely out of thin air.

-----Original Message-----
From: Paul Mahon [mailto:pmahon@decarta.com] 
Sent: Wednesday, April 27, 2011 1:46 PM
To: user@mahout.apache.org
Subject: Re: Finding thresholds for canopy

If you have a guess at how many clusters you want you could take the 
total area of the space and divide by the number of clusters to get an 
initial guess of T2 or T1. That might work to get you started, 
depending on the distribution.

On 04/27/2011 12:39 PM, Camilo Lopez wrote:
> I'm using Canopy as first step for K-means clustering, is there any algorithmic, or even a good heuristic to estimate good T1 and T2 from the vectorized data?

Re: Finding thresholds for canopy

Posted by Paul Mahon <pm...@decarta.com>.

If you have a guess at how many clusters you want you could take the 
total area of the space and divide by the number of clusters to get an 
initial guess of T2 or T1. That might work to get you started, 
depending on the distribution.

On 04/27/2011 12:39 PM, Camilo Lopez wrote:
> I'm using Canopy as first step for K-means clustering, is there any algorithmic, or even a good heuristic to estimate good T1 and T2 from the vectorized data?