You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Frank Scholten <fr...@frankscholten.nl> on 2011/05/17 17:04:29 UTC

Re: Finding thresholds for canopy

Hi Jeff,

After building this distance matrix, what would then be a good value
for T2? The average distance in the matrix?

Frank

On Wed, Apr 27, 2011 at 10:57 PM, Jeff Eastman <je...@narus.com> wrote:
> Worth a try, but it ultimately boils down to the distance measure you've chosen, the distributions of input vectors and T2. As a pre-run experiment, you could sample some points from your data set (e.g. using RandomSeedGenerator as you would to prime k-means), then build a distance matrix using your chosen distance measure. That would give you a T2 starting point in a more systematic manner than grabbing it completely out of thin air.
>
> -----Original Message-----
> From: Paul Mahon [mailto:pmahon@decarta.com]
> Sent: Wednesday, April 27, 2011 1:46 PM
> To: user@mahout.apache.org
> Subject: Re: Finding thresholds for canopy
>
> If you have a guess at how many clusters you want you could take the
> total area of the space and divide by the number of clusters to get an
> initial guess of T2 or T1. That might work to get you started,
> depending on the distribution.
>
> On 04/27/2011 12:39 PM, Camilo Lopez wrote:
>> I'm using Canopy as first step for K-means clustering, is there any algorithmic, or even a good heuristic to estimate good T1 and T2 from the vectorized data?
>

RE: Finding thresholds for canopy

Posted by Jeff Eastman <je...@Narus.com>.

Sure, seems like a reasonable place to start. I will be fascinated to hear how it works.

-----Original Message-----
From: Frank Scholten [mailto:frank@frankscholten.nl] 
Sent: Tuesday, May 17, 2011 8:04 AM
To: user@mahout.apache.org
Subject: Re: Finding thresholds for canopy

Hi Jeff,

After building this distance matrix, what would then be a good value
for T2? The average distance in the matrix?

Frank

On Wed, Apr 27, 2011 at 10:57 PM, Jeff Eastman <je...@narus.com> wrote:
> Worth a try, but it ultimately boils down to the distance measure you've chosen, the distributions of input vectors and T2. As a pre-run experiment, you could sample some points from your data set (e.g. using RandomSeedGenerator as you would to prime k-means), then build a distance matrix using your chosen distance measure. That would give you a T2 starting point in a more systematic manner than grabbing it completely out of thin air.
>
> -----Original Message-----
> From: Paul Mahon [mailto:pmahon@decarta.com]
> Sent: Wednesday, April 27, 2011 1:46 PM
> To: user@mahout.apache.org
> Subject: Re: Finding thresholds for canopy
>
> If you have a guess at how many clusters you want you could take the
> total area of the space and divide by the number of clusters to get an
> initial guess of T2 or T1. That might work to get you started,
> depending on the distribution.
>
> On 04/27/2011 12:39 PM, Camilo Lopez wrote:
>> I'm using Canopy as first step for K-means clustering, is there any algorithmic, or even a good heuristic to estimate good T1 and T2 from the vectorized data?
>