You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2011/11/02 04:31:53 UTC

Canopy and other clustering approaches

In reviewing clustering for upcoming training, I'm wondering about something w/ Canopy clustering that we claim, but wanted to check here first.  In the lectures, etc. I've seen on it, the idea is to run Canopy first and then some other more expensive algorithm, such as k-means, etc. with the idea that items further away than T2 are not even considered when scoring a centroid in the more complex clustering approach.  However, I think I'm missing where in the code this actually happens.  We do have code that allows K-Means to use the Canopy centroids as initial centroids for k-means, but the other material seemed to imply more aggressive pruning was possible since points outside of T2 would not even need to be considered.  Otherwise, it doesn't seem like we are saving anything by doing Canopy first other than we likely have a better set of starting centroids.  I haven't thought about how this would be implemented.

Then again, it's late and I'm tired.

-Grant

Re: Canopy and other clustering approaches

Posted by Grant Ingersoll <gs...@apache.org>.

Cool.

On Nov 1, 2011, at 11:46 PM, Paritosh Ranjan wrote:

> Hi Grant,
> 
> I have been working on Top Down Clustering. https://issues.apache.org/jira/browse/MAHOUT-843
> 
> In this, the top level  clustering algorithm ( for eg. Canopy ) can run with big t1,t2 values. And then any other clustering algorithm (selected by user) is executed on clusters produced by top level clustering.
> 
> I have been able to configure top level and bottom level clustering with some of the clustering algorithms available.
> 
> I will be submitting the patch sometime in this week. Using it, we will be able to run Canopy Clustering ( or other clustering algorithms first ) to extract bigger clusters first and then apply other fine grained clustering algorithms on the clusters extracted.
> 
> I think this will help in achieving what is needed.
> 
> Thanks and Regards,
> Paritosh
> 
> On 02-11-2011 09:01, Grant Ingersoll wrote:
>> In reviewing clustering for upcoming training, I'm wondering about something w/ Canopy clustering that we claim, but wanted to check here first.  In the lectures, etc. I've seen on it, the idea is to run Canopy first and then some other more expensive algorithm, such as k-means, etc. with the idea that items further away than T2 are not even considered when scoring a centroid in the more complex clustering approach.  However, I think I'm missing where in the code this actually happens.  We do have code that allows K-Means to use the Canopy centroids as initial centroids for k-means, but the other material seemed to imply more aggressive pruning was possible since points outside of T2 would not even need to be considered.  Otherwise, it doesn't seem like we are saving anything by doing Canopy first other than we likely have a better set of starting centroids.  I haven't thought about how this would be implemented.
>> 
>> Then again, it's late and I'm tired.
>> 
>> -Grant
>> 
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1411 / Virus Database: 2092/3990 - Release Date: 11/01/11
>> 
> 
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: Canopy and other clustering approaches

Posted by Paritosh Ranjan <pr...@xebia.com>.

Hi Grant,

I have been working on Top Down Clustering. 
https://issues.apache.org/jira/browse/MAHOUT-843

In this, the top level  clustering algorithm ( for eg. Canopy ) can run 
with big t1,t2 values. And then any other clustering algorithm (selected 
by user) is executed on clusters produced by top level clustering.

I have been able to configure top level and bottom level clustering with 
some of the clustering algorithms available.

I will be submitting the patch sometime in this week. Using it, we will 
be able to run Canopy Clustering ( or other clustering algorithms first 
) to extract bigger clusters first and then apply other fine grained 
clustering algorithms on the clusters extracted.

I think this will help in achieving what is needed.

Thanks and Regards,
Paritosh

On 02-11-2011 09:01, Grant Ingersoll wrote:
> In reviewing clustering for upcoming training, I'm wondering about something w/ Canopy clustering that we claim, but wanted to check here first.  In the lectures, etc. I've seen on it, the idea is to run Canopy first and then some other more expensive algorithm, such as k-means, etc. with the idea that items further away than T2 are not even considered when scoring a centroid in the more complex clustering approach.  However, I think I'm missing where in the code this actually happens.  We do have code that allows K-Means to use the Canopy centroids as initial centroids for k-means, but the other material seemed to imply more aggressive pruning was possible since points outside of T2 would not even need to be considered.  Otherwise, it doesn't seem like we are saving anything by doing Canopy first other than we likely have a better set of starting centroids.  I haven't thought about how this would be implemented.
>
> Then again, it's late and I'm tired.
>
> -Grant
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1411 / Virus Database: 2092/3990 - Release Date: 11/01/11
>