You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Lahiru Samarakoon <la...@gmail.com> on 2011/03/02 11:50:10 UTC

Hierarchical Agglomerative Clustering

Hi,

 I am Lahiru Samarakoon, a M.Phil student of University of Moratuwa, Sri
Lanka. I am doing research related to machine learning. I am interested in
implementing *Hierarchical Agglomerative Clustering* (HAC) to Mahout as a
GSoC project.

 Currently, Mahout does not have HAC. I believe the project and the
community will be benefited from the HAC integration.

 Please advice.

 Thank you,

 Best Regards,

Lahiru

Re: Hierarchical Agglomerative Clustering

Posted by Lahiru Samarakoon <la...@gmail.com>.

Hi Devs,

 Mahout's clustering algorithms list contains Hierarchical Clustering (
https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms).

So, I guess HAC is in the road map ?

 The AT&T Labs mentioned an efficient implementation of HAC by on the fly
cluster distance computation[1]. However, for the GSOC scope, standalone and
hadoop based algorithm implementation plus different similarity measures are
realistic.

 Please comment on this idea.

[1] L.Begeja, B.Renger, D.Gibbon, Z.Liu, and B.Shahraray, "Interactive
machine learning techniques for improving SLU models,"


 Thank you,

 Best Regards,

Lahiru

Re: Hierarchical Agglomerative Clustering

Posted by Lahiru Samarakoon <la...@gmail.com>.

Hi Ted,

In HAC algorithms, a large number of dot product computations are required.
So, we can use an inverted index ( Lucene Index) to improve the performance.
We can come up with a practical formula for similarity computations as done
in Lucene scoring.

 Most of the time, documents being clustered are high dimensional sparse
vectors. So the required number of computations are small. But one exception
is the case of dense centroids in using Group-average agglomerative
clustering. This issue can be addressed by using medoids (The document
vector that is closest to the centroid) instead of dense centroids.

 Anyway, Did Mahout addressed this dense centroid issue in K-means
implementation ?

 However, with very large datasets, HAC is infeasible. In such scenarios we
can use an HAC algorithm with a low threshold to compute high quality seeds
( currently Canopy is used) to K-means algorithm.

 Even though there are limitations, I believe, It is suitable and worthwhile
to include HAC in Mahout.


Thanks,

Lahiru

Re: Hierarchical Agglomerative Clustering

Posted by Ted Dunning <te...@gmail.com>.

Can you say how you would do this in a scalable way?

On Wed, Mar 2, 2011 at 2:50 AM, Lahiru Samarakoon <la...@gmail.com>wrote:

> Hi,
>
>  I am Lahiru Samarakoon, a M.Phil student of University of Moratuwa, Sri
> Lanka. I am doing research related to machine learning. I am interested in
> implementing *Hierarchical Agglomerative Clustering* (HAC) to Mahout as a
> GSoC project.
>
>  Currently, Mahout does not have HAC. I believe the project and the
> community will be benefited from the HAC integration.
>
>  Please advice.
>
>  Thank you,
>
>  Best Regards,
>
> Lahiru
>