You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Chih-Hsien Wu <ch...@gmail.com> on 2013/11/26 15:29:49 UTC

Good centroid generation algorithm for top-down clustering approach

Hi all, I'm trying to clustering text documents via top-down approach. I
have experienced both random seed and canopy generation, and have seen
their pros and cons. I realize that canopy is great for not known exact
cluster numbers; nevertheless, the memory need for canopy is great. I was
hoping to find something similar to canopy generation and was wondering if
there is any other recommendation?

Re: Good centroid generation algorithm for top-down clustering approach

Posted by Ted Dunning <te...@gmail.com>.
No.  Streaming k-means builds a small version of your data so that you can
use ball k-means or any other clever memory-resident centroid generation
algorithm.

See

http://www.cs.ucla.edu/~rafail/PUBLIC/76.pdf

http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf




On Tue, Nov 26, 2013 at 8:36 AM, Chih-Hsien Wu <ch...@gmail.com> wrote:

> I've heard about it but not familiar with it. Does Streaming K generate a
> list of centroids for other clustering algorithm?
>
>
> On Tue, Nov 26, 2013 at 10:55 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Have you looked at the streaming k-means work?  The basic idea is that
> you
> > generate a sketch of the data which you can then cluster in-memory.  That
> > lets you use very advanced centroid generation algorithms that require
> lots
> > of processing.
> >
> >
> >
> >
> > On Tue, Nov 26, 2013 at 6:29 AM, Chih-Hsien Wu <ch...@gmail.com>
> > wrote:
> >
> > > Hi all, I'm trying to clustering text documents via top-down approach.
> I
> > > have experienced both random seed and canopy generation, and have seen
> > > their pros and cons. I realize that canopy is great for not known exact
> > > cluster numbers; nevertheless, the memory need for canopy is great. I
> was
> > > hoping to find something similar to canopy generation and was wondering
> > if
> > > there is any other recommendation?
> > >
> >
>

Re: Good centroid generation algorithm for top-down clustering approach

Posted by Chih-Hsien Wu <ch...@gmail.com>.
I've heard about it but not familiar with it. Does Streaming K generate a
list of centroids for other clustering algorithm?


On Tue, Nov 26, 2013 at 10:55 AM, Ted Dunning <te...@gmail.com> wrote:

> Have you looked at the streaming k-means work?  The basic idea is that you
> generate a sketch of the data which you can then cluster in-memory.  That
> lets you use very advanced centroid generation algorithms that require lots
> of processing.
>
>
>
>
> On Tue, Nov 26, 2013 at 6:29 AM, Chih-Hsien Wu <ch...@gmail.com>
> wrote:
>
> > Hi all, I'm trying to clustering text documents via top-down approach. I
> > have experienced both random seed and canopy generation, and have seen
> > their pros and cons. I realize that canopy is great for not known exact
> > cluster numbers; nevertheless, the memory need for canopy is great. I was
> > hoping to find something similar to canopy generation and was wondering
> if
> > there is any other recommendation?
> >
>

Re: Good centroid generation algorithm for top-down clustering approach

Posted by Ted Dunning <te...@gmail.com>.
Have you looked at the streaming k-means work?  The basic idea is that you
generate a sketch of the data which you can then cluster in-memory.  That
lets you use very advanced centroid generation algorithms that require lots
of processing.




On Tue, Nov 26, 2013 at 6:29 AM, Chih-Hsien Wu <ch...@gmail.com> wrote:

> Hi all, I'm trying to clustering text documents via top-down approach. I
> have experienced both random seed and canopy generation, and have seen
> their pros and cons. I realize that canopy is great for not known exact
> cluster numbers; nevertheless, the memory need for canopy is great. I was
> hoping to find something similar to canopy generation and was wondering if
> there is any other recommendation?
>