You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Mike Hugo <mi...@piragua.com> on 2013/09/17 21:06:54 UTC

Clustering algorithms

Hello,

I'm new to mahout but have been working with Solr, Carrot2 and clustering
documents with the Lingo algorithm.  This has worked well for us for
clustering small sets of search results, but we are now branching out into
wanting to cluster larger sets of documents (millions of documents to 10s
of millions of document for now).

Could someone point me in the right direction as to which of the clustering
algorithms I should take a look at first (that would be similar to Lingo)?

Thanks,

Mike

Re: Clustering algorithms

Posted by Mike Hugo <mi...@piragua.com>.

Thanks Ted!


On Tue, Sep 17, 2013 at 2:59 PM, Ted Dunning <te...@gmail.com> wrote:

> Right now the best in terms of speed without losing quality in Mahout is
> the streaming k-means implementation.
>
> One exciting possibility is that you probably can combine a streaming
> k-means pre-pass with a regularized k-means algorithm in order to get
> results more like Lingo.  You could also follow with a DP-means pass to get
> an idea of optimal number of clusters.
>
> The idea with streaming k-means is that a first pass does a rough
> clustering into a whole lot of clusters.  This pass is fast because only
> approximate search is needed.  It is also adaptive so you only have to
> specify very roughly how many clusters you will probably be interested in
> having later.  The output is an approximate k-means clustering with many
> more clusters than you asked for.  This output can then be clustered in
> memory using any weighted clustering algorithm you care to use.  For
> k-means and certain kinds of data, you can even get nice probabilistic
> accuracy bounds for the combo.
>
>
>
> On Tue, Sep 17, 2013 at 12:06 PM, Mike Hugo <mi...@piragua.com> wrote:
>
> > Hello,
> >
> > I'm new to mahout but have been working with Solr, Carrot2 and clustering
> > documents with the Lingo algorithm.  This has worked well for us for
> > clustering small sets of search results, but we are now branching out
> into
> > wanting to cluster larger sets of documents (millions of documents to 10s
> > of millions of document for now).
> >
> > Could someone point me in the right direction as to which of the
> clustering
> > algorithms I should take a look at first (that would be similar to
> Lingo)?
> >
> > Thanks,
> >
> > Mike
> >
>

Re: Clustering algorithms

Posted by Ted Dunning <te...@gmail.com>.

Right now the best in terms of speed without losing quality in Mahout is
the streaming k-means implementation.

One exciting possibility is that you probably can combine a streaming
k-means pre-pass with a regularized k-means algorithm in order to get
results more like Lingo.  You could also follow with a DP-means pass to get
an idea of optimal number of clusters.

The idea with streaming k-means is that a first pass does a rough
clustering into a whole lot of clusters.  This pass is fast because only
approximate search is needed.  It is also adaptive so you only have to
specify very roughly how many clusters you will probably be interested in
having later.  The output is an approximate k-means clustering with many
more clusters than you asked for.  This output can then be clustered in
memory using any weighted clustering algorithm you care to use.  For
k-means and certain kinds of data, you can even get nice probabilistic
accuracy bounds for the combo.

On Tue, Sep 17, 2013 at 12:06 PM, Mike Hugo <mi...@piragua.com> wrote:

> Hello,
>
> I'm new to mahout but have been working with Solr, Carrot2 and clustering
> documents with the Lingo algorithm.  This has worked well for us for
> clustering small sets of search results, but we are now branching out into
> wanting to cluster larger sets of documents (millions of documents to 10s
> of millions of document for now).
>
> Could someone point me in the right direction as to which of the clustering
> algorithms I should take a look at first (that would be similar to Lingo)?
>
> Thanks,
>
> Mike
>