You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Borbála Siklósi <si...@gmail.com> on 2010/11/02 19:54:04 UTC

clustering after search

Maybe I have quite a simple question, but I haven't been able to find out
the solution. I have a solr index of doucuments and I run kmeans clustering
on them. It all works fine. How can I do that I make a keyword search on the
solr index and run the clustering only on the result set? Can I someway
determine what documents the algorithm should cluster?

Re: clustering after search

Posted by Ted Dunning <te...@gmail.com>.

You definitely can do this.

Assuming that you are using Lucene to do the search should be able to adapt
the lucene vector exporter to export a
specified list of documents as vectors and then run the normal clustering
operations.  The clustering should be reasonably
fast, but the export of a few hundred documents from Lucene will probably be
pretty slow for a large index.

If you mean just sort a set of documents into pre-existing clusters, you can
do that as well.  You would start with the same
exporter, but would need to glue more pieces together to build the
classifier part.

My guess is that I am still not quite understanding what you want.  Did my
suggestions come at all close?

On Tue, Nov 2, 2010 at 11:48 PM, Borbála Siklósi <si...@gmail.com> wrote:

> Yes I know carrot, but that is not a possibility for me to use that. There
> isn't any way to tell mahout which subset of documents to cluster?
>
> 2010/11/2 Ted Dunning <te...@gmail.com>
>
> > Have you looked at Carrot?   It works very well
> >
> > http://search.carrot2.org/stable/search
> >
> > On Tue, Nov 2, 2010 at 11:54 AM, Borbála Siklósi <si...@gmail.com>
> > wrote:
> >
> > > Maybe I have quite a simple question, but I haven't been able to find
> out
> > > the solution. I have a solr index of doucuments and I run kmeans
> > clustering
> > > on them. It all works fine. How can I do that I make a keyword search
> on
> > > the
> > > solr index and run the clustering only on the result set? Can I someway
> > > determine what documents the algorithm should cluster?
> > >
> >
>

Re: clustering after search

Posted by Borbála Siklósi <si...@gmail.com>.

Yes I know carrot, but that is not a possibility for me to use that. There
isn't any way to tell mahout which subset of documents to cluster?

2010/11/2 Ted Dunning <te...@gmail.com>

> Have you looked at Carrot?   It works very well
>
> http://search.carrot2.org/stable/search
>
> On Tue, Nov 2, 2010 at 11:54 AM, Borbála Siklósi <si...@gmail.com>
> wrote:
>
> > Maybe I have quite a simple question, but I haven't been able to find out
> > the solution. I have a solr index of doucuments and I run kmeans
> clustering
> > on them. It all works fine. How can I do that I make a keyword search on
> > the
> > solr index and run the clustering only on the result set? Can I someway
> > determine what documents the algorithm should cluster?
> >
>

Re: clustering after search

Posted by Ted Dunning <te...@gmail.com>.

Have you looked at Carrot?   It works very well

http://search.carrot2.org/stable/search

On Tue, Nov 2, 2010 at 11:54 AM, Borbála Siklósi <si...@gmail.com> wrote:

> Maybe I have quite a simple question, but I haven't been able to find out
> the solution. I have a solr index of doucuments and I run kmeans clustering
> on them. It all works fine. How can I do that I make a keyword search on
> the
> solr index and run the clustering only on the result set? Can I someway
> determine what documents the algorithm should cluster?
>

Re: clustering after search

Posted by Grant Ingersoll <gs...@apache.org>.

Hmm, you should come to ApacheCon tomorrow in Atlanta where I will be talking/showing this.

Assuming you won't, I've started a small prototype that hooks in clustering via KMeans to Solr's ClusteringComponent and will take and run the index through KMeans.  The hooks are in there for doing the clustering on a DocSet (i.e. the results from a search) as you are suggesting but I haven't implemented that yet and I don't know how well that will perform, especially as compared to Carrot2, which is already integrated into Solr and is better designed for the type of stuff you are wanting.  

You can take a look at _very early_ stage code at https://github.com/gsingers/ApacheCon2010.  This is by no means production quality yet.  It is not even fully tested yet. 

(For those interested, that link also hooks in Mahout to provide Recommendations and to classify documents using the Naive Bayes classifier.  This last bit is courtesy of Drew via our book Taming Text.)

The one gotcha with the code is that you have to un-WAR the Solr WAR file and stuff all the Mahout libs into the Solr WAR file because otherwise you get classloader problems between Solr's Resource Loader and Hadoop's class loader.

-Grant

On Nov 2, 2010, at 2:54 PM, Borbála Siklósi wrote:

> Maybe I have quite a simple question, but I haven't been able to find out
> the solution. I have a solr index of doucuments and I run kmeans clustering
> on them. It all works fine. How can I do that I make a keyword search on the
> solr index and run the clustering only on the result set? Can I someway
> determine what documents the algorithm should cluster?

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/