You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Robert Stewart <bs...@gmail.com> on 2011/11/05 13:36:11 UTC

getting mahout clustering info back into lucene

If I run mahout clustering on lucene vectors, how would I go about getting that cluster information back into lucene, in order to use the cluster identifiers in field collapsing?

I know I can re-index with the new cluster info, but is there any way to put cluster info into an existing index (which also may be non-optimized and quite large)?  One way maybe to have a custom field collapsing component that can read mahout cluster output.  Any thoughts?

Bob

Re: getting mahout clustering info back into lucene

Posted by Robert Stewart <bs...@gmail.com>.

I am thinking of this:

on commit:

open some file provided by mahout, which maps unique doc identifier to some cluster identifier.

map unique document identifiers to internal lucene doc ids (by using TermEnum/TermDocs of the unique field).

output an integer array: where size=numDocs, and each value is a cluster identifier, such that index into array for each internal docid results in the cluster for that document.

During replication, that file gets copied to slaves.

When new snapshot is opened, that file can be loaded and cached into RAM.

During search, some collector is used to collapse documents along the cluster ids (same way field collapsing works).

On Nov 5, 2011, at 10:06 AM, Grant Ingersoll wrote:

> 
> On Nov 5, 2011, at 8:36 AM, Robert Stewart wrote:
> 
>> If I run mahout clustering on lucene vectors, how would I go about getting that cluster information back into lucene, in order to use the cluster identifiers in field collapsing?
>> 
> 
> Since Lucene doesn't have incremental field update (which is seriously non-trivial to do in an inverted index), the only way to do this is to re-index.  Once DocValues are updateable, this may be a lot easier.   You could, also, perhaps use the ParallelReader, but that has some restrictions (you have to keep docids in sync)
> 
> 
>> I know I can re-index with the new cluster info, but is there any way to put cluster info into an existing index (which also may be non-optimized and quite large)?  One way maybe to have a custom field collapsing component that can read mahout cluster output.  Any thoughts?
> 
> Solr has some plugins around clustering already, if you are using that.  I've done some prototyping on hooking in Mahout, but there is nothing official yet.  I haven't looked at field collapsing in depth yet.
> 
> On trunk, you might be able to do some other fancy tricks to make this work via codecs.
> 
> -Grant
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> 
> 
>

Re: getting mahout clustering info back into lucene

Posted by Ken Krugler <kk...@transpac.com>.

On Nov 5, 2011, at 7:06am, Grant Ingersoll wrote:

> 
> On Nov 5, 2011, at 8:36 AM, Robert Stewart wrote:
> 
>> If I run mahout clustering on lucene vectors, how would I go about getting that cluster information back into lucene, in order to use the cluster identifiers in field collapsing?
>> 
> 
> Since Lucene doesn't have incremental field update (which is seriously non-trivial to do in an inverted index), the only way to do this is to re-index.  Once DocValues are updateable, this may be a lot easier.   You could, also, perhaps use the ParallelReader, but that has some restrictions (you have to keep docids in sync)
> 
>> I know I can re-index with the new cluster info, but is there any way to put cluster info into an existing index (which also may be non-optimized and quite large)?  One way maybe to have a custom field collapsing component that can read mahout cluster output.  Any thoughts?

Two thoughts on this...

1. Normally for indexes that include clustering, we re-generate the complete Solr index using a Hadoop-based workflow, which includes all of the processing/machine learning.

One reason why is that there's so much tweaking to get good results that you wind up often needing to rebuilt everything, versus trying to do incremental updates.

2. You could potentially put the data into external fields, but then it would need to be used via a FunctionQuery.

-- Ken

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: getting mahout clustering info back into lucene

Posted by Grant Ingersoll <gs...@apache.org>.

On Nov 5, 2011, at 8:36 AM, Robert Stewart wrote:

> If I run mahout clustering on lucene vectors, how would I go about getting that cluster information back into lucene, in order to use the cluster identifiers in field collapsing?
> 

Since Lucene doesn't have incremental field update (which is seriously non-trivial to do in an inverted index), the only way to do this is to re-index.  Once DocValues are updateable, this may be a lot easier.   You could, also, perhaps use the ParallelReader, but that has some restrictions (you have to keep docids in sync)

> I know I can re-index with the new cluster info, but is there any way to put cluster info into an existing index (which also may be non-optimized and quite large)?  One way maybe to have a custom field collapsing component that can read mahout cluster output.  Any thoughts?

Solr has some plugins around clustering already, if you are using that.  I've done some prototyping on hooking in Mahout, but there is nothing official yet.  I haven't looked at field collapsing in depth yet.

On trunk, you might be able to do some other fancy tricks to make this work via codecs.

-Grant
--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com