You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2008/10/12 05:00:43 UTC

Practical Advice on Clustering

I'm looking into adding document clustering capabilities to Solr,  
using Mahout [1][2].  I already have search-results clustering, thanks  
to Carrot2.  What I'm looking for is practical advice on deploying a  
system that is going to cluster potentially large corpora (but not  
huge, and let's assume one machine for now, but it shouldn't matter)

Here are some thoughts I have:

In Solr, I expect to send a request to go off and build the clusters  
for some non-trivial set of documents in the index.  The actual  
building needs to happen in a background thread, so as to not hold up  
the caller.  My thinking is the request will come in and spawn off a  
job that goes and calculates a similarity matrix for all the documents  
in the set (need to store the term vectors in Lucene) and then goes  
and runs the clustering job (user configurable, based on the  
implementations we have: k-means, mean-shift, fuzzy, whatever) and  
stores the results into Solr's data directory somehow (so that it can  
be replicated, but not a big concern of mine at the moment)

Then, at any time, the application can ask Solr for the clusters  
(whatever that means) and it will return them (docids, fields,  
whatever the app asks for).  If the background task isn't done yet,  
the results set will be empty, or it will return a percentage  
completion or something useful.

Obviously, my first step is to get it working, but...

Is it practical to return a partially done set of results?  i.e. the  
best clusters so far, with perhaps a percentage to completion value or  
perhaps a list of the comparisons that haven't been done yet?

What if something happens?  How can I make Mahout fault-tolerant, such  
that, conceivably I could pick up the job again from where it went  
down, or at least be able to get the clusters so far.  How do people  
approach this to date (w/ or w/o Mahout)  What needs to be done in  
Mahout to make this possible?  I suspect Hadoop has some support for it.

Anything else I don't know?  Does what I'm thinking about make sense?

Thanks for any insight,
Grant



[1] http://wiki.apache.org/solr/ClusteringComponent
[2] https://issues.apache.org/jira/browse/SOLR-769

Re: Practical Advice on Clustering

Posted by Grant Ingersoll <gs...@apache.org>.
On Oct 13, 2008, at 12:17 AM, Vaijanath N. Rao wrote:

> Hi Grant,
>
> My replies are inline.
>
> Grant Ingersoll wrote:
>> I'm looking into adding document clustering capabilities to Solr,  
>> using Mahout [1][2].  I already have search-results clustering,  
>> thanks to Carrot2.  What I'm looking for is practical advice on  
>> deploying a system that is going to cluster potentially large  
>> corpora (but not huge, and let's assume one machine for now, but it  
>> shouldn't matter)
>>
>> Here are some thoughts I have:
>>
>> In Solr, I expect to send a request to go off and build the  
>> clusters for some non-trivial set of documents in the index.  The  
>> actual building needs to happen in a background thread, so as to  
>> not hold up the caller.
> Bingo It's better to spawn a new process for clustering rather than  
> to hold up the caller. If there is a status page indicating the  
> status of this clustering algorithm it would be better as the caller  
> can than check against this status page to know what the current  
> status is.
>> My thinking is the request will come in and spawn off a job that  
>> goes and calculates a similarity matrix for all the documents in  
>> the set (need to store the term vectors in Lucene) and then goes  
>> and runs the clustering job (user configurable, based on the  
>> implementations we have: k-means, mean-shift, fuzzy, whatever) and  
>> stores the results into Solr's data directory somehow (so that it  
>> can be replicated, but not a big concern of mine at the moment)
> If we are going to work on similarity matrix would like to add FIHC  
> (Frequent Item set Hierarchical clustering) If you need I can  
> definitely pitch in with this. Ideally we should target replication  
> and I think the idea is good.

I'm open for anything.  I figured start with the simplest, but if you  
have references, that would be cool.

>
>> Then, at any time, the application can ask Solr for the clusters  
>> (whatever that means) and it will return them (docids, fields,  
>> whatever the app asks for).  If the background task isn't done yet,  
>> the results set will be empty, or it will return a percentage  
>> completion or something useful.
>>
> In my opinion it is better to return the percentage of completion  
> rather than the top clusters at time X if the clustering is not yet  
> over. In most clustering cases the input data decides the centroid  
> of the clusters so change in input mite change the centroid and you  
> mite get different results for different input sample derived from  
> same data set.

Yeah, I think percent complete is good, also will keep the amount of  
traffic down.  But, in true Solr option, maybe it can be optional to  
send partial clusters, too.



>
>> Obviously, my first step is to get it working, but...
>>
>> Is it practical to return a partially done set of results?  i.e.  
>> the best clusters so far, with perhaps a percentage to completion  
>> value or perhaps a list of the comparisons that haven't been done  
>> yet?
>>
>> What if something happens?  How can I make Mahout fault-tolerant,  
>> such that, conceivably I could pick up the job again from where it  
>> went down, or at least be able to get the clusters so far.  How do  
>> people approach this to date (w/ or w/o Mahout)  What needs to be  
>> done in Mahout to make this possible?  I suspect Hadoop has some  
>> support for it.
>>
> Not sure weather Mahoot is fault tolerant in that respect.  But I  
> guess other members can comment on this.

No, I don't think it is at the moment in this respect.  I mean, Hadoop  
has it, so it probably isn't that hard to add...


Re: Practical Advice on Clustering

Posted by "Vaijanath N. Rao" <va...@gmail.com>.
Hi Grant,

My replies are inline.

Grant Ingersoll wrote:
> I'm looking into adding document clustering capabilities to Solr, 
> using Mahout [1][2].  I already have search-results clustering, thanks 
> to Carrot2.  What I'm looking for is practical advice on deploying a 
> system that is going to cluster potentially large corpora (but not 
> huge, and let's assume one machine for now, but it shouldn't matter)
>
> Here are some thoughts I have:
>
> In Solr, I expect to send a request to go off and build the clusters 
> for some non-trivial set of documents in the index.  The actual 
> building needs to happen in a background thread, so as to not hold up 
> the caller.  
Bingo It's better to spawn a new process for clustering rather than to 
hold up the caller. If there is a status page indicating the status of 
this clustering algorithm it would be better as the caller can than 
check against this status page to know what the current status is.
> My thinking is the request will come in and spawn off a job that goes 
> and calculates a similarity matrix for all the documents in the set 
> (need to store the term vectors in Lucene) and then goes and runs the 
> clustering job (user configurable, based on the implementations we 
> have: k-means, mean-shift, fuzzy, whatever) and stores the results 
> into Solr's data directory somehow (so that it can be replicated, but 
> not a big concern of mine at the moment)
If we are going to work on similarity matrix would like to add FIHC 
(Frequent Item set Hierarchical clustering) If you need I can definitely 
pitch in with this. Ideally we should target replication and I think the 
idea is good.
> Then, at any time, the application can ask Solr for the clusters 
> (whatever that means) and it will return them (docids, fields, 
> whatever the app asks for).  If the background task isn't done yet, 
> the results set will be empty, or it will return a percentage 
> completion or something useful.
>
In my opinion it is better to return the percentage of completion rather 
than the top clusters at time X if the clustering is not yet over. In 
most clustering cases the input data decides the centroid of the 
clusters so change in input mite change the centroid and you mite get 
different results for different input sample derived from same data set.
> Obviously, my first step is to get it working, but...
>
> Is it practical to return a partially done set of results?  i.e. the 
> best clusters so far, with perhaps a percentage to completion value or 
> perhaps a list of the comparisons that haven't been done yet?
>
> What if something happens?  How can I make Mahout fault-tolerant, such 
> that, conceivably I could pick up the job again from where it went 
> down, or at least be able to get the clusters so far.  How do people 
> approach this to date (w/ or w/o Mahout)  What needs to be done in 
> Mahout to make this possible?  I suspect Hadoop has some support for it.
>
Not sure weather Mahoot is fault tolerant in that respect.  But I guess 
other members can comment on this.
> Anything else I don't know?  Does what I'm thinking about make sense?
>
> Thanks for any insight,
> Grant
>
>
>
> [1] http://wiki.apache.org/solr/ClusteringComponent
> [2] https://issues.apache.org/jira/browse/SOLR-769
>

--Thanks and Regards
Vaijanath