You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by vivek sar <vi...@gmail.com> on 2011/04/26 13:49:15 UTC

Clustering with Lucene?

Hi,

  I've been researching about clustering with Lucene. Here is what
I've found so far,

1) Lucene clustering with Carrot2 -
http://download.carrot2.org/head/manual/#section.getting-started.lucene
   - but, this seems suitable for only smaller size index (few hundred
documents) -  http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.choosing-algorithm

2) Lucene clustering with Mahout -
http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3
   - I'm not very sure if this is ready for prime time yet, there
seems to be very few examples on how to do this. Has anyone tried this
with large index size (millions of documents)?

3) Some clustering library in Lucene's contribution folder -
https://issues.apache.org/jira/browse/LUCENE-1421
   - this doesn't seem to be officially supported and last update to
it was an year back. Has anyone tried this?

4) Lucene clustering with Terracotta -
http://orionl.blogspot.com/2006/11/clustering-lucene.html
  - I'm not sure how to do this, there doesn't seem to be much
activity around this.

We got large indexes - over 500 million records, we usually partition
our index after 20 million records. There are total of 20 fields in
our index, of which we are trying to cluster 5 fields. Is there any
clustering solution for Lucene that would work for us? Carrot2 looked
the most active and promising, but it's clearly recommended for a
small index size. Any other suggestions?

Thanks,
-vivek

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Clustering with Lucene?

Posted by Dawid Weiss <da...@gmail.com>.
They may not be dictionary, but they is a limited number of term entries and
they seem regular. Your inquiries indicate you need a faceting feature (or
even an sql-like set of queries backed up by a fast index...), probably with
some pruning.

Clustering is an unsupervised process that attempts to find latent
relationships between concepts in text (and describe these somehow).
Faceting is, ehm... "flattening" of your search result with respect to some
category for which the dictionary of terms is relatively limited. Any
product categories (with counts) that you see in search results on Amazon or
other sites like these are facets.

Dawid

On Wed, Apr 27, 2011 at 6:07 AM, vivek sar <vi...@gmail.com> wrote:

> Thanks Dawid. I was trying to give some example, but this is not
> exactly our text. Our fields include things like "user name", "IP
> Address", "Application Name", "Port 3", "Byte Count" - all network
> related stuff. So, if user searches on certain IP address then we
> would need to group the result by user, application, i.e. show me all
> the users who have used this IP, what applications have been used on
> that IP etc. These are definitely not dictionary fields.
>
> I'm looking at faceting right now - checking if this would work with
> Lucene (as we can not change to Solr at this point). What's the main
> difference between clustering and faceting?
>
> Thanks,
> -vivek
>
> On Tue, Apr 26, 2011 at 12:02 PM, Dawid Weiss <da...@gmail.com>
> wrote:
> >> 1) We index around 20 fields, of that we want to have grouping option
> >> for five of them. For ex., user can search on name of the city and we
> >> should have option to group by products available in that city (and
> >> vice-versa).
> >>
> >
> > Are these fields stricly defined or free text? Because if they are
> > product/dictionary fields then what you're looking for is not text
> > clustering, but faceting and the solution is to use either SOLR or its
> > components for doing exactly this.
> >
> >
> >> 2) We also need an aggregation facility, which would allow to
> >> aggregate certain field value from that group. For ex., sum the qty
> >> for all the products in a category. The aggregation may not be part of
> >> clustering, but could be something add-on to it.
> >>
> >
> > This definitely looks like faceting. Take a look at Solr's faceting
> > functionality -- I think this will solve your problem.
> >
> > Dawid
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Clustering with Lucene?

Posted by vivek sar <vi...@gmail.com>.
Thanks Dawid. I was trying to give some example, but this is not
exactly our text. Our fields include things like "user name", "IP
Address", "Application Name", "Port 3", "Byte Count" - all network
related stuff. So, if user searches on certain IP address then we
would need to group the result by user, application, i.e. show me all
the users who have used this IP, what applications have been used on
that IP etc. These are definitely not dictionary fields.

I'm looking at faceting right now - checking if this would work with
Lucene (as we can not change to Solr at this point). What's the main
difference between clustering and faceting?

Thanks,
-vivek

On Tue, Apr 26, 2011 at 12:02 PM, Dawid Weiss <da...@gmail.com> wrote:
>> 1) We index around 20 fields, of that we want to have grouping option
>> for five of them. For ex., user can search on name of the city and we
>> should have option to group by products available in that city (and
>> vice-versa).
>>
>
> Are these fields stricly defined or free text? Because if they are
> product/dictionary fields then what you're looking for is not text
> clustering, but faceting and the solution is to use either SOLR or its
> components for doing exactly this.
>
>
>> 2) We also need an aggregation facility, which would allow to
>> aggregate certain field value from that group. For ex., sum the qty
>> for all the products in a category. The aggregation may not be part of
>> clustering, but could be something add-on to it.
>>
>
> This definitely looks like faceting. Take a look at Solr's faceting
> functionality -- I think this will solve your problem.
>
> Dawid
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Clustering with Lucene?

Posted by Dawid Weiss <da...@gmail.com>.
> 1) We index around 20 fields, of that we want to have grouping option
> for five of them. For ex., user can search on name of the city and we
> should have option to group by products available in that city (and
> vice-versa).
>

Are these fields stricly defined or free text? Because if they are
product/dictionary fields then what you're looking for is not text
clustering, but faceting and the solution is to use either SOLR or its
components for doing exactly this.


> 2) We also need an aggregation facility, which would allow to
> aggregate certain field value from that group. For ex., sum the qty
> for all the products in a category. The aggregation may not be part of
> clustering, but could be something add-on to it.
>

This definitely looks like faceting. Take a look at Solr's faceting
functionality -- I think this will solve your problem.

Dawid

Re: Clustering with Lucene?

Posted by vivek sar <vi...@gmail.com>.
Thanks Dawid for the reply. Here is what we are trying to do,

1) We index around 20 fields, of that we want to have grouping option
for five of them. For ex., user can search on name of the city and we
should have option to group by products available in that city (and
vice-versa).
2) We also need an aggregation facility, which would allow to
aggregate certain field value from that group. For ex., sum the qty
for all the products in a category. The aggregation may not be part of
clustering, but could be something add-on to it.

Any suggestions on what algorithm or even Mahout may help in this scenario.

Thanks,
-vivek

On Tue, Apr 26, 2011 at 4:56 AM, Dawid Weiss <da...@gmail.com> wrote:
> Can you shed some more light on what you're trying to achieve (what is
> the purpose of clustering -- are clusters to be utilized for front-end
> user interface, further data mining analysis, etc.)?
>
> With the sizes you report Carrot2 won't work for you, I'm afraid, but
> Mahout may. Still, there's plenty of algorithms and preprocessing
> options to consider, so if you provide more background somebody may
> push you in the right direction.
>
> Dawid
>
> On Tue, Apr 26, 2011 at 1:49 PM, vivek sar <vi...@gmail.com> wrote:
>> Hi,
>>
>>  I've been researching about clustering with Lucene. Here is what
>> I've found so far,
>>
>> 1) Lucene clustering with Carrot2 -
>> http://download.carrot2.org/head/manual/#section.getting-started.lucene
>>   - but, this seems suitable for only smaller size index (few hundred
>> documents) -  http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.choosing-algorithm
>>
>> 2) Lucene clustering with Mahout -
>> http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3
>>   - I'm not very sure if this is ready for prime time yet, there
>> seems to be very few examples on how to do this. Has anyone tried this
>> with large index size (millions of documents)?
>>
>> 3) Some clustering library in Lucene's contribution folder -
>> https://issues.apache.org/jira/browse/LUCENE-1421
>>   - this doesn't seem to be officially supported and last update to
>> it was an year back. Has anyone tried this?
>>
>> 4) Lucene clustering with Terracotta -
>> http://orionl.blogspot.com/2006/11/clustering-lucene.html
>>  - I'm not sure how to do this, there doesn't seem to be much
>> activity around this.
>>
>> We got large indexes - over 500 million records, we usually partition
>> our index after 20 million records. There are total of 20 fields in
>> our index, of which we are trying to cluster 5 fields. Is there any
>> clustering solution for Lucene that would work for us? Carrot2 looked
>> the most active and promising, but it's clearly recommended for a
>> small index size. Any other suggestions?
>>
>> Thanks,
>> -vivek
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Clustering with Lucene?

Posted by Dawid Weiss <da...@gmail.com>.
Can you shed some more light on what you're trying to achieve (what is
the purpose of clustering -- are clusters to be utilized for front-end
user interface, further data mining analysis, etc.)?

With the sizes you report Carrot2 won't work for you, I'm afraid, but
Mahout may. Still, there's plenty of algorithms and preprocessing
options to consider, so if you provide more background somebody may
push you in the right direction.

Dawid

On Tue, Apr 26, 2011 at 1:49 PM, vivek sar <vi...@gmail.com> wrote:
> Hi,
>
>  I've been researching about clustering with Lucene. Here is what
> I've found so far,
>
> 1) Lucene clustering with Carrot2 -
> http://download.carrot2.org/head/manual/#section.getting-started.lucene
>   - but, this seems suitable for only smaller size index (few hundred
> documents) -  http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.choosing-algorithm
>
> 2) Lucene clustering with Mahout -
> http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3
>   - I'm not very sure if this is ready for prime time yet, there
> seems to be very few examples on how to do this. Has anyone tried this
> with large index size (millions of documents)?
>
> 3) Some clustering library in Lucene's contribution folder -
> https://issues.apache.org/jira/browse/LUCENE-1421
>   - this doesn't seem to be officially supported and last update to
> it was an year back. Has anyone tried this?
>
> 4) Lucene clustering with Terracotta -
> http://orionl.blogspot.com/2006/11/clustering-lucene.html
>  - I'm not sure how to do this, there doesn't seem to be much
> activity around this.
>
> We got large indexes - over 500 million records, we usually partition
> our index after 20 million records. There are total of 20 fields in
> our index, of which we are trying to cluster 5 fields. Is there any
> clustering solution for Lucene that would work for us? Carrot2 looked
> the most active and promising, but it's clearly recommended for a
> small index size. Any other suggestions?
>
> Thanks,
> -vivek
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org