You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "kapilChhabra (sent by Nabble.com)" <li...@nabble.com> on 2005/08/24 09:46:46 UTC

Search Results Clustering

Hi All,
I have been using Lucene in my application to search over 4 million recordes updated daily.
I am currently using a single index with 21 fields.
Some of my fields contain numbers that are foreign keys to my data. I have provided a dropdown of values to select from, on my search form, to search on these fields.

A typical scenario of my index/search  is:
FIELD-1:token;index - formatted text using WhitespaceAnalyzer
FIELD-2:token;index - formatted text using WhitespaceAnalyzer
FIELD-3:ndex - integers[foreign keys] stored as string.
FIELD-4:ndex - integers[foreign keys] stored as string.

Sample Search Query:

(FIELD-1:apple OR FILED-1:orange) AND (FIELD-3:4 OR FILED-3:5) 

The results are Sorted on FILED-4.

I am getting results as expected.

An additional requirement is to bunch the search results display the count.
eg. ouput:
Search Results:
1. Doc 100
2. Doc 209
3. Doc 897
etc...

Search Clusters:
Total Results = 540
+results in [FILED-3:4] = 400
---results in [FILED-4:1] = 150
---results in [FILED-4:7] = 130
---results in [FILED-4:3] = 100
---results in [FILED-4:others] = 20

+results in [FILED-3:5] = 140
---results in [FILED-4:2] = 90
---results in [FILED-4:1] = 30
---results in [FILED-4:others] = 20

I have no clue how to do it using a single index.
Any pointers in this will be highly appreciated.

Thanks in advance,

Regards,
KapilChhabra
--
Sent from the Lucene - Java Users forum at Nabble.com:
http://www.nabble.com/Search-Results-Clustering-t249355.html#a696937

Re: Search Results Clustering

Posted by Ray Tsang <sa...@gmail.com>.

I had similar requirements of "count" and "group by" on over 130mil
records, it's really a pain.  It's currently usable but not
satisfactory.

Currently it's grouping at run-time by iterating through ungrouped
items.  It collects matching documents into BitSet, so subsequent
queries can use BitSet to retrieve the results of original query. 
Moreover, it can mark off documents that are already being grouped
from the BitSet.

In a page that shows 10 records/page, it will only group 10 records at
a time. Consequently, there is no way to know the total number grouped
records in the beginning.

In addition, it feels like reading the field values from the document
in order to look for group-by results is most time consuming.

How does RDBMS do it?

ray,

On 8/31/05, kapilChhabra (sent by Nabble.com) <li...@nabble.com> wrote:
> 
> thanks a lot for your suggestion.
> I'll try it and get back if need be.
> 
> Meanwhile, I gave it a thought and concluded that the best time to do the categorization/clustering should be lucene calculates Hits/in the Scrorer.
> I am not sure if I am right.
> In addition to the current functionality can we modify the Scorer class add the following feature:
> The class generates a 2 dimentional array for the clustered field, the first dimention contains the distinct values of the field and the second dimention contains the count of results under this field. This value is incremented for an acceptible hit.
> Does it make sense?
> If it is possible, i'll dig deeper into the code of the Hits/Scorer classes.
> 
> Thanks in advance,
> kapilChhabra
> 
> 
> --
> Sent from the Lucene - Java Users forum at Nabble.com:
> http://www.nabble.com/Search-Results-Clustering-t249355.html#a748901
> 
>

Re: Search Results Clustering

Posted by "kapilChhabra (sent by Nabble.com)" <li...@nabble.com>.

thanks a lot for your suggestion.
I'll try it and get back if need be.

Meanwhile, I gave it a thought and concluded that the best time to do the categorization/clustering should be lucene calculates Hits/in the Scrorer.
I am not sure if I am right. 
In addition to the current functionality can we modify the Scorer class add the following feature:
The class generates a 2 dimentional array for the clustered field, the first dimention contains the distinct values of the field and the second dimention contains the count of results under this field. This value is incremented for an acceptible hit.
Does it make sense? 
If it is possible, i'll dig deeper into the code of the Hits/Scorer classes.

Thanks in advance,
kapilChhabra


--
Sent from the Lucene - Java Users forum at Nabble.com:
http://www.nabble.com/Search-Results-Clustering-t249355.html#a748901

Re: Search Results Clustering

Posted by Chris Hostetter <ho...@fucit.org>.

: Suppose I cluster the results only on the 1st field i.e. I do not show
: the constituent clusters. Even in this case, i'll require around 900
: Filters[i have 900 unique terms] in memory and will have to run the same
: query 900 times, 1 on each Filter. I am sitting at a situation where I
: get around 15 queries/sec on an average. Even if I spare another machine
: to return me the clustering results, I'll be firing 15*90 = 1350
: queries/sec.

1) If I remember correctly, just because query X takes S seconds to
complete, doesn't mean issuing executing X N times in rapid succession
will take N * S seconds.  There is some internal caching taking place.

2) Regardless of what type of complex Query objects you need to generate
the set of products that match your users search, or the set of products
in each category (what you've been calling cluster), or which search
method you use to generate the main list of results, you can use the
QueryFilter class to translate that Query into a Filter, and call
Filter.bits to get a BitSet for each categories Filter (and for your
main result set) which can be intersected to find the counts you need.
These BitSets can be cached for as long as your index remains unmodified
-- using CachingWrapperFilter for example, which means the work required
to do those 900 (category specific) queries only happens once each time
the index is changed -- not once per user search.

-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search Results Clustering

Posted by "kapilChhabra (sent by Nabble.com)" <li...@nabble.com>.

thanks for the response.
I understand that using Filters can do the trick, but there are other issues invloved.
Suppose I cluster the results only on the 1st field i.e. I do not show the constituent clusters.
Even in this case, i'll require around 900 Filters[i have 900 unique terms] in memory and will have to run the same query 900 times, 1 on each Filter.
I am sitting at a situation where I get around 15 queries/sec on an average. Even if I spare another machine to return me the clustering results, I'll be firing 15*90 = 1350 queries/sec.

1. Am I thinking in the right direction?
2. If yes, then what else can be a more feasible solution?


Thanks in anticipation,
kapilChhabra
--
Sent from the Lucene - Java Users forum at Nabble.com:
http://www.nabble.com/Search-Results-Clustering-t249355.html#a731549

Re: Search Results Clustering

Posted by Chris Hostetter <ho...@fucit.org>.

the approach(es) I described in this thread...

http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3cPine.LNX.4.58.0505111358460.9671@hal.rescomp.berkeley.edu%3e

...should work, but you have the added complexity of whating the counts
not just for all unique values in a field, but all the permutations of
values from two fields -- which just means you need to compute a lot more
intersections.  Fortunately, Filters cache nicely so you only have to pay
the cost of computing them when your index changes.

Having done an *extensive* amount of work along this line recently, I can
tell you that something worth considering is to only precompute the
Filters for the individual terms in each field, and then find the
intersection of all the permutations at search time -- it reduces the
amount of precomputation needed, and in many cases can make a huge
difference in the amount of RAM needed to cache all of the Filters.


: Date: Wed, 24 Aug 2005 00:46:46 -0700 (PDT)
: From: "kapilChhabra (sent by Nabble.com)" <li...@nabble.com>
: Reply-To: java-user@lucene.apache.org,
:      kapilChhabra <ka...@naukri.com>
: To: java-user@lucene.apache.org
: Subject: Search Results Clustering
:
:
: Hi All,
: I have been using Lucene in my application to search over 4 million recordes updated daily.
: I am currently using a single index with 21 fields.
: Some of my fields contain numbers that are foreign keys to my data. I have provided a dropdown of values to select from, on my search form, to search on these fields.
:
: A typical scenario of my index/search  is:
: FIELD-1:token;index - formatted text using WhitespaceAnalyzer
: FIELD-2:token;index - formatted text using WhitespaceAnalyzer
: FIELD-3:ndex - integers[foreign keys] stored as string.
: FIELD-4:ndex - integers[foreign keys] stored as string.
:
: Sample Search Query:
:
: (FIELD-1:apple OR FILED-1:orange) AND (FIELD-3:4 OR FILED-3:5)
:
: The results are Sorted on FILED-4.
:
: I am getting results as expected.
:
: An additional requirement is to bunch the search results display the count.
: eg. ouput:
: Search Results:
: 1. Doc 100
: 2. Doc 209
: 3. Doc 897
: etc...
:
: Search Clusters:
: Total Results = 540
: +results in [FILED-3:4] = 400
: ---results in [FILED-4:1] = 150
: ---results in [FILED-4:7] = 130
: ---results in [FILED-4:3] = 100
: ---results in [FILED-4:others] = 20
:
: +results in [FILED-3:5] = 140
: ---results in [FILED-4:2] = 90
: ---results in [FILED-4:1] = 30
: ---results in [FILED-4:others] = 20
:
: I have no clue how to do it using a single index.
: Any pointers in this will be highly appreciated.
:
: Thanks in advance,
:
: Regards,
: KapilChhabra
: --
: Sent from the Lucene - Java Users forum at Nabble.com:
: http://www.nabble.com/Search-Results-Clustering-t249355.html#a696937
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search Results Clustering

Posted by Nader Henein <ns...@bayt.net>.

well you're not going to like my answer, to that, if what you're looking 
for is a group by result depending on the unique values of a field or a 
combination of fields ( field-3 field-4 ), something that in SQL would 
look like this :

select field-3 , field-4 , count(*) from ... where ..... group by 
field-3 , field-4

Then short of doing X Lucene queries to get total result counts, I'm nor 
really sure how you would get this, depending on how many unique 
combinations you have, you could always denormalize the count(*) in a 
separate table or Lucene Index and run two queries, the first to get the 
counts and the second to get the results which you could group on display.

So the question is, how many unique values do you have, it would help if 
you gave me a real world example because there are work arounds to these 
things, that can be a lot more performant and intelligent than a 
straight DB hit.

Nader Henein

kapilChhabra (sent by Nabble.com) wrote:

>Thanks for the prompt reply.
>I have to bunch the results [only the count] on the basis of value of one of the FIELDS.
>lets say FILED-3 and with in it FILED-4.
>
>It is very much similar to using "group by"
>
>What can be the options of doing this? And which is the best way to do it?
>
>
>Thanks in anticipation,
>Regards,
>kapilChhabra
>
>--
>Sent from the Lucene - Java Users forum at Nabble.com:
>http://www.nabble.com/Search-Results-Clustering-t249355.html#a697377
>
>  
>

-- 

Nader S. Henein
Senior Applications Architect

Bayt.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search Results Clustering

Posted by "kapilChhabra (sent by Nabble.com)" <li...@nabble.com>.

Thanks for the prompt reply.
I have to bunch the results [only the count] on the basis of value of one of the FIELDS.
lets say FILED-3 and with in it FILED-4.

It is very much similar to using "group by"

What can be the options of doing this? And which is the best way to do it?


Thanks in anticipation,
Regards,
kapilChhabra

--
Sent from the Lucene - Java Users forum at Nabble.com:
http://www.nabble.com/Search-Results-Clustering-t249355.html#a697377

Re: Search Results Clustering

Posted by Nader Henein <ns...@bayt.net>.

I don't understand your requirement, what do you want to bunch your 
results by?

Can you explain so I can help

Nader Henein

kapilChhabra (sent by Nabble.com) wrote:

>Hi All,
>I have been using Lucene in my application to search over 4 million recordes updated daily.
>I am currently using a single index with 21 fields.
>Some of my fields contain numbers that are foreign keys to my data. I have provided a dropdown of values to select from, on my search form, to search on these fields.
>
>A typical scenario of my index/search  is:
>FIELD-1:token;index - formatted text using WhitespaceAnalyzer
>FIELD-2:token;index - formatted text using WhitespaceAnalyzer
>FIELD-3:ndex - integers[foreign keys] stored as string.
>FIELD-4:ndex - integers[foreign keys] stored as string.
>
>Sample Search Query:
>
>(FIELD-1:apple OR FILED-1:orange) AND (FIELD-3:4 OR FILED-3:5) 
>
>The results are Sorted on FILED-4.
>
>I am getting results as expected.
>
>An additional requirement is to bunch the search results display the count.
>eg. ouput:
>Search Results:
>1. Doc 100
>2. Doc 209
>3. Doc 897
>etc...
>
>Search Clusters:
>Total Results = 540
>+results in [FILED-3:4] = 400
>---results in [FILED-4:1] = 150
>---results in [FILED-4:7] = 130
>---results in [FILED-4:3] = 100
>---results in [FILED-4:others] = 20
>
>+results in [FILED-3:5] = 140
>---results in [FILED-4:2] = 90
>---results in [FILED-4:1] = 30
>---results in [FILED-4:others] = 20
>
>I have no clue how to do it using a single index.
>Any pointers in this will be highly appreciated.
>
>Thanks in advance,
>
>Regards,
>KapilChhabra
>--
>Sent from the Lucene - Java Users forum at Nabble.com:
>http://www.nabble.com/Search-Results-Clustering-t249355.html#a696937
>
>  
>

-- 

Nader S. Henein
Senior Applications Architect

Bayt.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org