You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Mr Plate <pl...@gmail.com> on 2005/12/16 02:16:45 UTC

How to retrieve distinct field matches?

This puzzle has been bugging me for a while; I'm hoping there's an  
elegant way to handle it in Lucene.

DATA DESCRIPTION:

I've got an index of over 100,000 Documents. In addition to other  
fields, each of these Documents has 0 or more "category" field  
values. There are over 5,500 such categories (it's not a small set).  
Anywhere from 1 to 500+ Documents could belong to a single  
"category". This index does not get updated very often; anywhere from  
once a day to once a month. Indexing time is currently 15-30 minutes  
from start to finish/optimization.


PROBLEM:

I'd like to provide users a way to search these "category" values.  
For example, suppose the user searches for "fiction". They might see  
results of:  { "fiction", "non-fiction" }. However, I'd like to do  
this search as quickly and efficiently as reasonable. For example, if  
there are 500 Documents of category "fiction", and 400 of "non- 
fiction", I don't want to Sort and iterate through each Hit to weed  
out the duplicate values from my query.

For what it's worth, I imagine only 0-20 categories would match a  
given query.


SIMPLEST SOLUTION I CAN THINK OF:

The best I can imagine is to maintain a separate Lucene index for  
each of these category types. Each Document in this separate index  
would probably have fields of "field_name", and "field_value", and  
would not contain any duplicates. For example, you might see a  
Document of field_name "category" and field_value "non-fiction". My  
query would hit this second index instead, to perform these metadata  
searches.


I hope that makes sense; do you know of a more elegant way to handle  
this type of problem?


Thanks,

Tyler

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to retrieve distinct field matches?

Posted by "Michael D. Curtin" <mi...@curtin.com>.

Plat wrote:

> Basically, pretend I do a regular search for "category:fiction". After
> stemming/etc, this would match any Document with a category of
> "fiction", "non-fiction", "fictitious", etc. All 900+ of them.
> 
> BUT as far as the results are concerned, I'm not actually interested
> in each Document that was hit, nor about any other field besides the
> "category" field. I just want a list of the unique categories that
> matched the search string of "fiction".
> ...
> Again, I want to find a *unique* list of "category" field values that
> match certain query text.
> 
> I know this can be done using a second index, but wanted to be sure
> there isn't an obvious, less-hacky way first. I'm used to Lucene
> surprising me with sneaky efficiencies.

Ah, yes, I misunderstood what you are trying to do.  How about doing a 
simple string search (like String.indexOf) on the contents of a TermEnum 
from IndexReader.terms()?  Since you've only got a few thousand distinct 
values, that should be pretty fast.

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to retrieve distinct field matches?

Posted by Plat <pl...@gmail.com>.

Ahh, interesting point, though I'm afraid it solves a different
problem than my intentions. Re-reading this, I think I've described my
problem in a very obscure way. Sorry :-/.


Basically, pretend I do a regular search for "category:fiction". After
stemming/etc, this would match any Document with a category of
"fiction", "non-fiction", "fictitious", etc. All 900+ of them.

BUT as far as the results are concerned, I'm not actually interested
in each Document that was hit, nor about any other field besides the
"category" field. I just want a list of the unique categories that
matched the search string of "fiction".

In this example, my ultimate goal would be a String[] of:

     { "fiction", "fictitious", "non-fiction" }

... without any costly iterations of all 900+ Hit Documents' category values of:

     { "fiction", "non-fiction", "fiction", "fiction", "fiction",
"fictitious", "non-fiction", ... }

Again, I want to find a *unique* list of "category" field values that
match certain query text.

I know this can be done using a second index, but wanted to be sure
there isn't an obvious, less-hacky way first. I'm used to Lucene
surprising me with sneaky efficiencies.

Thanks for the valiant effort to make sense of me! :)

Tyler

On 12/15/05, Michael D. Curtin <mi...@curtin.com> wrote:
> Mr Plate wrote:
>
> > This puzzle has been bugging me for a while; I'm hoping there's an
> > elegant way to handle it in Lucene.
> >
> > DATA DESCRIPTION:
> >
> > I've got an index of over 100,000 Documents. In addition to other
> > fields, each of these Documents has 0 or more "category" field  values.
> > There are over 5,500 such categories (it's not a small set).  Anywhere
> > from 1 to 500+ Documents could belong to a single  "category". This
> > index does not get updated very often; anywhere from  once a day to once
> > a month. Indexing time is currently 15-30 minutes  from start to
> > finish/optimization.
> >
> >
> > PROBLEM:
> >
> > I'd like to provide users a way to search these "category" values.  For
> > example, suppose the user searches for "fiction". They might see
> > results of:  { "fiction", "non-fiction" }. However, I'd like to do  this
> > search as quickly and efficiently as reasonable. For example, if  there
> > are 500 Documents of category "fiction", and 400 of "non- fiction", I
> > don't want to Sort and iterate through each Hit to weed  out the
> > duplicate values from my query.
> >
> > For what it's worth, I imagine only 0-20 categories would match a  given
> > query.
> >
> >
> > SIMPLEST SOLUTION I CAN THINK OF:
> >
> > The best I can imagine is to maintain a separate Lucene index for  each
> > of these category types. Each Document in this separate index  would
> > probably have fields of "field_name", and "field_value", and  would not
> > contain any duplicates. For example, you might see a  Document of
> > field_name "category" and field_value "non-fiction". My  query would hit
> > this second index instead, to perform these metadata  searches.
> >
> >
> > I hope that makes sense; do you know of a more elegant way to handle
> > this type of problem?
>
> I'm guessing that each Document doesn't have a "category" field with
> multiple values in it but, instead, has a uniquely-named field for each
> category.  Would it work to change your data model to the former?  That
> is, have a Text field named "category" in each document, so that it gets
> tokenized and indexed.  Then you could do a search of the 5K category
> names (outside of Lucene, perhaps by getting the list of Terms from the
> "category" field) for the query term of interest, "fiction" in your
> example, then compose a Lucene query with the results.  Your example
> would produce a query equivalent to 'category:fiction
> category:non-fiction'.  For only 100K documents, this should be pretty fast.
>
> Good luck!
>
> --MDC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to retrieve distinct field matches?

Posted by "Michael D. Curtin" <mi...@curtin.com>.

Mr Plate wrote:

> This puzzle has been bugging me for a while; I'm hoping there's an  
> elegant way to handle it in Lucene.
> 
> DATA DESCRIPTION:
> 
> I've got an index of over 100,000 Documents. In addition to other  
> fields, each of these Documents has 0 or more "category" field  values. 
> There are over 5,500 such categories (it's not a small set).  Anywhere 
> from 1 to 500+ Documents could belong to a single  "category". This 
> index does not get updated very often; anywhere from  once a day to once 
> a month. Indexing time is currently 15-30 minutes  from start to 
> finish/optimization.
> 
> 
> PROBLEM:
> 
> I'd like to provide users a way to search these "category" values.  For 
> example, suppose the user searches for "fiction". They might see  
> results of:  { "fiction", "non-fiction" }. However, I'd like to do  this 
> search as quickly and efficiently as reasonable. For example, if  there 
> are 500 Documents of category "fiction", and 400 of "non- fiction", I 
> don't want to Sort and iterate through each Hit to weed  out the 
> duplicate values from my query.
> 
> For what it's worth, I imagine only 0-20 categories would match a  given 
> query.
> 
> 
> SIMPLEST SOLUTION I CAN THINK OF:
> 
> The best I can imagine is to maintain a separate Lucene index for  each 
> of these category types. Each Document in this separate index  would 
> probably have fields of "field_name", and "field_value", and  would not 
> contain any duplicates. For example, you might see a  Document of 
> field_name "category" and field_value "non-fiction". My  query would hit 
> this second index instead, to perform these metadata  searches.
> 
> 
> I hope that makes sense; do you know of a more elegant way to handle  
> this type of problem?

I'm guessing that each Document doesn't have a "category" field with 
multiple values in it but, instead, has a uniquely-named field for each 
category.  Would it work to change your data model to the former?  That 
is, have a Text field named "category" in each document, so that it gets 
tokenized and indexed.  Then you could do a search of the 5K category 
names (outside of Lucene, perhaps by getting the list of Terms from the 
"category" field) for the query term of interest, "fiction" in your 
example, then compose a Lucene query with the results.  Your example 
would produce a query equivalent to 'category:fiction 
category:non-fiction'.  For only 100K documents, this should be pretty fast.

Good luck!

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to retrieve distinct field matches?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

This is pretty much the same problem that many of us have faced when  
it comes to faceted browsing.  I'm using a set of cached BitSet's  
that represent the documents that have a specific category (or  
general "facet" in my case).  I do a full-text search for "some query  
expression", using QueryFilter to get the BitSet for the query.  Then  
I AND the Hits BitSet with each of the facet BitSet's, and the  
cardinality of each gives me the number in each category that matches  
the query.  I load up these BitSet's when my search server is  
launched.  In my case I'm currently dealing with about 30k documents,  
with maybe 100 unique facet values, and these load in the blink of an  
eye.

I realize the above description was void of code specifics, but the  
gist is there.  Hope it helps.

	Erik



On Dec 15, 2005, at 8:16 PM, Mr Plate wrote:

> This puzzle has been bugging me for a while; I'm hoping there's an  
> elegant way to handle it in Lucene.
>
> DATA DESCRIPTION:
>
> I've got an index of over 100,000 Documents. In addition to other  
> fields, each of these Documents has 0 or more "category" field  
> values. There are over 5,500 such categories (it's not a small  
> set). Anywhere from 1 to 500+ Documents could belong to a single  
> "category". This index does not get updated very often; anywhere  
> from once a day to once a month. Indexing time is currently 15-30  
> minutes from start to finish/optimization.
>
>
> PROBLEM:
>
> I'd like to provide users a way to search these "category" values.  
> For example, suppose the user searches for "fiction". They might  
> see results of:  { "fiction", "non-fiction" }. However, I'd like to  
> do this search as quickly and efficiently as reasonable. For  
> example, if there are 500 Documents of category "fiction", and 400  
> of "non-fiction", I don't want to Sort and iterate through each Hit  
> to weed out the duplicate values from my query.
>
> For what it's worth, I imagine only 0-20 categories would match a  
> given query.
>
>
> SIMPLEST SOLUTION I CAN THINK OF:
>
> The best I can imagine is to maintain a separate Lucene index for  
> each of these category types. Each Document in this separate index  
> would probably have fields of "field_name", and "field_value", and  
> would not contain any duplicates. For example, you might see a  
> Document of field_name "category" and field_value "non-fiction". My  
> query would hit this second index instead, to perform these  
> metadata searches.
>
>
> I hope that makes sense; do you know of a more elegant way to  
> handle this type of problem?
>
>
> Thanks,
>
> Tyler
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org