You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Mr Plate <pl...@gmail.com> on 2005/12/16 02:16:45 UTC
How to retrieve distinct field matches?
This puzzle has been bugging me for a while; I'm hoping there's an
elegant way to handle it in Lucene.
DATA DESCRIPTION:
I've got an index of over 100,000 Documents. In addition to other
fields, each of these Documents has 0 or more "category" field
values. There are over 5,500 such categories (it's not a small set).
Anywhere from 1 to 500+ Documents could belong to a single
"category". This index does not get updated very often; anywhere from
once a day to once a month. Indexing time is currently 15-30 minutes
from start to finish/optimization.
PROBLEM:
I'd like to provide users a way to search these "category" values.
For example, suppose the user searches for "fiction". They might see
results of: { "fiction", "non-fiction" }. However, I'd like to do
this search as quickly and efficiently as reasonable. For example, if
there are 500 Documents of category "fiction", and 400 of "non-
fiction", I don't want to Sort and iterate through each Hit to weed
out the duplicate values from my query.
For what it's worth, I imagine only 0-20 categories would match a
given query.
SIMPLEST SOLUTION I CAN THINK OF:
The best I can imagine is to maintain a separate Lucene index for
each of these category types. Each Document in this separate index
would probably have fields of "field_name", and "field_value", and
would not contain any duplicates. For example, you might see a
Document of field_name "category" and field_value "non-fiction". My
query would hit this second index instead, to perform these metadata
searches.
I hope that makes sense; do you know of a more elegant way to handle
this type of problem?
Thanks,
Tyler
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: How to retrieve distinct field matches?
Posted by "Michael D. Curtin" <mi...@curtin.com>.
Plat wrote:
> Basically, pretend I do a regular search for "category:fiction". After
> stemming/etc, this would match any Document with a category of
> "fiction", "non-fiction", "fictitious", etc. All 900+ of them.
>
> BUT as far as the results are concerned, I'm not actually interested
> in each Document that was hit, nor about any other field besides the
> "category" field. I just want a list of the unique categories that
> matched the search string of "fiction".
> ...
> Again, I want to find a *unique* list of "category" field values that
> match certain query text.
>
> I know this can be done using a second index, but wanted to be sure
> there isn't an obvious, less-hacky way first. I'm used to Lucene
> surprising me with sneaky efficiencies.
Ah, yes, I misunderstood what you are trying to do. How about doing a
simple string search (like String.indexOf) on the contents of a TermEnum
from IndexReader.terms()? Since you've only got a few thousand distinct
values, that should be pretty fast.
--MDC
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: How to retrieve distinct field matches?
Posted by Plat <pl...@gmail.com>.
Ahh, interesting point, though I'm afraid it solves a different
problem than my intentions. Re-reading this, I think I've described my
problem in a very obscure way. Sorry :-/.
Basically, pretend I do a regular search for "category:fiction". After
stemming/etc, this would match any Document with a category of
"fiction", "non-fiction", "fictitious", etc. All 900+ of them.
BUT as far as the results are concerned, I'm not actually interested
in each Document that was hit, nor about any other field besides the
"category" field. I just want a list of the unique categories that
matched the search string of "fiction".
In this example, my ultimate goal would be a String[] of:
{ "fiction", "fictitious", "non-fiction" }
... without any costly iterations of all 900+ Hit Documents' category values of:
{ "fiction", "non-fiction", "fiction", "fiction", "fiction",
"fictitious", "non-fiction", ... }
Again, I want to find a *unique* list of "category" field values that
match certain query text.
I know this can be done using a second index, but wanted to be sure
there isn't an obvious, less-hacky way first. I'm used to Lucene
surprising me with sneaky efficiencies.
Thanks for the valiant effort to make sense of me! :)
Tyler
On 12/15/05, Michael D. Curtin <mi...@curtin.com> wrote:
> Mr Plate wrote:
>
> > This puzzle has been bugging me for a while; I'm hoping there's an
> > elegant way to handle it in Lucene.
> >
> > DATA DESCRIPTION:
> >
> > I've got an index of over 100,000 Documents. In addition to other
> > fields, each of these Documents has 0 or more "category" field values.
> > There are over 5,500 such categories (it's not a small set). Anywhere
> > from 1 to 500+ Documents could belong to a single "category". This
> > index does not get updated very often; anywhere from once a day to once
> > a month. Indexing time is currently 15-30 minutes from start to
> > finish/optimization.
> >
> >
> > PROBLEM:
> >
> > I'd like to provide users a way to search these "category" values. For
> > example, suppose the user searches for "fiction". They might see
> > results of: { "fiction", "non-fiction" }. However, I'd like to do this
> > search as quickly and efficiently as reasonable. For example, if there
> > are 500 Documents of category "fiction", and 400 of "non- fiction", I
> > don't want to Sort and iterate through each Hit to weed out the
> > duplicate values from my query.
> >
> > For what it's worth, I imagine only 0-20 categories would match a given
> > query.
> >
> >
> > SIMPLEST SOLUTION I CAN THINK OF:
> >
> > The best I can imagine is to maintain a separate Lucene index for each
> > of these category types. Each Document in this separate index would
> > probably have fields of "field_name", and "field_value", and would not
> > contain any duplicates. For example, you might see a Document of
> > field_name "category" and field_value "non-fiction". My query would hit
> > this second index instead, to perform these metadata searches.
> >
> >
> > I hope that makes sense; do you know of a more elegant way to handle
> > this type of problem?
>
> I'm guessing that each Document doesn't have a "category" field with
> multiple values in it but, instead, has a uniquely-named field for each
> category. Would it work to change your data model to the former? That
> is, have a Text field named "category" in each document, so that it gets
> tokenized and indexed. Then you could do a search of the 5K category
> names (outside of Lucene, perhaps by getting the list of Terms from the
> "category" field) for the query term of interest, "fiction" in your
> example, then compose a Lucene query with the results. Your example
> would produce a query equivalent to 'category:fiction
> category:non-fiction'. For only 100K documents, this should be pretty fast.
>
> Good luck!
>
> --MDC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: How to retrieve distinct field matches?
Posted by "Michael D. Curtin" <mi...@curtin.com>.
Mr Plate wrote:
> This puzzle has been bugging me for a while; I'm hoping there's an
> elegant way to handle it in Lucene.
>
> DATA DESCRIPTION:
>
> I've got an index of over 100,000 Documents. In addition to other
> fields, each of these Documents has 0 or more "category" field values.
> There are over 5,500 such categories (it's not a small set). Anywhere
> from 1 to 500+ Documents could belong to a single "category". This
> index does not get updated very often; anywhere from once a day to once
> a month. Indexing time is currently 15-30 minutes from start to
> finish/optimization.
>
>
> PROBLEM:
>
> I'd like to provide users a way to search these "category" values. For
> example, suppose the user searches for "fiction". They might see
> results of: { "fiction", "non-fiction" }. However, I'd like to do this
> search as quickly and efficiently as reasonable. For example, if there
> are 500 Documents of category "fiction", and 400 of "non- fiction", I
> don't want to Sort and iterate through each Hit to weed out the
> duplicate values from my query.
>
> For what it's worth, I imagine only 0-20 categories would match a given
> query.
>
>
> SIMPLEST SOLUTION I CAN THINK OF:
>
> The best I can imagine is to maintain a separate Lucene index for each
> of these category types. Each Document in this separate index would
> probably have fields of "field_name", and "field_value", and would not
> contain any duplicates. For example, you might see a Document of
> field_name "category" and field_value "non-fiction". My query would hit
> this second index instead, to perform these metadata searches.
>
>
> I hope that makes sense; do you know of a more elegant way to handle
> this type of problem?
I'm guessing that each Document doesn't have a "category" field with
multiple values in it but, instead, has a uniquely-named field for each
category. Would it work to change your data model to the former? That
is, have a Text field named "category" in each document, so that it gets
tokenized and indexed. Then you could do a search of the 5K category
names (outside of Lucene, perhaps by getting the list of Terms from the
"category" field) for the query term of interest, "fiction" in your
example, then compose a Lucene query with the results. Your example
would produce a query equivalent to 'category:fiction
category:non-fiction'. For only 100K documents, this should be pretty fast.
Good luck!
--MDC
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: How to retrieve distinct field matches?
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
This is pretty much the same problem that many of us have faced when
it comes to faceted browsing. I'm using a set of cached BitSet's
that represent the documents that have a specific category (or
general "facet" in my case). I do a full-text search for "some query
expression", using QueryFilter to get the BitSet for the query. Then
I AND the Hits BitSet with each of the facet BitSet's, and the
cardinality of each gives me the number in each category that matches
the query. I load up these BitSet's when my search server is
launched. In my case I'm currently dealing with about 30k documents,
with maybe 100 unique facet values, and these load in the blink of an
eye.
I realize the above description was void of code specifics, but the
gist is there. Hope it helps.
Erik
On Dec 15, 2005, at 8:16 PM, Mr Plate wrote:
> This puzzle has been bugging me for a while; I'm hoping there's an
> elegant way to handle it in Lucene.
>
> DATA DESCRIPTION:
>
> I've got an index of over 100,000 Documents. In addition to other
> fields, each of these Documents has 0 or more "category" field
> values. There are over 5,500 such categories (it's not a small
> set). Anywhere from 1 to 500+ Documents could belong to a single
> "category". This index does not get updated very often; anywhere
> from once a day to once a month. Indexing time is currently 15-30
> minutes from start to finish/optimization.
>
>
> PROBLEM:
>
> I'd like to provide users a way to search these "category" values.
> For example, suppose the user searches for "fiction". They might
> see results of: { "fiction", "non-fiction" }. However, I'd like to
> do this search as quickly and efficiently as reasonable. For
> example, if there are 500 Documents of category "fiction", and 400
> of "non-fiction", I don't want to Sort and iterate through each Hit
> to weed out the duplicate values from my query.
>
> For what it's worth, I imagine only 0-20 categories would match a
> given query.
>
>
> SIMPLEST SOLUTION I CAN THINK OF:
>
> The best I can imagine is to maintain a separate Lucene index for
> each of these category types. Each Document in this separate index
> would probably have fields of "field_name", and "field_value", and
> would not contain any duplicates. For example, you might see a
> Document of field_name "category" and field_value "non-fiction". My
> query would hit this second index instead, to perform these
> metadata searches.
>
>
> I hope that makes sense; do you know of a more elegant way to
> handle this type of problem?
>
>
> Thanks,
>
> Tyler
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org