You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Robert Engels <re...@ix.netcom.com> on 2004/01/06 22:04:11 UTC

Lucene Optimized Query Broken?

I have implemented a IndexReader that uses a relational datastore, and in
performing the queries (and reviewing the Lucene code), I see the following
behavior with Lucene.

It does has no way of limiting further searches based 'hits' on the more
unique terms. UNLESS I AM MISSING SOMETHING...

Say for example:

I have a index with documents that have only 2 fields, the first (unique) is
'very unique', in that most document have at least somewhat varying terms,
the second is a boolean that contains only (boolean) 'true' or 'false'. The
index contains 100,000,000+ documents.

If I perform the following search "+unique:somevalue +boolean:true', lucene
with search on the first term, returning very few documents, but then it
will search the second term, returning possibly a million+ documents, then
it will intersect the list, return 'hits' of only a few documents.

Shouldn't Lucene look at the 'term frequency', build the query in order of
'uniqueness', and then have some method of restricting further 'term'
searches to only certain sets of documents? The only 'IndexReader' interface
based support is TermEnum and TermDocs, but neither of these can take a
'document id set restriction'.

THE SAME PROBLEM OCCURS WITH ONLY A SINGLE TERM AS WELL. Using the same
example as above, a search like

"+unique:someuniquevalue +unique:someveryuniquevalue"

will still cause Lucene to read all of the index information for
'someverynonuniqueterm', rather than restrict the search to only those
documents returned for 'someuniquevalue'.

All types of queries should be reordered to restrict further searches, based
on the matches/non-matches in REQUIRED/PROHIBITED term clauses.

This overhead may not be noticeable in the default file-system based index,
but given enough documents it would be..., or when the index information is
stored on a network (possibly remote) file system.

This behavior has been observed with the 1.3 final code.

Robert Engels

Re: Lucene Optimized Query Broken?

Posted by Doug Cutting <cu...@apache.org>.
Leo Galambos wrote:
> Do you mean inverted lists with skipping (Moffat, Zobel: Self-Indexing 
> Inverted Files for Fast Text Retrieval, ACM TIS 10/1996)?

Yes, similar to that.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Lucene Optimized Query Broken?

Posted by Leo Galambos <Le...@seznam.cz>.
Doug Cutting wrote:

> The heart of it is to add data to the index so that TermDocs.skipTo() 
> is much faster.  Then the search algorithms are modified to call 
> TermDocs.skipTo().  This should make conjunctive queries (ANDs and 
> phrases) significantly faster when one term occurs much less 
> frequently than others.  I hope to check this in in the next week or so.


Do you mean inverted lists with skipping (Moffat, Zobel: Self-Indexing 
Inverted Files for Fast Text Retrieval, ACM TIS 10/1996)?

THX
Leo




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


RE: Lucene Optimized Query Broken?

Posted by Jochen <lu...@quontis.com>.
Robert,

	Could you share some details of the implementation, and performance
of the relational data store you implemented? I would be especially
interested in the DB design. How does a very large number of documents
affect your performance and DB size (as you hinted in other mail of yours)?

	Do you think it is worth the effort even if the indexes do not
change frequently (i.e. only increase in size over time)?

	Best,
		Jochen

> -----Original Message-----
> From: Robert Engels [mailto:rengels@ix.netcom.com]
> Sent: Wednesday, January 07, 2004 9:13 AM
> To: Lucene Developers List
> Subject: RE: Lucene Optimized Query Broken?
> 
------ snip -----
> 
> I do have an Lucene IndexReader & IndexWriter implementation that uses a
> relational datastore, and it is extremely fast, in many ways much faster
> than Lucene's file system based indexing, especially for indexes that
> change
> frequently.
> 
> This is the last holdup, on having a truely lightening fast search system
> in
> the relational store.
> 
> It sounds like your proposed changes ill work. If you need any assistance
> in
> debugging, etc. please let  me know.
> 
> Robert
> 
> -----Original Message-----
> From: Doug Cutting [mailto:cutting@apache.org]
> Sent: Wednesday, January 07, 2004 10:56 AM
> To: Lucene Developers List
> Subject: Re: Lucene Optimized Query Broken?
> 
> 
> Robert Engels wrote:
> > I have a index with documents that have only 2 fields, the first
> (unique)
> is
> > 'very unique', in that most document have at least somewhat varying
> terms,
> > the second is a boolean that contains only (boolean) 'true' or 'false'.
> The
> > index contains 100,000,000+ documents.
> >
> > If I perform the following search "+unique:somevalue +boolean:true',
> lucene
> > with search on the first term, returning very few documents, but then it
> > will search the second term, returning possibly a million+ documents,
> then
> > it will intersect the list, return 'hits' of only a few documents.
> 
> First, this is not the sort of query that Lucene is designed to
> efficiently handle.  Rather, this is the sort of thing that a relational
> database is desgined for.  Lucene is primarily designed to support text
> searching, where field values are natural language text and query terms
> are words describing a user's interest.  You can implement full text
> search with a relational database, but it will be slow.  Similarly, you
> can search tabular data with Lucene, but it may be slow.
> 
> That said, I'm currently working on an optimization that will make such
> queries substantially faster in Lucene.  The heart of it is to add data
> to the index so that TermDocs.skipTo() is much faster.  Then the search
> algorithms are modified to call TermDocs.skipTo().  This should make
> conjunctive queries (ANDs and phrases) significantly faster when one
> term occurs much less frequently than others.  I hope to check this in
> in the next week or so.
> 
> Doug
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


RE: Lucene Optimized Query Broken?

Posted by Robert Engels <re...@ix.netcom.com>.
I agree, sort of. I would think that even Lucene standard usage would
benefit greatly from this, especially for terms that occur very frequently,
think of the search +epson +printer - why search the entire printer index in
this case, especially if the printer index can be randomly accessed - if it
is sorted by document number.

Now that I've dug deeper into the code, I think the Lucene structure is very
close to supporting this. Once the outer query is flattened, just sort the
sub boolean term queries by frequency, then when evalutating the hits, ust
pass the required hits as a set into TermDocs?

I do have an Lucene IndexReader & IndexWriter implementation that uses a
relational datastore, and it is extremely fast, in many ways much faster
than Lucene's file system based indexing, especially for indexes that change
frequently.

This is the last holdup, on having a truely lightening fast search system in
the relational store.

It sounds like your proposed changes ill work. If you need any assistance in
debugging, etc. please let  me know.

Robert

-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org]
Sent: Wednesday, January 07, 2004 10:56 AM
To: Lucene Developers List
Subject: Re: Lucene Optimized Query Broken?


Robert Engels wrote:
> I have a index with documents that have only 2 fields, the first (unique)
is
> 'very unique', in that most document have at least somewhat varying terms,
> the second is a boolean that contains only (boolean) 'true' or 'false'.
The
> index contains 100,000,000+ documents.
>
> If I perform the following search "+unique:somevalue +boolean:true',
lucene
> with search on the first term, returning very few documents, but then it
> will search the second term, returning possibly a million+ documents, then
> it will intersect the list, return 'hits' of only a few documents.

First, this is not the sort of query that Lucene is designed to
efficiently handle.  Rather, this is the sort of thing that a relational
database is desgined for.  Lucene is primarily designed to support text
searching, where field values are natural language text and query terms
are words describing a user's interest.  You can implement full text
search with a relational database, but it will be slow.  Similarly, you
can search tabular data with Lucene, but it may be slow.

That said, I'm currently working on an optimization that will make such
queries substantially faster in Lucene.  The heart of it is to add data
to the index so that TermDocs.skipTo() is much faster.  Then the search
algorithms are modified to call TermDocs.skipTo().  This should make
conjunctive queries (ANDs and phrases) significantly faster when one
term occurs much less frequently than others.  I hope to check this in
in the next week or so.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


RE: Lucene Optimized Query Broken?

Posted by Robert Engels <re...@ix.netcom.com>.
I thought of another case, probably more appropriate...

For example, the search is on 'printer', but the user only wants to see
documents in the last 7 days. There may be thousands of 'printer' documents,
but very few in last 7 days. Why search the entire index, when the date
range will probably restrict the documents (depending on how often document
are added to the index) to a much greater degree.

Maybe I am having this trouble because the documents I've indexed have
common terms that occur very frequently but really can't be thrown out. For
example, "ibm printer" is far different than "ibm computer".

Robert

-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org]
Sent: Wednesday, January 07, 2004 10:56 AM
To: Lucene Developers List
Subject: Re: Lucene Optimized Query Broken?


Robert Engels wrote:
> I have a index with documents that have only 2 fields, the first (unique)
is
> 'very unique', in that most document have at least somewhat varying terms,
> the second is a boolean that contains only (boolean) 'true' or 'false'.
The
> index contains 100,000,000+ documents.
>
> If I perform the following search "+unique:somevalue +boolean:true',
lucene
> with search on the first term, returning very few documents, but then it
> will search the second term, returning possibly a million+ documents, then
> it will intersect the list, return 'hits' of only a few documents.

First, this is not the sort of query that Lucene is designed to
efficiently handle.  Rather, this is the sort of thing that a relational
database is desgined for.  Lucene is primarily designed to support text
searching, where field values are natural language text and query terms
are words describing a user's interest.  You can implement full text
search with a relational database, but it will be slow.  Similarly, you
can search tabular data with Lucene, but it may be slow.

That said, I'm currently working on an optimization that will make such
queries substantially faster in Lucene.  The heart of it is to add data
to the index so that TermDocs.skipTo() is much faster.  Then the search
algorithms are modified to call TermDocs.skipTo().  This should make
conjunctive queries (ANDs and phrases) significantly faster when one
term occurs much less frequently than others.  I hope to check this in
in the next week or so.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Lucene Optimized Query Broken?

Posted by Doug Cutting <cu...@apache.org>.
Robert Engels wrote:
> I have a index with documents that have only 2 fields, the first (unique) is
> 'very unique', in that most document have at least somewhat varying terms,
> the second is a boolean that contains only (boolean) 'true' or 'false'. The
> index contains 100,000,000+ documents.
> 
> If I perform the following search "+unique:somevalue +boolean:true', lucene
> with search on the first term, returning very few documents, but then it
> will search the second term, returning possibly a million+ documents, then
> it will intersect the list, return 'hits' of only a few documents.

First, this is not the sort of query that Lucene is designed to 
efficiently handle.  Rather, this is the sort of thing that a relational 
database is desgined for.  Lucene is primarily designed to support text 
searching, where field values are natural language text and query terms 
are words describing a user's interest.  You can implement full text 
search with a relational database, but it will be slow.  Similarly, you 
can search tabular data with Lucene, but it may be slow.

That said, I'm currently working on an optimization that will make such 
queries substantially faster in Lucene.  The heart of it is to add data 
to the index so that TermDocs.skipTo() is much faster.  Then the search 
algorithms are modified to call TermDocs.skipTo().  This should make 
conjunctive queries (ANDs and phrases) significantly faster when one 
term occurs much less frequently than others.  I hope to check this in 
in the next week or so.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org