You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by ArtemGr <ar...@gmail.com> on 2009/03/12 16:32:50 UTC

What kind of performance to expect from a MultiTermQuery being used in BooleanQuery?

Hi!
I have this NotEmptyQuery class (http://gist.github.com/78115) which extends
the MultiTermQuery. The class is added into a BooleanQuery, after some other
queries (e.g. after TermQuery and LongTrieRangeFilter queries).
I wonder: does Lucene need to scan all the terms in the inverted index
and the collect all the document identifiers into the DocIdSet
in order to implement the MultiTermQuery which goes after some other
queries in a BooleanQuery? Like, collecting a thouthand of document
identifiers only to filter a few documents which remain after the other
queries had been fired up?
Or does Lucene use some trickery to only provide a subset of terms to the
MultiTermQuery?

Sorry if this is a dumb question or a part of a FAQ somewhere.
(I haven't found any related performance discussion of Lucene internals
on the site).


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: What kind of performance to expect from a MultiTermQuery being used in BooleanQuery?

Posted by Michael McCandless <lu...@mikemccandless.com>.
ArtemGr wrote:

>> I wonder: does Lucene need to scan all the terms in the inverted  
>> index
>> and the collect all the document identifiers into the DocIdSet
>> in order to implement the MultiTermQuery which goes after some other
>> queries in a BooleanQuery? Like, collecting a thouthand of document
>> identifiers only to filter a few documents which remain after the  
>> other
>> queries had been fired up?
>
> Adding some debugging output shows that MultiTermQuery still scans  
> all the
> terms even though the other parts of the enclosing BooleanQuery
> had already diminished the results to one or two documents.

By default, MultiTermQuery rewrites itself to a BooleanQuery OR'ing  
together
BooleanClauses for each term in the range.

So, yes, MultiTermQuery enumerates all terms in the range up front,
but you should not see it visiting all the docs for those terms
when another clause in the toplevel BooleanQuery is very restrictive.

If you use a constant-score query (ConstantScoreRangeQuery in 2.4,
or on trunk many queries can now do constant-score mode), then
unfortunately it will enumerate all docs for all terms, up front (well,
each time it advances to another segment).

> But then, the MultiTermQuery itself looks to be implemented as a  
> filter.
> (I first created a filter, but then decided to make a query
> from the MultiTermQuery in hopes that it would be faster due to some
> query combination technology, but it turns out there is no difference,
> since MultiTermQuery itself only implements a filter).
> I do not understand how MultiTermQuery works with BooleanQuery.
> BooleanWeight asks all the underlying queries for their weights,
> but MultiTermQuery does not implement the createWeight method and  
> should
> throw an UnsupportedOperationException...

MultiTermQuery gets rewritten to a new Query impl that implements
createWeight.  (It is rather confusing to "follow" how exactly a query
is searched).

> I think there is a space for optimization here:
> IndexReader is basically a "SortedMap<Term,List<Doc>> reader".
> Now, some of the BooleanQuery clauses will diminish that SortedMap.
> Currently they return a "List<(Doc,Float)> scores", not the filtered
> SortedMap<Term,List<Doc>>, but the optimization is still possible
> by dynamically filtering a subset of SortedMap<Term,List<Doc>> terms,
> only returning from it the entries which does still have a passing  
> score.
> E.g., moving the "merge" operation on List<Doc> sets earlier, so that
> the subsequent queries does not even see the documents which were  
> already
> excluded by the previous BooleanQuery clauses.
> It is quite possible that the previous BooleanQuery clause will  
> remove a lot
> of documents from the consideration, so that some terms in the  
> IndexReader
> will have an empty set of documents to consider by the subsequent
> BooleanQuery clauses. A filtered IndexReader won't even pass such  
> terms
> into the subsequent clauses (e.g. queries).
> Thus, in the best case, instead of considering tens of thouthands of  
> unique
> terms and documents, a subsequent clause in a BooleanQuery will only  
> have to
> consider a few.

There is a heuristic in ConjunctionScorer that tries to order the  
clauses
so that we visit the most restrictive ones first.  It uses the first  
docID for
all clauses to predict sparsness.  (It would be better to add an  
approxCount()
to scorers, but likely this heuristic works well in practice).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: What kind of performance to expect from a MultiTermQuery being used in BooleanQuery?

Posted by ArtemGr <ar...@gmail.com>.
> I wonder: does Lucene need to scan all the terms in the inverted index
> and the collect all the document identifiers into the DocIdSet
> in order to implement the MultiTermQuery which goes after some other
> queries in a BooleanQuery? Like, collecting a thouthand of document
> identifiers only to filter a few documents which remain after the other
> queries had been fired up?

Adding some debugging output shows that MultiTermQuery still scans all the
terms even though the other parts of the enclosing BooleanQuery
had already diminished the results to one or two documents.

But then, the MultiTermQuery itself looks to be implemented as a filter.
(I first created a filter, but then decided to make a query
from the MultiTermQuery in hopes that it would be faster due to some
query combination technology, but it turns out there is no difference,
since MultiTermQuery itself only implements a filter).
I do not understand how MultiTermQuery works with BooleanQuery.
BooleanWeight asks all the underlying queries for their weights,
but MultiTermQuery does not implement the createWeight method and should
throw an UnsupportedOperationException...

I think there is a space for optimization here:
IndexReader is basically a "SortedMap<Term,List<Doc>> reader".
Now, some of the BooleanQuery clauses will diminish that SortedMap.
Currently they return a "List<(Doc,Float)> scores", not the filtered
SortedMap<Term,List<Doc>>, but the optimization is still possible
by dynamically filtering a subset of SortedMap<Term,List<Doc>> terms,
only returning from it the entries which does still have a passing score.
E.g., moving the "merge" operation on List<Doc> sets earlier, so that
the subsequent queries does not even see the documents which were already
excluded by the previous BooleanQuery clauses.
It is quite possible that the previous BooleanQuery clause will remove a lot
of documents from the consideration, so that some terms in the IndexReader
will have an empty set of documents to consider by the subsequent
BooleanQuery clauses. A filtered IndexReader won't even pass such terms
into the subsequent clauses (e.g. queries).
Thus, in the best case, instead of considering tens of thouthands of unique
terms and documents, a subsequent clause in a BooleanQuery will only have to
consider a few.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org