You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Delalande, Thierry" <Th...@uk.daiwacm.com> on 2012/02/15 12:33:42 UTC

Short circuit AND or subquerying in lucene for performance

Hi,

 

I've been looking for a short circuit AND operator in Lucene or a way to
do subquerying.

Basically for queries such as field1:foo AND field2:*bar, I think it
would be highly beneficial to restrict evaluation of the second field on
the result of the first to avoid scanning the index in its entirety due
to the leading wildcard.

This can be seen as a subquery (running a query only on the result of a
first query) or as a short circuit AND, and would exist for performance
reasons.

Using SAND to denote the short-circuit variety, the short-circuit
expression x SAND y is equivalent to the conditional expression if x
then y else false.

 

So my example query would be more performant expressed as field1:foo
SAND field2:*bar

Other examples:

field1:(foo AND *bar) would be more performant expressed as field1:(foo
SAND *bar)

 

Please let me know what's already possible in terms of subquerying and
what it would take to implement this new operator in Lucene.

 

Thanks


****************************************************************
Daiwa Capital Markets Europe Limited is registered in England (registered number 01487359). The registered office is at 5 King William Street, London EC4N 7AX. The company is authorised and regulated by The Financial Services Authority and is a member of the London Stock Exchange.

The information contained in this E-Mail is confidential unless the sender has specifically stated otherwise. If you are not the intended recipient please notify Daiwa Capital Markets Europe Limited at the sender's address and delete it immediately. Communications sent by or to any person through our computer systems may be viewed by other personnel and agents of Daiwa Capital Markets Europe Limited . The sender does not intend by sending this message to form a contract with the recipient, and Daiwa Capital Markets Europe Limited, its affiliates and staff do not accept any liability for the contents of this message.

The information contained herein has been obtained from sources we believe to be reliable but we do not represent that it is accurate or complete, and therefore, Daiwa Capital Markets Europe Limited, its affiliates and staff cannot be held  responsible or liable for the contents of this message. The foregoing is not an offer or solicitation to buy or sell any security, instrument or investment. In addition Daiwa Capital Markets Europe Limited, or any affiliated company, may have an interest, position, or effect transactions, in any investment mentioned herein. Any opinions or recommendations expressed herein are solely those of the author or analyst.

RE: Short circuit AND or subquerying in lucene for performance

Posted by Uwe Schindler <uw...@thetaphi.de>.
> : Basically for queries such as field1:foo AND field2:*bar, I think it
> : would be highly beneficial to restrict evaluation of the second field on
> : the result of the first to avoid scanning the index in its entirety due
> : to the leading wildcard.
> 
> This is exactly how the BooleanQuery class in Lucene works.
> 
> Please note the logic in ConjunctionScorer and BooleanScorer2 (how much
> optimizing can be done depends on wether all of the clauses are required
or
> not)

The problem here is more the leading wildcard query. The terms are scanned
before the scoring/result collection occurs (partly during query rewrite,
partly as bitset before the scorer starts - depends on term density). The
problem is that short circuiting in BS2 occurs when the wild card bitsets
are already calculated... For wildcard queries there is no possibility to
optimize the document collection, because *every* matching term has to be
scanned and termdocs retrieved.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Short circuit AND or subquerying in lucene for performance

Posted by Chris Hostetter <ho...@fucit.org>.
: Basically for queries such as field1:foo AND field2:*bar, I think it
: would be highly beneficial to restrict evaluation of the second field on
: the result of the first to avoid scanning the index in its entirety due
: to the leading wildcard.

This is exactly how the BooleanQuery class in Lucene works.

Please note the logic in ConjunctionScorer and BooleanScorer2 (how much 
optimizing can be done depends on wether all of the clauses are required 
or not)

-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org