You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Peter Bloem <p...@peterbloem.nl> on 2007/05/21 02:37:51 UTC

Optional terms in BooleanQuery

I'm constructing a search with some required terms and some optional 
terms in in the query. According to some earlier posts that looks like 
"+(A B) C D E" in query syntax for required terms A and B and optional 
terms C D and E. In other words, Lucene considers all documents that 
have both A and B, and ranks them higher if they also have C D or E.

I'm wondering how this translates to a BooleanQuery. I know I should use 
BooleanClause.Occur.MUST for A and B, and I guess I should use 
BooleanQuery.Occur.SHOULD for C, D and E. However the javadocs for 
BooleanClause.Occur.SHOULD states:

"Use this operator for clauses that /should/ appear in the matching 
documents. For a BooleanQuery with two |SHOULD| subqueries, at least one 
of the clauses must appear in the matching documents."

Does this last sentence actually mean that a query with _just_ two 
SHOULD clauses (ie. only SHOULD clauses) must contain one of the 
clauses, or will the BooleanQuery described above actually constrain the 
search results to (A AND B) AND (B OR C OR D)? If so, what should I use 
instead?

thank you,
 Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Optional terms in BooleanQuery

Posted by Paul Elschot <pa...@xs4all.nl>.
This is actually more for java-dev, but anyway.

On Tuesday 22 May 2007 11:04, Mark Miller wrote:
> Sorry, didn't mean to imply that that whole spiel was a technical 
> explanation...just a "how I like to think of it" to get my head around 
> the BooleanQuery system. If your reading that, think high level overview 
> more than technically accurate. I'll be more specific in the future -- 
> as always, the javadocs are the best place to get down to the nitty gritty.
> 
> HitCollector:
>   /** Called once for every non-zero scoring document, with the document 
> number
>    * and its score.
> 
> TopDocCollector (used by Hits and returned by a Searcher) does ensure 
> scores are greater than 0. If you roll your own HitCollector, you 
> shouldn't need my thoughts on how I think of BooleanQuery's.

Among others, this javadoc is corrected by the patch here:
http://issues.apache.org/jira/browse/LUCENE-584
It introduces Matcher as a superclass of Scorer.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Optional terms in BooleanQuery

Posted by Mark Miller <ma...@gmail.com>.
Sorry, didn't mean to imply that that whole spiel was a technical 
explanation...just a "how I like to think of it" to get my head around 
the BooleanQuery system. If your reading that, think high level overview 
more than technically accurate. I'll be more specific in the future -- 
as always, the javadocs are the best place to get down to the nitty gritty.

HitCollector:
  /** Called once for every non-zero scoring document, with the document 
number
   * and its score.

TopDocCollector (used by Hits and returned by a Searcher) does ensure 
scores are greater than 0. If you roll your own HitCollector, you 
shouldn't need my thoughts on how I think of BooleanQuery's.

- Mark

Chris Hostetter wrote:
> : Each doc is going to get a score -- if the score is positive the doc
> : will be a hit, if the score is 0 the doc will not be a hit.
>
> that's actually a fairly missleading statement ... the guts of Lucene
> doesn't prevent documents from "matching" with a negative score
> (specificly: a HitCollector can be asked to collect a match with a
> negative score)
>
> (dropping matches with negative scores only happens in the Hits
> class/collector as i recall)
>
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Optional terms in BooleanQuery

Posted by Chris Hostetter <ho...@fucit.org>.
: Each doc is going to get a score -- if the score is positive the doc
: will be a hit, if the score is 0 the doc will not be a hit.

that's actually a fairly missleading statement ... the guts of Lucene
doesn't prevent documents from "matching" with a negative score
(specificly: a HitCollector can be asked to collect a match with a
negative score)

(dropping matches with negative scores only happens in the Hits
class/collector as i recall)




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Optional terms in BooleanQuery

Posted by Mark Miller <ma...@gmail.com>.
I like to think of it like this:

Each doc is going to get a score -- if the score is positive the doc 
will be a hit, if the score is 0 the doc will not be a hit.

If a boolean clause is Occur.Must and it is not found, the score will be 
dropped to 0 no matter what (if found, the score is obviously 
increased). If a boolean clause is Occur.Must_Not and is found then the 
score will be dropped to 0 no matter what.
If the boolean query is Occur.Should and it is found a positive number 
is added to the score...if it is not found, nothing is added to the score.

Now you see why it says: "Use this operator for clauses that /should/ 
appear in the matching documents. For a BooleanQuery with two |SHOULD| 
subqueries, at least one of the clauses must appear in the matching 
documents."

To get a positive score and make a hit, one of the Occur.Should clauses 
needs to be found to increase the score above 0.

- Mark

Peter Bloem wrote:
> I'm constructing a search with some required terms and some optional 
> terms in in the query. According to some earlier posts that looks like 
> "+(A B) C D E" in query syntax for required terms A and B and optional 
> terms C D and E. In other words, Lucene considers all documents that 
> have both A and B, and ranks them higher if they also have C D or E.
>
> I'm wondering how this translates to a BooleanQuery. I know I should 
> use BooleanClause.Occur.MUST for A and B, and I guess I should use 
> BooleanQuery.Occur.SHOULD for C, D and E. However the javadocs for 
> BooleanClause.Occur.SHOULD states:
>
> "Use this operator for clauses that /should/ appear in the matching 
> documents. For a BooleanQuery with two |SHOULD| subqueries, at least 
> one of the clauses must appear in the matching documents."
>
> Does this last sentence actually mean that a query with _just_ two 
> SHOULD clauses (ie. only SHOULD clauses) must contain one of the 
> clauses, or will the BooleanQuery described above actually constrain 
> the search results to (A AND B) AND (B OR C OR D)? If so, what should 
> I use instead?
>
> thank you,
> Peter
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Optional terms in BooleanQuery

Posted by Chris Hostetter <ho...@fucit.org>.
: BooleanQuery.Occur.SHOULD for C, D and E. However the javadocs for
: BooleanClause.Occur.SHOULD states:
:
: "Use this operator for clauses that /should/ appear in the matching
: documents. For a BooleanQuery with two |SHOULD| subqueries, at least one
: of the clauses must appear in the matching documents."

Yeah, that's missleading... i've commited an updte that reads...

    /** Use this operator for clauses that <i>should</i> appear in the
     * matching documents. For a BooleanQuery with no <code>MUST</code>
     * clauses one or more <code>SHOULD</code> clauses must match a document
     * for the BooleanQuery to match.
     * @see BooleanQuery#setMinimumNumberShouldMatch
     */
    public static final Occur SHOULD = new Occur("SHOULD");

...does that make more sense?



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Optional terms in BooleanQuery

Posted by Soeren Pekrul <so...@gmx.de>.
Peter Bloem wrote:
[...]
> "+(A B) C D E"
[...]
> In other words, Lucene considers all documents that 
> have both A and B, and ranks them higher if they also have C D or E.

Hello Peter,

for my understanding "+(A B) C D E" means at least one of the terms "A" 
or "B" must be contained and the terms "C", "D", and "E" are optional. 
The following documents d are hits:
d(A, B)
d(A)
d(B)
d(A, C)
...
Documents without "A" and "B" are not a hit.

To have both terms "A" and "B" in a document the query should be: "(+A 
+B) C D E" or "+A +B C D E".

Sören



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org