You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Peter Bloem <p...@peterbloem.nl> on 2007/05/21 02:37:51 UTC
Optional terms in BooleanQuery
I'm constructing a search with some required terms and some optional
terms in in the query. According to some earlier posts that looks like
"+(A B) C D E" in query syntax for required terms A and B and optional
terms C D and E. In other words, Lucene considers all documents that
have both A and B, and ranks them higher if they also have C D or E.
I'm wondering how this translates to a BooleanQuery. I know I should use
BooleanClause.Occur.MUST for A and B, and I guess I should use
BooleanQuery.Occur.SHOULD for C, D and E. However the javadocs for
BooleanClause.Occur.SHOULD states:
"Use this operator for clauses that /should/ appear in the matching
documents. For a BooleanQuery with two |SHOULD| subqueries, at least one
of the clauses must appear in the matching documents."
Does this last sentence actually mean that a query with _just_ two
SHOULD clauses (ie. only SHOULD clauses) must contain one of the
clauses, or will the BooleanQuery described above actually constrain the
search results to (A AND B) AND (B OR C OR D)? If so, what should I use
instead?
thank you,
Peter
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Optional terms in BooleanQuery
Posted by Paul Elschot <pa...@xs4all.nl>.
This is actually more for java-dev, but anyway.
On Tuesday 22 May 2007 11:04, Mark Miller wrote:
> Sorry, didn't mean to imply that that whole spiel was a technical
> explanation...just a "how I like to think of it" to get my head around
> the BooleanQuery system. If your reading that, think high level overview
> more than technically accurate. I'll be more specific in the future --
> as always, the javadocs are the best place to get down to the nitty gritty.
>
> HitCollector:
> /** Called once for every non-zero scoring document, with the document
> number
> * and its score.
>
> TopDocCollector (used by Hits and returned by a Searcher) does ensure
> scores are greater than 0. If you roll your own HitCollector, you
> shouldn't need my thoughts on how I think of BooleanQuery's.
Among others, this javadoc is corrected by the patch here:
http://issues.apache.org/jira/browse/LUCENE-584
It introduces Matcher as a superclass of Scorer.
Regards,
Paul Elschot
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Optional terms in BooleanQuery
Posted by Mark Miller <ma...@gmail.com>.
Sorry, didn't mean to imply that that whole spiel was a technical
explanation...just a "how I like to think of it" to get my head around
the BooleanQuery system. If your reading that, think high level overview
more than technically accurate. I'll be more specific in the future --
as always, the javadocs are the best place to get down to the nitty gritty.
HitCollector:
/** Called once for every non-zero scoring document, with the document
number
* and its score.
TopDocCollector (used by Hits and returned by a Searcher) does ensure
scores are greater than 0. If you roll your own HitCollector, you
shouldn't need my thoughts on how I think of BooleanQuery's.
- Mark
Chris Hostetter wrote:
> : Each doc is going to get a score -- if the score is positive the doc
> : will be a hit, if the score is 0 the doc will not be a hit.
>
> that's actually a fairly missleading statement ... the guts of Lucene
> doesn't prevent documents from "matching" with a negative score
> (specificly: a HitCollector can be asked to collect a match with a
> negative score)
>
> (dropping matches with negative scores only happens in the Hits
> class/collector as i recall)
>
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Optional terms in BooleanQuery
Posted by Chris Hostetter <ho...@fucit.org>.
: Each doc is going to get a score -- if the score is positive the doc
: will be a hit, if the score is 0 the doc will not be a hit.
that's actually a fairly missleading statement ... the guts of Lucene
doesn't prevent documents from "matching" with a negative score
(specificly: a HitCollector can be asked to collect a match with a
negative score)
(dropping matches with negative scores only happens in the Hits
class/collector as i recall)
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Optional terms in BooleanQuery
Posted by Mark Miller <ma...@gmail.com>.
I like to think of it like this:
Each doc is going to get a score -- if the score is positive the doc
will be a hit, if the score is 0 the doc will not be a hit.
If a boolean clause is Occur.Must and it is not found, the score will be
dropped to 0 no matter what (if found, the score is obviously
increased). If a boolean clause is Occur.Must_Not and is found then the
score will be dropped to 0 no matter what.
If the boolean query is Occur.Should and it is found a positive number
is added to the score...if it is not found, nothing is added to the score.
Now you see why it says: "Use this operator for clauses that /should/
appear in the matching documents. For a BooleanQuery with two |SHOULD|
subqueries, at least one of the clauses must appear in the matching
documents."
To get a positive score and make a hit, one of the Occur.Should clauses
needs to be found to increase the score above 0.
- Mark
Peter Bloem wrote:
> I'm constructing a search with some required terms and some optional
> terms in in the query. According to some earlier posts that looks like
> "+(A B) C D E" in query syntax for required terms A and B and optional
> terms C D and E. In other words, Lucene considers all documents that
> have both A and B, and ranks them higher if they also have C D or E.
>
> I'm wondering how this translates to a BooleanQuery. I know I should
> use BooleanClause.Occur.MUST for A and B, and I guess I should use
> BooleanQuery.Occur.SHOULD for C, D and E. However the javadocs for
> BooleanClause.Occur.SHOULD states:
>
> "Use this operator for clauses that /should/ appear in the matching
> documents. For a BooleanQuery with two |SHOULD| subqueries, at least
> one of the clauses must appear in the matching documents."
>
> Does this last sentence actually mean that a query with _just_ two
> SHOULD clauses (ie. only SHOULD clauses) must contain one of the
> clauses, or will the BooleanQuery described above actually constrain
> the search results to (A AND B) AND (B OR C OR D)? If so, what should
> I use instead?
>
> thank you,
> Peter
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Optional terms in BooleanQuery
Posted by Chris Hostetter <ho...@fucit.org>.
: BooleanQuery.Occur.SHOULD for C, D and E. However the javadocs for
: BooleanClause.Occur.SHOULD states:
:
: "Use this operator for clauses that /should/ appear in the matching
: documents. For a BooleanQuery with two |SHOULD| subqueries, at least one
: of the clauses must appear in the matching documents."
Yeah, that's missleading... i've commited an updte that reads...
/** Use this operator for clauses that <i>should</i> appear in the
* matching documents. For a BooleanQuery with no <code>MUST</code>
* clauses one or more <code>SHOULD</code> clauses must match a document
* for the BooleanQuery to match.
* @see BooleanQuery#setMinimumNumberShouldMatch
*/
public static final Occur SHOULD = new Occur("SHOULD");
...does that make more sense?
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Optional terms in BooleanQuery
Posted by Soeren Pekrul <so...@gmx.de>.
Peter Bloem wrote:
[...]
> "+(A B) C D E"
[...]
> In other words, Lucene considers all documents that
> have both A and B, and ranks them higher if they also have C D or E.
Hello Peter,
for my understanding "+(A B) C D E" means at least one of the terms "A"
or "B" must be contained and the terms "C", "D", and "E" are optional.
The following documents d are hits:
d(A, B)
d(A)
d(B)
d(A, C)
...
Documents without "A" and "B" are not a hit.
To have both terms "A" and "B" in a document the query should be: "(+A
+B) C D E" or "+A +B C D E".
Sören
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org