You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by Jenny Brown <sk...@gmail.com> on 2009/03/06 18:53:24 UTC
operator precedence and confusing result
I have Lucene running against a fairly large set of documents, and a
user interface for full text querying. All queries mentioned below go
against the document full text.
I have one query from a user that is giving me bizarre behavior. Its
structure is:
A) medicine AND (cat OR dog OR horse OR fish) = 10,000 results
B) medicine AND cat OR dog OR horse OR fish = 4,000 results
C) (medicine AND cat) OR dog OR horse OR fish = 90,000 results
(which tells me this isn't how B is interpretted)
This is backwards from what I would expect, as I would guess that case
B would bind (medicine AND cat) OR dog... etc., thus resulting in
-more- hits than A, rather than less. Can someone explain what Lucene
is doing for case B?
I'm using StandardAnalyzer and a typically tokenized document text
field, and BooleanQuery. Thanks for ideas and help.
Jenny Brown
Re: operator precedence and confusing result
Posted by Chris Hostetter <ho...@fucit.org>.
to elaborate a bit on Mike's comments...
: version of Lucene). Lucene doesn't have a hierarchical boolean query model
...it has a relevancy based query model. documents don't just
match(true|false) ... if they match, they have a score.
: A AND B OR C OR D OR E OR F
: -> +A +B C D E F
: -> find documents that match clause A and clause B (other clauses don't affect
: matching)
...but the other caluses do affect scoring, the more of those clauses that
match, the higher the score.
-Hoss
Re: operator precedence and confusing result
Posted by Chris Hostetter <ho...@fucit.org>.
: Yes, but until that work is done (and it probably belongs in a separate
: queryparser module, imo), we're left with the current situation, which doesn't
: work but seductively gives the appearance of working.
:
: Anyway, the point seems moot given the volumous silence my suggestion has
: incited.
In general i agree with you Mike: the world would be a lot simpler if we
removed the AND/NOT/OR keywords from the query parser ... but the hitch is
that when used in a particular way (i don't want to say "correctly"
because that's a subjective evaluation, so instead i'll say
"unambiguously") they can be extremely effective for people.
((A AND B) OR (C AND D)) AND X
...works the way *everyone* would exepect it to work, and is much simpler
for some people to udnerstand then...
+((+A +B) (+C +D)) +X
Personally: I'd be very happy if there were a lot more QueryParsers, and
we had one that supported the +/- syntax and a seperate one that supported
the AND/OR/NOT syntax ... in fact: i think there's a nice long thread
(that i really need to find time to read) on java-dev about how the IBM
folks may have solved that problem for everyone.
-Hoss
Re: operator precedence and confusing result
Posted by Mike Klaas <mi...@gmail.com>.
On 13-Mar-09, at 11:33 AM, Ted Dunning wrote:
> Actually, the nesting of BooleanQueries with required, optional and
> excluded
> clauses are equivalent in power to (), AND, OR and AND-NOT. Any
> expression
> in one form can be expressed in the other. Moreover, the Lucene
> form allows
> very nice intepretation with relevance scoring that is difficult
> with the
> normal Boolean expressions.
>
> The only thing you need to have is an accurate translation. No need
> to
> remove anything.
Yes, but until that work is done (and it probably belongs in a
separate queryparser module, imo), we're left with the current
situation, which doesn't work but seductively gives the appearance of
working.
Anyway, the point seems moot given the volumous silence my suggestion
has incited.
-Mike
Re: operator precedence and confusing result
Posted by Ted Dunning <te...@gmail.com>.
Actually, the nesting of BooleanQueries with required, optional and excluded
clauses are equivalent in power to (), AND, OR and AND-NOT. Any expression
in one form can be expressed in the other. Moreover, the Lucene form allows
very nice intepretation with relevance scoring that is difficult with the
normal Boolean expressions.
The only thing you need to have is an accurate translation. No need to
remove anything.
On Fri, Mar 13, 2009 at 10:51 AM, Mike Klaas <mi...@gmail.com> wrote:
> Anyway, just a note to say that boolean matching is important to me
>> and my users; it'd be good if it worked the way it looks like it
>> would. If it doesn't, I need to understand better what the current
>> limitations are.
>>
>
> Well, this is precisely why I am suggesting that we remove it (in some
> future version of Lucene). Lucene doesn't have a hierarchical boolean query
> model that works like people "expect", and bugs filed that report
> discrepancies between the way boolean operators work and intuition are
> rejected. We are left with something that is convenient if you understand
> how it works, but if that is so, there is no reason that translation into
> the alternate syntax can't be used.
>
> Lucene's query model is based on REQUIRED, OPTIONAL, and EXCLUDED clauses.
> A clause with no annotation is always OPTIONAL, and doesn't affect matching
> unless there are only OPTIONAL clauses on that level. brackets () create a
> subclause (note that this is OPTIONAL by default!). AND terms are
> translated into REQUIRED clauses, AND NOT's are translated into EXCLUDED
> clauses. Require clauses are annotated with +'s
--
Ted Dunning, CTO
DeepDyve
Re: operator precedence and confusing result
Posted by Mike Klaas <mi...@gmail.com>.
On 11-Mar-09, at 7:13 PM, Jenny Brown wrote:
> I use the boolean logic heavily in a production app, because it's the
> grammar that my users understand (and they put together complex
> boolean queries in other apps too). Also, we're not using relevance
> ranking. A document either "matches the query" and gets returned, or
> "doesn't match" and doesn't get returned. We only want yes/no
> answers.
>
> I haven't had time to really figure out what the earlier commenter
> meant with the + operators syntax conversion. I still thought it
> would have meant the same thing as the query I had posted, ie, article
> has to match all terms in the AND clauses, and at least one of the
> terms in the OR list. I guess I'm still missing what his explanation
> was trying to demonstrate.
>
> Anyway, just a note to say that boolean matching is important to me
> and my users; it'd be good if it worked the way it looks like it
> would. If it doesn't, I need to understand better what the current
> limitations are.
Well, this is precisely why I am suggesting that we remove it (in some
future version of Lucene). Lucene doesn't have a hierarchical boolean
query model that works like people "expect", and bugs filed that
report discrepancies between the way boolean operators work and
intuition are rejected. We are left with something that is convenient
if you understand how it works, but if that is so, there is no reason
that translation into the alternate syntax can't be used.
Lucene's query model is based on REQUIRED, OPTIONAL, and EXCLUDED
clauses. A clause with no annotation is always OPTIONAL, and doesn't
affect matching unless there are only OPTIONAL clauses on that level.
brackets () create a subclause (note that this is OPTIONAL by
default!). AND terms are translated into REQUIRED clauses, AND NOT's
are translated into EXCLUDED clauses. Require clauses are annotated
with +'s
A AND B OR C OR D OR E OR F
-> +A +B C D E F
-> find documents that match clause A and clause B (other clauses
don't affect matching)
C OR D OR E OR F
-> C D E F
-> find documents matching at least one of these clauses
A AND (B OR C OR D OR E OR F)
-> +A +(B C D E F)
-> find documents that match A, and match one of B, C, D, E, or F
(A AND B) OR C OR D OR E OR F
-> (+A +B) C D E F
-> find documents that match at least one of C, D, E, F, or both of A
and B
The key takeaway: once you have an AND in a grouped set of clauses,
the OR are completely irrelevant for matching.
-Mike
Re: operator precedence and confusing result
Posted by Jenny Brown <sk...@gmail.com>.
I use the boolean logic heavily in a production app, because it's the
grammar that my users understand (and they put together complex
boolean queries in other apps too). Also, we're not using relevance
ranking. A document either "matches the query" and gets returned, or
"doesn't match" and doesn't get returned. We only want yes/no
answers.
I haven't had time to really figure out what the earlier commenter
meant with the + operators syntax conversion. I still thought it
would have meant the same thing as the query I had posted, ie, article
has to match all terms in the AND clauses, and at least one of the
terms in the OR list. I guess I'm still missing what his explanation
was trying to demonstrate.
Anyway, just a note to say that boolean matching is important to me
and my users; it'd be good if it worked the way it looks like it
would. If it doesn't, I need to understand better what the current
limitations are.
I'll be out of town and away from net for 4 days, starting tomorrow,
so I'll have to continue conversation once I'm back. Thanks for the
help and comments thus far.
Jenny Brown
On Wed, Mar 11, 2009 at 8:58 PM, Mike Klaas <mi...@gmail.com> wrote:
>
> On 8-Mar-09, at 3:46 PM, Chris Hostetter wrote:
>>
>> it's important to rememebr that Lucene isn't a boolean matching system --
>> the undelrying semantics are MUST/MUST_NOT/SHOULD -- the AND/OR
>> keywords are just syntactic sugar that attempt to apply underlying
>> semantics as binary opperators.
>
> Has anyone (but me) considered ripping out this misleading (and imho not
> terribly useful) syntactic sugar in Lucene 3.0? It seems that the boolean
> operator support is little but an impediment to understanding the Lucene
> query model. Analogies that only occasionally accord with a person's
> intuition are among the most dangerous.
>
> There could always be a query parser mode or module that is devoted to
> parsing boolean operators correctly.
>
> -Mike
>
Re: operator precedence and confusing result
Posted by Mike Klaas <mi...@gmail.com>.
On 8-Mar-09, at 3:46 PM, Chris Hostetter wrote:
>
> it's important to rememebr that Lucene isn't a boolean matching
> system --
> the undelrying semantics are MUST/MUST_NOT/SHOULD -- the AND/OR
> keywords are just syntactic sugar that attempt to apply underlying
> semantics as binary opperators.
Has anyone (but me) considered ripping out this misleading (and imho
not terribly useful) syntactic sugar in Lucene 3.0? It seems that the
boolean operator support is little but an impediment to understanding
the Lucene query model. Analogies that only occasionally accord with
a person's intuition are among the most dangerous.
There could always be a query parser mode or module that is devoted to
parsing boolean operators correctly.
-Mike
Re: operator precedence and confusing result
Posted by Ted Dunning <te...@gmail.com>.
Jenny,
The best thing to do is to use Luke to explain how the query parser is
reading your query.
See http://luke.sourceforge.net/ or http://www.getopt.org/luke/ for a
download and web start link.
And, no, the proximity query should not match if both words are not
present. But users are notoriously unreliable witnesses relative to this
sort of thing.
On Mon, Mar 16, 2009 at 12:55 PM, Jenny Brown <sk...@gmail.com> wrote:
> I often have users performing a search like:
>
> "music arts"~10 OR "band festival" OR "rock concert"
>
> and then reporting back to me that they'll see results containing just
> the word band, or just the word concert, that don't match an entire
> phrase. Am I misunderstanding phrase search here? Is the proximity
> search changing the behavior of the whole query?
>
>
Re: operator precedence and confusing result
Posted by Jenny Brown <sk...@gmail.com>.
Ok, after Mike Klaas's explanation of how it's translated, and a few
minutes of really studying your examples, this makes sense to me now.
And the number of results makes sense. Now my challenge is how to
explain this to my users - oi, what a difference, and subtleties.
Thanks for the help. I do have one remaining question. How do
proximity searches fit into this?
I often have users performing a search like:
"music arts"~10 OR "band festival" OR "rock concert"
and then reporting back to me that they'll see results containing just
the word band, or just the word concert, that don't match an entire
phrase. Am I misunderstanding phrase search here? Is the proximity
search changing the behavior of the whole query?
Jenny Brown
On Sun, Mar 8, 2009 at 5:46 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : A) medicine AND (cat OR dog OR horse OR fish) = 10,000 results
> : B) medicine AND cat OR dog OR horse OR fish = 4,000 results
> : C) (medicine AND cat) OR dog OR horse OR fish = 90,000 results
> : (which tells me this isn't how B is interpretted)
>
> it's important to rememebr that Lucene isn't a boolean matching system --
> the undelrying semantics are MUST/MUST_NOT/SHOULD -- the AND/OR
> keywords are just syntactic sugar that attempt to apply underlying
> semantics as binary opperators.
>
> if you look at the toString output of queries you can see what the parser
> has done to try and apply your use of AND/OR to the underlying query
> structures. here's an example of what the toString for those queries
> looks like (my analyzer does some stemming, but the structure of hte
> queries is going to be the same regardless of the analyzer)
>
>
> A) +text:medicin +(text:cat text:dog text:hors text:fish)
> B) +text:medicin +text:cat text:dog text:hors text:fish
> C) (+text:medicin +text:cat) text:dog text:hors text:fish
>
> ...Does the discrepency in the number of results make sense now?
>
>
> -Hoss
>
>
Re: operator precedence and confusing result
Posted by Chris Hostetter <ho...@fucit.org>.
: A) medicine AND (cat OR dog OR horse OR fish) = 10,000 results
: B) medicine AND cat OR dog OR horse OR fish = 4,000 results
: C) (medicine AND cat) OR dog OR horse OR fish = 90,000 results
: (which tells me this isn't how B is interpretted)
it's important to rememebr that Lucene isn't a boolean matching system --
the undelrying semantics are MUST/MUST_NOT/SHOULD -- the AND/OR
keywords are just syntactic sugar that attempt to apply underlying
semantics as binary opperators.
if you look at the toString output of queries you can see what the parser
has done to try and apply your use of AND/OR to the underlying query
structures. here's an example of what the toString for those queries
looks like (my analyzer does some stemming, but the structure of hte
queries is going to be the same regardless of the analyzer)
A) +text:medicin +(text:cat text:dog text:hors text:fish)
B) +text:medicin +text:cat text:dog text:hors text:fish
C) (+text:medicin +text:cat) text:dog text:hors text:fish
...Does the discrepency in the number of results make sense now?
-Hoss