You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jonathan Rochkind <ro...@jhu.edu> on 2011/05/18 00:57:50 UTC
lucene parser, negative OR operands
(changed subject for this topic). Weird. I'm seeing it wrong myself, and
have for a while -- I even wrote some custom pre-processor logic at my
app level to work around it. Weird, I dunno.
Wait. "Queries with -one OR -two return less documents than a either
operand does on its own."
Wait, that's exactly what's wrong, isn't it? How can there be fewer
documents that have "-one OR -two" then have "-one" alone? If there are
X documents that do not have a "one" in them, there can't be less than X
documents that EITHER do not have a "one" OR do not have a "two" (ie,
documents that do not have BOTH one and two), can there? We didn't ask
for "-one AND -two", we asked for "-one OR -two".
On 5/17/2011 6:42 PM, Markus Jelsma wrote:
> mmmm, that's not what i see while testing right now. Queries with -one OR -two
> return less documents than a either operand does on its own, this is with
> LuceneQParser. I haven't done extensive testing since i rarely use boolean
> algebra in Lucene or Solr.
>
>> Oops, you're right, I had misremembered --- Solr 1.4.1 "lucene" qp
>> handles pure negative fine, it's Solr 1.4.1 _dismax_ that does not.
>>
>> Although, here's one, not actually related to this thread, that DOESN'T
>> work in Solr 1.4.1 lucene query parser. Curious if it's been fixed in
>> Solr 3.1.
>>
>> &defType=lucene&q=-one OR -two
>>
>> That one does NOT work as expected in solr 1.4.1, although I can't
>> explain exactly what it's doing, it's not right. (It returns FEWER
>> results than "-one" alone, which can't be right algebraicly). I think.
>> So there are still some kinds of negative queries that do weird things.
>>
>> On 5/17/2011 6:29 PM, Markus Jelsma wrote:
>>> Such a negation works just as one would expect.
>>>
>>> q=*:*
>>> <result name="response" numFound="158" start="0">
>>>
>>> q=*:*&fq=-type:text/html
>>> <result name="response" numFound="25" start="0">
>>>
>>> q=*:*&fq=type:text/html
>>> <result name="response" numFound="133" start="0">
>>>
>>> Well, that adds up , doesn't it ;)
>>>
>>>> 1. I don't think Solr will re-use the filter cache in that situation,
>>>> although I'm not sure. But I comment anyway because, not what you asked
>>>> but something else that will trip you up with your example:
>>>>
>>>> 2. In fact, a pure-negative query like that doesn't work _at all_ in the
>>>> default solr/lucene query parser used for 'fq', at least in Solr 1.4.1.
>>>> Not sure if it's been improved in 3.1, but I don't think so. It will
>>>> always return 0 hits, the solr/lucene query parser can't generate a
>>>> proper lucene query from a pure negative query like that.
>>>>
>>>> To get around this, you can find a variation the query that means the
>>>> same thing but isn't that form. Here's a really ugly one I use, with a
>>>> nested dismax -- dismax ALSO has trouble with pure negatives, although I
>>>> think maybe edismax can handle em? But this weird as heck combo works,
>>>> maybe there's a better way.
>>>>
>>>> NOT _query_:"{!dismax qf=something}history"
>>>>
>>>> And to come around full circle, I have NO idea what effect nested
>>>> queries have on the filter cache. I think that STILL won't re-use the
>>>> filter cache.... but I wonder if it'll re-use the _query_ cache for
>>>> "history"? I forget even more how the query cache works though.
>>>>
>>>> On 5/17/2011 6:07 PM, Burton-West, Tom wrote:
>>>>> If I have a query with a filter query such as : " q=art&fq=history" and
>>>>> then run a second query "q=art&fq=-history", will Solr realize that it
>>>>> can use the cached results of the previous filter query "history" (in
>>>>> the filter cache) or will it not realize this and have to actually do a
>>>>> second filter query against the index for "not history"?
>>>>>
>>>>> Tom
Re: lucene parser, negative OR operands
Posted by Jonathan Rochkind <ro...@jhu.edu>.
On 5/17/2011 7:07 PM, Markus Jelsma wrote:
>
> This is propably due to the contents (HTML bodies) of the documents i've
> queried. It's not so strange for this type of document to return less
> documents when two negated operands are specified. In my case (i tested it) a
> conjunction returned the same documents as a disjunction did.
>
> Again, i haven't done extensive testing on this subject.
I think we're disagreeing on what the proper behavior is. Because my
understanding of the proper behavior under boolean logic, it doesn't
matter the contents of your documents, it is logically impossible.
Perhaps I am wrong to expect that lucene's pseudo-boolean operators will
behave like actual boolean logic?
under boolean logic -- assuming "-one" means the same thing as "NOT one"
-- both mean "all documents that do not have 'one'", right? :
-one OR -two === (NOT one) OR (NOT two) === NOT ( one AND two )
And it is logically impossible for that query to return FEWER results
than "-one" alone does, or than "-two" alone does, in ANY corpus. It
can return the same #, or it can return more. You can never get fewer
documents by adding an "OR" union on, right? That's a set union, union
of set A with some other (possibly empty) set B can never have fewer
members than set A alone!
In fact, playing around more and comparing hit counts, it looks like
Solr 1.4.1 lucene query parser treats:
"-one OR -two"
the same as
NOT (one OR two)
Which is not/should not be the same query at all.
The first is "all documents that don't have 'one' COMBINED WITH all
documents that don't have 'two'". The second is "all documents that
have NEITHER 'one' NOR 'two'". Those are two different things, or
ought to be.
Or am I wrong to think that? That is certainly the way boolean algebra
works; if "-one" is a boolean negation the same as "NOT one". Then "-one
OR -two" definitely ought _not_ to be the same query as "NOT (one OR
two)". But maybe I should not be expecting predictable boolean algebra
here? But if that's the case then I'm not sure what behavior I should be
expecting, what the expected predictable behavior of these operators is!
If we want to make things even more confusing, I can supply some other
patterns involving an explicit "NOT" that also don't work how I expect
or according to any predictable way I can figure out.
Re: lucene parser, negative OR operands
Posted by Markus Jelsma <ma...@openindex.io>.
> (changed subject for this topic). Weird. I'm seeing it wrong myself, and
> have for a while -- I even wrote some custom pre-processor logic at my
> app level to work around it. Weird, I dunno.
>
> Wait. "Queries with -one OR -two return less documents than a either
> operand does on its own."
>
> Wait, that's exactly what's wrong, isn't it? How can there be fewer
> documents that have "-one OR -two" then have "-one" alone? If there are
> X documents that do not have a "one" in them, there can't be less than X
> documents that EITHER do not have a "one" OR do not have a "two" (ie,
> documents that do not have BOTH one and two), can there? We didn't ask
> for "-one AND -two", we asked for "-one OR -two".
This is propably due to the contents (HTML bodies) of the documents i've
queried. It's not so strange for this type of document to return less
documents when two negated operands are specified. In my case (i tested it) a
conjunction returned the same documents as a disjunction did.
Again, i haven't done extensive testing on this subject.
>
> On 5/17/2011 6:42 PM, Markus Jelsma wrote:
> > mmmm, that's not what i see while testing right now. Queries with -one OR
> > -two return less documents than a either operand does on its own, this
> > is with LuceneQParser. I haven't done extensive testing since i rarely
> > use boolean algebra in Lucene or Solr.
> >
> >> Oops, you're right, I had misremembered --- Solr 1.4.1 "lucene" qp
> >> handles pure negative fine, it's Solr 1.4.1 _dismax_ that does not.
> >>
> >> Although, here's one, not actually related to this thread, that DOESN'T
> >> work in Solr 1.4.1 lucene query parser. Curious if it's been fixed in
> >> Solr 3.1.
> >>
> >> &defType=lucene&q=-one OR -two
> >>
> >> That one does NOT work as expected in solr 1.4.1, although I can't
> >> explain exactly what it's doing, it's not right. (It returns FEWER
> >> results than "-one" alone, which can't be right algebraicly). I think.
> >> So there are still some kinds of negative queries that do weird things.
> >>
> >> On 5/17/2011 6:29 PM, Markus Jelsma wrote:
> >>> Such a negation works just as one would expect.
> >>>
> >>> q=*:*
> >>> <result name="response" numFound="158" start="0">
> >>>
> >>> q=*:*&fq=-type:text/html
> >>> <result name="response" numFound="25" start="0">
> >>>
> >>> q=*:*&fq=type:text/html
> >>> <result name="response" numFound="133" start="0">
> >>>
> >>> Well, that adds up , doesn't it ;)
> >>>
> >>>> 1. I don't think Solr will re-use the filter cache in that situation,
> >>>> although I'm not sure. But I comment anyway because, not what you
> >>>> asked but something else that will trip you up with your example:
> >>>>
> >>>> 2. In fact, a pure-negative query like that doesn't work _at all_ in
> >>>> the default solr/lucene query parser used for 'fq', at least in Solr
> >>>> 1.4.1. Not sure if it's been improved in 3.1, but I don't think so.
> >>>> It will always return 0 hits, the solr/lucene query parser can't
> >>>> generate a proper lucene query from a pure negative query like that.
> >>>>
> >>>> To get around this, you can find a variation the query that means the
> >>>> same thing but isn't that form. Here's a really ugly one I use, with a
> >>>> nested dismax -- dismax ALSO has trouble with pure negatives, although
> >>>> I think maybe edismax can handle em? But this weird as heck combo
> >>>> works, maybe there's a better way.
> >>>>
> >>>> NOT _query_:"{!dismax qf=something}history"
> >>>>
> >>>> And to come around full circle, I have NO idea what effect nested
> >>>> queries have on the filter cache. I think that STILL won't re-use the
> >>>> filter cache.... but I wonder if it'll re-use the _query_ cache for
> >>>> "history"? I forget even more how the query cache works though.
> >>>>
> >>>> On 5/17/2011 6:07 PM, Burton-West, Tom wrote:
> >>>>> If I have a query with a filter query such as : " q=art&fq=history"
> >>>>> and then run a second query "q=art&fq=-history", will Solr realize
> >>>>> that it can use the cached results of the previous filter query
> >>>>> "history" (in the filter cache) or will it not realize this and
> >>>>> have to actually do a second filter query against the index for
> >>>>> "not history"?
> >>>>>
> >>>>> Tom
Re: lucene parser, negative OR operands
Posted by Jonathan Rochkind <ro...@jhu.edu>.
On 5/18/2011 9:07 PM, Chris Hostetter wrote:
> You could implement a parser like that relatively easily -- just make sure
> you put a MatchAllDocsQuery in every BooleanQuery object thta you
> construct, and only ever use the PROHIBITED and MANDATORY clause types
> (never OPTIONAL) ... the thing is, a parser like that isn't as useful
> as you think it might be when dealing with search results. "OPTIONAL"
> clauses are where most of the useful factors of scoring documents ocme
> into play.
Thanks for the background and ideas, very helpful.
Hmm, but what if it DID use OPTIONAL clause types.... but just turned
all pure-negative clauses into the alternative combination with
MatchAllDocsQuery ( "*:* AND $pure_negative")? Just like lucene query
parser does now, but not only for top-level clauses. Seems like that
might maintain the power of optional clauses for scoring, but still
allow negative clauses to work the 'boolean logic' way people expect --
same rationale that has the query parser doing this at top-level, why
not do it for sub-clauses as well? Does that have any promise do you think?
Jonathan
Re: lucene parser, negative OR operands
Posted by Chris Hostetter <ho...@fucit.org>.
: Thanks Yonik. I recall hearing about this before, but was vague on the
: details, thanks for supplying some and refreshing my memory.
matching in Lucene is addative ... queries must match *something*, a
clause ofa boolean query can be the negation of a query, but that only
defines how documents should be removed from the set matched by the other
queries in that boolean.
To put it another way: imagine modeling the list of documents matching a
query as a bitset. you can set bits to true, and you can set bits to
false, but the bitset starts out with all bits as false, so if all
you do is set bits to false, your bitset will *end* will all bits as false
: If I want to understand more about how the lucene query parser does it's
: thing, can anyone suggest the source files I should be looking at?
the QueryParser.jj is the grammer for parsing, but the crux is to
understand that the BooleanQuery class supports three types of clauses:
PROHIBITED, MANDATORY, and OPTIONAL. The QueryParser implements those as
"-", "+" and the default beahvior when neither +/- is present. The
QueryParser also jumps through some hoops to support AND, OR, NOT but not
all permutations of those are viable
: If I really do want actual boolean logic behavior, what are my options? I
: guess one is trying to write my own query parser.
"boolean logic" generally is defined in some form relative "the universe"
.. so a pure negative query like "-red" really means "all things IN THE
UNIVERSE that are not 'red'" ... you can express that using "*:* -red"
What solr does (and how this thread started) is pointing out that for top
level queries, (like "q=-red" or "fq=-red") solr adds the *:* to the
boolean query for you.
: Hmm, for that particular query, what about using parens to force a sub-query?
:
: (-one) OR (-two)
:
: Ha, nope, that runs into a different problem (or is it the same problem?), and
: always returns 0 hits. It looks like the lucene query parser can't handle a
: pure-negative sub-query like that seperate by OR? Not sure why, can anyone
: explain that one?
the query parser can handle it, and it produces a valid query object, but
that query object doesn't match anything. "-one" matches nothing, "-two"
matchines nothing ... nothing union nothing is still nothing.
: For that particular pattern, this crazy refactoring of the query does work and
: get the actual boolean logic result of "(not 'one') OR (not 'two')":
:
: (*:* AND -one) OR (*:* AND -two)
correct -- that is you formally saying "give me all docs IN THE UNIVERSE
that are not 'one', and union that with all docs IN THE UNIVERSE that are
not 'two'"
: behavior for that pattern, but in general, I'm kind of wanting a parser that
: will give actual boolean logic behavior. Maybe someday I can find time to
: write it in Java (not the quickest thing for me, not familiar with the code at
: all).
You could implement a parser like that relatively easily -- just make sure
you put a MatchAllDocsQuery in every BooleanQuery object thta you
construct, and only ever use the PROHIBITED and MANDATORY clause types
(never OPTIONAL) ... the thing is, a parser like that isn't as useful
as you think it might be when dealing with search results. "OPTIONAL"
clauses are where most of the useful factors of scoring documents ocme
into play.
-Hoss
Re: lucene parser, negative OR operands
Posted by Jonathan Rochkind <ro...@jhu.edu>.
On 5/17/2011 8:00 PM, Yonik Seeley wrote:
> This doesn't have to do with Solr's support of pure-negative top-level
> queries, but does have to do with
> a long standing confusion of how the lucene queryparser works with
> some of the operators (i.e. not really boolean logic).
>
> In a Lucene BooleanQuery, clauses are mandatory, optional, or prohibited.
> -foo OR -bar actually parses to a boolean query with two prohibited
> clauses... essentially the
> same as -foo AND -bar. You can see this by adding debugQuery=true to
> the request.
Thanks Yonik. I recall hearing about this before, but was vague on the
details, thanks for supplying some and refreshing my memory.
So I guess there is no such thing as an "optional prohibited" clause.
Which is what makes "-one OR -two" the same thing as "-one AND -two".
Actually, yeah, an "optional prohibited" clause doesn't reallly even
make sense. Hmm.
If I want to understand more about how the lucene query parser does it's
thing, can anyone suggest the source files I should be looking at?
If I really do want actual boolean logic behavior, what are my options?
I guess one is trying to write my own query parser.
Hmm, for that particular query, what about using parens to force a
sub-query?
(-one) OR (-two)
Ha, nope, that runs into a different problem (or is it the same
problem?), and always returns 0 hits. It looks like the lucene query
parser can't handle a pure-negative sub-query like that seperate by OR?
Not sure why, can anyone explain that one?
For that particular pattern, this crazy refactoring of the query does
work and get the actual boolean logic result of "(not 'one') OR (not
'two')":
(*:* AND -one) OR (*:* AND -two)
Phew, crazy stuff. So that's a weird solution to getting actual boolean
logic behavior for that pattern, but in general, I'm kind of wanting a
parser that will give actual boolean logic behavior. Maybe someday I can
find time to write it in Java (not the quickest thing for me, not
familiar with the code at all).
Jonathan
Re: lucene parser, negative OR operands
Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Tue, May 17, 2011 at 6:57 PM, Jonathan Rochkind <ro...@jhu.edu> wrote:
> (changed subject for this topic). Weird. I'm seeing it wrong myself, and
> have for a while -- I even wrote some custom pre-processor logic at my app
> level to work around it. Weird, I dunno.
>
> Wait. "Queries with -one OR -two return less documents than a either operand
> does on its own."
This doesn't have to do with Solr's support of pure-negative top-level
queries, but does have to do with
a long standing confusion of how the lucene queryparser works with
some of the operators (i.e. not really boolean logic).
In a Lucene BooleanQuery, clauses are mandatory, optional, or prohibited.
-foo OR -bar actually parses to a boolean query with two prohibited
clauses... essentially the
same as -foo AND -bar. You can see this by adding debugQuery=true to
the request.
-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco