You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jonathan Rochkind <ro...@jhu.edu> on 2011/05/18 00:57:50 UTC

lucene parser, negative OR operands

(changed subject for this topic). Weird. I'm seeing it wrong myself, and 
have for a while -- I even wrote some custom pre-processor logic at my 
app level to work around it.  Weird, I dunno.

Wait. "Queries with -one OR -two return less documents than a either 
operand does on its own."

Wait, that's exactly what's wrong, isn't it?  How can there be fewer 
documents that have "-one OR -two" then have "-one" alone?  If there are 
X documents that do not have a "one" in them, there can't be less than X 
documents that EITHER do not have a "one" OR do not have a "two" (ie, 
documents that do not have BOTH one and two), can there? We didn't ask 
for "-one AND -two", we asked for "-one OR -two".

On 5/17/2011 6:42 PM, Markus Jelsma wrote:
> mmmm, that's not what i see while testing right now. Queries with -one OR -two
> return less documents than a either operand does on its own, this is with
> LuceneQParser. I haven't done extensive testing since i rarely use boolean
> algebra in Lucene or Solr.
>
>> Oops, you're right, I had misremembered --- Solr 1.4.1 "lucene" qp
>> handles pure negative fine, it's Solr 1.4.1 _dismax_ that does not.
>>
>> Although, here's one, not actually related to this thread,  that DOESN'T
>> work in Solr 1.4.1 lucene query parser. Curious if it's been fixed in
>> Solr 3.1.
>>
>> &defType=lucene&q=-one OR -two
>>
>> That one does NOT work as expected in solr 1.4.1, although I can't
>> explain exactly what it's doing, it's not right. (It returns FEWER
>> results than "-one" alone, which can't be right algebraicly). I think.
>> So there are still some kinds of negative queries that do weird things.
>>
>> On 5/17/2011 6:29 PM, Markus Jelsma wrote:
>>> Such a negation works just as one would expect.
>>>
>>> q=*:*
>>> <result name="response" numFound="158" start="0">
>>>
>>> q=*:*&fq=-type:text/html
>>> <result name="response" numFound="25" start="0">
>>>
>>> q=*:*&fq=type:text/html
>>> <result name="response" numFound="133" start="0">
>>>
>>> Well, that adds up , doesn't it ;)
>>>
>>>> 1. I don't think Solr will re-use the filter cache in that situation,
>>>> although I'm not sure. But I comment anyway because, not what you asked
>>>> but something else that will trip you up with your example:
>>>>
>>>> 2. In fact, a pure-negative query like that doesn't work _at all_ in the
>>>> default solr/lucene query parser used for 'fq', at least in Solr 1.4.1.
>>>> Not sure if it's been improved in 3.1, but I don't think so.  It will
>>>> always return 0 hits, the solr/lucene query parser can't generate a
>>>> proper lucene query from a pure negative query like that.
>>>>
>>>> To get around this, you can find a variation the query that means the
>>>> same thing but isn't that form. Here's a really ugly one I use, with a
>>>> nested dismax -- dismax ALSO has trouble with pure negatives, although I
>>>> think maybe edismax can handle em? But this weird as heck combo works,
>>>> maybe there's a better way.
>>>>
>>>> NOT _query_:"{!dismax qf=something}history"
>>>>
>>>> And to come around full circle, I have NO idea what effect nested
>>>> queries have on the filter cache. I think that STILL won't re-use the
>>>> filter cache.... but I wonder if it'll re-use the _query_ cache for
>>>> "history"?  I forget even more how the query cache works though.
>>>>
>>>> On 5/17/2011 6:07 PM, Burton-West, Tom wrote:
>>>>> If I have a query with a filter query such as : " q=art&fq=history" and
>>>>> then run a second query  "q=art&fq=-history", will Solr realize that it
>>>>> can use the cached results of the previous filter query "history"  (in
>>>>> the filter cache) or will it not realize this and have to actually do a
>>>>> second filter query against the index  for "not history"?
>>>>>
>>>>> Tom

Re: lucene parser, negative OR operands

Posted by Jonathan Rochkind <ro...@jhu.edu>.
On 5/17/2011 7:07 PM, Markus Jelsma wrote:
>
> This is propably due to the contents (HTML bodies) of the documents i've
> queried. It's not so strange for this type of document to return less
> documents when two negated operands are specified. In my case (i tested it) a
> conjunction returned the same documents as a disjunction did.
>
> Again, i haven't done extensive testing on this subject.

I think we're disagreeing on what the proper behavior is. Because my 
understanding of the proper behavior under boolean logic, it doesn't 
matter the contents of your documents, it is logically impossible. 
Perhaps I am wrong to expect that lucene's pseudo-boolean operators will 
behave like actual boolean logic?

under boolean logic -- assuming "-one" means the same thing as "NOT one" 
-- both mean "all documents that do not have 'one'", right? :

-one OR -two === (NOT one) OR (NOT two) ===  NOT (  one AND two )

And it is logically impossible for that query to return FEWER results 
than "-one" alone does, or than "-two" alone does, in ANY corpus.  It 
can return the same #, or it can return more.  You can never get fewer 
documents by adding an "OR" union on, right? That's a set union, union 
of set A with some other (possibly empty) set B can never have fewer 
members than set A alone!

In fact, playing around more and comparing hit counts, it looks like 
Solr 1.4.1 lucene query parser treats:

"-one OR -two"
the same as
NOT (one OR two)

Which is not/should not be the same query at all.

The first is "all documents that don't have 'one' COMBINED WITH all 
documents that don't have 'two'".  The second is "all documents that 
have NEITHER 'one' NOR 'two'".   Those are two different things, or 
ought to be.

Or am I wrong to think that? That is certainly the way boolean algebra 
works; if "-one" is a boolean negation the same as "NOT one". Then "-one 
OR -two" definitely ought _not_ to be the same query as "NOT (one OR 
two)".   But maybe I should not be expecting predictable boolean algebra 
here? But if that's the case then I'm not sure what behavior I should be 
expecting, what the expected predictable behavior of these operators is!

If we want to make things even more confusing, I can supply some other 
patterns involving an explicit "NOT" that also don't work how I expect 
or according to any predictable way I can figure out.

Re: lucene parser, negative OR operands

Posted by Markus Jelsma <ma...@openindex.io>.
> (changed subject for this topic). Weird. I'm seeing it wrong myself, and
> have for a while -- I even wrote some custom pre-processor logic at my
> app level to work around it.  Weird, I dunno.
> 
> Wait. "Queries with -one OR -two return less documents than a either
> operand does on its own."
> 
> Wait, that's exactly what's wrong, isn't it?  How can there be fewer
> documents that have "-one OR -two" then have "-one" alone?  If there are
> X documents that do not have a "one" in them, there can't be less than X
> documents that EITHER do not have a "one" OR do not have a "two" (ie,
> documents that do not have BOTH one and two), can there? We didn't ask
> for "-one AND -two", we asked for "-one OR -two".

This is propably due to the contents (HTML bodies) of the documents i've 
queried. It's not so strange for this type of document to return less 
documents when two negated operands are specified. In my case (i tested it) a 
conjunction returned the same documents as a disjunction did.

Again, i haven't done extensive testing on this subject.

> 
> On 5/17/2011 6:42 PM, Markus Jelsma wrote:
> > mmmm, that's not what i see while testing right now. Queries with -one OR
> > -two return less documents than a either operand does on its own, this
> > is with LuceneQParser. I haven't done extensive testing since i rarely
> > use boolean algebra in Lucene or Solr.
> > 
> >> Oops, you're right, I had misremembered --- Solr 1.4.1 "lucene" qp
> >> handles pure negative fine, it's Solr 1.4.1 _dismax_ that does not.
> >> 
> >> Although, here's one, not actually related to this thread,  that DOESN'T
> >> work in Solr 1.4.1 lucene query parser. Curious if it's been fixed in
> >> Solr 3.1.
> >> 
> >> &defType=lucene&q=-one OR -two
> >> 
> >> That one does NOT work as expected in solr 1.4.1, although I can't
> >> explain exactly what it's doing, it's not right. (It returns FEWER
> >> results than "-one" alone, which can't be right algebraicly). I think.
> >> So there are still some kinds of negative queries that do weird things.
> >> 
> >> On 5/17/2011 6:29 PM, Markus Jelsma wrote:
> >>> Such a negation works just as one would expect.
> >>> 
> >>> q=*:*
> >>> <result name="response" numFound="158" start="0">
> >>> 
> >>> q=*:*&fq=-type:text/html
> >>> <result name="response" numFound="25" start="0">
> >>> 
> >>> q=*:*&fq=type:text/html
> >>> <result name="response" numFound="133" start="0">
> >>> 
> >>> Well, that adds up , doesn't it ;)
> >>> 
> >>>> 1. I don't think Solr will re-use the filter cache in that situation,
> >>>> although I'm not sure. But I comment anyway because, not what you
> >>>> asked but something else that will trip you up with your example:
> >>>> 
> >>>> 2. In fact, a pure-negative query like that doesn't work _at all_ in
> >>>> the default solr/lucene query parser used for 'fq', at least in Solr
> >>>> 1.4.1. Not sure if it's been improved in 3.1, but I don't think so. 
> >>>> It will always return 0 hits, the solr/lucene query parser can't
> >>>> generate a proper lucene query from a pure negative query like that.
> >>>> 
> >>>> To get around this, you can find a variation the query that means the
> >>>> same thing but isn't that form. Here's a really ugly one I use, with a
> >>>> nested dismax -- dismax ALSO has trouble with pure negatives, although
> >>>> I think maybe edismax can handle em? But this weird as heck combo
> >>>> works, maybe there's a better way.
> >>>> 
> >>>> NOT _query_:"{!dismax qf=something}history"
> >>>> 
> >>>> And to come around full circle, I have NO idea what effect nested
> >>>> queries have on the filter cache. I think that STILL won't re-use the
> >>>> filter cache.... but I wonder if it'll re-use the _query_ cache for
> >>>> "history"?  I forget even more how the query cache works though.
> >>>> 
> >>>> On 5/17/2011 6:07 PM, Burton-West, Tom wrote:
> >>>>> If I have a query with a filter query such as : " q=art&fq=history"
> >>>>> and then run a second query  "q=art&fq=-history", will Solr realize
> >>>>> that it can use the cached results of the previous filter query
> >>>>> "history"  (in the filter cache) or will it not realize this and
> >>>>> have to actually do a second filter query against the index  for
> >>>>> "not history"?
> >>>>> 
> >>>>> Tom

Re: lucene parser, negative OR operands

Posted by Jonathan Rochkind <ro...@jhu.edu>.
On 5/18/2011 9:07 PM, Chris Hostetter wrote:
> You could implement a parser like that relatively easily -- just make sure
> you put a MatchAllDocsQuery in every BooleanQuery object thta you
> construct, and only ever use the PROHIBITED and MANDATORY clause types
> (never OPTIONAL) ...  the thing is, a parser like that isn't as useful
> as you think it might be when dealing with search results.  "OPTIONAL"
> clauses are where most of the useful factors of scoring documents ocme
> into play.

Thanks for the background and ideas, very helpful.

Hmm, but what if it DID use OPTIONAL clause types.... but just turned 
all pure-negative clauses into the alternative combination with 
MatchAllDocsQuery ( "*:* AND $pure_negative")?  Just like lucene query 
parser does now, but not only for top-level clauses. Seems like that 
might maintain the power of optional clauses for scoring, but still 
allow negative clauses to work the 'boolean logic' way people expect -- 
same rationale that has the query parser doing this at top-level, why 
not do it for sub-clauses as well? Does that have any promise do you think?

Jonathan

Re: lucene parser, negative OR operands

Posted by Chris Hostetter <ho...@fucit.org>.
: Thanks Yonik. I recall hearing about this before, but was vague on the
: details, thanks for supplying some and refreshing my memory.

matching in Lucene is addative ... queries must match *something*, a 
clause ofa boolean query can be the negation of a query, but that only 
defines how documents should be removed from the set matched by the other 
queries in that boolean.

To put it another way: imagine modeling the list of documents matching a 
query as a bitset.  you can set bits to true, and you can set bits to 
false, but the bitset starts out with all bits as false, so if all 
you do is set bits to false, your bitset will *end* will all bits as false 

: If I want to understand more about how the lucene query parser does it's
: thing, can anyone suggest the source files I should be looking at?

the QueryParser.jj is the grammer for parsing, but the crux is to 
understand that the BooleanQuery class supports three types of clauses: 
PROHIBITED, MANDATORY, and OPTIONAL.  The QueryParser implements those as 
"-", "+" and the default beahvior when neither +/- is present. The 
QueryParser also jumps through some hoops to support AND, OR, NOT but not 
all permutations of those are viable

: If I really do want actual boolean logic behavior, what are my options?  I
: guess one is trying to write my own query parser.

"boolean logic" generally is defined in some form relative "the universe" 
.. so a pure negative query like "-red" really means "all things IN THE 
UNIVERSE that are not 'red'" ... you can express that using "*:* -red"

What solr does (and how this thread started) is pointing out that for top 
level queries, (like "q=-red" or "fq=-red") solr adds the *:* to the 
boolean query for you.

: Hmm, for that particular query, what about using parens to force a sub-query?
: 
: (-one) OR (-two)
: 
: Ha, nope, that runs into a different problem (or is it the same problem?), and
: always returns 0 hits.  It looks like the lucene query parser can't handle a
: pure-negative sub-query like that seperate by OR?  Not sure why, can anyone
: explain that one?

the query parser can handle it, and it produces a valid query object, but 
that query object doesn't match anything. "-one" matches nothing, "-two" 
matchines nothing ... nothing union nothing is still nothing.

: For that particular pattern, this crazy refactoring of the query does work and
: get the actual boolean logic result of "(not 'one') OR (not 'two')":
: 
: (*:* AND -one) OR (*:* AND -two)

correct -- that is you formally saying "give me all docs IN THE UNIVERSE 
that are not 'one', and union that with all docs IN THE UNIVERSE that are 
not 'two'"

: behavior for that pattern, but in general, I'm kind of wanting a parser that
: will give actual boolean logic behavior. Maybe someday I can find time to
: write it in Java (not the quickest thing for me, not familiar with the code at
: all).

You could implement a parser like that relatively easily -- just make sure 
you put a MatchAllDocsQuery in every BooleanQuery object thta you 
construct, and only ever use the PROHIBITED and MANDATORY clause types 
(never OPTIONAL) ...  the thing is, a parser like that isn't as useful 
as you think it might be when dealing with search results.  "OPTIONAL" 
clauses are where most of the useful factors of scoring documents ocme 
into play.

-Hoss

Re: lucene parser, negative OR operands

Posted by Jonathan Rochkind <ro...@jhu.edu>.
On 5/17/2011 8:00 PM, Yonik Seeley wrote:
> This doesn't have to do with Solr's support of pure-negative top-level
> queries, but does have to do with
> a long standing confusion of how the lucene queryparser works with
> some of the operators (i.e. not really boolean logic).
>
> In a Lucene BooleanQuery, clauses are mandatory, optional, or prohibited.
> -foo OR -bar actually parses to a boolean query with two prohibited
> clauses... essentially the
> same as -foo AND -bar.  You can see this by adding debugQuery=true to
> the request.

Thanks Yonik. I recall hearing about this before, but was vague on the 
details, thanks for supplying some and refreshing my memory.

So I guess there is no such thing as an "optional prohibited" clause.  
Which is what makes "-one OR -two" the same thing as "-one AND -two".  
Actually, yeah, an "optional prohibited" clause doesn't reallly even 
make sense. Hmm.

If I want to understand more about how the lucene query parser does it's 
thing, can anyone suggest the source files I should be looking at?

If I really do want actual boolean logic behavior, what are my options?  
I guess one is trying to write my own query parser.

Hmm, for that particular query, what about using parens to force a 
sub-query?

(-one) OR (-two)

Ha, nope, that runs into a different problem (or is it the same 
problem?), and always returns 0 hits.  It looks like the lucene query 
parser can't handle a pure-negative sub-query like that seperate by OR?  
Not sure why, can anyone explain that one?

For that particular pattern, this crazy refactoring of the query does 
work and get the actual boolean logic result of "(not 'one') OR (not 
'two')":

(*:* AND -one) OR (*:* AND -two)

Phew, crazy stuff. So that's a weird solution to getting actual boolean 
logic behavior for that pattern, but in general, I'm kind of wanting a 
parser that will give actual boolean logic behavior. Maybe someday I can 
find time to write it in Java (not the quickest thing for me, not 
familiar with the code at all).

Jonathan


Re: lucene parser, negative OR operands

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Tue, May 17, 2011 at 6:57 PM, Jonathan Rochkind <ro...@jhu.edu> wrote:
> (changed subject for this topic). Weird. I'm seeing it wrong myself, and
> have for a while -- I even wrote some custom pre-processor logic at my app
> level to work around it.  Weird, I dunno.
>
> Wait. "Queries with -one OR -two return less documents than a either operand
> does on its own."

This doesn't have to do with Solr's support of pure-negative top-level
queries, but does have to do with
a long standing confusion of how the lucene queryparser works with
some of the operators (i.e. not really boolean logic).

In a Lucene BooleanQuery, clauses are mandatory, optional, or prohibited.
-foo OR -bar actually parses to a boolean query with two prohibited
clauses... essentially the
same as -foo AND -bar.  You can see this by adding debugQuery=true to
the request.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco