You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Walt Stoneburner <wa...@gmail.com> on 2007/04/09 05:27:42 UTC

Standard Parser Behavior

I'm trying to understand the specifics behind the notation +(...) and -(...)
as it applies to the standard parser.

I have three lists of words.  I want documents that have at least one word
from list A and also at least one word from list B (just one list isn't
enough), and, finally, no documents can contain any words in list C.

I believe the correct syntax for that is:
+(apple animal aspirin)  +(bacon banana book)  -candle -computer -currency

Can someone confirm that?

What I'm trying to convincing myself of is that -(candle computer currency)
doesn't do what one thinks it might at first glance.  However, Lucene seems
to be giving the correct answer, though I'm having a hard time understanding
why.  Let me explain with some simple pseudo code

X = 1
IF (X != 1 OR X !=2) THEN True ELSE False

It ought to come as no surprise that this actually evaluates to True.  The
reason is that X != 1 is false, and X != 2 is true, and false or true is
...true.  More interestingly, this statement should always be true (because
if X is 2, it can't also be 1, making that part of the subexpression true).

Thus, moving back to Lucene from trivial boolean algebra, the notation
-(candle OR computer OR currency), would, in my mind, match any and all
documents unless every word in the negation list was found.  Clearly this
can't be right.

Is the minus operator distributive?  I suspect what I'm seeing is the
reality that Lucene is not doing boolean logic at all, but set operations.

A co-worker of mine came up with an interesting syntax, and I had no idea
what it meant either:  +( -A -B )   ...which to him it meat "must have no A
and no B".

Can anyone clarify how + and - work on groups, and if the above has any
coherent meaning?

-Walt Stoneburner