You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Bob Rhodes <Bo...@trssllc.com> on 2012/06/07 19:50:08 UTC

easy one? IN and OR stopword help

Hi all,

This is driving me crazy. In my data if I search "state" AND "GA" I get
hits. If I search "state" AND "OR" or "state" AND "IN" I get no hits even
though I can see examples of state AND IN in the content. I've tried
searching with "in" in lower case and quotes to no avail. 

 

The data is indexed and searched with lucene 3.3 and the ClassicAnalyzer. In
my searching code I have a stopword list containing OR and IN, but I'm
pretty sure the indexing code didn't have this stop word list. 

 

Can anyone please help me solve this problem and understand what I'm doing
wrong?

 

Thanks!

 

Bob


Re: easy one? IN and OR stopword help

Posted by Jack Krupansky <ja...@basetechnology.com>.
It depends on whether the query parser is smart enough to optimize away 
empty boolean terms. Otherwise, the semantics of "x AND y" (or BooleanQuery 
with two "MUST" clauses) is the intersection of the documents selected by 
matching x and the documents selected by matching y. If y selects no 
documents, the intersection will be empty. Analysis is a separate semantic 
step from syntactic parsing, so if y is a stopword or a quoted phrase 
containing only a stopword, it parses fine, but a dumb query parser might 
generate a TermQuery with an empty term, which will match no documents.

Or, if stopwords are disabled at query time, but were enabled at index time, 
the TermQuery would refer to a term that cannot be found in the index.

-- Jack Krupansky

-----Original Message----- 
From: Trejkaz
Sent: Thursday, June 07, 2012 5:44 PM
To: java-user@lucene.apache.org
Subject: Re: easy one? IN and OR stopword help

On Fri, Jun 8, 2012 at 5:35 AM, Jack Krupansky <ja...@basetechnology.com> 
wrote:
> Well, if you have defined OR/or and IN/in as stopwords, what is it you 
> expect other than for the analyzer to ignore those terms (which with a 
> boolean “AND” means match nothing)?

Is this behaviour really logical?

If I search for a single phrase like "Jack and Jill", and "and" is a
stop word, it becomes "Jack - Jill", right? And then matches documents
which have Jack and Jill next to each other (although I'm not 100%
sure on whether term positions mess it up for this specific case as I
can't remember whether the term position increments on a stop word or
not. It's irrelevant for the next step in my logic anyway.)

If I search for a single term like "and" and "and" is a stop word, the
equivalent behaviour should be to search for [] (the empty term set),
and every item matches the empty term set, so {X} AND "and" should
return the same as {X} for any query {X}, I would have thought.

Is this some peculiarity with boolean query or query parser implementation?

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: easy one? IN and OR stopword help

Posted by Trejkaz <tr...@trypticon.org>.
On Fri, Jun 8, 2012 at 5:35 AM, Jack Krupansky <ja...@basetechnology.com> wrote:
> Well, if you have defined OR/or and IN/in as stopwords, what is it you expect other than for the analyzer to ignore those terms (which with a boolean “AND” means match nothing)?

Is this behaviour really logical?

If I search for a single phrase like "Jack and Jill", and "and" is a
stop word, it becomes "Jack - Jill", right? And then matches documents
which have Jack and Jill next to each other (although I'm not 100%
sure on whether term positions mess it up for this specific case as I
can't remember whether the term position increments on a stop word or
not. It's irrelevant for the next step in my logic anyway.)

If I search for a single term like "and" and "and" is a stop word, the
equivalent behaviour should be to search for [] (the empty term set),
and every item matches the empty term set, so {X} AND "and" should
return the same as {X} for any query {X}, I would have thought.

Is this some peculiarity with boolean query or query parser implementation?

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: easy one? IN and OR stopword help

Posted by Jack Krupansky <ja...@basetechnology.com>.
Well, if you have defined OR/or and IN/in as stopwords, what is it you expect other than for the analyzer to ignore those terms (which with a boolean “AND” means match nothing)?

What does your constructor look like for ClassicAnalyzer – do you pass in an explict stop word set or nothing, which uses the default set of stopwords (which includes “or” and “in”)?

Try passing null as the second argument to ClassicAnalyzer – it disables the default stop word list.

-- Jack Krupansky

From: Bob Rhodes 
Sent: Thursday, June 07, 2012 1:50 PM
To: java-user@lucene.apache.org 
Subject: easy one? IN and OR stopword help

Hi all,

This is driving me crazy. In my data if I search “state” AND “GA” I get hits. If I search “state” AND “OR” or “state” AND “IN” I get no hits even though I can see examples of state AND IN in the content. I’ve tried searching with “in” in lower case and quotes to no avail. 

 

The data is indexed and searched with lucene 3.3 and the ClassicAnalyzer. In my searching code I have a stopword list containing OR and IN, but I’m pretty sure the indexing code didn’t have this stop word list. 

 

Can anyone please help me solve this problem and understand what I’m doing wrong?

 

Thanks!

 

Bob