You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Mike Barry <mb...@cos.com> on 2005/06/15 18:12:49 UTC

QueryParser, phrases and stopwords

I have a situation where a query such as "climate control" is returning
documents with the phrase "climate of control".  (I'm using QueryParser).

After searching, I found  the similar issue on the mailing list from
Greg Robertson
with a patch from Steve Rowe.

Looking at the source repository for StopFilter.java, the patch was applied
in November of 2003 and then reverted in Dec 2003 (by Erik), with the note:

revert position increment change due to conflict with PhraseQuery

(the patch  incremented the token position to inhibit exact matching across
removed stopword(s)).

I couldn't find any info on how/why this approach conflicted with
PhraseQuery.
Can anyone elighten me on this? Does anyone know of a way to inhibit
exact matching across removed stopwords(s)?

Pointers to nutch are appreciated, pointers into nutch where this
situation is
handled are appreciated even more :-)

Thanks, MikeB.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: QueryParser, phrases and stopwords

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Are there any other issues or concerns with making this change to  
StopFilter?  Should we make this change in 1.9?  Or wait until after  
2.0 is released?

Mike - if you could create some test cases for this scenario and  
contribute your patch and tests to Bugzilla, barring no objections,  
I'll apply it.

     Erik


On Jun 16, 2005, at 8:57 AM, Mike Barry wrote:

> Erik,
>    Thanks, I applied the changes found in  version 150148 of  
> StopFilter.java
> and they work great for me. I did remove the setting of position=1  
> before
> the return of the token since that seemed spurious to me. Here's a  
> context
> diff of the current StopFilter.java and my changes:
>
> *** analysis/StopFilter.java.old        Thu Jun 16 07:42:28 2005
> --- analysis/StopFilter.java    Thu Jun 16 08:44:50 2005
> ***************
> *** 94,109 ****
>      * Returns the next input Token whose termText() is not a stop  
> word.
>      */
>     public final Token next() throws IOException {
> -     int position = 1;
> -
>       // return the first non-stop word found
> !     for (Token token = input.next(); token != null; token =
> input.next()) {
> !       if (!stopWords.contains(token.termText)) {
> !         token.setPositionIncrement( position );
>           return token;
> -       }
> -       position++;
> -     }
>       // reached EOS -- return null
>       return null;
>     }
> --- 94,103 ----
>      * Returns the next input Token whose termText() is not a stop  
> word.
>      */
>     public final Token next() throws IOException {
>       // return the first non-stop word found
> !     for (Token token = input.next(); token != null; token =  
> input.next())
> !       if (!stopWords.contains(token.termText))
>           return token;
>       // reached EOS -- return null
>       return null;
>     }
>
>
>
>
> Erik Hatcher wrote:
>
>
>>
>> On Jun 15, 2005, at 12:12 PM, Mike Barry wrote:
>>
>>
>>> I have a situation where a query such as "climate control" is   
>>> returning
>>> documents with the phrase "climate of control".  (I'm using
>>> QueryParser).
>>>
>>> After searching, I found  the similar issue on the mailing list from
>>> Greg Robertson
>>> with a patch from Steve Rowe.
>>>
>>> Looking at the source repository for StopFilter.java, the patch was
>>> applied
>>> in November of 2003 and then reverted in Dec 2003 (by Erik), with
>>> the note:
>>>
>>> revert position increment change due to conflict with PhraseQuery
>>>
>>> (the patch  incremented the token position to inhibit exact   
>>> matching
>>> across
>>> removed stopword(s)).
>>>
>>> I couldn't find any info on how/why this approach conflicted with
>>> PhraseQuery.
>>> Can anyone elighten me on this? Does anyone know of a way to inhibit
>>> exact matching across removed stopwords(s)?
>>>
>>
>>
>> PhraseQuery originally did not account for gaps left in the terms of
>> the phrase.
>>
>> PhraseQuery was modified last year to allow for this though:
>>
>> r150509 | goller | 2004-09-15 05:38:50 -0400 (Wed, 15 Sep 2004) | 5
>> lines
>>
>> PhraseQuery and PhrasePrefixQuery are extended. It's now
>> possible to specify the relative position of a term within
>> a phrase. This allows gaps and multiple terms at the same
>> position.
>> -----
>>
>> So we could change StopFilter to put the gaps back in safely now, I
>> think.
>>
>> Thoughts?
>>
>>     Erik
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: QueryParser, phrases and stopwords

Posted by Mike Barry <MB...@cos.com>.
Erik,
   Thanks, I applied the changes found in  version 150148 of StopFilter.java
and they work great for me. I did remove the setting of position=1 before
the return of the token since that seemed spurious to me. Here's a context
diff of the current StopFilter.java and my changes:

*** analysis/StopFilter.java.old        Thu Jun 16 07:42:28 2005
--- analysis/StopFilter.java    Thu Jun 16 08:44:50 2005
***************
*** 94,109 ****
     * Returns the next input Token whose termText() is not a stop word.
     */
    public final Token next() throws IOException {
-     int position = 1;
-
      // return the first non-stop word found
!     for (Token token = input.next(); token != null; token =
input.next()) {
!       if (!stopWords.contains(token.termText)) {
!         token.setPositionIncrement( position );
          return token;
-       }
-       position++;
-     }
      // reached EOS -- return null
      return null;
    }
--- 94,103 ----
     * Returns the next input Token whose termText() is not a stop word.
     */
    public final Token next() throws IOException {
      // return the first non-stop word found
!     for (Token token = input.next(); token != null; token = input.next())
!       if (!stopWords.contains(token.termText))
          return token;
      // reached EOS -- return null
      return null;
    }




Erik Hatcher wrote:

>
> On Jun 15, 2005, at 12:12 PM, Mike Barry wrote:
>
>> I have a situation where a query such as "climate control" is  returning
>> documents with the phrase "climate of control".  (I'm using 
>> QueryParser).
>>
>> After searching, I found  the similar issue on the mailing list from
>> Greg Robertson
>> with a patch from Steve Rowe.
>>
>> Looking at the source repository for StopFilter.java, the patch was 
>> applied
>> in November of 2003 and then reverted in Dec 2003 (by Erik), with 
>> the note:
>>
>> revert position increment change due to conflict with PhraseQuery
>>
>> (the patch  incremented the token position to inhibit exact  matching
>> across
>> removed stopword(s)).
>>
>> I couldn't find any info on how/why this approach conflicted with
>> PhraseQuery.
>> Can anyone elighten me on this? Does anyone know of a way to inhibit
>> exact matching across removed stopwords(s)?
>
>
> PhraseQuery originally did not account for gaps left in the terms of 
> the phrase.
>
> PhraseQuery was modified last year to allow for this though:
>
> r150509 | goller | 2004-09-15 05:38:50 -0400 (Wed, 15 Sep 2004) | 5 
> lines
>
> PhraseQuery and PhrasePrefixQuery are extended. It's now
> possible to specify the relative position of a term within
> a phrase. This allows gaps and multiple terms at the same
> position.
> -----
>
> So we could change StopFilter to put the gaps back in safely now, I 
> think.
>
> Thoughts?
>
>     Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: QueryParser, phrases and stopwords

Posted by Stephen Halsey <st...@moreover.com>.
Hi,

I'm just writing to ask if you know if the the change discussed below is likely to be in the next version of Lucene as a default for StopFilter.   I'm happy to apply the diff supplied by Mike Barry on my own source code to stop "climate control" matching "climate of control", but if its likely to go into the new version soon I'll hold off and and download that.

thanks a lot

Stephen Halsey
  ----- Original Message ----- 
  From: Erik Hatcher 
  To: java-user@lucene.apache.org 
  Sent: Thursday, June 16, 2005 7:55 PM
  Subject: Re: QueryParser, phrases and stopwords



  On Jun 16, 2005, at 2:03 PM, Daniel Naber wrote:

  > On Thursday 16 June 2005 04:17, Erik Hatcher wrote:
  >
  >
  >> So we could change StopFilter to put the gaps back in safely now, I
  >> think.
  >>
  >> Thoughts?
  >>
  >
  > I personally don't have a problem with this, but shouldn't such a  
  > change be
  > optional? Like a parameter for StopFilter or a StopGapFilter? I'm sure
  > there are people who prefer the way it is done now.

  Making it optional is ok by me, though I'm curious about a use case  
  that would prefer it the way it is now.  Searching for "lucene in  
  action" and having it match documents with "lucene action" in them  
  seems awkward to me in a precision context.  Google allows  
  wildcarding of words with an asterisk:

       <http://www.google.com/search?client=safari&rls=en&q=%22lucene+* 
  +action%22&ie=UTF-8&oe=UTF-8>

  Erik


  ---------------------------------------------------------------------
  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
  For additional commands, e-mail: java-user-help@lucene.apache.org


Re: QueryParser, phrases and stopwords

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jun 16, 2005, at 2:03 PM, Daniel Naber wrote:

> On Thursday 16 June 2005 04:17, Erik Hatcher wrote:
>
>
>> So we could change StopFilter to put the gaps back in safely now, I
>> think.
>>
>> Thoughts?
>>
>
> I personally don't have a problem with this, but shouldn't such a  
> change be
> optional? Like a parameter for StopFilter or a StopGapFilter? I'm sure
> there are people who prefer the way it is done now.

Making it optional is ok by me, though I'm curious about a use case  
that would prefer it the way it is now.  Searching for "lucene in  
action" and having it match documents with "lucene action" in them  
seems awkward to me in a precision context.  Google allows  
wildcarding of words with an asterisk:

     <http://www.google.com/search?client=safari&rls=en&q=%22lucene+* 
+action%22&ie=UTF-8&oe=UTF-8>

Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: QueryParser, phrases and stopwords

Posted by Mike Barry <MB...@cos.com>.
Daniel Naber wrote:

>On Thursday 16 June 2005 04:17, Erik Hatcher wrote:
>
>  
>
>>So we could change StopFilter to put the gaps back in safely now, I  
>>think.
>>
>>Thoughts?
>>    
>>
>
>I personally don't have a problem with this, but shouldn't such a change be 
>optional? Like a parameter for StopFilter or a StopGapFilter? I'm sure 
>there are people who prefer the way it is done now.
>
>Regards
> Daniel
>  
>

To me the change is more consistent with with the documentation
for Phrase Query.  If you need it to be a bit looser, set the slop
accordingly.

If you need "lucene action" to match "Lucene In Action",
you should adjust your query  (using QueryParser syntax):

"lucene action"~2


-MikeB.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: QueryParser, phrases and stopwords

Posted by Daniel Naber <lu...@danielnaber.de>.
On Thursday 16 June 2005 04:17, Erik Hatcher wrote:

> So we could change StopFilter to put the gaps back in safely now, I  
> think.
>
> Thoughts?

I personally don't have a problem with this, but shouldn't such a change be 
optional? Like a parameter for StopFilter or a StopGapFilter? I'm sure 
there are people who prefer the way it is done now.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: QueryParser, phrases and stopwords

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jun 15, 2005, at 12:12 PM, Mike Barry wrote:

> I have a situation where a query such as "climate control" is  
> returning
> documents with the phrase "climate of control".  (I'm using  
> QueryParser).
>
> After searching, I found  the similar issue on the mailing list from
> Greg Robertson
> with a patch from Steve Rowe.
>
> Looking at the source repository for StopFilter.java, the patch was  
> applied
> in November of 2003 and then reverted in Dec 2003 (by Erik), with  
> the note:
>
> revert position increment change due to conflict with PhraseQuery
>
> (the patch  incremented the token position to inhibit exact  
> matching across
> removed stopword(s)).
>
> I couldn't find any info on how/why this approach conflicted with
> PhraseQuery.
> Can anyone elighten me on this? Does anyone know of a way to inhibit
> exact matching across removed stopwords(s)?

PhraseQuery originally did not account for gaps left in the terms of  
the phrase.

PhraseQuery was modified last year to allow for this though:

r150509 | goller | 2004-09-15 05:38:50 -0400 (Wed, 15 Sep 2004) | 5  
lines

PhraseQuery and PhrasePrefixQuery are extended. It's now
possible to specify the relative position of a term within
a phrase. This allows gaps and multiple terms at the same
position.
-----

So we could change StopFilter to put the gaps back in safely now, I  
think.

Thoughts?

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org