You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Mike Barry <mb...@cos.com> on 2005/06/15 18:12:49 UTC
QueryParser, phrases and stopwords
I have a situation where a query such as "climate control" is returning
documents with the phrase "climate of control". (I'm using QueryParser).
After searching, I found the similar issue on the mailing list from
Greg Robertson
with a patch from Steve Rowe.
Looking at the source repository for StopFilter.java, the patch was applied
in November of 2003 and then reverted in Dec 2003 (by Erik), with the note:
revert position increment change due to conflict with PhraseQuery
(the patch incremented the token position to inhibit exact matching across
removed stopword(s)).
I couldn't find any info on how/why this approach conflicted with
PhraseQuery.
Can anyone elighten me on this? Does anyone know of a way to inhibit
exact matching across removed stopwords(s)?
Pointers to nutch are appreciated, pointers into nutch where this
situation is
handled are appreciated even more :-)
Thanks, MikeB.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: QueryParser, phrases and stopwords
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Are there any other issues or concerns with making this change to
StopFilter? Should we make this change in 1.9? Or wait until after
2.0 is released?
Mike - if you could create some test cases for this scenario and
contribute your patch and tests to Bugzilla, barring no objections,
I'll apply it.
Erik
On Jun 16, 2005, at 8:57 AM, Mike Barry wrote:
> Erik,
> Thanks, I applied the changes found in version 150148 of
> StopFilter.java
> and they work great for me. I did remove the setting of position=1
> before
> the return of the token since that seemed spurious to me. Here's a
> context
> diff of the current StopFilter.java and my changes:
>
> *** analysis/StopFilter.java.old Thu Jun 16 07:42:28 2005
> --- analysis/StopFilter.java Thu Jun 16 08:44:50 2005
> ***************
> *** 94,109 ****
> * Returns the next input Token whose termText() is not a stop
> word.
> */
> public final Token next() throws IOException {
> - int position = 1;
> -
> // return the first non-stop word found
> ! for (Token token = input.next(); token != null; token =
> input.next()) {
> ! if (!stopWords.contains(token.termText)) {
> ! token.setPositionIncrement( position );
> return token;
> - }
> - position++;
> - }
> // reached EOS -- return null
> return null;
> }
> --- 94,103 ----
> * Returns the next input Token whose termText() is not a stop
> word.
> */
> public final Token next() throws IOException {
> // return the first non-stop word found
> ! for (Token token = input.next(); token != null; token =
> input.next())
> ! if (!stopWords.contains(token.termText))
> return token;
> // reached EOS -- return null
> return null;
> }
>
>
>
>
> Erik Hatcher wrote:
>
>
>>
>> On Jun 15, 2005, at 12:12 PM, Mike Barry wrote:
>>
>>
>>> I have a situation where a query such as "climate control" is
>>> returning
>>> documents with the phrase "climate of control". (I'm using
>>> QueryParser).
>>>
>>> After searching, I found the similar issue on the mailing list from
>>> Greg Robertson
>>> with a patch from Steve Rowe.
>>>
>>> Looking at the source repository for StopFilter.java, the patch was
>>> applied
>>> in November of 2003 and then reverted in Dec 2003 (by Erik), with
>>> the note:
>>>
>>> revert position increment change due to conflict with PhraseQuery
>>>
>>> (the patch incremented the token position to inhibit exact
>>> matching
>>> across
>>> removed stopword(s)).
>>>
>>> I couldn't find any info on how/why this approach conflicted with
>>> PhraseQuery.
>>> Can anyone elighten me on this? Does anyone know of a way to inhibit
>>> exact matching across removed stopwords(s)?
>>>
>>
>>
>> PhraseQuery originally did not account for gaps left in the terms of
>> the phrase.
>>
>> PhraseQuery was modified last year to allow for this though:
>>
>> r150509 | goller | 2004-09-15 05:38:50 -0400 (Wed, 15 Sep 2004) | 5
>> lines
>>
>> PhraseQuery and PhrasePrefixQuery are extended. It's now
>> possible to specify the relative position of a term within
>> a phrase. This allows gaps and multiple terms at the same
>> position.
>> -----
>>
>> So we could change StopFilter to put the gaps back in safely now, I
>> think.
>>
>> Thoughts?
>>
>> Erik
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: QueryParser, phrases and stopwords
Posted by Mike Barry <MB...@cos.com>.
Erik,
Thanks, I applied the changes found in version 150148 of StopFilter.java
and they work great for me. I did remove the setting of position=1 before
the return of the token since that seemed spurious to me. Here's a context
diff of the current StopFilter.java and my changes:
*** analysis/StopFilter.java.old Thu Jun 16 07:42:28 2005
--- analysis/StopFilter.java Thu Jun 16 08:44:50 2005
***************
*** 94,109 ****
* Returns the next input Token whose termText() is not a stop word.
*/
public final Token next() throws IOException {
- int position = 1;
-
// return the first non-stop word found
! for (Token token = input.next(); token != null; token =
input.next()) {
! if (!stopWords.contains(token.termText)) {
! token.setPositionIncrement( position );
return token;
- }
- position++;
- }
// reached EOS -- return null
return null;
}
--- 94,103 ----
* Returns the next input Token whose termText() is not a stop word.
*/
public final Token next() throws IOException {
// return the first non-stop word found
! for (Token token = input.next(); token != null; token = input.next())
! if (!stopWords.contains(token.termText))
return token;
// reached EOS -- return null
return null;
}
Erik Hatcher wrote:
>
> On Jun 15, 2005, at 12:12 PM, Mike Barry wrote:
>
>> I have a situation where a query such as "climate control" is returning
>> documents with the phrase "climate of control". (I'm using
>> QueryParser).
>>
>> After searching, I found the similar issue on the mailing list from
>> Greg Robertson
>> with a patch from Steve Rowe.
>>
>> Looking at the source repository for StopFilter.java, the patch was
>> applied
>> in November of 2003 and then reverted in Dec 2003 (by Erik), with
>> the note:
>>
>> revert position increment change due to conflict with PhraseQuery
>>
>> (the patch incremented the token position to inhibit exact matching
>> across
>> removed stopword(s)).
>>
>> I couldn't find any info on how/why this approach conflicted with
>> PhraseQuery.
>> Can anyone elighten me on this? Does anyone know of a way to inhibit
>> exact matching across removed stopwords(s)?
>
>
> PhraseQuery originally did not account for gaps left in the terms of
> the phrase.
>
> PhraseQuery was modified last year to allow for this though:
>
> r150509 | goller | 2004-09-15 05:38:50 -0400 (Wed, 15 Sep 2004) | 5
> lines
>
> PhraseQuery and PhrasePrefixQuery are extended. It's now
> possible to specify the relative position of a term within
> a phrase. This allows gaps and multiple terms at the same
> position.
> -----
>
> So we could change StopFilter to put the gaps back in safely now, I
> think.
>
> Thoughts?
>
> Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: QueryParser, phrases and stopwords
Posted by Stephen Halsey <st...@moreover.com>.
Hi,
I'm just writing to ask if you know if the the change discussed below is likely to be in the next version of Lucene as a default for StopFilter. I'm happy to apply the diff supplied by Mike Barry on my own source code to stop "climate control" matching "climate of control", but if its likely to go into the new version soon I'll hold off and and download that.
thanks a lot
Stephen Halsey
----- Original Message -----
From: Erik Hatcher
To: java-user@lucene.apache.org
Sent: Thursday, June 16, 2005 7:55 PM
Subject: Re: QueryParser, phrases and stopwords
On Jun 16, 2005, at 2:03 PM, Daniel Naber wrote:
> On Thursday 16 June 2005 04:17, Erik Hatcher wrote:
>
>
>> So we could change StopFilter to put the gaps back in safely now, I
>> think.
>>
>> Thoughts?
>>
>
> I personally don't have a problem with this, but shouldn't such a
> change be
> optional? Like a parameter for StopFilter or a StopGapFilter? I'm sure
> there are people who prefer the way it is done now.
Making it optional is ok by me, though I'm curious about a use case
that would prefer it the way it is now. Searching for "lucene in
action" and having it match documents with "lucene action" in them
seems awkward to me in a precision context. Google allows
wildcarding of words with an asterisk:
<http://www.google.com/search?client=safari&rls=en&q=%22lucene+*
+action%22&ie=UTF-8&oe=UTF-8>
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: QueryParser, phrases and stopwords
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jun 16, 2005, at 2:03 PM, Daniel Naber wrote:
> On Thursday 16 June 2005 04:17, Erik Hatcher wrote:
>
>
>> So we could change StopFilter to put the gaps back in safely now, I
>> think.
>>
>> Thoughts?
>>
>
> I personally don't have a problem with this, but shouldn't such a
> change be
> optional? Like a parameter for StopFilter or a StopGapFilter? I'm sure
> there are people who prefer the way it is done now.
Making it optional is ok by me, though I'm curious about a use case
that would prefer it the way it is now. Searching for "lucene in
action" and having it match documents with "lucene action" in them
seems awkward to me in a precision context. Google allows
wildcarding of words with an asterisk:
<http://www.google.com/search?client=safari&rls=en&q=%22lucene+*
+action%22&ie=UTF-8&oe=UTF-8>
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: QueryParser, phrases and stopwords
Posted by Mike Barry <MB...@cos.com>.
Daniel Naber wrote:
>On Thursday 16 June 2005 04:17, Erik Hatcher wrote:
>
>
>
>>So we could change StopFilter to put the gaps back in safely now, I
>>think.
>>
>>Thoughts?
>>
>>
>
>I personally don't have a problem with this, but shouldn't such a change be
>optional? Like a parameter for StopFilter or a StopGapFilter? I'm sure
>there are people who prefer the way it is done now.
>
>Regards
> Daniel
>
>
To me the change is more consistent with with the documentation
for Phrase Query. If you need it to be a bit looser, set the slop
accordingly.
If you need "lucene action" to match "Lucene In Action",
you should adjust your query (using QueryParser syntax):
"lucene action"~2
-MikeB.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: QueryParser, phrases and stopwords
Posted by Daniel Naber <lu...@danielnaber.de>.
On Thursday 16 June 2005 04:17, Erik Hatcher wrote:
> So we could change StopFilter to put the gaps back in safely now, I
> think.
>
> Thoughts?
I personally don't have a problem with this, but shouldn't such a change be
optional? Like a parameter for StopFilter or a StopGapFilter? I'm sure
there are people who prefer the way it is done now.
Regards
Daniel
--
http://www.danielnaber.de
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: QueryParser, phrases and stopwords
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jun 15, 2005, at 12:12 PM, Mike Barry wrote:
> I have a situation where a query such as "climate control" is
> returning
> documents with the phrase "climate of control". (I'm using
> QueryParser).
>
> After searching, I found the similar issue on the mailing list from
> Greg Robertson
> with a patch from Steve Rowe.
>
> Looking at the source repository for StopFilter.java, the patch was
> applied
> in November of 2003 and then reverted in Dec 2003 (by Erik), with
> the note:
>
> revert position increment change due to conflict with PhraseQuery
>
> (the patch incremented the token position to inhibit exact
> matching across
> removed stopword(s)).
>
> I couldn't find any info on how/why this approach conflicted with
> PhraseQuery.
> Can anyone elighten me on this? Does anyone know of a way to inhibit
> exact matching across removed stopwords(s)?
PhraseQuery originally did not account for gaps left in the terms of
the phrase.
PhraseQuery was modified last year to allow for this though:
r150509 | goller | 2004-09-15 05:38:50 -0400 (Wed, 15 Sep 2004) | 5
lines
PhraseQuery and PhrasePrefixQuery are extended. It's now
possible to specify the relative position of a term within
a phrase. This allows gaps and multiple terms at the same
position.
-----
So we could change StopFilter to put the gaps back in safely now, I
think.
Thoughts?
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org