You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by greg <gr...@webpipe.net> on 2003/07/17 15:20:52 UTC

interesting phrase query issue

I have several document sections that are being indexed via the 
StandardAnalyzer.  One of these documents has the line "access, the 
manager".  When searching for the phrase "access manager", this document is 
being returned.  I understand why (at least i think i do), because a stop 
word is "the" and the "," is being removed by the tokenizer, my question is 
is there any way I can avoid having this returned in the results?  My 
thoughts were to create a new analyzer that indexes the word "the" (blick to 
many of those), or index the "," in some way (also not good).  Any 
suggestions? 

Thanks, 

Greg T Robertson

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: interesting phrase query issue

Posted by Victor Hadianto <vi...@nuix.com.au>.

> One of these documents has the line "access, the
> manager".  When searching for the phrase "access manager", this document is
> being returned.  I understand why (at least i think i do), because a stop
> word is "the" and the "," is being removed by the tokenizer, my question is
> is there any way I can avoid having this returned in the results?  

I don't think you can't without reindexing the documents and changing 
QueryParser a bit. The reasons is although if you introduce your new 
tokenizer/analyzer the original documents have been indexed with those stop 
words removed.

You have to create an analyzer that doesn't drop your stop words and start the 
reindexing again.

However you must be careful when using your custom analyser to do the query 
parsing, because sometime you may want to drop the stop words in a non-quoted 
query, so 

hello and world ---> +hello +world

but

"hello and world" --> +"hello and world"

One solution that I can think of is by passing two analysers in QueryParser, 
one is for the "standard" analyser and the other is for the "phrase query" 
analyser. Down in the QueryParser.jj around this area do something like this:

     | term=<QUOTED>
       [ slop=<SLOP> ]
       [ <CARAT> boost=<NUMBER> ]
       {
         if (phraseAnalyzer == null)  {
		// use phrase query custom analyser that doesn't drop stop words
	} else {  
		 // otherwise use normal analyzer
	}

This may work as a matter of fact I think it should.

HTH

victor


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: interesting phrase query issue

Posted by greg <gr...@webpipe.net>.

Steve, 

This is great stuff.  IMHO this should be the default behaviour - especially 
coming from someone who is new to Lucene.  I can not tell you how surprised 
i was when someone searched for a phrase and it was returned  but did not 
exist in the document except as "access, the manager".  Your patch really 
did the job. 

Thanks again, 

Greg T Robertson 

Steve Rowe writes: 

> Greg, 
> 
> I include a patch below to StopFilter.java which should inhibit exact 
> phrase
> matching across a removed stopword. 
> 
> [Would it be useful for this to be the default behavior?] 
> 
> See the API docs for Token.setPositionIncrement(int): 
> 
> <URL:http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/ 
> Token.html#setPositionIncrement(int)>
> "Set [the position increment] to values greater than one to inhibit exact 
> phrase
> matches. If, for example, one does not want phrases to match across 
> removed stop
> words, then one could build a stop word filter that removes stop words and 
> also
> sets the increment to the number of stop words removed before each 
> non-stop
> word. Then exact phrase queries will only match when the terms occur with 
> no
> intervening stop words." 
> 
> --------->8-----------cut here--------->8-----------
> Index: StopFilter.java
> ===================================================================
> RCS file:
> /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/analysis/StopFil 
> ter.java,v
> retrieving revision 1.3
> diff -r1.3 StopFilter.java
> 64a65
> >   private int       positionIncrement = 1;
> 93,94c94,97
> <     for (Token token = input.next(); token != null; token = 
> input.next())
> <       if (table.get(token.termText) == null)
> ---
> >     for (Token token = input.next(); token != null; token = 
> input.next()) {
> >       if (table.get(token.termText) == null) {
> >         token.setPositionIncrement(positionIncrement);
> >         positionIncrement = 1; // reset the position increment
> 95a99,102
> >       } else {
> >         ++positionIncrement;  // stopword -- increase the position 
> increment
> >       }
> >     }
> --------->8-----------cut here--------->8----------- 
> 
> greg wrote:
> > I have several document sections that are being indexed via the
> > StandardAnalyzer.  One of these documents has the line "access, the
> > manager".  When searching for the phrase "access manager", this document
> > is being returned.  I understand why (at least i think i do), because a
> > stop word is "the" and the "," is being removed by the tokenizer, my
> > question is is there any way I can avoid having this returned in the
> > results?  My thoughts were to create a new analyzer that indexes the
> > word "the" (blick to many of those), or index the "," in some way (also
> > not good).  Any suggestions?
> > Thanks,
> > Greg T Robertson 
> 
> -- 
> Steve Rowe
> Software Engineer
> Center for Natural Language Processing
> School of Information Studies
> Syracuse University
> www.cnlp.org 
> 
>  
> 
>  
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org 
> 
 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: interesting phrase query issue

Posted by Steve Rowe <sa...@gwmail.syr.edu>.

Greg,

I include a patch below to StopFilter.java which should inhibit exact phrase
matching across a removed stopword.

[Would it be useful for this to be the default behavior?]

See the API docs for Token.setPositionIncrement(int):

<URL:http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/Token.html#setPositionIncrement(int)>
"Set [the position increment] to values greater than one to inhibit exact phrase
matches. If, for example, one does not want phrases to match across removed stop
words, then one could build a stop word filter that removes stop words and also
sets the increment to the number of stop words removed before each non-stop
word. Then exact phrase queries will only match when the terms occur with no
intervening stop words."

--------->8-----------cut here--------->8-----------
Index: StopFilter.java
===================================================================
RCS file:
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/analysis/StopFilter.java,v
retrieving revision 1.3
diff -r1.3 StopFilter.java
64a65
 >   private int       positionIncrement = 1;
93,94c94,97
<     for (Token token = input.next(); token != null; token = input.next())
<       if (table.get(token.termText) == null)
---
 >     for (Token token = input.next(); token != null; token = input.next()) {
 >       if (table.get(token.termText) == null) {
 >         token.setPositionIncrement(positionIncrement);
 >         positionIncrement = 1; // reset the position increment
95a99,102
 >       } else {
 >         ++positionIncrement;  // stopword -- increase the position increment
 >       }
 >     }
--------->8-----------cut here--------->8-----------

greg wrote:
 > I have several document sections that are being indexed via the
 > StandardAnalyzer.  One of these documents has the line "access, the
 > manager".  When searching for the phrase "access manager", this document
 > is being returned.  I understand why (at least i think i do), because a
 > stop word is "the" and the "," is being removed by the tokenizer, my
 > question is is there any way I can avoid having this returned in the
 > results?  My thoughts were to create a new analyzer that indexes the
 > word "the" (blick to many of those), or index the "," in some way (also
 > not good).  Any suggestions?
 > Thanks,
 > Greg T Robertson

-- 
Steve Rowe
Software Engineer
Center for Natural Language Processing
School of Information Studies
Syracuse University
www.cnlp.org





---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: interesting phrase query issue

Posted by Tatu Saloranta <ta...@hypermall.net>.

On Thursday 17 July 2003 07:20, greg wrote:
> I have several document sections that are being indexed via the
> StandardAnalyzer.  One of these documents has the line "access, the
> manager".  When searching for the phrase "access manager", this document is
> being returned.  I understand why (at least i think i do), because a stop
> word is "the" and the "," is being removed by the tokenizer, my question is
> is there any way I can avoid having this returned in the results?  My
> thoughts were to create a new analyzer that indexes the word "the" (blick
> to many of those), or index the "," in some way (also not good).  Any
> suggestions?

You can also replace all stop words with "dummy" token ("" might be an ok 
candidate?). That would be similar to indexing "the" (which probably is  
better idea than indexing ",").

I'm planning to do something similar for paragraph breaks (in case of plain 
text, double linefeed, for HTML <p> etc), to prevent similar problems.

-+ Tatu +-



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org