You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Bill Taylor <wa...@alum.mit.edu> on 2007/04/13 03:41:00 UTC

I have a question about phrase query with stop words

I found some discussions of this question from back in 2003, but that was
many updates ago.

I have built an index using the standard stop analyser which uses the
standard list of stop words.  "will" and :the" are stop words.

As I understand analyzers and phrase queries, when I search for

you will find the answer

using the default slop of 0, I should find any pattern like

you <any stop word> find <any stop word> answer

because the analyzer replaces "will" and "the" in the query with a space
indicator as it did when analyzing the original input text.  Instead, I find
phrases such as

you find an answer

"an" is a stop work, so matching "find an answer" is as expected, but there
is no stop word between "you" and "find" in the original input string.  I do
not see why "you find an answer" matches.

What am I doing wrong?


Also, when I try to highlight after searching for a phrase, the highlighter
highlights individual words wherever it finds them in the input text.  The
documentation suggests that if I use the right scoring system, I will
highlight only long strings of adjacent tokens which are found in the
phrase, but I am not sure how to do that.

If necessary, I will paste in samples of my code for creating the indexes
and doing the search.


Thanks.

Bill Taylor

Re: I have a question about phrase query with stop words

Posted by Paul Elschot <pa...@xs4all.nl>.
On Friday 13 April 2007 04:04, Erick Erickson wrote:
> As I understand it, there really is no "space indicator". I think of it
> as replacing the stop word with a space, which is then discarded.

You can replace all stop words by your own special term value
to have space indicator.

It is also possible to index nothing at a particular position, for example
at the position of a stop word. This gives a "gap" in the index,
see below.
 
> so, you're indexing 'you find answer', and both your searches are
> looking for 'you find answer',  the stop words are just gone as though
> they never were. So both queries match.
> 
> But I've been wrong before <G>...
> 
> I can't really speak to the highlighter question, so I'll let someone
> more knowledgeable pipe up.
> 
> Erick
> 
> On 4/12/07, Bill Taylor <wa...@alum.mit.edu> wrote:
> >
> > I found some discussions of this question from back in 2003, but that was
> > many updates ago.
> >
> > I have built an index using the standard stop analyser which uses the
> > standard list of stop words.  "will" and :the" are stop words.
> >
> > As I understand analyzers and phrase queries, when I search for
> >
> > you will find the answer
> >
> > using the default slop of 0, I should find any pattern like
> >
> > you <any stop word> find <any stop word> answer
> >
> > because the analyzer replaces "will" and "the" in the query with a space
> > indicator as it did when analyzing the original input text.  Instead, I
> > find
> > phrases such as
> >
> > you find an answer
> >
> > "an" is a stop work, so matching "find an answer" is as expected, but
> > there
> > is no stop word between "you" and "find" in the original input string.  I
> > do
> > not see why "you find an answer" matches.
> >
> > What am I doing wrong?

The problem may be that you expect a gap in the index.
When there is a gap in the index, it is also necessary to adapt
the analyzer used for the phrase query to query for a gap.
I don't know whether PhraseQuery can handle such an analyzer.

To have a gap in the index, you need to change your analyzer
to add a gap for a stop word. This can be done by changing the
position increment when a stop word is encountered, see
Token.setPositionIncrement(). Iirc you need to make a variation
on StopFilter for this.

Regards,
Paul Elschot



> >
> >
> > Also, when I try to highlight after searching for a phrase, the
> > highlighter
> > highlights individual words wherever it finds them in the input text.  The
> > documentation suggests that if I use the right scoring system, I will
> > highlight only long strings of adjacent tokens which are found in the
> > phrase, but I am not sure how to do that.
> >
> > If necessary, I will paste in samples of my code for creating the indexes
> > and doing the search.
> >
> >
> > Thanks.
> >
> > Bill Taylor
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: I have a question about phrase query with stop words

Posted by Erick Erickson <er...@gmail.com>.
As I understand it, there really is no "space indicator". I think of it
as replacing the stop word with a space, which is then discarded.

so, you're indexing 'you find answer', and both your searches are
looking for 'you find answer',  the stop words are just gone as though
they never were. So both queries match.

But I've been wrong before <G>...

I can't really speak to the highlighter question, so I'll let someone
more knowledgeable pipe up.

Erick

On 4/12/07, Bill Taylor <wa...@alum.mit.edu> wrote:
>
> I found some discussions of this question from back in 2003, but that was
> many updates ago.
>
> I have built an index using the standard stop analyser which uses the
> standard list of stop words.  "will" and :the" are stop words.
>
> As I understand analyzers and phrase queries, when I search for
>
> you will find the answer
>
> using the default slop of 0, I should find any pattern like
>
> you <any stop word> find <any stop word> answer
>
> because the analyzer replaces "will" and "the" in the query with a space
> indicator as it did when analyzing the original input text.  Instead, I
> find
> phrases such as
>
> you find an answer
>
> "an" is a stop work, so matching "find an answer" is as expected, but
> there
> is no stop word between "you" and "find" in the original input string.  I
> do
> not see why "you find an answer" matches.
>
> What am I doing wrong?
>
>
> Also, when I try to highlight after searching for a phrase, the
> highlighter
> highlights individual words wherever it finds them in the input text.  The
> documentation suggests that if I use the right scoring system, I will
> highlight only long strings of adjacent tokens which are found in the
> phrase, but I am not sure how to do that.
>
> If necessary, I will paste in samples of my code for creating the indexes
> and doing the search.
>
>
> Thanks.
>
> Bill Taylor
>