You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Avi Rosenschein <ar...@gmail.com> on 2010/01/19 12:50:38 UTC

PhraseQuery with term positions

Hi,

I am using PhraseQuery with explicitly set term positions and slop=0, in
order to skip stop words. The field in my index is indexed with TermVector
positions.

When I do a query with stop words skipped, for example "internet for
research" (translated into PhraseQuery: "internet ? research"), I am getting
results with non-stop words as well as stop words, where the stop word
should be (e.g. "internet related research").

Is this expected behavior? If so, is there any way to do what I want, which
is for the query to match only results like "internet [stop-word] research"?

Thanks,
-- Avi

Re: PhraseQuery with term positions

Posted by Avi Rosenschein <ar...@gmail.com>.
Index is pretty large (50GB, divided into 8 shards). I'm afraid I would
start running into memory issues by adding the stop words (though it is
definitely something I would like to test at some point).

My question was more to try to understand if this was known behavior in
lucene, since I can't really think of a situation where this would be
desired (maybe if the user was knowingly searching for "a
[one-word-wildcard] b"; but a better way to do that would be with slop, not
with term positions). Wouldn't it be better to have the ExactPhraseScorer
not allow unmatched holes (i.e. terms in the document that are not matched
in the query)?

-- Avi

On Tue, Jan 19, 2010 at 3:28 PM, Erick Erickson <er...@gmail.com>wrote:

> How big is your index? Because the simplest thing would be
> to just not remove stopwords at index or query time. Perhaps
> in a duplicate field depending upon your needs.
>
> Erick
>
> On Tue, Jan 19, 2010 at 6:50 AM, Avi Rosenschein <arosenschein@gmail.com
> >wrote:
>
> > Hi,
> >
> > I am using PhraseQuery with explicitly set term positions and slop=0, in
> > order to skip stop words. The field in my index is indexed with
> TermVector
> > positions.
> >
> > When I do a query with stop words skipped, for example "internet for
> > research" (translated into PhraseQuery: "internet ? research"), I am
> > getting
> > results with non-stop words as well as stop words, where the stop word
> > should be (e.g. "internet related research").
> >
> > Is this expected behavior? If so, is there any way to do what I want,
> which
> > is for the query to match only results like "internet [stop-word]
> > research"?
> >
> > Thanks,
> > -- Avi
> >
>

Re: PhraseQuery with term positions

Posted by Erick Erickson <er...@gmail.com>.
How big is your index? Because the simplest thing would be
to just not remove stopwords at index or query time. Perhaps
in a duplicate field depending upon your needs.

Erick

On Tue, Jan 19, 2010 at 6:50 AM, Avi Rosenschein <ar...@gmail.com>wrote:

> Hi,
>
> I am using PhraseQuery with explicitly set term positions and slop=0, in
> order to skip stop words. The field in my index is indexed with TermVector
> positions.
>
> When I do a query with stop words skipped, for example "internet for
> research" (translated into PhraseQuery: "internet ? research"), I am
> getting
> results with non-stop words as well as stop words, where the stop word
> should be (e.g. "internet related research").
>
> Is this expected behavior? If so, is there any way to do what I want, which
> is for the query to match only results like "internet [stop-word]
> research"?
>
> Thanks,
> -- Avi
>