You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Marc Hadfield <ma...@animarc.com> on 2006/01/05 03:39:13 UTC
span / position increment issue
hello all -
i have a problem with a SpanNearQuery returning incorrect (false
positive) results.
I am creating the context of a field using tokens which have position
increment set to either 1 or 0. The position increment is set to 0 for
special tokens, in this case part-of-speech markers.
Thus, brackets set of position increments:
[The __pos_dt] [cow __pos_noun] [jumped __pos_verb] [over __pos_prep]
[the __pos_dt] [moon __pos_noun] [. __pos_.]
My Span Query looks like:
SpanQuery sq = new SpanNearQuery(new SpanQuery[]
{
new SpanTermQuery(new Term("content", "jumped")),
new SpanTermQuery(new Term("content", "__pos_verb"))
}, 0, false);
This correctly finds the span: [jumped __pos_verb]
However, if I query:
SpanQuery sq = new SpanNearQuery(new SpanQuery[]
{
new SpanTermQuery(new Term("content", "jumped")),
new SpanTermQuery(new Term("content", "__pos_noun"))
}, 0, false);
This incorrectly finds the span: [cow __pos_noun] [jumped __pos_verb]
This is wrong because there is a distance of 1 between the tokens, not 0.
I am using a recent version of Lucene from SVN.
I am thinking that the problem is related to the position increment
being set to 0 for the first token of the incorrect "match" -- thus
perhaps this is a bug in the SpanNearQuery?
Best,
Marc
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: span / position increment issue
Posted by Paul Elschot <pa...@xs4all.nl>.
On Thursday 05 January 2006 19:33, Marc Hadfield wrote:
>
> Thanks Erik, Hoss -
>
> I will try MultiPhraseQuery and report back.
>
> Back in an email thread with Doug be mentioned SpanQuery would work, and
> in a fashion it does, but I can't differentiate between terms at the
> same position and contiguous positions. The problem gets worse if I
> want to test for more than two terms at the same position ( moon / noun
> / end_of_sentence / ...), which opens it up to wider contiguous spaces.
>
> Would it make sense to extend SpanNearQuery for this case, perhaps with
> a slop of a special value (-1) to indicate terms should be found at the
> same position? I'm not sure how difficult this would be to represent in
> the underlying queues keeping track of matches.
Fwiw, there is an alternative implementation of ordered NearSpans here:
http://issues.apache.org/jira/browse/LUCENE-413
The NearSpansOrdered there might actually work as needed for this case.
Spans are normally slower than Phrases, so I'd try the MultiPhraseQuery first.
Regards,
Paul Elschot
>
> ---Marc
>
> Erik Hatcher wrote:
>
> > Marc,
> >
> > SpanNearQuery isn't capable of performing the proximity to within
> > only a single position in the manner you've described. A slop of 0
> > means the terms must be contiguous with no gaps, which also allows
> > for matches in the same position as in your first example.
> >
> > I think MultiPhraseQuery (from svn trunk) will do what you want
> > though. Please try that and report back on how it works for you.
> >
> > Erik
> >
> >
>
> >: i have a problem with a SpanNearQuery returning incorrect (false
> >: positive) results.
> >
> >I'm not familiar with how Span queries are implimented, but there doesn't
> >appear to be any test cases dealing with an index where term position
> >increments are ever 0, so i can neither confirm nor deny the bug you're
> >seeing.
> >
> >Can you post code demonstrating the problem? ideally in the form of
> >a simple, self contained, JUnit test?
> >
> >
> >
> >-Hoss
> >
>
>
>
>
> > On Jan 4, 2006, at 9:39 PM, Marc Hadfield wrote:
> >
> >> hello all -
> >>
> >> i have a problem with a SpanNearQuery returning incorrect (false
> >> positive) results.
> >>
> >> I am creating the context of a field using tokens which have
> >> position increment set to either 1 or 0. The position increment is
> >> set to 0 for special tokens, in this case part-of-speech markers.
> >> Thus, brackets set of position increments:
> >>
> >> [The __pos_dt] [cow __pos_noun] [jumped __pos_verb] [over
> >> __pos_prep] [the __pos_dt] [moon __pos_noun] [. __pos_.]
> >>
> >>
> >> My Span Query looks like:
> >> SpanQuery sq = new SpanNearQuery(new SpanQuery[]
> >> {
> >> new SpanTermQuery(new Term("content",
> >> "jumped")),
> >> new SpanTermQuery(new
> >> Term("content", "__pos_verb"))
> >> }, 0, false);
> >>
> >> This correctly finds the span: [jumped __pos_verb]
> >>
> >> However, if I query:
> >> SpanQuery sq = new SpanNearQuery(new SpanQuery[]
> >> {
> >> new SpanTermQuery(new Term("content",
> >> "jumped")),
> >> new SpanTermQuery(new
> >> Term("content", "__pos_noun"))
> >> }, 0, false);
> >>
> >> This incorrectly finds the span: [cow __pos_noun] [jumped __pos_verb]
> >>
> >> This is wrong because there is a distance of 1 between the tokens,
> >> not 0.
> >>
> >> I am using a recent version of Lucene from SVN.
> >>
> >> I am thinking that the problem is related to the position increment
> >> being set to 0 for the first token of the incorrect "match" -- thus
> >> perhaps this is a bug in the SpanNearQuery?
> >>
> >>
> >> Best,
> >> Marc
> >>
> >>
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: span / position increment issue
Posted by Marc Hadfield <ma...@animarc.com>.
Thanks Erik, Hoss -
I will try MultiPhraseQuery and report back.
Back in an email thread with Doug be mentioned SpanQuery would work, and
in a fashion it does, but I can't differentiate between terms at the
same position and contiguous positions. The problem gets worse if I
want to test for more than two terms at the same position ( moon / noun
/ end_of_sentence / ...), which opens it up to wider contiguous spaces.
Would it make sense to extend SpanNearQuery for this case, perhaps with
a slop of a special value (-1) to indicate terms should be found at the
same position? I'm not sure how difficult this would be to represent in
the underlying queues keeping track of matches.
---Marc
Erik Hatcher wrote:
> Marc,
>
> SpanNearQuery isn't capable of performing the proximity to within
> only a single position in the manner you've described. A slop of 0
> means the terms must be contiguous with no gaps, which also allows
> for matches in the same position as in your first example.
>
> I think MultiPhraseQuery (from svn trunk) will do what you want
> though. Please try that and report back on how it works for you.
>
> Erik
>
>
>: i have a problem with a SpanNearQuery returning incorrect (false
>: positive) results.
>
>I'm not familiar with how Span queries are implimented, but there doesn't
>appear to be any test cases dealing with an index where term position
>increments are ever 0, so i can neither confirm nor deny the bug you're
>seeing.
>
>Can you post code demonstrating the problem? ideally in the form of
>a simple, self contained, JUnit test?
>
>
>
>-Hoss
>
> On Jan 4, 2006, at 9:39 PM, Marc Hadfield wrote:
>
>> hello all -
>>
>> i have a problem with a SpanNearQuery returning incorrect (false
>> positive) results.
>>
>> I am creating the context of a field using tokens which have
>> position increment set to either 1 or 0. The position increment is
>> set to 0 for special tokens, in this case part-of-speech markers.
>> Thus, brackets set of position increments:
>>
>> [The __pos_dt] [cow __pos_noun] [jumped __pos_verb] [over
>> __pos_prep] [the __pos_dt] [moon __pos_noun] [. __pos_.]
>>
>>
>> My Span Query looks like:
>> SpanQuery sq = new SpanNearQuery(new SpanQuery[]
>> {
>> new SpanTermQuery(new Term("content",
>> "jumped")),
>> new SpanTermQuery(new
>> Term("content", "__pos_verb"))
>> }, 0, false);
>>
>> This correctly finds the span: [jumped __pos_verb]
>>
>> However, if I query:
>> SpanQuery sq = new SpanNearQuery(new SpanQuery[]
>> {
>> new SpanTermQuery(new Term("content",
>> "jumped")),
>> new SpanTermQuery(new
>> Term("content", "__pos_noun"))
>> }, 0, false);
>>
>> This incorrectly finds the span: [cow __pos_noun] [jumped __pos_verb]
>>
>> This is wrong because there is a distance of 1 between the tokens,
>> not 0.
>>
>> I am using a recent version of Lucene from SVN.
>>
>> I am thinking that the problem is related to the position increment
>> being set to 0 for the first token of the incorrect "match" -- thus
>> perhaps this is a bug in the SpanNearQuery?
>>
>>
>> Best,
>> Marc
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: span / position increment issue
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Marc,
SpanNearQuery isn't capable of performing the proximity to within
only a single position in the manner you've described. A slop of 0
means the terms must be contiguous with no gaps, which also allows
for matches in the same position as in your first example.
I think MultiPhraseQuery (from svn trunk) will do what you want
though. Please try that and report back on how it works for you.
Erik
On Jan 4, 2006, at 9:39 PM, Marc Hadfield wrote:
> hello all -
>
> i have a problem with a SpanNearQuery returning incorrect (false
> positive) results.
>
> I am creating the context of a field using tokens which have
> position increment set to either 1 or 0. The position increment is
> set to 0 for special tokens, in this case part-of-speech markers.
> Thus, brackets set of position increments:
>
> [The __pos_dt] [cow __pos_noun] [jumped __pos_verb] [over
> __pos_prep] [the __pos_dt] [moon __pos_noun] [. __pos_.]
>
>
> My Span Query looks like:
> SpanQuery sq = new SpanNearQuery(new SpanQuery[]
> {
> new SpanTermQuery(new Term("content",
> "jumped")),
> new SpanTermQuery(new
> Term("content", "__pos_verb"))
> }, 0, false);
>
> This correctly finds the span: [jumped __pos_verb]
>
> However, if I query:
> SpanQuery sq = new SpanNearQuery(new SpanQuery[]
> {
> new SpanTermQuery(new Term("content",
> "jumped")),
> new SpanTermQuery(new
> Term("content", "__pos_noun"))
> }, 0, false);
>
> This incorrectly finds the span: [cow __pos_noun] [jumped __pos_verb]
>
> This is wrong because there is a distance of 1 between the tokens,
> not 0.
>
> I am using a recent version of Lucene from SVN.
>
> I am thinking that the problem is related to the position increment
> being set to 0 for the first token of the incorrect "match" -- thus
> perhaps this is a bug in the SpanNearQuery?
>
>
> Best,
> Marc
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: span / position increment issue
Posted by Chris Hostetter <ho...@fucit.org>.
: i have a problem with a SpanNearQuery returning incorrect (false
: positive) results.
I'm not familiar with how Span queries are implimented, but there doesn't
appear to be any test cases dealing with an index where term position
increments are ever 0, so i can neither confirm nor deny the bug you're
seeing.
Can you post code demonstrating the problem? ideally in the form of
a simple, self contained, JUnit test?
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org