You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Nathan Ashworth <na...@yahoo.com> on 2008/08/28 00:54:13 UTC

SpanNearQuery: All matches within slop

A more detailed explanation of the issue was posted about a year ago,
http://www.nabble.com/Possible-bug-in-SpanNearQuery-td10345758.html. I
couldn't find any signs of resolution.

As a brief summary, consider a field with these terms,

"two one one two"

An ordered SpanNearQuery,

spanNear([text:two, text:one], 1, true)

yields one span,

two one [0,2]

An unordered SpanNearQuery,

spanNear([text:two, text:one], 1, false)

yields three spans,

two one [0,2]
one one two [1,4]
one two [2,4]

Neither query includes the span, "two one one" [0,3].

This manifests itself as a problem in my work when I want to define an
inverted proximity operation. Say I want to find all instances of the word
"one" that don't follow the word "two" by some slop value. My initial
thought was that this query,

spanNot(text:one, spanNear([text:two, text:one], 1, true))

would work. With the example string, I would have expected 0 spans returned.
However, that query returns a span, "one" [2,3]. I understand now why this
happens.

As a result of SpanNearQuery not matching all possible spans, the
SpanNotQuery operator cannot provide a logically inverted set of all
possible spans. Any compound SpanQuery that is dependent on that inverted
set being complete will be glaringly inaccurate.

I've looked at the code enough to know that know I would have to look at it
a lot longer in order to fully understand the algorithm. Is there any
general interest in modifying NearSpanOrdered/NearSpanUnordered to include
all possible spans?

Thanks,

Nathan
--
View this message in context: http://www.nabble.com/SpanNearQuery%3A-All-matches-within-slop-tp19191359p19191359.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: SpanNearQuery: All matches within slop

Posted by Mark Miller <ma...@gmail.com>.

Correction - meant greedy not lazy.

Mark Miller wrote:
> Looks pretty interesting. A lazy span implementation could certainly 
> be useful for certain highlighting situations. Barrier of entry looks 
> a bit high unfortunately, but the text def helps for anyone looking to 
> learn how the current spanquery stuff works.
>
>
> Paul Elschot wrote:
>> A bit late in reacting, but you may also may want to take a look
>> at this:
>> Paolo Boldi, Sebastiano Vigna
>> Efficient Optimally Lazy Algorithms for  Minimal-Interval Semantics
>> Oct 2007, arXiv:0710.1525v1
>>
>> The algorithms used in the lucene spans package are surprisingly
>> similar. Nevertheless, there are some differences too, especially
>> in the queue ordering conditions.
>>
>> Regards,
>> Paul Elschot
>>
>>
>> Op Thursday 28 August 2008 01:09:51 schreef Mark Miller:
>>  
>>> Its a matter of speed. Once you know the document matches the query,
>>> it would in general, make no sense to keep looking unless you had a
>>> strong reason to factor it into scoring. So I don't think it makes
>>> much sense to modify...now adding new Span classes...
>>>
>>>    
>>>>  Is there any
>>>> general interest in modifying NearSpanOrdered/NearSpanUnordered to
>>>> include all possible spans?
>>>>
>>>> Thanks,
>>>>
>>>> Nathan
>>>>       
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>     
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>   
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: SpanNearQuery: All matches within slop

Posted by Mark Miller <ma...@gmail.com>.

Looks pretty interesting. A lazy span implementation could certainly be 
useful for certain highlighting situations. Barrier of entry looks a bit 
high unfortunately, but the text def helps for anyone looking to learn 
how the current spanquery stuff works.


Paul Elschot wrote:
> A bit late in reacting, but you may also may want to take a look
> at this:
> Paolo Boldi, Sebastiano Vigna
> Efficient Optimally Lazy Algorithms for  Minimal-Interval Semantics
> Oct 2007, arXiv:0710.1525v1
>
> The algorithms used in the lucene spans package are surprisingly
> similar. Nevertheless, there are some differences too, especially
> in the queue ordering conditions.
>
> Regards,
> Paul Elschot
>
>
> Op Thursday 28 August 2008 01:09:51 schreef Mark Miller:
>   
>> Its a matter of speed. Once you know the document matches the query,
>> it would in general, make no sense to keep looking unless you had a
>> strong reason to factor it into scoring. So I don't think it makes
>> much sense to modify...now adding new Span classes...
>>
>>     
>>>  Is there any
>>> general interest in modifying NearSpanOrdered/NearSpanUnordered to
>>> include all possible spans?
>>>
>>> Thanks,
>>>
>>> Nathan
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>     
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: SpanNearQuery: All matches within slop

Posted by Paul Elschot <pa...@xs4all.nl>.

A bit late in reacting, but you may also may want to take a look
at this:
Paolo Boldi, Sebastiano Vigna
Efficient Optimally Lazy Algorithms for  Minimal-Interval Semantics
Oct 2007, arXiv:0710.1525v1

The algorithms used in the lucene spans package are surprisingly
similar. Nevertheless, there are some differences too, especially
in the queue ordering conditions.

Regards,
Paul Elschot


Op Thursday 28 August 2008 01:09:51 schreef Mark Miller:
> Its a matter of speed. Once you know the document matches the query,
> it would in general, make no sense to keep looking unless you had a
> strong reason to factor it into scoring. So I don't think it makes
> much sense to modify...now adding new Span classes...
>
> >  Is there any
> > general interest in modifying NearSpanOrdered/NearSpanUnordered to
> > include all possible spans?
> >
> > Thanks,
> >
> > Nathan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: SpanNearQuery: All matches within slop

Posted by Mark Miller <ma...@gmail.com>.

Its a matter of speed. Once you know the document matches the query, it 
would in general, make no sense to keep looking unless you had a strong 
reason to factor it into scoring. So I don't think it makes much sense 
to modify...now adding new Span classes...
>  Is there any
> general interest in modifying NearSpanOrdered/NearSpanUnordered to include
> all possible spans? 
>
> Thanks,
>
> Nathan
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org