You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Dmitry Kan <so...@gmail.com> on 2012/12/19 09:53:37 UTC

using PositionIncrementAttribute to increment certain term positions to large values

Dear list,

We are currently evaluating proximity searches ("term1 term2" ~slope) for a
specific use case. In particular, each document contains artificial
delimiter characters (one character between each pair of sentences in the
text). Our goal is to hit the sentences individually for any proximity
search and avoid sentence cross-boundary matches.

We figured, that by using PositionIncrementAttribute as a field in the
descendant of TokenFilter class it is possible to set a position increment
of each artificial character (which is a term in Lucene / SOLR notation) to
an arbitrarily large number. Thus any proximity searches with reasonably
small slope values should automatically hit withing the sentence boundaries.

Does this sound like a right way to tackle the problem? Are there any
performance costs involved?

Thanks in advance for any input,

Dmitry Kan

Re: using PositionIncrementAttribute to increment certain term positions to large values

Posted by Dmitry Kan <so...@gmail.com>.

Hi,

For the sake of story completeness, I was able to fix the highlighter to
work with the token matches that go beyond the length of the text field.
The solution was to mod on matched token positions, if they exceed the
length of the text.

Dmitry

On Thu, Dec 27, 2012 at 10:13 AM, Dmitry Kan <so...@gmail.com> wrote:

> Hi,
>
> answering my own question for the records: the experiments show that the
> described functionality is achievable with the TokenFilter class
> implementation. The only caveat though, is that Highlighter component stops
> working properly, if the match position goes beyond the length of the text
> field.
>
> As for the performance, no major delays compared to the original proximity
> search implementation have been noticed.
>
> Best,
>
> Dmitry Kan
>
>
> On Wed, Dec 19, 2012 at 10:53 AM, Dmitry Kan <so...@gmail.com> wrote:
>
>> Dear list,
>>
>> We are currently evaluating proximity searches ("term1 term2" ~slope) for
>> a specific use case. In particular, each document contains artificial
>> delimiter characters (one character between each pair of sentences in the
>> text). Our goal is to hit the sentences individually for any proximity
>> search and avoid sentence cross-boundary matches.
>>
>> We figured, that by using PositionIncrementAttribute as a field in the
>> descendant of TokenFilter class it is possible to set a position
>> increment of each artificial character (which is a term in Lucene / SOLR
>> notation) to an arbitrarily large number. Thus any proximity searches with
>> reasonably small slope values should automatically hit withing the sentence
>> boundaries.
>>
>> Does this sound like a right way to tackle the problem? Are there any
>> performance costs involved?
>>
>> Thanks in advance for any input,
>>
>> Dmitry Kan
>>
>
>

Re: using PositionIncrementAttribute to increment certain term positions to large values

Posted by Dmitry Kan <so...@gmail.com>.

Hi,

answering my own question for the records: the experiments show that the
described functionality is achievable with the TokenFilter class
implementation. The only caveat though, is that Highlighter component stops
working properly, if the match position goes beyond the length of the text
field.

As for the performance, no major delays compared to the original proximity
search implementation have been noticed.

Best,

Dmitry Kan

On Wed, Dec 19, 2012 at 10:53 AM, Dmitry Kan <so...@gmail.com> wrote:

> Dear list,
>
> We are currently evaluating proximity searches ("term1 term2" ~slope) for
> a specific use case. In particular, each document contains artificial
> delimiter characters (one character between each pair of sentences in the
> text). Our goal is to hit the sentences individually for any proximity
> search and avoid sentence cross-boundary matches.
>
> We figured, that by using PositionIncrementAttribute as a field in the
> descendant of TokenFilter class it is possible to set a position
> increment of each artificial character (which is a term in Lucene / SOLR
> notation) to an arbitrarily large number. Thus any proximity searches with
> reasonably small slope values should automatically hit withing the sentence
> boundaries.
>
> Does this sound like a right way to tackle the problem? Are there any
> performance costs involved?
>
> Thanks in advance for any input,
>
> Dmitry Kan
>