You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Michael Sokolov <ms...@gmail.com> on 2018/07/25 12:27:10 UTC

offsets

I've run into some difficulties with offsets in some TokenFilters I've been
writing, and I wonder if anyone can shed any light. Because characters may
be inserted or removed by prior filters (eg ICUFoldingFilter does this with
ellipses), and there is no offset-correcting data structure available to
TokenFilters (as there is in CharFilter), there doesn't seem to be any
reliable way to calculate the offset at a point interior to a token, which
means that essentially the only reasonable thing to do with OffsetAttribute
is to preserve the offsets from the input. This is means that filters that
split their tokens (like WordDelimiterGraphFilter) have no reliable way of
mapping their split tokens' offsets. One can try, but it seems inevitably
to require making some arbitrary "fixup" stage in order to guarantee that
the offsets are nondecreasing and properly bounded by the original text
length.

If this analysis is correct, it seems one should really never call
OffsetAttribute.setOffset at all? Which makes it seem like a trappy kind of
method to provide. (hmm now I see this comment in OffsetAttributeImpl
suggesting making the method call-once). If that really is the case, I
think some assertion, deprecation, or other API protection would be useful
so the policy is clear.

Alternatively, do we want to consider providing a "fixup" API as we have
for CharFilter? OffsetAttribute, eg, could do the fixup if we provide an
API for setting offset deltas. This would make more precise highlighting
possible in these cases, at least. I'm not sure what other use cases folks
have come up with for offsets?

-Mike

Re: offsets

Posted by Robert Muir <rc...@gmail.com>.

I think you see it correctly. Currently, only tokenizers can really
safely modify offsets, because only they have access to the correction
logic from the charfilter.

Doing it from a tokenfilter just means you will have bugs...

On Wed, Jul 25, 2018 at 8:27 AM, Michael Sokolov <ms...@gmail.com> wrote:
> I've run into some difficulties with offsets in some TokenFilters I've been
> writing, and I wonder if anyone can shed any light. Because characters may
> be inserted or removed by prior filters (eg ICUFoldingFilter does this with
> ellipses), and there is no offset-correcting data structure available to
> TokenFilters (as there is in CharFilter), there doesn't seem to be any
> reliable way to calculate the offset at a point interior to a token, which
> means that essentially the only reasonable thing to do with OffsetAttribute
> is to preserve the offsets from the input. This is means that filters that
> split their tokens (like WordDelimiterGraphFilter) have no reliable way of
> mapping their split tokens' offsets. One can try, but it seems inevitably
> to require making some arbitrary "fixup" stage in order to guarantee that
> the offsets are nondecreasing and properly bounded by the original text
> length.
>
> If this analysis is correct, it seems one should really never call
> OffsetAttribute.setOffset at all? Which makes it seem like a trappy kind of
> method to provide. (hmm now I see this comment in OffsetAttributeImpl
> suggesting making the method call-once). If that really is the case, I
> think some assertion, deprecation, or other API protection would be useful
> so the policy is clear.
>
> Alternatively, do we want to consider providing a "fixup" API as we have
> for CharFilter? OffsetAttribute, eg, could do the fixup if we provide an
> API for setting offset deltas. This would make more precise highlighting
> possible in these cases, at least. I'm not sure what other use cases folks
> have come up with for offsets?
>
> -Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: offsets

Posted by Michael Sokolov <ms...@gmail.com>.

OK, so I thought some more concrete evidence might be helpful to make the
case here and did a quick POC. To get access to precise within-token
offsets we do need to make some changes to the public API, but the profile
could be kept small. In the version I worked up, I extracted the character
offset mapping implementation from BaseCharFilter into a separate
CharOffsetMap interface/class and added these new public methods to
existing classes:

TokenStream.getCharOffsetMap()

CharFilter.uncorrect(int correctOffset)  (pseudo-inverse of correct --
returns the left-most offset in the current character coordinates that
corresponds to the given original character offset)

The CharOffsetMap interface has just two methods correctOffset and
uncorrectOffset that support the offset mapping in both CharFilter and
TokenStream

To fully support setting offsets in TokenFilters we need (at least
something like) this inverse offset-correction method (uncorrect) because
OffsetAttribute's offsets are in the original "correct" character
coordinates, but token lengths in incrementToken() are in filtered ("not
correct") character space, and are not anchored to the origin so cannot be
converted directly.

I recognize the impact is not huge here, but we do have TokenFilters that
split tokens, and a currently trappy OffsetAttribute API. Personally I
think it makes sense to acknowledge that and make it a first class citizen,
but I guess another alternative (for fixing the trappiness) would be to
make OffsetAttribute unmodifiable. I know that either approach would have
saved me hours of confusion as I tried to correctly implement offsets.


On Wed, Aug 1, 2018 at 8:57 AM Michael Sokolov <ms...@gmail.com> wrote:

> Given that character transformations do happen in TokenFilters, shouldn't
> we strive to have an API that supports correct offsets (ie highlighting)
> for any combination of token filters? Currently we can't do that. For
> example because of the current situation, WordDelimiterGraphFilter,
> decompounding filters and the like cannot assign offsets correctly, so eg
> it becomes impossible to highlight the text that exactly corresponds to the
> user query.
>
> Just one example, if I have URLs in some document text, and analysis chain
> is Whitespace tokenizer followed by WordDelimiterGraphFilter, then a query
> for "http" will end up highlighting the entire URL.
>
> Do you have an idea how we can address this without making our apis crazy?
> Or are you just saying we should live with it as it is?
>
> -Mike
>
>
> On Tue, Jul 31, 2018 at 6:36 AM Robert Muir <rc...@gmail.com> wrote:
>
>> The problem is not a performance one, its a complexity thing. Really I
>> think only the tokenizer should be messing with the offsets...
>> They are the ones actually parsing the original content so it makes
>> sense they would produce the pointers back to them.
>> I know there are some tokenfilters out there trying to be tokenizers,
>> but we don't need to make our apis crazy to support that.
>>
>> On Mon, Jul 30, 2018 at 11:53 PM, Michael Sokolov <ms...@gmail.com>
>> wrote:
>> > Yes, in fact Tokenizer already provides correctOffset which just
>> delegates
>> > to CharFilter. We could expand on this, moving correctOffset up to
>> > TokenStream, and also adding correct() so that TokenFilters can add to
>> the
>> > character offset data structure (two int arrays) and share it across the
>> > analysis chain.
>> >
>> > Implementation-wise this could continue to delegate to CharFilter I
>> guess,
>> > but I think it would be better to add a character-offset-map abstraction
>> > that wraps the two int arrays and provides the correct/correctOffset
>> > methods to both TokenStream and CharFilter.
>> >
>> > This would let us preserve correct offsets in the face of manipulations
>> > like replacing ellipses, ligatures (like AE, OE), trademark symbols
>> > (replaced by "tm") and the like so that we can have the invariant that
>> > correctOffset(OffsetAttribute.startOffset) + CharTermAttribute.length()
>> ==
>> > correctOffset(OffsetAttribute.endOffset), roughly speaking, and enable
>> > token-splitting with correct offsets.
>> >
>> > I can work up a proof of concept; I don't think it would be too
>> > API-intrusive or change performance in a significant way.  Only
>> > TokenFilters that actually care about this (ie that insert or remove
>> > characters, or split tokens) would need to change; others would
>> continue to
>> > work as-is.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

Re: offsets

Posted by Michael Sokolov <ms...@gmail.com>.

Given that character transformations do happen in TokenFilters, shouldn't
we strive to have an API that supports correct offsets (ie highlighting)
for any combination of token filters? Currently we can't do that. For
example because of the current situation, WordDelimiterGraphFilter,
decompounding filters and the like cannot assign offsets correctly, so eg
it becomes impossible to highlight the text that exactly corresponds to the
user query.

Just one example, if I have URLs in some document text, and analysis chain
is Whitespace tokenizer followed by WordDelimiterGraphFilter, then a query
for "http" will end up highlighting the entire URL.

Do you have an idea how we can address this without making our apis crazy?
Or are you just saying we should live with it as it is?

-Mike


On Tue, Jul 31, 2018 at 6:36 AM Robert Muir <rc...@gmail.com> wrote:

> The problem is not a performance one, its a complexity thing. Really I
> think only the tokenizer should be messing with the offsets...
> They are the ones actually parsing the original content so it makes
> sense they would produce the pointers back to them.
> I know there are some tokenfilters out there trying to be tokenizers,
> but we don't need to make our apis crazy to support that.
>
> On Mon, Jul 30, 2018 at 11:53 PM, Michael Sokolov <ms...@gmail.com>
> wrote:
> > Yes, in fact Tokenizer already provides correctOffset which just
> delegates
> > to CharFilter. We could expand on this, moving correctOffset up to
> > TokenStream, and also adding correct() so that TokenFilters can add to
> the
> > character offset data structure (two int arrays) and share it across the
> > analysis chain.
> >
> > Implementation-wise this could continue to delegate to CharFilter I
> guess,
> > but I think it would be better to add a character-offset-map abstraction
> > that wraps the two int arrays and provides the correct/correctOffset
> > methods to both TokenStream and CharFilter.
> >
> > This would let us preserve correct offsets in the face of manipulations
> > like replacing ellipses, ligatures (like AE, OE), trademark symbols
> > (replaced by "tm") and the like so that we can have the invariant that
> > correctOffset(OffsetAttribute.startOffset) + CharTermAttribute.length()
> ==
> > correctOffset(OffsetAttribute.endOffset), roughly speaking, and enable
> > token-splitting with correct offsets.
> >
> > I can work up a proof of concept; I don't think it would be too
> > API-intrusive or change performance in a significant way.  Only
> > TokenFilters that actually care about this (ie that insert or remove
> > characters, or split tokens) would need to change; others would continue
> to
> > work as-is.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: offsets

Posted by Robert Muir <rc...@gmail.com>.

The problem is not a performance one, its a complexity thing. Really I
think only the tokenizer should be messing with the offsets...
They are the ones actually parsing the original content so it makes
sense they would produce the pointers back to them.
I know there are some tokenfilters out there trying to be tokenizers,
but we don't need to make our apis crazy to support that.

On Mon, Jul 30, 2018 at 11:53 PM, Michael Sokolov <ms...@gmail.com> wrote:
> Yes, in fact Tokenizer already provides correctOffset which just delegates
> to CharFilter. We could expand on this, moving correctOffset up to
> TokenStream, and also adding correct() so that TokenFilters can add to the
> character offset data structure (two int arrays) and share it across the
> analysis chain.
>
> Implementation-wise this could continue to delegate to CharFilter I guess,
> but I think it would be better to add a character-offset-map abstraction
> that wraps the two int arrays and provides the correct/correctOffset
> methods to both TokenStream and CharFilter.
>
> This would let us preserve correct offsets in the face of manipulations
> like replacing ellipses, ligatures (like AE, OE), trademark symbols
> (replaced by "tm") and the like so that we can have the invariant that
> correctOffset(OffsetAttribute.startOffset) + CharTermAttribute.length() ==
> correctOffset(OffsetAttribute.endOffset), roughly speaking, and enable
> token-splitting with correct offsets.
>
> I can work up a proof of concept; I don't think it would be too
> API-intrusive or change performance in a significant way.  Only
> TokenFilters that actually care about this (ie that insert or remove
> characters, or split tokens) would need to change; others would continue to
> work as-is.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: offsets

Posted by Michael Sokolov <ms...@gmail.com>.

Yes, in fact Tokenizer already provides correctOffset which just delegates
to CharFilter. We could expand on this, moving correctOffset up to
TokenStream, and also adding correct() so that TokenFilters can add to the
character offset data structure (two int arrays) and share it across the
analysis chain.

Implementation-wise this could continue to delegate to CharFilter I guess,
but I think it would be better to add a character-offset-map abstraction
that wraps the two int arrays and provides the correct/correctOffset
methods to both TokenStream and CharFilter.

This would let us preserve correct offsets in the face of manipulations
like replacing ellipses, ligatures (like AE, OE), trademark symbols
(replaced by "tm") and the like so that we can have the invariant that
correctOffset(OffsetAttribute.startOffset) + CharTermAttribute.length() ==
correctOffset(OffsetAttribute.endOffset), roughly speaking, and enable
token-splitting with correct offsets.

I can work up a proof of concept; I don't think it would be too
API-intrusive or change performance in a significant way.  Only
TokenFilters that actually care about this (ie that insert or remove
characters, or split tokens) would need to change; others would continue to
work as-is.

Re: offsets

Posted by Michael McCandless <lu...@mikemccandless.com>.

How would a fixup API work?  We would try to provide correctOffset
throughout the full analysis chain?

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jul 25, 2018 at 8:27 AM, Michael Sokolov <ms...@gmail.com> wrote:

> I've run into some difficulties with offsets in some TokenFilters I've been
> writing, and I wonder if anyone can shed any light. Because characters may
> be inserted or removed by prior filters (eg ICUFoldingFilter does this with
> ellipses), and there is no offset-correcting data structure available to
> TokenFilters (as there is in CharFilter), there doesn't seem to be any
> reliable way to calculate the offset at a point interior to a token, which
> means that essentially the only reasonable thing to do with OffsetAttribute
> is to preserve the offsets from the input. This is means that filters that
> split their tokens (like WordDelimiterGraphFilter) have no reliable way of
> mapping their split tokens' offsets. One can try, but it seems inevitably
> to require making some arbitrary "fixup" stage in order to guarantee that
> the offsets are nondecreasing and properly bounded by the original text
> length.
>
> If this analysis is correct, it seems one should really never call
> OffsetAttribute.setOffset at all? Which makes it seem like a trappy kind of
> method to provide. (hmm now I see this comment in OffsetAttributeImpl
> suggesting making the method call-once). If that really is the case, I
> think some assertion, deprecation, or other API protection would be useful
> so the policy is clear.
>
> Alternatively, do we want to consider providing a "fixup" API as we have
> for CharFilter? OffsetAttribute, eg, could do the fixup if we provide an
> API for setting offset deltas. This would make more precise highlighting
> possible in these cases, at least. I'm not sure what other use cases folks
> have come up with for offsets?
>
> -Mike
>