You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2018/08/09 08:47:00 UTC
[jira] [Commented] (LUCENE-8450) Enable TokenFilters to assign offsets when splitting tokens

    [ https://issues.apache.org/jira/browse/LUCENE-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16574493#comment-16574493 ] 

Robert Muir commented on LUCENE-8450:
-------------------------------------

As mentioned on the mailing list, I don't think we should do this for tokenfilters due to complexity: it is not just on the API side, but the maintenance side too: worrying about how to keep all the tokenfilters correct.

I am sorry that some tokenfilters that really should be tokenizers extend the wrong base class, but that problem should simply be fixed.

Separately I don't like the correctOffset() method that we already have on tokenizer today. maybe it could be in the offsetattributeimpl or similar instead. But at least today its just one method.

> Enable TokenFilters to assign offsets when splitting tokens
> -----------------------------------------------------------
>
>                 Key: LUCENE-8450
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8450
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Priority: Major
>         Attachments: offsets.patch
>
>
> CharFilters and TokenFilters may alter token lengths, meaning that subsequent filters cannot perform simple arithmetic to calculate the original ("correct") offset of a character in the interior of the token. A similar situation exists for Tokenizers, but these can call CharFilter.correctOffset() to map offsets back to their original location in the input stream. There is no such API for TokenFilters.
> This issue calls for adding an API to support use cases like highlighting the correct portion of a compound token. For example the german word "außerstand" (meaning afaict "unable to do something") will be decompounded and match "stand and "ausser", but as things are today, offsets are always set using the start and end of the tokens produced by Tokenizer, meaning that highlighters will match the entire compound.
> I'm proposing to add this method to `TokenStream`:
> {{     public CharOffsetMap getCharOffsetMap()}}
> referencing a CharOffsetMap with these methods:
> {{     int correctOffset(int currentOff);}}
> {{     int uncorrectOffset(int originalOff);}}
>  
> The uncorrectOffset method is a pseudo-inverse of correctOffset, mapping from original offset forward to the current "offset space".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org