You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Alan Woodward (JIRA)" <ji...@apache.org> on 2018/10/01 07:47:00 UTC

[jira] [Commented] (LUCENE-8516) Make WordDelimiterGraphFilter a Tokenizer

    [ https://issues.apache.org/jira/browse/LUCENE-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16633677#comment-16633677 ] 

Alan Woodward commented on LUCENE-8516:
---------------------------------------

Comment from [~msokolov@gmail.com]:

My current usage of this filter requires it to be a filter, since I need to precede it with other filters. I think the idea of not touching offsets preserves more flexibility, and since the offsets are already unreliable, we wouldn't be losing much.

> Make WordDelimiterGraphFilter a Tokenizer
> -----------------------------------------
>
>                 Key: LUCENE-8516
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8516
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8516.patch
>
>
> Being able to split tokens up at arbitrary points in a filter chain, in effect adding a second round of tokenization, can cause any number of problems when trying to keep tokenstreams to contract.  The most common offender here is the WordDelimiterGraphFilter, which can produce broken offsets in a wide range of situations.
> We should make WDGF a Tokenizer in its own right, which should preserve all the functionality we need, but make reasoning about the resulting tokenstream much simpler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org