You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Michael Gibney (Jira)" <ji...@apache.org> on 2019/09/16 15:56:00 UTC

[jira] [Commented] (LUCENE-8972) CharFilter version of ICUTransformFilter, to better support dictionary-based tokenization

    [ https://issues.apache.org/jira/browse/LUCENE-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930664#comment-16930664 ] 

Michael Gibney commented on LUCENE-8972:
----------------------------------------

Thanks for the feedback/advice, [~rcmuir]. Along the same lines as what you mention, I think some attention also needs to be payed to the resolution/accuracy of offset correction. I'm going to take a crack at this and hope to have something shortly.

> CharFilter version of ICUTransformFilter, to better support dictionary-based tokenization
> -----------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8972
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8972
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: master (9.0), 8.2
>            Reporter: Michael Gibney
>            Priority: Minor
>
> The ICU Transliteration API is currently exposed through Lucene only post-tokinzer, via ICUTransformFilter. Some tokenizers (particularly dictionary-based) may assume pre-normalized input (e.g., for Chinese characters, there may be an assumption of traditional-only or simplified-only input characters, at the level of either all input, or per-dictionary-defined-token).
> The potential usefulness of a CharFilter that exposes the ICU Transliteration API was suggested in a [thread on the Solr mailing list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201807.mbox/%3C4DAB7BA7-42A8-4009-8B49-60822B00DE7D%40wunderwood.org%3E], and my hope is that this issue can facilitate more detailed discussion of the proposed addition.
> A concrete example of mixed traditional/simplified characters that are currently tokenized differently by the ICUTokenizer are:
>  * 红楼梦 (SSS)
>  * 紅樓夢 (TTT)
>  * 紅楼夢 (TST)
> The first two tokens (simplified-only and traditional-only, respectively) are included in the [CJ dictionary that backs ICUTokenizer|https://raw.githubusercontent.com/unicode-org/icu/release-62-1/icu4c/source/data/brkitr/dictionaries/cjdict.txt], but the last (a mixture of traditional and simplified characters) is not, and is not recognized as a token. Even _if_ we assume this to be an intentional omission from the dictionary that results in behavior that could be desirable for some use cases, there are surely some use cases that would benefit from a more permissive dictionary-based tokenization strategy (such as could be supported by pre-tokenizer transliteration).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org