You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Tomoko Uchida (Jira)" <ji...@apache.org> on 2020/06/20 06:11:00 UTC
[jira] [Commented] (LUCENE-9413) Add a char filter corresponding to
CJKWidthFilter
[ https://issues.apache.org/jira/browse/LUCENE-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140976#comment-17140976 ]
Tomoko Uchida commented on LUCENE-9413:
---------------------------------------
I cannot take time for working on this soon, but wanted to hook it as an issue... comments and thoughts are welcomed.
> Add a char filter corresponding to CJKWidthFilter
> -------------------------------------------------
>
> Key: LUCENE-9413
> URL: https://issues.apache.org/jira/browse/LUCENE-9413
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Tomoko Uchida
> Priority: Minor
>
> In association with issues in Elasticsearch ([https://github.com/elastic/elasticsearch/issues/58384] and [https://github.com/elastic/elasticsearch/issues/58385]), it might be useful for Japanese default analyzer.
> Although I don't think it's a bug to not normalize FULL and HALF width characters before tokenization, the behaviour sometimes confuses beginners or users who have limited knowledge about Japanese analysis (and Unicode).
> If we have a FULL and HALF width character normalization filter in {{analyzers-common}}, we can include it into JapaneseAnalyzer (currently, JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization so some of FULL width numbers or alphabets are separated by the tokenizer).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org