You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Jim Ferenczi (Jira)" <ji...@apache.org> on 2020/06/03 19:39:00 UTC
[jira] [Created] (LUCENE-9390) Kuromoji tokenizer discards tokens
if they start with a punctuation character
Jim Ferenczi created LUCENE-9390:
------------------------------------
Summary: Kuromoji tokenizer discards tokens if they start with a punctuation character
Key: LUCENE-9390
URL: https://issues.apache.org/jira/browse/LUCENE-9390
Project: Lucene - Core
Issue Type: Improvement
Reporter: Jim Ferenczi
This issue was first raised in Elasticsearch here.
The unidic dictionary that is used by the Kuromoji tokenizer contains entries that mix punctuations and other characters. For instance the following entry:
_(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_
can be found in the Noun.csv file.
Today, tokens that start with punctuations are automatically removed by default (discardPunctuation is true). I think the code was written this way because we expect punctuations to be separated from normal tokens but there are exceptions in the original dictionary. Maybe we should check the entire token when discarding punctuations ?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org