You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Jim Ferenczi (Jira)" <ji...@apache.org> on 2020/06/03 19:39:00 UTC

[jira] [Created] (LUCENE-9390) Kuromoji tokenizer discards tokens if they start with a punctuation character

Jim Ferenczi created LUCENE-9390:
------------------------------------

             Summary: Kuromoji tokenizer discards tokens if they start with a punctuation character
                 Key: LUCENE-9390
                 URL: https://issues.apache.org/jira/browse/LUCENE-9390
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Jim Ferenczi


This issue was first raised in Elasticsearch here.

The unidic dictionary that is used by the Kuromoji tokenizer contains entries that mix punctuations and other characters. For instance the following entry:

_(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_

can be found in the Noun.csv file.

Today, tokens that start with punctuations are automatically removed by default (discardPunctuation  is true). I think the code was written this way because we expect punctuations to be separated from normal tokens but there are exceptions in the original dictionary. Maybe we should check the entire token when discarding punctuations ?

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org