You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Jim Ferenczi (Jira)" <ji...@apache.org> on 2020/06/03 19:40:00 UTC
[jira] [Updated] (LUCENE-9390) Kuromoji tokenizer discards tokens if they start with a punctuation character

     [ https://issues.apache.org/jira/browse/LUCENE-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Ferenczi updated LUCENE-9390:
---------------------------------
    Description: 
This issue was first raised in Elasticsearch [here|[https://github.com/elastic/elasticsearch/issues/57614]|https://github.com/elastic/elasticsearch/issues/57614]

The unidic dictionary that is used by the Kuromoji tokenizer contains entries that mix punctuations and other characters. For instance the following entry:

_（株）,1285,1285,3690,名詞,一般,*,*,*,*,（株）,カブシキガイシャ,カブシキガイシャ_

can be found in the Noun.csv file.

Today, tokens that start with punctuations are automatically removed by default (discardPunctuation  is true). I think the code was written this way because we expect punctuations to be separated from normal tokens but there are exceptions in the original dictionary. Maybe we should check the entire token when discarding punctuations ?

 

 

  was:
This issue was first raised in Elasticsearch here.

The unidic dictionary that is used by the Kuromoji tokenizer contains entries that mix punctuations and other characters. For instance the following entry:

_（株）,1285,1285,3690,名詞,一般,*,*,*,*,（株）,カブシキガイシャ,カブシキガイシャ_

can be found in the Noun.csv file.

Today, tokens that start with punctuations are automatically removed by default (discardPunctuation  is true). I think the code was written this way because we expect punctuations to be separated from normal tokens but there are exceptions in the original dictionary. Maybe we should check the entire token when discarding punctuations ?

 

 


> Kuromoji tokenizer discards tokens if they start with a punctuation character
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-9390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9390
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>
> This issue was first raised in Elasticsearch [here|[https://github.com/elastic/elasticsearch/issues/57614]|https://github.com/elastic/elasticsearch/issues/57614]
> The unidic dictionary that is used by the Kuromoji tokenizer contains entries that mix punctuations and other characters. For instance the following entry:
> _（株）,1285,1285,3690,名詞,一般,*,*,*,*,（株）,カブシキガイシャ,カブシキガイシャ_
> can be found in the Noun.csv file.
> Today, tokens that start with punctuations are automatically removed by default (discardPunctuation  is true). I think the code was written this way because we expect punctuations to be separated from normal tokens but there are exceptions in the original dictionary. Maybe we should check the entire token when discarding punctuations ?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org