You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "Tomoko Uchida (Jira)" <ji...@apache.org> on 2020/06/04 04:55:00 UTC

[jira] [Comment Edited] (LUCENE-9390) Kuromoji tokenizer discards tokens if they start with a punctuation character

    [ https://issues.apache.org/jira/browse/LUCENE-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125542#comment-17125542 ] 

Tomoko Uchida edited comment on LUCENE-9390 at 6/4/20, 4:54 AM:
----------------------------------------------------------------

Personally, I usually set the "discardPunctuation" flag to False to avoid such subtle situation.

As a possible solution, instead of "discardPunctuation" flag we could add a token filter to discard all tokens which is composed only of punctuation characters after tokenization (just like stop filter) ? To me, it is a token filter's job rather than a tokenizer...


was (Author: tomoko uchida):
Personally, I usually set the "discardPunctuation" flag to False to avoid such subtle situation.

As a possible solution, instead of "discardPunctuation" flag we could add a token filter to discard tokens that remove all tokens which is composed only of punctuation characters after tokenization (just like stop filter) ? To me, it is a token filter's job rather than a tokenizer...

> Kuromoji tokenizer discards tokens if they start with a punctuation character
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-9390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9390
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>
> This issue was first raised in Elasticsearch [here|https://github.com/elastic/elasticsearch/issues/57614]
> The unidic dictionary that is used by the Kuromoji tokenizer contains entries that mix punctuations and other characters. For instance the following entry:
> _（株）,1285,1285,3690,名詞,一般,*,*,*,*,（株）,カブシキガイシャ,カブシキガイシャ_
> can be found in the Noun.csv file.
> Today, tokens that start with punctuations are automatically removed by default (discardPunctuation  is true). I think the code was written this way because we expect punctuations to be separated from normal tokens but there are exceptions in the original dictionary. Maybe we should check the entire token when discarding punctuations ?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org