You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Jim Ferenczi (Jira)" <ji...@apache.org> on 2019/09/05 09:05:00 UTC

[jira] [Created] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

Jim Ferenczi created LUCENE-8966:
------------------------------------

             Summary: KoreanTokenizer should split unknown words on digits
                 Key: LUCENE-8966
                 URL: https://issues.apache.org/jira/browse/LUCENE-8966
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Jim Ferenczi


Since LUCENE-XXX the Korean tokenizer groups characters of unknown words if they belong to the same script or an inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the rest in Latin) but this rule doesn't work well on digits since they are considered common with other scripts. For instance the input "44사이즈" is kept as is even though "사이즈" is part of the dictionary. We should restore the original behavior and splits any unknown words if a digit is followed by another type.

This issue was first discovered in [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org