You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Namgyu Kim (Jira)" <ji...@apache.org> on 2019/09/18 17:53:00 UTC
[jira] [Commented] (LUCENE-8977) Handle punctuation characters in
KoreanTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-8977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932713#comment-16932713 ]
Namgyu Kim commented on LUCENE-8977:
------------------------------------
Sorry for late reply. [~jim.ferenczi] :(
First, I'll modify this issue from Bug to Improvement because it is ambiguous to see it as a bug.
{quote}I wonder why you think that this is an issue. Punctuations are removed by default so this is only an issue if you want to use the Korean number filter ?
{quote}
As you said, the biggest purpose is KoreanNumberFilter.
However, users can simply use discardPunctuation option of KoreanTokenizer. (not use KoreanNumberFilter)
{code:java}
Analyzer myAnalyzer = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new KoreanTokenizer(newAttributeFactory(), userDictionary, DecompoundMode.NONE, false, false);
return new TokenStreamComponents(tokenizer, tokenizer);
}
};
{code}
When using it as false, users may think the following result strange. (at least I do)
ex)
Input : ...사이즈...
Expect1 : [.][..][사이즈][.][..]
Expect2 : [...][사이즈][...]
Result : [...][사이즈][.][..]
How do you think about this?
> Handle punctuation characters in KoreanTokenizer
> ------------------------------------------------
>
> Key: LUCENE-8977
> URL: https://issues.apache.org/jira/browse/LUCENE-8977
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Namgyu Kim
> Priority: Minor
>
> As we discussed on LUCENE-8966, KoreanTokenizer always divides into one and the others now when there are continuous punctuation marks.
> (사이즈.... => [사이즈] [.] [...])
> But KoreanTokenizer doesn't divide when first character is punctuation.
> (...사이즈 => [...] [사이즈])
> It looks like the result from the viterbi path, but users can think weird about the following case:
> ("사이즈" means "size" in Korean)
> ||Case #1||Case #2||
> |Input : "...사이즈..."|Input : "...4......4사이즈"|
> |Result : [...] [사이즈] [.] [..]|Result : [...] [4] [.] [.....] [4] [사이즈]|
> From what I checked, Nori has a punctuation characters(like . ,) in the dictionary but Kuromoji is not.
> ("サイズ" means "size" in Japanese)
> ||Case #1||Case #2||
> |Input : "...サイズ..."|Input : "...4......4サイズ"|
> |Result : [...] [サイズ] [...]|Result : [...] [4] [......] [4] [サイズ]|
> There are some ways to resolve it like hard-coding for punctuation but it seems not good.
> So I think we need to discuss it.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org