You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Uihyun Kim (Jira)" <ji...@apache.org> on 2022/03/17 23:28:00 UTC

[jira] [Commented] (LUCENE-10416) Update Korean Dictionary for Nori

    [ https://issues.apache.org/jira/browse/LUCENE-10416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508465#comment-17508465 ] 

Uihyun Kim commented on LUCENE-10416:
-------------------------------------

[~tomoko] [~uschindler] Thank you for reviewing this. It will give more precise outputs from the Korean tokenizer. The overall outputs must be the same since words' costs used for FST aren't changed. But to adopt this change fully without any potential issues, full reindexing or new indexing would be recommended. It seems good to be considered as a major change.

> Update Korean Dictionary for Nori
> ---------------------------------
>
>                 Key: LUCENE-10416
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10416
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Uihyun Kim
>            Priority: Minor
>             Fix For: 10.0 (main)
>
>         Attachments: LUCENE-10416.patch
>
>
> For Nori - Korean analyzer, there is Korean dictionary named mecab-ko-dic, which is available under an Apache license here: [https://bitbucket.org/eunjeon/mecab-ko-dic]
>  
> The dictionary hasn't been updated in Nori although it has some updates to provide better analysis results. Downloading is available here: [https://bitbucket.org/eunjeon/mecab-ko-dic/downloads]
>  * Currently used in Nori: [https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.3-20170922.tar.gz]
>  * Latest: [https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.3-20170922.tar.gz]
>  
> There are changes between the currently used version and the latest release version(change log: [https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/CHANGES.md])
>  * New feature: added semantic class for NNG - 장소, 행위, 상태변화, 정적상태
>  * Fix: correct unexpectedly huge cost on NNG/장소
>  * New words
>  
> There's no issue with testing :lucene:analysis:nori:test and building a new binary.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org