You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Uwe Schindler (Jira)" <ji...@apache.org> on 2022/02/20 13:20:00 UTC

[jira] [Comment Edited] (LUCENE-10416) Update Korean Dictionary for Nori

    [ https://issues.apache.org/jira/browse/LUCENE-10416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17495157#comment-17495157 ] 

Uwe Schindler edited comment on LUCENE-10416 at 2/20/22, 1:19 PM:
------------------------------------------------------------------

I have one question: if you have indexed text using the Korean analyzer - do you need to reindex or is it "mostly fine"?

The problem is if tokens are generated with different rules or normalization, you won't find them in index anymore.

In older Lucene versions we had "matchVersion" parameter for this, but this would require to ship with both dictionaries.

If there are significant changes we should ship this only in 10.0, not with version 9.1.


was (Author: thetaphi):
I have one question: if you have indexed text using the Korean analyzer - do you need to reindex or is it "mostly fine"?

The problem is if tokens are generated with different rules or normalization, you won't find them in index anymore.

In older Lucene versions we had "matchVersion" parameter for this, but this would require to ship with both dictionaries.

If there are significant changes we should shop this only in 10.0, not with version 9.1.

> Update Korean Dictionary for Nori
> ---------------------------------
>
>                 Key: LUCENE-10416
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10416
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Uihyun Kim
>            Priority: Minor
>             Fix For: 9.1, 10.0 (main)
>
>         Attachments: LUCENE-10416.patch
>
>
> For Nori - Korean analyzer, there is Korean dictionary named mecab-ko-dic, which is available under an Apache license here: [https://bitbucket.org/eunjeon/mecab-ko-dic]
>  
> The dictionary hasn't been updated in Nori although it has some updates to provide better analysis results. Downloading is available here: [https://bitbucket.org/eunjeon/mecab-ko-dic/downloads]
>  * Currently used in Nori: [https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.3-20170922.tar.gz]
>  * Latest: [https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.3-20170922.tar.gz]
>  
> There are changes between the currently used version and the latest release version(change log: [https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/CHANGES.md])
>  * New feature: added semantic class for NNG - 장소, 행위, 상태변화, 정적상태
>  * Fix: correct unexpectedly huge cost on NNG/장소
>  * New words
>  
> There's no issue with testing :lucene:analysis:nori:test and building a new binary.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org