You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by GitBox <gi...@apache.org> on 2021/08/24 13:16:21 UTC

[GitHub] [solr] mocobeta commented on a change in pull request #270: SOLR-12255: Add docs for Nori Korean tokenizer

mocobeta commented on a change in pull request #270:
URL: https://github.com/apache/solr/pull/270#discussion_r694839187



##########
File path: solr/solr-ref-guide/src/language-analysis.adoc
##########
@@ -2419,6 +2422,130 @@ Example:
 ====
 --
 
+=== Korean
+
+The Korean (nori) analyzer integrates Lucene's nori analysis module into Solr.
+It uses the https://bitbucket.org/eunjeon/mecab-ko-dic[mecab-ko-dic] dictionary to perform morphological analysis of Korean texts.
+
+The dictionary was built with http://taku910.github.io/mecab/[MeCab] and defines a format for the features adapted for the Korean language.
+
+Nori also has a user dictionary feature that allows overriding the statistical model with your own entries for segmentation, part-of-speech tags, and readings without a need to specify weights.
+
+*Example*:
+
+[.dynamic-tabs]
+--
+[example.tab-pane#byname-lang-korean]
+====
+[.tab-label]*With name*
+[source,xml]
+----
+<fieldType name="text_ko" class="solr.TextField" positionIncrementGap="100">
+  <analyzer>
+    <tokenizer name="korean" decompoundMode="discard" outputUnknownUnigrams="false"/>
+    <filter name="koreanPartOfSpeechStop" />
+    <filter name="koreanReadingForm" />
+    <filter name="lowercase" />
+  </analyzer>
+</fieldType>
+----
+====
+
+[example.tab-pane#byclass-lang-korean]
+====
+[.tab-label]*With class name (legacy)*
+[source,xml]
+----
+<fieldType name="text_ko" class="solr.TextField" positionIncrementGap="100">
+  <analyzer>
+    <tokenizer class="solr.KoreanTokenizerFactory" decompoundMode="discard" outputUnknownUnigrams="false"/>
+    <filter class="solr.KoreanPartOfSpeechStopFilterFactory" />
+    <filter class="solr.KoreanReadingFormFilterFactory" />
+    <filter class="solr.LowerCaseFilterFactory" />
+  </analyzer>
+</fieldType>
+----
+====
+--
+
+
+==== Korean Tokenizer
+
+*Factory class*: `solr.KoreanTokenizerFactory`
+
+*SPI name*: `korean`
+
+*Arguments*:
+
+`userDictionary`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: none
+|===
++
+Path to a user-supplied dictionary to add custom nouns or compound terms to the default dictionary.
+
+`userDictionaryEncoding`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: none
+|===
++
+Character encoding of the user dictionary.
+
+`decompoundMode`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `discard`
+|===
++
+Defines how to handle compound tokens. The options are:
+
+* `none`: No decomposition for tokens.
+* `discard`: Tokens are decomposed and the original form is discarded.
+* `mixed`: Tokens are decomposed and the original form is retained.
+
+`outputUnknownUnigrams`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `true`

Review comment:
       I think the default value is `false`?
   https://github.com/apache/lucene/blob/83ba5d859c377c6882947253ce0c6435153a1139/lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/KoreanTokenizerFactory.java#L96




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org