You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by sa...@apache.org on 2017/05/26 22:05:49 UTC

[1/3] lucene-solr:branch_6_6: SOLR-10758: fix broken internal link to new HMM Chinese Tokenizer section

Repository: lucene-solr
Updated Branches:
  refs/heads/branch_6_6 732e8331c -> 430f6c9be


SOLR-10758: fix broken internal link to new HMM Chinese Tokenizer section


Project: http://git-wip-us.apache.org/repos/asf/lucene-solr/repo
Commit: http://git-wip-us.apache.org/repos/asf/lucene-solr/commit/430f6c9b
Tree: http://git-wip-us.apache.org/repos/asf/lucene-solr/tree/430f6c9b
Diff: http://git-wip-us.apache.org/repos/asf/lucene-solr/diff/430f6c9b

Branch: refs/heads/branch_6_6
Commit: 430f6c9be2fdf73982d67b1c4a5ed69bcfd21de1
Parents: 82b5350
Author: Steve Rowe <sa...@gmail.com>
Authored: Fri May 26 16:57:53 2017 -0400
Committer: Steve Rowe <sa...@gmail.com>
Committed: Fri May 26 18:05:27 2017 -0400

----------------------------------------------------------------------
 solr/solr-ref-guide/src/language-analysis.adoc | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/430f6c9b/solr/solr-ref-guide/src/language-analysis.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/language-analysis.adoc b/solr/solr-ref-guide/src/language-analysis.adoc
index c82cd61..11b0b78 100644
--- a/solr/solr-ref-guide/src/language-analysis.adoc
+++ b/solr/solr-ref-guide/src/language-analysis.adoc
@@ -565,7 +565,7 @@ See the example under <<LanguageAnalysis-TraditionalChinese,Traditional Chinese>
 [[LanguageAnalysis-SimplifiedChinese]]
 === Simplified Chinese
 
-For Simplified Chinese, Solr provides support for Chinese sentence and word segmentation with the <<LanguageAnalysis-HMMChineseTokenizerFactory,HMM Chinese Tokenizer`>>. This component includes a large dictionary and segments Chinese text into words with the Hidden Markov Model. To use this tokenizer, you must add additional .jars to Solr's classpath (as described in the section <<lib-directives-in-solrconfig.adoc#lib-directives-in-solrconfig,Lib Directives in SolrConfig>>). See the `solr/contrib/analysis-extras/README.txt` for information on which jars you need to add to your `SOLR_HOME/lib`.
+For Simplified Chinese, Solr provides support for Chinese sentence and word segmentation with the <<LanguageAnalysis-HMMChineseTokenizer,HMM Chinese Tokenizer`>>. This component includes a large dictionary and segments Chinese text into words with the Hidden Markov Model. To use this tokenizer, you must add additional .jars to Solr's classpath (as described in the section <<lib-directives-in-solrconfig.adoc#lib-directives-in-solrconfig,Lib Directives in SolrConfig>>). See the `solr/contrib/analysis-extras/README.txt` for information on which jars you need to add to your `SOLR_HOME/lib`.
 
 The default configuration of the <<tokenizers.adoc#Tokenizers-ICUTokenizer,ICU Tokenizer>> is also suitable for Simplified Chinese text.  It follows the Word Break rules from the Unicode Text Segmentation algorithm for non-Chinese text, and uses a dictionary to segment Chinese words.  To use this tokenizer, you must add additional .jars to Solr's classpath (as described in the section <<lib-directives-in-solrconfig.adoc#lib-directives-in-solrconfig,Lib Directives in SolrConfig>>). See the `solr/contrib/analysis-extras/README.txt` for information on which jars you need to add to your `SOLR_HOME/lib`.
 
@@ -598,6 +598,7 @@ Also useful for Chinese analysis:
 </analyzer>
 ----
 
+[[LanguageAnalysis-HMMChineseTokenizer]]
 === HMM Chinese Tokenizer
 
 For Simplified Chinese, Solr provides support for Chinese sentence and word segmentation with the `solr.HMMChineseTokenizerFactory` in the `analysis-extras` contrib module. This component includes a large dictionary and segments Chinese text into words with the Hidden Markov Model. To use this tokenizer, see `solr/contrib/analysis-extras/README.txt` for instructions on which jars you need to add to your `solr_home/lib`.


[3/3] lucene-solr:branch_6_6: SOLR-10758: Modernize the Solr ref guide's Chinese language analysis coverage

Posted by sa...@apache.org.
SOLR-10758: Modernize the Solr ref guide's Chinese language analysis coverage


Project: http://git-wip-us.apache.org/repos/asf/lucene-solr/repo
Commit: http://git-wip-us.apache.org/repos/asf/lucene-solr/commit/e9a91805
Tree: http://git-wip-us.apache.org/repos/asf/lucene-solr/tree/e9a91805
Diff: http://git-wip-us.apache.org/repos/asf/lucene-solr/diff/e9a91805

Branch: refs/heads/branch_6_6
Commit: e9a918058b271e1ad69cf786bcc22215db26b4de
Parents: 732e833
Author: Steve Rowe <sa...@gmail.com>
Authored: Fri May 26 14:47:24 2017 -0400
Committer: Steve Rowe <sa...@gmail.com>
Committed: Fri May 26 18:05:27 2017 -0400

----------------------------------------------------------------------
 .../icu/segmentation/TestICUTokenizerCJK.java   |   9 +-
 solr/CHANGES.txt                                |   5 +
 solr/solr-ref-guide/src/language-analysis.adoc  | 120 ++++++++++++++-----
 solr/solr-ref-guide/src/tokenizers.adoc         |   2 +-
 4 files changed, 102 insertions(+), 34 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/e9a91805/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/TestICUTokenizerCJK.java
----------------------------------------------------------------------
diff --git a/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/TestICUTokenizerCJK.java b/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/TestICUTokenizerCJK.java
index 96f44d6..75481f1 100644
--- a/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/TestICUTokenizerCJK.java
+++ b/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/TestICUTokenizerCJK.java
@@ -53,7 +53,14 @@ public class TestICUTokenizerCJK extends BaseTokenStreamTestCase {
         new String[] { "我", "购买", "了", "道具", "和", "服装" }
     );
   }
-  
+
+  public void testTraditionalChinese() throws Exception {
+    assertAnalyzesTo(a, "我購買了道具和服裝。",
+        new String[] { "我", "購買", "了", "道具", "和", "服裝"});
+    assertAnalyzesTo(a, "定義切分字串的基本單位是訂定分詞標準的首要工作", // From http://godel.iis.sinica.edu.tw/CKIP/paper/wordsegment_standard.pdf
+        new String[] { "定義", "切", "分", "字串", "的", "基本", "單位", "是", "訂定", "分詞", "標準", "的", "首要", "工作" });
+  }
+
   public void testChineseNumerics() throws Exception {
     assertAnalyzesTo(a, "9483", new String[] { "9483" });
     assertAnalyzesTo(a, "院內分機9483。",

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/e9a91805/solr/CHANGES.txt
----------------------------------------------------------------------
diff --git a/solr/CHANGES.txt b/solr/CHANGES.txt
index 8a34d02..cce7973 100644
--- a/solr/CHANGES.txt
+++ b/solr/CHANGES.txt
@@ -239,6 +239,11 @@ Bug Fixes
 * SOLR-10735: Windows script (solr.cmd) didn't work properly with directory containing spaces. Adding quotations
   to fix (Uwe Schindler, janhoy, Tomas Fernandez-Lobbe, Ishan Chattopadhyaya) 
 
+Ref Guide
+----------------------
+
+* SOLR-10758: Modernize the Solr ref guide's Chinese language analysis coverage. (Steve Rowe)
+
 Other Changes
 ----------------------
 

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/e9a91805/solr/solr-ref-guide/src/language-analysis.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/language-analysis.adoc b/solr/solr-ref-guide/src/language-analysis.adoc
index 0cf8e13..c55a0cd 100644
--- a/solr/solr-ref-guide/src/language-analysis.adoc
+++ b/solr/solr-ref-guide/src/language-analysis.adoc
@@ -378,9 +378,8 @@ These factories are each designed to work with specific languages. The languages
 * <<Brazilian Portuguese>>
 * <<Bulgarian>>
 * <<Catalan>>
-* <<Chinese>>
+* <<Traditional Chinese>>
 * <<Simplified Chinese>>
-* <<CJK>>
 * <<LanguageAnalysis-Czech,Czech>>
 * <<LanguageAnalysis-Danish,Danish>>
 
@@ -508,55 +507,112 @@ Solr can stem Catalan using the Snowball Porter Stemmer with an argument of `lan
 
 *Out:* "llengu"(1), "llengu"(2)
 
-[[LanguageAnalysis-Chinese]]
-=== Chinese
+[[LanguageAnalysis-TraditionalChinese]]
+=== Traditional Chinese
 
-<<tokenizers.adoc#Tokenizers-StandardTokenizer,`solr.StandardTokenizerFactory`>> is suitable for Traditional Chinese text.  Following the Word Break rules from the Unicode Text Segmentation algorithm, it produces one token per Chinese character.
+The default configuration of the <<tokenizers.adoc#Tokenizers-ICUTokenizer,ICU Tokenizer>> is suitable for Traditional Chinese text.  It follows the Word Break rules from the Unicode Text Segmentation algorithm for non-Chinese text, and uses a dictionary to segment Chinese words.  To use this tokenizer, see `solr/contrib/analysis-extras/README.txt` for instructions on which jars you need to add to your `solr_home/lib`.
 
-[[LanguageAnalysis-SimplifiedChinese]]
-=== Simplified Chinese
+<<tokenizers.adoc#Tokenizers-StandardTokenizer,Standard Tokenizer>> can also be used to tokenize Traditional Chinese text.  Following the Word Break rules from the Unicode Text Segmentation algorithm, it produces one token per Chinese character.  When combined with <<LanguageAnalysis-CJKBigramFilter,CJK Bigram Filter>>, overlapping bigrams of Chinese characters are formed.
+ 
+<<LanguageAnalysis-CJKWidthFilter,CJK Width Filter>> folds fullwidth ASCII variants into the equivalent Basic Latin forms.
 
-For Simplified Chinese, Solr provides support for Chinese sentence and word segmentation with the `solr.HMMChineseTokenizerFactory` in the `analysis-extras` contrib module. This component includes a large dictionary and segments Chinese text into words with the Hidden Markov Model. To use this filter, see `solr/contrib/analysis-extras/README.txt` for instructions on which jars you need to add to your `solr_home/lib`.
+*Examples:*
 
-*Factory class:* `solr.HMMChineseTokenizerFactory`
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.ICUTokenizerFactory"/>
+  <filter class="solr.CJKWidthFilterFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+</analyzer>
+----
 
-*Arguments:* None
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.StandardTokenizerFactory"/>
+  <filter class="solr.CJKBigramFilterFactory"/>
+  <filter class="solr.CJKWidthFilterFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+</analyzer>
+----
 
-*Examples:*
+[[LanguageAnalysis-CJKBigramFilter]]
+=== CJK Bigram Filter
 
-To use the default setup with fallback to English Porter stemmer for English words, use:
+Forms bigrams (overlapping 2-character sequences) of CJK characters that are generated from <<tokenizers.adoc#Tokenizers-StandardTokenizer,Standard Tokenizer>> or <<tokenizers.adoc#Tokenizers-ICUTokenizer,ICU Tokenizer>>.
 
-`<analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>`
+By default, all CJK characters produce bigrams, but finer grained control is available by specifying orthographic type arguments `han`, `hiragana`, `katakana`, and `hangul`.  When set to `false`, characters of the corresponding type will be passed through as unigrams, and will not be included in any bigrams.
+
+When a CJK character has no adjacent characters to form a bigram, it is output in unigram form. If you want to always output both unigrams and bigrams, set the `outputUnigrams` argument to `true`.
 
-Or to configure your own analysis setup, use the `solr.HMMChineseTokenizerFactory` along with your custom filter setup.
+In all cases, all non-CJK input is passed through unmodified.
+
+*Arguments:*
+
+`han`:: (true/false) If false, Han (Chinese) characters will not form bigrams. Default is true.
+
+`hiragana`:: (true/false) If false, Hiragana (Japanese) characters will not form bigrams. Default is true.
+
+`katakana`:: (true/false) If false, Katakana (Japanese) characters will not form bigrams. Default is true.
+
+`hangul`:: (true/false) If false, Hangul (Korean) characters will not form bigrams. Default is true.
+
+`outputUnigrams`:: (true/false) If true, in addition to forming bigrams, all characters are also passed through as unigrams. Default is false.
+
+See the example under <<LanguageAnalysis-TraditionalChinese,Traditional Chinese>>.
+
+[[LanguageAnalysis-SimplifiedChinese]]
+=== Simplified Chinese
+
+For Simplified Chinese, Solr provides support for Chinese sentence and word segmentation with the <<LanguageAnalysis-HMMChineseTokenizerFactory,HMM Chinese Tokenizer`>>. This component includes a large dictionary and segments Chinese text into words with the Hidden Markov Model. To use this tokenizer, you must add additional .jars to Solr's classpath (as described in the section <<lib-directives-in-solrconfig.adoc#lib-directives-in-solrconfig,Lib Directives in SolrConfig>>). See the `solr/contrib/analysis-extras/README.txt` for information on which jars you need to add to your `SOLR_HOME/lib`.
+
+The default configuration of the <<tokenizers.adoc#Tokenizers-ICUTokenizer,ICU Tokenizer>> is also suitable for Simplified Chinese text.  It follows the Word Break rules from the Unicode Text Segmentation algorithm for non-Chinese text, and uses a dictionary to segment Chinese words.  To use this tokenizer, you must add additional .jars to Solr's classpath (as described in the section <<lib-directives-in-solrconfig.adoc#lib-directives-in-solrconfig,Lib Directives in SolrConfig>>). See the `solr/contrib/analysis-extras/README.txt` for information on which jars you need to add to your `SOLR_HOME/lib`.
+
+Also useful for Chinese analysis:
+
+<<LanguageAnalysis-CJKWidthFilter,CJK Width Filter>> folds fullwidth ASCII variants into the equivalent Basic Latin forms, and folds halfwidth Katakana variants into their equivalent fullwidth forms.
+
+*Examples:*
 
 [source,xml]
 ----
 <analyzer>
   <tokenizer class="solr.HMMChineseTokenizerFactory"/>
+  <filter class="solr.CJKWidthFilterFactory"/>
   <filter class="solr.StopFilterFactory"
           words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
   <filter class="solr.PorterStemFilterFactory"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
+</analyzer>
+----
+
+[source,xml]
+----
+<analyzer>
+  <tokenizer class="solr.ICUTokenizerFactory"/>
+  <filter class="solr.CJKWidthFilterFactory"/>
+  <filter class="solr.StopFilterFactory"
+          words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
+  <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
 ----
 
-[[LanguageAnalysis-CJK]]
-=== CJK
+=== HMM Chinese Tokenizer
 
-This tokenizer breaks Chinese, Japanese and Korean language text into tokens. These are not whitespace delimited languages. The tokens generated by this tokenizer are "doubles", overlapping pairs of CJK characters found in the field text.
+For Simplified Chinese, Solr provides support for Chinese sentence and word segmentation with the `solr.HMMChineseTokenizerFactory` in the `analysis-extras` contrib module. This component includes a large dictionary and segments Chinese text into words with the Hidden Markov Model. To use this tokenizer, see `solr/contrib/analysis-extras/README.txt` for instructions on which jars you need to add to your `solr_home/lib`.
 
-*Factory class:* `solr.CJKTokenizerFactory`
+*Factory class:* `solr.HMMChineseTokenizerFactory`
 
 *Arguments:* None
 
-*Example:*
+*Examples:*
 
-[source,xml]
-----
-<analyzer type="index">
-  <tokenizer class="solr.CJKTokenizerFactory"/>
-</analyzer>
-----
+To use the default setup with fallback to English Porter stemmer for English words, use:
+
+`<analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>`
+
+Or to configure your own analysis setup, use the `solr.HMMChineseTokenizerFactory` along with your custom filter setup.  See an example of this in the <<LanguageAnalysis-SimplifiedChinese,Simplified Chinese>> section. 
 
 [[LanguageAnalysis-Czech]]
 === Czech
@@ -947,15 +1003,15 @@ Solr can stem Irish using the Snowball Porter Stemmer with an argument of `langu
 
 Solr includes support for analyzing Japanese, via the Lucene Kuromoji morphological analyzer, which includes several analysis components - more details on each below:
 
-* `JapaneseIterationMarkCharFilter` normalizes Japanese horizontal iteration marks (odoriji) to their expanded form.
-* `JapaneseTokenizer` tokenizes Japanese using morphological analysis, and annotates each term with part-of-speech, base form (a.k.a. lemma), reading and pronunciation.
-* `JapaneseBaseFormFilter` replaces original terms with their base forms (a.k.a. lemmas).
-* `JapanesePartOfSpeechStopFilter` removes terms that have one of the configured parts-of-speech.
-* `JapaneseKatakanaStemFilter` normalizes common katakana spelling variations ending in a long sound character (U+30FC) by removing the long sound character.
+* <<LanguageAnalysis-JapaneseIterationMarkCharFilter,`JapaneseIterationMarkCharFilter`>> normalizes Japanese horizontal iteration marks (odoriji) to their expanded form.
+* <<LanguageAnalysis-JapaneseTokenizer,`JapaneseTokenizer`>> tokenizes Japanese using morphological analysis, and annotates each term with part-of-speech, base form (a.k.a. lemma), reading and pronunciation.
+* <<LanguageAnalysis-JapaneseBaseFormFilter,`JapaneseBaseFormFilter`>> replaces original terms with their base forms (a.k.a. lemmas).
+* <<LanguageAnalysis-JapanesePartOfSpeechStopFilter,`JapanesePartOfSpeechStopFilter`>> removes terms that have one of the configured parts-of-speech.
+* <<LanguageAnalysis-JapaneseKatakanaStemFilter,`JapaneseKatakanaStemFilter`>> normalizes common katakana spelling variations ending in a long sound character (U+30FC) by removing the long sound character.
 
 Also useful for Japanese analysis, from lucene-analyzers-common:
 
-* `CJKWidthFilter` folds fullwidth ASCII variants into the equivalent Basic Latin forms, and folds halfwidth Katakana variants into their equivalent fullwidth forms.
+* <<LanguageAnalysis-CJKWidthFilter,`CJKWidthFilter`>> folds fullwidth ASCII variants into the equivalent Basic Latin forms, and folds halfwidth Katakana variants into their equivalent fullwidth forms.
 
 [[LanguageAnalysis-JapaneseIterationMarkCharFilter]]
 ==== Japanese Iteration Mark CharFilter
@@ -1022,7 +1078,7 @@ Removes terms with one of the configured parts-of-speech. `JapaneseTokenizer` an
 
 Normalizes common katakana spelling variations ending in a long sound character (U+30FC) by removing the long sound character.
 
-`CJKWidthFilterFactory` should be specified prior to this filter to normalize half-width katakana to full-width.
+<<LanguageAnalysis-CJKWidthFilter,`solr.CJKWidthFilterFactory`>> should be specified prior to this filter to normalize half-width katakana to full-width.
 
 *Factory class:* `JapaneseKatakanaStemFilterFactory`
 

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/e9a91805/solr/solr-ref-guide/src/tokenizers.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/tokenizers.adoc b/solr/solr-ref-guide/src/tokenizers.adoc
index 5c7a819..7a8bdeb 100644
--- a/solr/solr-ref-guide/src/tokenizers.adoc
+++ b/solr/solr-ref-guide/src/tokenizers.adoc
@@ -286,7 +286,7 @@ This tokenizer processes multilingual text and tokenizes it appropriately based
 
 You can customize this tokenizer's behavior by specifying http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules[per-script rule files]. To add per-script rules, add a `rulefiles` argument, which should contain a comma-separated list of `code:rulefile` pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. For example, to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"), you would enter `Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi`.
 
-The default `solr.ICUTokenizerFactory` provides UAX#29 word break rules tokenization (like `solr.StandardTokenizer`), but also includes custom tailorings for Hebrew (specializing handling of double and single quotation marks), and for syllable tokenization for Khmer, Lao, and Myanmar.
+The default configuration for `solr.ICUTokenizerFactory` provides UAX#29 word break rules tokenization (like `solr.StandardTokenizer`), but also includes custom tailorings for Hebrew (specializing handling of double and single quotation marks), for syllable tokenization for Khmer, Lao, and Myanmar, and dictionary-based word segmentation for CJK characters.
 
 *Factory class:* `solr.ICUTokenizerFactory`
 


[2/3] lucene-solr:branch_6_6: SOLR-10758: Point to Lib Directives in SolrConfig page from the Traditional Chinese ICUTokenizer paragraph.

Posted by sa...@apache.org.
SOLR-10758: Point to Lib Directives in SolrConfig page from the Traditional Chinese ICUTokenizer paragraph.


Project: http://git-wip-us.apache.org/repos/asf/lucene-solr/repo
Commit: http://git-wip-us.apache.org/repos/asf/lucene-solr/commit/82b53503
Tree: http://git-wip-us.apache.org/repos/asf/lucene-solr/tree/82b53503
Diff: http://git-wip-us.apache.org/repos/asf/lucene-solr/diff/82b53503

Branch: refs/heads/branch_6_6
Commit: 82b535039b730a9db0d4e3a6e308750a46e53cff
Parents: e9a9180
Author: Steve Rowe <sa...@gmail.com>
Authored: Fri May 26 16:04:22 2017 -0400
Committer: Steve Rowe <sa...@gmail.com>
Committed: Fri May 26 18:05:27 2017 -0400

----------------------------------------------------------------------
 solr/solr-ref-guide/src/language-analysis.adoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/82b53503/solr/solr-ref-guide/src/language-analysis.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/language-analysis.adoc b/solr/solr-ref-guide/src/language-analysis.adoc
index c55a0cd..c82cd61 100644
--- a/solr/solr-ref-guide/src/language-analysis.adoc
+++ b/solr/solr-ref-guide/src/language-analysis.adoc
@@ -510,7 +510,7 @@ Solr can stem Catalan using the Snowball Porter Stemmer with an argument of `lan
 [[LanguageAnalysis-TraditionalChinese]]
 === Traditional Chinese
 
-The default configuration of the <<tokenizers.adoc#Tokenizers-ICUTokenizer,ICU Tokenizer>> is suitable for Traditional Chinese text.  It follows the Word Break rules from the Unicode Text Segmentation algorithm for non-Chinese text, and uses a dictionary to segment Chinese words.  To use this tokenizer, see `solr/contrib/analysis-extras/README.txt` for instructions on which jars you need to add to your `solr_home/lib`.
+The default configuration of the <<tokenizers.adoc#Tokenizers-ICUTokenizer,ICU Tokenizer>> is suitable for Traditional Chinese text.  It follows the Word Break rules from the Unicode Text Segmentation algorithm for non-Chinese text, and uses a dictionary to segment Chinese words.  To use this tokenizer, you must add additional .jars to Solr's classpath (as described in the section <<lib-directives-in-solrconfig.adoc#lib-directives-in-solrconfig,Lib Directives in SolrConfig>>). See the `solr/contrib/analysis-extras/README.txt` for information on which jars you need to add to your `SOLR_HOME/lib`.
 
 <<tokenizers.adoc#Tokenizers-StandardTokenizer,Standard Tokenizer>> can also be used to tokenize Traditional Chinese text.  Following the Word Break rules from the Unicode Text Segmentation algorithm, it produces one token per Chinese character.  When combined with <<LanguageAnalysis-CJKBigramFilter,CJK Bigram Filter>>, overlapping bigrams of Chinese characters are formed.