You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2011/02/25 07:24:35 UTC

[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by RobertMuir

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "AnalyzersTokenizersTokenFilters" page has been changed by RobertMuir.
The comment on this change is: add docs for icu analysis factories.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=diff&rev1=109&rev2=110

--------------------------------------------------

      </analyzer>
    </fieldType>
  }}}
+ 
+ === solr.ICUTokenizerFactory ===
+ <!> [[Solr3.1]] Uses [[http://site.icu-project.org/|ICU]]'s text bounds capabilities to tokenize text.
+ 
+ This tokenizer first identifies the writing system "Script" for runs of text within the document. Then, it tokenizes
+ the text according to rules or dictionaries depending upon the writing system. For example, if it encounters
+ Thai, it will apply dictionary-based segmentation to split the Thai text (Thai uses no spaces between words).
+ 
+ ||'''Input String'''||'''Output Tokens'''||'''Script Attribute'''||
+ ||Testing บริษัทชื่อ נאסק"ר||Testing<<BR>>บริษัท<<BR>>ชื่อ<<BR>>נאסק"ר||Latin<<BR>>Thai<<BR>>Thai<<BR>>Hebrew||
+ 
+ {{{
+     <fieldType name="text_icu" class="solr.TextField" autoGeneratePhraseQueries="false">
+       <analyzer>
+         <tokenizer class="solr.ICUTokenizerFactory"/>
+       </analyzer>
+     </fieldType>
+ }}}
+ 
+ Note: to use this tokenizer, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib
  
  == TokenFilterFactories ==
  
@@ -699, +719 @@

  <<Anchor(CollationKeyFilterFactory)>>
  
  === solr.CollationKeyFilterFactory ===
- <!> [[Solr1.5]]
+ <!> [[Solr3.1]]
  
  A filter that lets one specify:
  
@@ -715, +735 @@

   1. [[http://lucene.apache.org/java/2_9_1/api/contrib-collation/org/apache/lucene/collation/CollationKeyFilter.html|Lucene's CollationKeyFilter javadocs]]
   1. UnicodeCollation
  
+ === solr.ICUCollationKeyFilterFactory ===
+ <!> [[Solr3.1]]
+ 
+ This filter works like CollationKeyFilterFactory, except it uses ICU for collation. This makes smaller and faster sort keys, and it supports more locales. See UnicodeCollation for some more information, the same concepts apply.
+ 
+ The only configuration difference is that locales should be specified to this filter with RFC 3066 locale IDs.
+ 
+ {{{
+     <fieldType name="icu_sort_en" class="solr.TextField">
+       <analyzer>
+         <tokenizer class="solr.KeywordTokenizerFactory"/>
+         <filter class="solr.ICUCollationKeyFilterFactory" locale="en" strength="primary"/>
+       </analyzer>
+     </fieldType>
+ }}}
+ 
+ Note: to use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib
+ 
+ === solr.ICUNormalizer2FilterFactory ===
+ <!> [[Solr3.1]]
+ 
+ This filter normalizes text to a [[http://unicode.org/reports/tr15/|Unicode Normalization Form]].
+ 
+ {{{
+     <fieldType name="normalized" class="solr.TextField">
+       <analyzer>
+         <tokenizer class="solr.StandardTokenizerFactory"/>
+         <filter class="solr.ICUNormalizer2FilterFactory" name="nfkc_cf" mode="compose"/>
+       </analyzer>
+     </fieldType>
+ }}}
+ 
+ These are the supported normalization forms: 
+ {{{
+ NFC: name="nfc" mode="compose"
+ NFD: name="nfc" mode="decompose"
+ NFKC: name="nfkc" mode="compose"
+ NFKD: name="nfkc" mode="decompose"
+ NFKC_Casefold: name="nfkc_cf" mode="compose"
+ }}}
+ 
+ NFKC_Casefold (nfkc_cf) means applying the Unicode Case-Folding algorithm in conjunction with NFKC normalization. Unicode Case-Folding is more than lowercasing, e.g. it handles cases like ß/SS. Behind the scenes this is its own form (nfkc_cf), but both algorithms have been recursively computed across all of Unicode offline, so that its an efficient single-pass algorithm.
+ For practical purposes this means you can use this factory with nfkc_cf as a better substitute for the combined behavior of LowerCaseFilter and NFKC normalization.
+ 
+ If you want to do more advanced normalization (e.g. apply a filter to work only on a subset of Unicode), see the javadocs.
+ 
+ Note: to use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib
+ 
+ === solr.ICUFoldingFilterFactory ===
+ <!> [[Solr3.1]]
+ 
+ This filter is a custom unicode normalization form that applies the foldings specified in [[http://www.unicode.org/reports/tr30/tr30-4.html|UTR#30]] in addition to NFKC_Casefold.
+ 
+ {{{
+     <fieldType name="folded" class="solr.TextField">
+       <analyzer>
+         <tokenizer class="solr.StandardTokenizerFactory"/>
+         <filter class="solr.ICUFoldingFilterFactory"/>
+       </analyzer>
+     </fieldType>
+ }}}
+ 
+ This means NFKC normalization, Unicode case folding, and search term folding (removing accents, etc) have been recursively computed across all of Unicode offline, so that its an efficient single-pass through the string.
+ For practical purposes this means you can use this factory as a better substitute for the combined behavior of ASCIIFoldingFilter, LowerCaseFilter, and ICUNormalizer2Filter
+ 
+ Note: to use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib
+ 
+ === solr.ICUTransformFilterFactory ===
+ <!> [[Solr3.1]]
+ 
+ This filter applies [[http://userguide.icu-project.org/transforms/general|ICU Transforms]] to text.
+ 
+ Currently the filter only supports System transforms (or compounds consisting of), and custom rulesets are not yet supported.
+ 
+ {{{
+     <fieldType name="transformed" class="solr.TextField">
+       <analyzer>
+         <tokenizer class="solr.StandardTokenizerFactory"/>
+         <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
+       </analyzer>
+     </fieldType>
+ }}}
+ 
+ You can see a list of the supported System transforms by going to [[http://demo.icu-project.org/icu-bin/translit?TEMPLATE_FILE=data/translit_rule_main.html|this link]], clicking the drop-down, and scrolling down to System.
+ 
+ Note: to use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib
+