You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2009/12/04 03:42:29 UTC

[Solr Wiki] Update of "UnicodeCollation" by OtisGospodnetic

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "UnicodeCollation" page has been changed by OtisGospodnetic.
The comment on this change is: Clarification FIXME for.... Robert Muir?.
http://wiki.apache.org/solr/UnicodeCollation?action=diff&rev1=1&rev2=2

--------------------------------------------------

  == Sorting text for multiple languages ==
  There are two approaches to supporting multiple languages:
  
-  * If there is a small list, consider defining collated fields for each language and using copyField.
+  * If there is a small list (FIXME: small list of Languages? Fields?), consider defining collated fields for each language and using copyField.
   * If there is a very large list, an alternative is to use the "Unicode default" collator.
  
  The Unicode default, or "ROOT" Locale, has rules that are designed to work well in general for most languages. To use it, simply define the language as the empty string.
@@ -70, +70 @@

  The example code below shows how to create a custom ruleset and dump it to a file.
  
  {{{
-     // get the default rules for germany
+     // get the default rules for Germany
      // these are called DIN 5007-1 sorting
      RuleBasedCollator baseCollator = (RuleBasedCollator) Collator.getInstance(new Locale("de", "DE"));
  
@@ -116, +116 @@

    </analyzer>
  </fieldType>
  }}}
- 
  Below is an example of what this would look like for two words that should match with this collator: Töne and toene.
  
  '''org.apache.solr.analysis.StandardTokenizerFactory'''
@@ -127, +126 @@

  ||<style="text-align: center;" |1>payload ||<class="debugdata"> ||<class="debugdata"> ||
  
  
+ 
+ 
  '''org.apache.solr.analysis.CollationKeyFilterFactory   {strength=primary, custom=customRules.dat}'''
  ||<tablewidth="" tableclass="analysis"style="text-align: center;" |1>term position ||<class="debugdata">1 ||<class="debugdata">2 ||
  ||<style="text-align: center;" |1>term text ||<class="debugdata">3䀘䀋#6;ࠂ怀#0;#0;#0; ||<class="debugdata">3䀘䀋#6;ࠂ怀#0;#0;#0; ||
@@ -134, +135 @@

  ||<style="text-align: center;" |1>source start,end ||<class="debugdata">0,4 ||<class="debugdata">5,10 ||
  ||<style="text-align: center;" |1>payload ||<class="debugdata"> ||<class="debugdata"> ||
  
- Please note that the strange output you see from the filter is really a binary collation key encoded in a special form.
- What is important is that it is the same value for equivalent tokens as defined by that collator.
  
+ 
+ 
+ Please note that the strange output you see from the filter is really a binary collation key encoded in a special form. What is important is that it is the same value for equivalent tokens as defined by that collator.
+