You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2009/12/03 05:22:37 UTC
[Solr Wiki] Update of "UnicodeCollation" by RobertMuir

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "UnicodeCollation" page has been changed by RobertMuir.
http://wiki.apache.org/solr/UnicodeCollation

--------------------------------------------------

New page:
= Unicode Collation =
<!> [[Solr1.5]]

== Overview ==
[[http://en.wikipedia.org/wiki/Unicode_collation_algorithm|Unicode Collation]] is a method to sort text in a language-sensitive way. It is primarily intended for sorting, but can also be used for advanced search purposes.

Unicode Collation in Solr is fast, all the work is done at index time. For more information, see the [[http://lucene.apache.org/solr/api/org/apache/solr/analysis/CollationKeyFilterFactory.html|Javadocs]].

<<TableOfContents>>

== Sorting text for a specific language ==
In the example below, text will be sorted according to the default German rules provided by Java. The rules for sorting German in Java are defined in a package called a Java Locale.

Locales are typically defined as a combination of language and country, but you can specify just the language if you want. For example, if you specify "de" as the language, you will get sorting that works well for German language. If you specify "de" as the language and "CH" as the country, you will get German sorting specifically tailored for Switzerland.

You can see a list of supported Locales [[http://java.sun.com/j2se/1.5.0/docs/guide/intl/locale.doc.html#util-text|here]].

{{{
<!-- define a field type for German collation -->
<fieldType name="collatedGERMAN" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.CollationKeyFilterFactory"
        language="de"
        strength="primary"
    />
  </analyzer>
</fieldType>
...
<!-- define a field to store the German collated manufacturer names -->
<field name="manuGERMAN" type="collatedGERMAN" indexed="true" stored="false" />
...
<!-- copy the text to this field. we could create French, English, Spanish versions too, and sort differently for different users! -->
<copyField source="manu" dest="manuGERMAN"/>
}}}
In the example above, you will notice we defined the strength as "primary". The strength of the collation determines how "picky" the sort order will be, but depends upon the language. For example in English, "primary" strength ignores differences in case and accents.

For more information, see the [[http://java.sun.com/j2se/1.5.0/docs/api/java/text/Collator.html|Collator javadocs]].

== Sorting text for multiple languages ==
There are two approaches to supporting multiple languages:

 * If there is a small list, consider defining collated fields for each language and using copyField.
 * If there is a very large list, an alternative is to use the "Unicode default" collator.

The Unicode default, or "ROOT" Locale, has rules that are designed to work well in general for most languages. To use it, simply define the language as the empty string.

This Unicode default sort is still significantly more advanced than the standard Solr sort.

{{{
<fieldType name="collatedROOT" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.CollationKeyFilterFactory"
        language=""
        strength="primary"
    />
  </analyzer>
</fieldType>
}}}
== Sorting text with custom rules ==
For advanced usage, you can define your own set of rules that determine how the sorting takes place. Its easiest not to start from scratch, but instead to take existing rules that are close to what you want, and "tailor" or customize them.

In the example below, we create a custom ruleset for German known as DIN 5007-2.  This ruleset treats umlauts in German differently, for example it treats ö as equivalent to oe.

For more information, see the [[http://java.sun.com/j2se/1.5.0/docs/api/java/text/RuleBasedCollator.html|RuleBasedCollator javadocs]].

The example code below shows how to create a custom ruleset and dump it to a file.

{{{
    // get the default rules for germany
    // these are called DIN 5007-1 sorting
    RuleBasedCollator baseCollator = (RuleBasedCollator) Collator.getInstance(new Locale("de", "DE"));

    // define some tailorings, to make it DIN 5007-2 sorting.
    // For example, this makes ö equivalent to oe
    String DIN5007_2_tailorings =
      "& ae , a\u0308 & AE , A\u0308"+
      "& oe , o\u0308 & OE , O\u0308"+
      "& ue , u\u0308 & UE , u\u0308";

    // concatenate the default rules to the tailorings, and dump it to a String
    RuleBasedCollator tailoredCollator = new RuleBasedCollator(baseCollator.getRules() + DIN5007_2_tailorings);
    String tailoredRules = tailoredCollator.getRules();
    // write these to a file, be sure to use UTF-8 encoding!!!
    IOUtils.write(tailoredRules, new FileOutputStream("/solr_home/conf/customRules.dat"), "UTF-8");
}}}
This file of rules can now be used for custom collation in Solr.

{{{
<fieldType name="collatedCUSTOM" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.CollationKeyFilterFactory"
        custom="customRules.dat"
        strength="primary"
    />
  </analyzer>
</fieldType>
}}}
== Searching ==
For advanced use cases, Collation can be used for search as well, on a tokenized field.

In the example below, we use the same custom German rules defined above on a tokenized field. Just like when using a stemmer, although the output tokens are nonsense, they are the same values and will match for search purposes.

{{{
<fieldType name="collatedCUSTOM" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.CollationKeyFilterFactory"
        custom="customRules.dat"
        strength="primary"
    />
  </analyzer>
</fieldType>
}}}

Below is an example of what this would look like for two words that should match with this collator: Töne and toene.

'''org.apache.solr.analysis.StandardTokenizerFactory'''
||<tablewidth="" tableclass="analysis"style="text-align: center;" |1>term position ||<class="debugdata">1 ||<class="debugdata">2 ||
||<style="text-align: center;" |1>term text ||<class="debugdata">Töne ||<class="debugdata">toene ||
||<style="text-align: center;" |1>term type ||<class="debugdata"><ALPHANUM> ||<class="debugdata"><ALPHANUM> ||
||<style="text-align: center;" |1>source start,end ||<class="debugdata">0,4 ||<class="debugdata">5,10 ||
||<style="text-align: center;" |1>payload ||<class="debugdata"> ||<class="debugdata"> ||


'''org.apache.solr.analysis.CollationKeyFilterFactory   {strength=primary, custom=customRules.dat}'''
||<tablewidth="" tableclass="analysis"style="text-align: center;" |1>term position ||<class="debugdata">1 ||<class="debugdata">2 ||
||<style="text-align: center;" |1>term text ||<class="debugdata">3䀘䀋#6;ࠂ怀#0;#0;#0; ||<class="debugdata">3䀘䀋#6;ࠂ怀#0;#0;#0; ||
||<style="text-align: center;" |1>term type ||<class="debugdata"><ALPHANUM> ||<class="debugdata"><ALPHANUM> ||
||<style="text-align: center;" |1>source start,end ||<class="debugdata">0,4 ||<class="debugdata">5,10 ||
||<style="text-align: center;" |1>payload ||<class="debugdata"> ||<class="debugdata"> ||

Please note that the strange output you see from the filter is really a binary collation key encoded in a special form.
What is important is that it is the same value for equivalent tokens as defined by that collator.