You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2010/05/18 18:33:36 UTC

[Solr Wiki] Update of "LanguageAnalysis" by RobertMuir

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "LanguageAnalysis" page has been changed by RobertMuir.
The comment on this change is: first cut at improving this documentation.
http://wiki.apache.org/solr/LanguageAnalysis

--------------------------------------------------

New page:
= Language Analysis =

== Overview ==

This page describes some of the language-specific analysis components available in Solr. These components can be used to improve search results for specific languages.

Please look at [[AnalyzersTokenizersTokenFilters|AnalyzersTokenizersTokenFilters]] for other analysis components you can use in combination with these components.

<<TableOfContents>>

=== By language ===
==== Arabic ====
Solr provides support for the [[http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf|Light-10]] stemming algorithm, and Lucene includes an example stopword list.

This algorithm defines both character normalization and stemming, so these are split into two filters to provide more flexibility.

{{{
...
  <filter class="solr.ArabicNormalizationFilterFactory"/>
  <filter class="solr.ArabicStemFilterFactory"/>
...
}}}

Example set of Arabic [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ar/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)

==== Brazilian Portuguese ====
Solr includes a modified version of the Snowball Portuguese algorithm for Brazilian Portuguese, and Lucene includes an example stopword list. This stemmer handles diacritical marks differently than the European Portuguese stemmer.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.BrazilianStemFilterFactory"/>
... 
}}}

Example set of Brazilian Portuguese [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/br/BrazilianAnalyzer.java|stopwords]] (Look for BRAZILIAN_STOP_WORDS)

==== Bulgarian ====
<!> [[Solr3.1]]

Solr includes a light stemmer for Bulgarian, following this [[http://members.unine.ch/jacques.savoy/Papers/BUIR.pdf|algorithm]], and Lucene includes an example stopword list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.BulgarianStemFilterFactory"/>
...
}}}

Example set of Bulgarian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/bg/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)

==== Chinese, Japanese, Korean ====
Lucene provides support for these languages with CJKTokenizer, which indexes bigrams and does some character folding of full-width forms.

{{{
   <tokenizer class="solr.CJKTokenizerFactory"/>
...
}}}

<!> Note: Be sure to use PositionFilter at query-time (only) as these languages do not use spaces between words. 

==== Czech ====
<!> [[Solr3.1]]

Solr includes a light stemmer for Czech, following this [[http://portal.acm.org/citation.cfm?id=1598600|algorithm]], and Lucene includes an example stopword list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.CzechStemFilterFactory"/>
...
}}}

Example set of Czech [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/cz/CzechAnalyzer.java|stopwords]] (Look for CZECH_STOP_WORDS)

==== Danish ====
Solr includes support for stemming Danish via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Danish" />
...
}}}

Example set of Danish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/danish_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)

<!> Note: See also {{{Decompounding}}} below.

==== Dutch ====
Solr includes two stemmers for Dutch via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Dutch" />
...
}}}

An alternative stemmer (Kraaij-Pohlmann) can be used by specifying the language as "Kp".

Example set of Dutch [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/dutch_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)

<!> Note: See also {{{Decompounding}}} below.

==== English ====
Solr includes two stemmers for English, the original Porter stemmer via {{{solr.PorterStemFilterFactory}}}, and the Porter2 stemmer via {{{solr.SnowballPorterFilterFactory}}}, as well as an example stopword list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.PorterStemFilterFactory"/>
...
}}}

<!> Note: The standard {{{PorterStemFilterFactory}}} is significantly faster than {{{solr.SnowballPorterFilterFactory}}}.

Larger example set English 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/english_stop.txt|stopwords]]

==== Finnish ====
Solr includes support for stemming Finnish via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Finnish" />
...
}}}

Example set of Finnish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/finnish_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)
<!> Note: See also {{{Decompounding}}} below.

==== French ====
Solr includes support for stemming French via {{{solr.SnowballPorterFilterFactory}}}, removing elisions via ElisionFilterFactory, and Lucene includes an example stopword list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.ElisionFilterFactory"/>
  <!-- do word delimiter, etc here -->
  <filter class="solr.SnowballPorterFilterFactory" language="French" />
...
}}}

Example set of French [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/french_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)

<!> Note: Its probably best to use the ElisionFilter before WordDelimiterFilter. This will prevent very slow phrase queries.

==== German ====
Solr includes support for stemming German with three different algorithms: two via {{{solr.SnowballPorterFilterFactory}}}, and one via {{{solr.GermanStemFilterFactory}}}, and Lucene includes an example stopword list.

With the {{{solr.SnowballPorterFilterFactory}}} you can supply two different language attributes: "German" and "German2". German2 is just a modified version of German that handles the umlaut characters differently: for example it treats "ü" as "ue" in most contexsts.

The {{{solr.GermanStemFilterFactory}}} instead uses a different [[http://www.inf.fu-berlin.de/inst/pubs/tr-b-99-16.abstract.html|algorithm]].

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="German2" />
...
}}}

Example set of German [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/german_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)

<!> Note: See also {{{Decompounding}}} below.

==== Greek ====
Solr includes support for stemming Greek following this [[http://people.dsv.su.se/~hercules/papers/Ntais_greek_stemmer_thesis_final.pdf|algorithm]] <!> [[Solr3.1]], as well as support for case/diacritics-insensitive search via {{{solr.GreekLowerCaseFilterFactory}}}, and Lucene includes an example stopword list.

{{{
...
  <filter class="solr.GreekLowerCaseFilterFactory"/>
  <filter class="solr.GreekStemFilterFactory"/>
...
}}}

Example set of Greek [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/el/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)

<!> Note: Be sure to use the Greek-specific GreekLowerCaseFilterFactory

==== Hindi ====
<!> [[Solr3.1]]

Solr includes support for stemming Hindi following this [[http://computing.open.ac.uk/Sites/EACLSouthAsia/Papers/p6-Ramanathan.pdf|algorithm]], support for common spelling differences via {{{solr.HindiNormalizationFilterFactory}}} following this [[http://web2py.iiit.ac.in/publications/default/download/inproceedings.pdf.3fe5b38c-02ee-41ce-9a8f-3e745670be32.pdf|algorithm]], support for encoding differences via {{{solr.IndicNormalizationFilterFactory}}} following this [[http://ldc.upenn.edu/myl/IndianScriptsUnicode.html|algorithm]], and Lucene includes an example stopword list.

{{{
...
  <filter class="solr.IndicNormalizationFilterFactory"/>
  <filter class="solr.HindiNormalizationFilterFactory"/>
  <filter class="solr.HindiStemFilterFactory"/>
...
}}}

Example set of Hindi [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/hi/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)

==== Hungarian ====

Solr includes support for stemming Hungarian via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Hungarian" />
...
}}}

Example set of Hungarian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/hungarian_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)

<!> Note: See also {{{Decompounding}}} below.

==== Indonesian ====
<!> [[Solr3.1]]

Solr includes support for stemming Indonesian (Bahasa Indonesia) following this [[http://www.illc.uva.nl/Publications/ResearchReports/MoL-2003-02.text.pdf|algorithm]], and Lucene includes an example stopword list.

You can set the stemDerivational attribute to false to only stem inflectional suffixes, for a lighter approach.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.IndonesianStemFactory" stemDerivational="true" />
...
}}}

Example set of Indonesian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/id/stopwords.txt|stopwords]]

==== Italian ====
Solr includes support for stemming Italian via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Italian" />
...
}}}

Example set of Italian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/italian_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)

==== Norwegian ====
Solr includes support for stemming Norwegian via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Norwegian" />
...
}}}

Example set of Norwegian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/norwegian_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)

<!> Note: See also {{{Decompounding}}} below.

==== Persian / Farsi ====
Solr includes support for normalizing Persian via {{{solr.PersianNormalizationFilterFactory}}}, and Lucene includes an example stopword list.

{{{
...
  <filter class="solr.ArabicNormalizationFilterFactory"/>
  <filter class="solr.PersianNormalizationFilterFactory"/>
...
}}}

Example set of Persian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/fa/stopwords.txt|stopwords]]

<!> Note: WordDelimiterFilter does not split on joiners by default. You can solve this by using ArabicLetterTokenizerFactory, which does, or by using a custom WordDelimiterFilterFactory which supplies a customized charTypeTable to WordDelimiterFilter. In either case, consider using PositionFilter at query-time (only), as the QueryParser does not consider joiners and could create unwanted phrase queries.

==== Portuguese ====
Solr includes support for stemming Portuguese via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Portuguese" />
...
}}}

Example set of Portuguese [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/portuguese_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)

==== Romanian ====
Solr includes support for stemming Romanian via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Romanian" />
...
}}}

Example set of Romanian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ro/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)

==== Russian ====
Solr includes support for stemming Russian via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Russian" />
...
}}}

Example set of Russian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/russian_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)

==== Spanish ====
Solr includes support for stemming Spanish via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Spanish" />
...
}}}

Example set of Spanish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/spanish_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)

==== Swedish ====
Solr includes support for stemming Swedish via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.

{{{
...
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Swedish" />
...
}}}

Example set of Swedish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/swedish_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)

<!> Note: See also {{{Decompounding}}} below.

==== Thai ====
Solr includes support for breaking Thai text into words via {{{solr.ThaiWordFilterFactory}}}

{{{
...
  <filter class="solr.ThaiWordFilterFactory"/>
...
}}}

<!> Note: Be sure to use PositionFilter at query-time (only) as this language does not use spaces between words.

==== Turkish ====
Solr includes support for stemming Turkish via {{{solr.SnowballPorterFilterFactory}}}, as well as support for case-insensitive search via {{{solr.TurkishLowerCaseFilterFactory}}} <!> [[Solr3.1]], and Lucene includes an example stopword list.

{{{
...
  <filter class="solr.TurkishLowerCaseFilterFactory"/>
  <filter class="solr.SnowballPorterFilterFactory" language="Turkish" />
...
}}}

Example set of Turkish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/tr/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)

<!> Note: Be sure to use the Turkish-specific TurkishLowerCaseFilterFactory <!> [[Solr3.1]]

=== Not yet Integrated ===

The following languages have explicit support in Lucene, but it is not yet integrated into Solr. If you need to support these languages you might find this information useful in the meantime.

==== Chinese, Japanese, Korean ====

Lucene provides support for Chinese word segmentation (SentenceTokenizer, WordTokenFilter) in a separate jar file (lucene-analyzers-smartcn.jar). This component includes a large dictionary and segments Chinese text into words with the Hidden Markov Model.

<!> [[Lucene3.1]]

Additionally, Lucene provides support for matching between Traditional and Simplified Chinese and for matching between Hiragana and Katakana (ICUTransformFilter) in a separate jar file (lucene-icu.jar).

<!> Note: Be sure to use PositionFilter at query-time (only) as this language does not use spaces between words.

==== Polish ====
<!> [[Lucene3.1]]

Lucene provides support for Polish stemming (StempelFilter) in a separate jar file (lucene-analyzers-stempel.jar). This component includes an algorithmic stemmer with tables for Polish.

==== Lao, Myanmar, Khmer ====
<!> [[Lucene3.1]]

Lucene provides support for segmenting these languages into syllables (ICUTokenizer) in a separate jar file (lucene-icu.jar).

<!> Note: Be sure to use PositionFilter at query-time (only) as these languages do not use spaces between words. 

=== My language is not listed!!! ===

Your language might work anyway. A first step is to start with the "textgen" type in the example schema. Remember, things like stemming and stopwords aren't mandatory for the search to work, only optional language-specific improvements.

If you have problems (your language is highly-inflectional, etc), you might want to try using an n-gram approach as an alternative.

=== Tokenization ===

In general most languages don't require special tokenization (and will work just fine with Whitespace + WordDelimiterFilter), so you can safely tailor the English "text" example schema definition to fit.

=== Ignoring Case ===

In most cases LowerCaseFilterFactory is sufficient. 
However, some languages have special casing properties, and these have their own filters:

 * TurkishLowerCaseFilterFactory: Use this instead of LowerCaseFilterFactory for the Turkish language. It includes special handling for [[http://en.wikipedia.org/wiki/Dotted_and_dotless_I|dotted and dotless I]].
 * GreekLowerCaseFilterFactory: Use this instead of LowerCaseFilterFactory for the Greek language. It removes Greek diacritics and has special handling for the Greek final sigma.

=== Ignoring Diacritics ===

Some languages use diacritics, but people are not always consistent about typing them in queries or documents.

If you are using a stemmer, most stemmers (especially Snowball stemmers) are a bit forgiving about diacritics, and these are handled on a language-specific basis.

For Latin-script writing systems, you can remove all diacritics with ASCIIFoldingFilterFactory. But this might not be the best approach for your language, for example you may want ü to match to ue for German. In this case it is better to not use ASCIIFoldingFilter before stemming, but instead to use the "German2" stemmer first, which has language-specific handling for this case.

For some languages in non-Latin writing systems (Arabic, Greek, Hindi, Persian), there are filters to support the idea of "diacritics-insensitive search" already included in Solr. These filters are described above under the relevant languages.

For other languages, the ASCIIFoldingFilterFactory won't do the foldings that you need. One solution is to use the ICUFoldingFilter <!> [[Lucene3.1]], which implements a [[http://unicode.org/reports/tr30/tr30-4.html|similar idea]] across all of Unicode. Unfortunately, this filter is not yet integrated into Solr, so for now you must make the factory yourself.

=== Stopwords ===

Stopwords affect Solr in three ways: relevance, performance, and resource utilization.

>>From a relevance perspective, these extremely high-frequency terms tend to throw off the scoring algorithm, and you won't get very good results if you leave them. At the same time, if you remove them, you can return bad results when the stopword is actually important.

>>From a performance perspective, if you keep stopwords, some queries (especially phrase queries) can be very slow.

>>From a resource utilization perspective, if you keep stopwords, the index is much larger than if you remove them.

One tradeoff you can make if you have the disk space: You can use CommonGramsFilter/CommonGramsQueryFilter instead of StopFilter. This solves the relevance and performance problems, at the expense of even more resource utilization, because it will form bigrams of stopwords to their adjacent words.

=== Stemming ===

Stemming can help improve relevance, but it can also hurt.

There is no general rule for whether or not to stem: It depends not only on the language, but also on the properties of your documents and queries.

In general, if the language is highly inflectional, its worth evaluating as it might bring a significant improvement. Some annoyances caused by stemming can then be handled with tuning: See {{{CustomizingStemming}}} below.

==== Notes about solr.PorterStemFilterFactory ====

Porter stemmer for the English language.

Standard Lucene implementation of the [[http://tartarus.org/~martin/PorterStemmer/|Porter Stemming Algorithm]], a normalization process that removes common endings from words.

  Example: "riding", "rides", "horses" ==> "ride", "ride", "hors".

Note: This differs very slightly from the "Porter" algorithm available in `solr.SnowballPorterFilter`, in that it deviates slightly from the published algorithm.
For more details, see the section "Points of difference from the published algorithm" described [[http://tartarus.org/~martin/PorterStemmer/|here]].

This is the fastest stemmer for English: approximately twice as fast as using SnowballPorterFilterFactory.

<<Anchor(SnowballPorterFilter)>>
==== Notes about solr.SnowballPorterFilterFactory ====

Creates `org.apache.lucene.analysis.SnowballPorterFilter`.

Creates an [[http://snowball.tartarus.org/texts/stemmersoverview.html|Snowball stemmer]] from the Java classes generated from a [[http://snowball.tartarus.org/|Snowball]] specification.  The language attribute is used to specify the language of the stemmer.
{{{
<fieldtype name="myfieldtype" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="German" />
  </analyzer>
</fieldtype>
}}}

Valid values for the language attribute (creates the snowball stemmer class language + "Stemmer"):
 * [[http://snowball.tartarus.org/algorithms/danish/stemmer.html|Danish]]
 * [[http://snowball.tartarus.org/algorithms/dutch/stemmer.html|Dutch]]
 * [[http://snowball.tartarus.org/algorithms/kraaij_pohlmann/stemmer.html|Kp]]: The Kraaij-Pohlmann stemming algorithm for Dutch.
 * [[http://snowball.tartarus.org/algorithms/porter/stemmer.html|Porter]]: The original Porter stemming algorithm for English.
 * [[http://snowball.tartarus.org/algorithms/english/stemmer.html|English]]: The Porter2 stemming algorithm for English.
 * [[http://snowball.tartarus.org/algorithms/lovins/stemmer.html|Lovins]]: The early Lovins stemming algorithm for English.
 * [[http://snowball.tartarus.org/algorithms/finnish/stemmer.html|Finnish]]
 * [[http://snowball.tartarus.org/algorithms/french/stemmer.html|French]]
 * [[http://snowball.tartarus.org/algorithms/german/stemmer.html|German]]
 * [[http://snowball.tartarus.org/algorithms/german2/stemmer.html|German2]]: A variation of the German algorithm with handling to allow ä, ö and ü to be represented by ae, oe and ue
 * [[http://snowball.tartarus.org/algorithms/hungarian/stemmer.html|Hungarian]]
 * [[http://snowball.tartarus.org/algorithms/italian/stemmer.html|Italian]]
 * [[http://snowball.tartarus.org/algorithms/norwegian/stemmer.html|Norwegian]]
 * [[http://snowball.tartarus.org/algorithms/portuguese/stemmer.html|Portuguese]]
 * [[http://snowball.tartarus.org/algorithms/romanian/stemmer.html|Romanian]]
 * [[http://snowball.tartarus.org/algorithms/russian/stemmer.html|Russian]]
 * [[http://snowball.tartarus.org/algorithms/spanish/stemmer.html|Spanish]]
 * [[http://snowball.tartarus.org/algorithms/swedish/stemmer.html|Swedish]]
 * [[http://snowball.tartarus.org/algorithms/turkish/stemmer.html|Turkish]]

<!> Gotchas:
 * Although the Lovins stemmer is described as faster than Porter/Porter2, practically it is much slower in Solr, as it is implemented using reflection.
 * Neither the Lovins nor the Finnish stemmer produce correct output (as of Solr 1.4), due to a [[http://article.gmane.org/gmane.comp.search.snowball/1139|known bug in Snowball]]
 * The Turkish stemmer requires special lowercasing. You should use TurkishLowerCaseFilter instead of LowerCaseFilter with this language. See [[http://en.wikipedia.org/wiki/Dotted_and_dotless_I|background information]].
 * The stemmers are sensitive to diacritics. Think carefully before removing these with something like `ASCIIFoldingFilterFactory` before stemming, as this could cause unwanted results. For example, `résumé` will not be stemmed by the Porter stemmer, but `resume` will be stemmed to `resum`, causing it to match with `resumed`, `resuming`, etc. The differences can be more profound for non-english stemmers.

<<Anchor(CustomizingStemming)>>
=== Customizing Stemming ===

Sometimes a stemmer might not do what you want out-of-box. For example, you might be happy with the results on average, but have a few particular cases (such as Product Names or similar) where it annoys you or actually hurts your search results.

The components below allow you to fine-tune the stemming process by preventing words from being stemmed at all, or by overriding the stemming algorithm with custom mappings.

==== solr.KeywordMarkerFilterFactory ====
<!> [[Solr3.1]]

Protects words from being modified by stemmers.

A customized protected word list may be specified with the "protected" attribute in the schema. Any words in the protected word list will not be modified by any stemmer in Solr.

A [[http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/conf/protwords.txt|sample Solr protwords.txt with comments]] can be found in the Source Repository.

{{{
<fieldtype name="myfieldtype" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
    <filter class="solr.PorterStemFilterFactory" />
  </analyzer>
</fieldtype>
}}}

==== solr.StemmerOverrideFilterFactory ====
<!> [[Solr3.1]]

Overrides stemming algorithms, by applying a custom mapping, then protecting these terms from being modified by stemmers.

A customized mapping of words to stems, in a tab-separated file, can be specified to the "dictionary" attribute in the schema.  Words in this mapping will be stemmed to the stems from the file, and will not be further changed by any stemmer.

A [[http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/test/test-files/solr/conf/stemdict.txt|sample stemdict.txt with comments]] can be found in the Source Repository.

{{{
<fieldtype name="myfieldtype" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StemmerOverrideFilterFactory" dictionary="stemdict.txt" />
    <filter class="solr.PorterStemFilterFactory" />
  </analyzer>
</fieldtype>
}}}

<<Anchor(Decompounding)>>
=== Decompounding ===

Decompounding can improve search results for some languages. At the same time, it can increase the time it takes to index and search, as well as increase the index size itself.

Solr provides dictionary-based decompounding support via solr.DictionaryCompoundWordTokenFilterFactory. This factory allows you to provide a dictionary, along with some settings (min/max subword size, etc), to break compound words into pieces.

One alternative is to use n-gram tokenization so that the search is less sensitive to compound words.

TODO: Add support for Lucene's hyphenation grammar-based decompounding and document it here.