You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2010/05/19 23:08:01 UTC
[Solr Wiki] Update of "LanguageAnalysis" by HossMan

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "LanguageAnalysis" page has been changed by HossMan.
The comment on this change is: normalize header nesting level.
http://wiki.apache.org/solr/LanguageAnalysis?action=diff&rev1=1&rev2=2

--------------------------------------------------

  = Language Analysis =
  
- == Overview ==
- 
  This page describes some of the language-specific analysis components available in Solr. These components can be used to improve search results for specific languages.
  
  Please look at [[AnalyzersTokenizersTokenFilters|AnalyzersTokenizersTokenFilters]] for other analysis components you can use in combination with these components.
  
  <<TableOfContents>>
  
- === By language ===
+ == By language ==
+ 
- ==== Arabic ====
+ === Arabic ===
  Solr provides support for the [[http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf|Light-10]] stemming algorithm, and Lucene includes an example stopword list.
  
  This algorithm defines both character normalization and stemming, so these are split into two filters to provide more flexibility.
@@ -25, +24 @@

  
  Example set of Arabic [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ar/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)
  
- ==== Brazilian Portuguese ====
+ === Brazilian Portuguese ===
  Solr includes a modified version of the Snowball Portuguese algorithm for Brazilian Portuguese, and Lucene includes an example stopword list. This stemmer handles diacritical marks differently than the European Portuguese stemmer.
  
  {{{
@@ -37, +36 @@

  
  Example set of Brazilian Portuguese [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/br/BrazilianAnalyzer.java|stopwords]] (Look for BRAZILIAN_STOP_WORDS)
  
- ==== Bulgarian ====
+ === Bulgarian ===
  <!> [[Solr3.1]]
  
  Solr includes a light stemmer for Bulgarian, following this [[http://members.unine.ch/jacques.savoy/Papers/BUIR.pdf|algorithm]], and Lucene includes an example stopword list.
@@ -51, +50 @@

  
  Example set of Bulgarian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/bg/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)
  
- ==== Chinese, Japanese, Korean ====
+ === Chinese, Japanese, Korean ===
  Lucene provides support for these languages with CJKTokenizer, which indexes bigrams and does some character folding of full-width forms.
  
  {{{
@@ -61, +60 @@

  
  <!> Note: Be sure to use PositionFilter at query-time (only) as these languages do not use spaces between words. 
  
- ==== Czech ====
+ === Czech ===
  <!> [[Solr3.1]]
  
  Solr includes a light stemmer for Czech, following this [[http://portal.acm.org/citation.cfm?id=1598600|algorithm]], and Lucene includes an example stopword list.
@@ -75, +74 @@

  
  Example set of Czech [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/cz/CzechAnalyzer.java|stopwords]] (Look for CZECH_STOP_WORDS)
  
- ==== Danish ====
+ === Danish ===
  Solr includes support for stemming Danish via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.
  
  {{{
@@ -89, +88 @@

  
  <!> Note: See also {{{Decompounding}}} below.
  
- ==== Dutch ====
+ === Dutch ===
  Solr includes two stemmers for Dutch via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.
  
  {{{
@@ -105, +104 @@

  
  <!> Note: See also {{{Decompounding}}} below.
  
- ==== English ====
+ === English ===
  Solr includes two stemmers for English, the original Porter stemmer via {{{solr.PorterStemFilterFactory}}}, and the Porter2 stemmer via {{{solr.SnowballPorterFilterFactory}}}, as well as an example stopword list.
  
  {{{
@@ -120, +119 @@

  Larger example set English 
  [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/english_stop.txt|stopwords]]
  
- ==== Finnish ====
+ === Finnish ===
  Solr includes support for stemming Finnish via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.
  
  {{{
@@ -133, +132 @@

  Example set of Finnish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/finnish_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)
  <!> Note: See also {{{Decompounding}}} below.
  
- ==== French ====
+ === French ===
  Solr includes support for stemming French via {{{solr.SnowballPorterFilterFactory}}}, removing elisions via ElisionFilterFactory, and Lucene includes an example stopword list.
  
  {{{
@@ -149, +148 @@

  
  <!> Note: Its probably best to use the ElisionFilter before WordDelimiterFilter. This will prevent very slow phrase queries.
  
- ==== German ====
+ === German ===
  Solr includes support for stemming German with three different algorithms: two via {{{solr.SnowballPorterFilterFactory}}}, and one via {{{solr.GermanStemFilterFactory}}}, and Lucene includes an example stopword list.
  
  With the {{{solr.SnowballPorterFilterFactory}}} you can supply two different language attributes: "German" and "German2". German2 is just a modified version of German that handles the umlaut characters differently: for example it treats "ü" as "ue" in most contexsts.
@@ -167, +166 @@

  
  <!> Note: See also {{{Decompounding}}} below.
  
- ==== Greek ====
+ === Greek ===
  Solr includes support for stemming Greek following this [[http://people.dsv.su.se/~hercules/papers/Ntais_greek_stemmer_thesis_final.pdf|algorithm]] <!> [[Solr3.1]], as well as support for case/diacritics-insensitive search via {{{solr.GreekLowerCaseFilterFactory}}}, and Lucene includes an example stopword list.
  
  {{{
@@ -181, +180 @@

  
  <!> Note: Be sure to use the Greek-specific GreekLowerCaseFilterFactory
  
- ==== Hindi ====
+ === Hindi ===
  <!> [[Solr3.1]]
  
  Solr includes support for stemming Hindi following this [[http://computing.open.ac.uk/Sites/EACLSouthAsia/Papers/p6-Ramanathan.pdf|algorithm]], support for common spelling differences via {{{solr.HindiNormalizationFilterFactory}}} following this [[http://web2py.iiit.ac.in/publications/default/download/inproceedings.pdf.3fe5b38c-02ee-41ce-9a8f-3e745670be32.pdf|algorithm]], support for encoding differences via {{{solr.IndicNormalizationFilterFactory}}} following this [[http://ldc.upenn.edu/myl/IndianScriptsUnicode.html|algorithm]], and Lucene includes an example stopword list.
@@ -196, +195 @@

  
  Example set of Hindi [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/hi/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)
  
- ==== Hungarian ====
+ === Hungarian ===
  
  Solr includes support for stemming Hungarian via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.
  
@@ -211, +210 @@

  
  <!> Note: See also {{{Decompounding}}} below.
  
- ==== Indonesian ====
+ === Indonesian ===
  <!> [[Solr3.1]]
  
  Solr includes support for stemming Indonesian (Bahasa Indonesia) following this [[http://www.illc.uva.nl/Publications/ResearchReports/MoL-2003-02.text.pdf|algorithm]], and Lucene includes an example stopword list.
@@ -227, +226 @@

  
  Example set of Indonesian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/id/stopwords.txt|stopwords]]
  
- ==== Italian ====
+ === Italian ===
  Solr includes support for stemming Italian via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.
  
  {{{
@@ -239, +238 @@

  
  Example set of Italian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/italian_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)
  
- ==== Norwegian ====
+ === Norwegian ===
  Solr includes support for stemming Norwegian via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.
  
  {{{
@@ -253, +252 @@

  
  <!> Note: See also {{{Decompounding}}} below.
  
- ==== Persian / Farsi ====
+ === Persian / Farsi ===
  Solr includes support for normalizing Persian via {{{solr.PersianNormalizationFilterFactory}}}, and Lucene includes an example stopword list.
  
  {{{
@@ -267, +266 @@

  
  <!> Note: WordDelimiterFilter does not split on joiners by default. You can solve this by using ArabicLetterTokenizerFactory, which does, or by using a custom WordDelimiterFilterFactory which supplies a customized charTypeTable to WordDelimiterFilter. In either case, consider using PositionFilter at query-time (only), as the QueryParser does not consider joiners and could create unwanted phrase queries.
  
- ==== Portuguese ====
+ === Portuguese ===
  Solr includes support for stemming Portuguese via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.
  
  {{{
@@ -279, +278 @@

  
  Example set of Portuguese [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/portuguese_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)
  
- ==== Romanian ====
+ === Romanian ===
  Solr includes support for stemming Romanian via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.
  
  {{{
@@ -291, +290 @@

  
  Example set of Romanian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ro/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)
  
- ==== Russian ====
+ === Russian ===
  Solr includes support for stemming Russian via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.
  
  {{{
@@ -303, +302 @@

  
  Example set of Russian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/russian_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)
  
- ==== Spanish ====
+ === Spanish ===
  Solr includes support for stemming Spanish via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.
  
  {{{
@@ -315, +314 @@

  
  Example set of Spanish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/spanish_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)
  
- ==== Swedish ====
+ === Swedish ===
  Solr includes support for stemming Swedish via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list.
  
  {{{
@@ -329, +328 @@

  
  <!> Note: See also {{{Decompounding}}} below.
  
- ==== Thai ====
+ === Thai ===
  Solr includes support for breaking Thai text into words via {{{solr.ThaiWordFilterFactory}}}
  
  {{{
@@ -340, +339 @@

  
  <!> Note: Be sure to use PositionFilter at query-time (only) as this language does not use spaces between words.
  
- ==== Turkish ====
+ === Turkish ===
  Solr includes support for stemming Turkish via {{{solr.SnowballPorterFilterFactory}}}, as well as support for case-insensitive search via {{{solr.TurkishLowerCaseFilterFactory}}} <!> [[Solr3.1]], and Lucene includes an example stopword list.
  
  {{{
@@ -354, +353 @@

  
  <!> Note: Be sure to use the Turkish-specific TurkishLowerCaseFilterFactory <!> [[Solr3.1]]
  
- === Not yet Integrated ===
+ == Not yet Integrated ==
  
  The following languages have explicit support in Lucene, but it is not yet integrated into Solr. If you need to support these languages you might find this information useful in the meantime.
  
- ==== Chinese, Japanese, Korean ====
+ === Chinese, Japanese, Korean ===
  
  Lucene provides support for Chinese word segmentation (SentenceTokenizer, WordTokenFilter) in a separate jar file (lucene-analyzers-smartcn.jar). This component includes a large dictionary and segments Chinese text into words with the Hidden Markov Model.
  
@@ -368, +367 @@

  
  <!> Note: Be sure to use PositionFilter at query-time (only) as this language does not use spaces between words.
  
- ==== Polish ====
+ === Polish ===
  <!> [[Lucene3.1]]
  
  Lucene provides support for Polish stemming (StempelFilter) in a separate jar file (lucene-analyzers-stempel.jar). This component includes an algorithmic stemmer with tables for Polish.
  
- ==== Lao, Myanmar, Khmer ====
+ === Lao, Myanmar, Khmer ===
  <!> [[Lucene3.1]]
  
  Lucene provides support for segmenting these languages into syllables (ICUTokenizer) in a separate jar file (lucene-icu.jar).
  
  <!> Note: Be sure to use PositionFilter at query-time (only) as these languages do not use spaces between words. 
  
- === My language is not listed!!! ===
+ == My language is not listed!!! ==
  
  Your language might work anyway. A first step is to start with the "textgen" type in the example schema. Remember, things like stemming and stopwords aren't mandatory for the search to work, only optional language-specific improvements.
  
  If you have problems (your language is highly-inflectional, etc), you might want to try using an n-gram approach as an alternative.
  
+ == Other Tips ==
  === Tokenization ===
  
  In general most languages don't require special tokenization (and will work just fine with Whitespace + WordDelimiterFilter), so you can safely tailor the English "text" example schema definition to fit.