You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2006/07/25 22:46:37 UTC
[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by YonikSeeley

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by YonikSeeley:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

The comment on the change is:
configurable stemmer languages, latin1 filter

------------------------------------------------------------------------------
  }}}
  
  '''Note:''' Due to performance concerns, this implementation does not utilize `org.apache.lucene.analysis.snowball.SnowballFilter`, as that class uses Java reflection to stem every word. 
+ 
+ ==== solr.SnowballPorterFilterFactory ====
+ 
+ Creates `org.apache.lucene.analysis.SnowballPorterFilter`.
+ 
+ Creates an [http://snowball.tartarus.org/algorithms/english/stemmer.html Porter2 stemmer] from the Java classes generated from a [http://snowball.tartarus.org/ Snowball] specification.  The language attribute is used to specify the language of the stemmer.
+ {{{
+ <fieldtype name="myfieldtype" class="solr.TextField">
+   <analyzer>
+     <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
+     <filter class="solr.SnowballPorterFilterFactory" language="German" />
+   </analyzer>
+ </fieldtype>
+ }}}
+ 
+ Valid values for the language attribute (creates the snowball stemmer class language + "Stemmer"):
+  * Danish
+  * Dutch
+  * English
+  * Finnish
+  * French
+  * German2
+  * German
+  * Italian
+  * Kp
+  * Lovins
+  * Norwegian
+  * Porter
+  * Portuguese
+  * Russian
+  * Spanish
+  * Swedish
+ 
  
  ==== solr.WordDelimiterFilterFactory ====
  
@@ -358, +391 @@

     * Many thousands of documents containing the term "text:TV"
     * A few hundred documents containing the term "text:Television"
  
- A query for `text:TV` will expand into `(text:TV text:Television)` and the lower docFreq for `text:Television` will give the documents that match "Television" a much higher score then docs that match "TV" comparably -- which may be somewhat counter intuative to the client.  Index time expansion (or reduction) will result in the same idf for all documents regardless of which term the orriginal text contained.
+ A query for `text:TV` will expand into `(text:TV text:Television)` and the lower docFreq for `text:Television` will give the documents that match "Television" a much higher score then docs that match "TV" comparably -- which may be somewhat counter intuative to the client.  Index time expansion (or reduction) will result in the same idf for all documents regardless of which term the original text contained.
  
  ==== solr.RemoveDuplicatesTokenFilterFactory ====
  
@@ -366, +399 @@

  
  Filters out any tokens which are at the same logical position in the tokenstream as a previous token with the same text.  This situation can arise from a number of situations depending on what the "up stream" token filters are -- notably when stemming synonyms with similar roots.  It is usefull to remove the duplicates to prevent `idf` inflation at index time, or `tf` inflation (in a !MultiPhraseQuery) at query time.
  
+ 
+ ==== solr.ISOLatin1AccentFilterFactory ====
+ 
+ Creates `org.apache.lucene.analysis.ISOLatin1AccentFilter`.
+ 
+ Replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalent.
+