You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2006/02/11 10:42:41 UTC

[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by HossMan

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by HossMan:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

The comment on the change is:
initial import from CNET wiki page PI/AnalyzersTokenizersTokenFilters

New page:
= Analyzers, Tokenizers, and Token Filters =

/!\ :TODO: /!\ Package names are all probably wrong and need fixed

When a document comes in, individual fields are subject to the analyzing and tokenizing filters that can transform the data in the fields. For example &#151; removing blank spaces, removing html code, stemming, removing a particular character and replacing it with another. At collection time as well as at query time you may need to do some of the above or similiar operations. For example, you might perform a [http://en.wikipedia.org/wiki/Soundex Soundex] transformation (a type of phonic hashing) on a string to enable a search based upon the word and upon its 'sound-alikes'.  

'''Note:''' 
Before continuing with this doc, it's recommended that you read the following sections in [http://lucenebook.com/search Lucene In Action]: 
 * 1.5.3 : Analyzer
 * Chapter 4.0 through 4.7 at least 

Try searches for "analyzer", "token", and "stemming".

[[TableOfContents]]


== Stemming ==

Two types of stemming are available to you:
   * [http://tartarus.org/~martin/PorterStemmer/ Porter] or Reduction stemming &#151; A transforming algorithm that reduces any of the forms of a word such  "runs, running, ran", to its elemental root e.g., "run". Porter stemming must be performed ''both'' at insertion time and at query time.
   * Expansion stemming &#151; Takes a root word and 'expands' it to all of its various forms &#151; can be used ''either'' at insertion time ''or'' at query time. 

== Analyzers ==

Analyzers are components that pre-process input text at index time and/or at  search time. Because a search string has to be processed the same way that the indexed text was processed, ''it is important to use the same Analyzer for both indexing and searching. Not using the same Analyzer will likely result in invalid search results.''  /!\ :TODO: /!\ this isn't really true.. rephrase -YCS

The Analyzer class is an abstract class, but Lucene comes with a few concrete Analyzers that pre-process their input in different ways. If you need to pre-process input text and queries in a way that is not provided by any of Lucene's built-in Analyzers, you will need to implement a custom Analyzer.  

== Tokens and Token Filters ==

An analyzer splits up a text field into tokens that the field is indexed by. An Analyzer is normally implemented by creating a '''Tokenizer''' that splits-up a stream (normally a single field value) into a series of tokens. These tokens are then passed through Token Filters that add, change, or remove tokens. The field is then indexed by the resulting token stream.

== Specifying an Analyzer in the schema ==

A SOLR schema.xml file allows two methods for specifying they way a text field is analyzed. (Normally only fieldtypes of `solr.TextField` will have Analyzers explicitly specified in the schema): 

  1.  Specifying the '''class name''' of an Analyzer &#151; anything extending org.apache.lucene.analysis.Analyzer. [[BR]] Example: [[BR]] {{{
<fieldtype name="nametext" class="solr.TextField">
  <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
</fieldtype>
}}}
  1.  Specifing a '''Tokenizer''' followed by a list of optional !TokenFilters that are applied in the listed order. Factories that can create the tokenizers or token filters are used to avoid the overhead of creation via reflection. [[BR]] Example: [[BR]] {{{
<fieldtype name="text" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldtype>
}}}

=== TokenizerFactories ===

Solr provides the following  !TokenizerFactories (Tokenizers and !TokenFilters):

==== solr.LetterTokenizerFactory ====

Creates `org.apache.lucene.analysis.LetterTokenizer`. 

Creates tokens consisting of strings of contiguous letters. Any non-letter characters will be discarded.

  Example: `"I can't" ==> "I", "can", "t"` 

==== solr.WhitespaceTokenizerFactory ====

Creates `org.apache.lucene.analysis.WhitespaceTokenizer`.

Creates tokens of characters separated by splitting on whitespace. 

==== solr.LowerCaseTokenizerFactory ====

Creates `org.apache.lucene.analysis.LowerCaseTokenizer`.

Creates tokens by lowercasing all letters and dropping non-letters.

  Example: `"I can't" ==> "i", "can", "t"`

==== solr.StandardTokenizerFactory ====

Creates `org.apache.lucene.analysis.standard.StandardTokenizer`.

A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values.  Token types are only useful for subsequent token filters that are type-aware.  The StandardFilter is the only Lucene filter that utilizes token type.
   
Some token types are number, alphanumeric, email, acronym, URL, etc. &#151;

  Example: `"I.B.M. cat's can't" ==> ACRONYM: "I.B.M.", APOSTROPHE:"cat's", APOSTROPHE:"can't"`

==== solr.HTMLStripWhitespaceTokenizerFactory ====

Strips HTML from the input stream and passes the result to a !WhitespaceTokenizer.

HTML stripping features:
 * The input need not be an HTML document as only constructs that look like HTML will be removed.
 * Removes HTML/XML tags while keeping the content
   * Attributes within tags are also removed, and attribute quoting is optional.
 * Removes XML processing instructions: <?foo bar?>
 * Removes XML comments
 * Removes XML elements starting with <! and ending with > 
 * Removes contents of <script> and <style> elements.
   * Handles XML comments inside these elements (normal comment processing won't always work)
   * Replaces numeric character entities references like &#65; or &#x7f;
     * The terminating ';' is optional if the entity reference is followed by whitespace.
   * Replaces all [http://www.w3.org/TR/REC-html40/sgml/entities.html named character entity references].
     * &nbsp; is replaced with a space instead of 0xa0
     * terminating ';' is mandatory to avoid false matches on something like "Alpha&Omega Corp" 

HTML stripping examples:

|| my <a href="www.foo.bar">link</a> || my link ||
|| <?xml?><br>hello<!--comment--> || hello ||
|| hello<script><-- f('<--internal--></script>'); --></script> || hello ||
|| if a<b then print a; || if a<b then print a; ||
|| hello <td height=22 nowrap align="left"> || hello ||
|| a&lt;b &#65 Alpha&Omega &Omega; || a<b A Alpha&Omega Ω ||


==== solr.HTMLStripStandardTokenizerFactory ====

Strips HTML from the input stream and passes the result to a !StandardTokenizer.

See solr.HTMLStripWhitespaceTokenizerFactory for details on HTML stripping.

=== TokenFilterFactories ===

==== solr.StandardFilterFactory ====

Creates `org.apache.lucene.analysis.standard.StandardFilter`.

Removes dots from acronyms and 's from the end of tokens. Works only on typed tokens, i.e., those produced by !StandardTokenizer or equivalent.

  Example of !StandardTokenizer followed by !StandardFilter:
     `"I.B.M. cat's can't" ==> "IBM", "cat", "can't"`

==== solr.LowerCaseFilterFactory ====

Creates `org.apache.lucene.analysis.LowerCaseFilter`.

Lowercases the letters in each token. Leaves non-letter tokens alone.<br>

  Example: `"I.B.M.", "Solr" ==> "i.b.m.", "solr"`.

==== solr.StopFilterFactory ====

Creates `org.apache.lucene.analysis.StopFilter`.

Discards common words.

The default English stop words are:
{{{
    "a", "an", "and", "are", "as", "at", "be", "but", "by",
    "for", "if", "in", "into", "is", "it",
    "no", "not", "of", "on", "or", "s", "such",
    "t", "that", "the", "their", "then", "there", "these",
    "they", "this", "to", "was", "will", "with"
}}}
 
A customized stop word list may be specified with the "words" attribute in the schema. The file referenced by the words parameter will be loaded by the !ClassLoader and hence must be in the classpath.

{{{
<fieldtype name="teststop" class="solr.TextField">
   <analyzer>
     <tokenizer class="solr.LowerCaseTokenizerFactory"/> 
     <filter class="solr.StopFilterFactory" words="stopwords.txt" />
   </analyzer>
</fieldtype>
}}}

==== solr.LengthFilterFactory ====

Creates `solr.LengthFilter`.

Filters out those tokens *not* having length min through max inclusive.
{{{
<fieldtype name="lengthfilt" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LengthFilterFactory" min="2" max="5" />
  </analyzer>
</fieldtype>
}}}

==== solr.PorterStemFilterFactory ====

Creates `org.apache.lucene.analysis.PorterStemFilter`.

Standard Lucene implementation of the     [http://tartarus.org/~martin/PorterStemmer/ Porter Stemming Algorithm], a normalization process that removes common endings from words.

  Example: "riding", "rides", "horses" ==> "ride", "ride", "hors".

==== solr.EnglishPorterFilterFactory ====

Creates `solr.EnglishPorterFilter`.

Creates an [http://snowball.tartarus.org/algorithms/english/stemmer.html English Porter2 stemmer] from the Java classes generated from a [http://snowball.tartarus.org/ Snowball] specification. 

A customized protected word list may be specified with the "protected" attribute in the schema. The file referenced will be loaded by the !ClassLoader and hence must be in the classpath. Any words in the protected word list will not be modified (stemmed).

{{{
<fieldtype name="myfieldtype" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.LowerCaseTokenizerFactory"/> 
    <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
  </analyzer>
</fieldtype>
}}}

'''Note:''' Due to performance concerns, this implementation does not utilize `org.apache.lucene.analysis.snowball.SnowballFilter`, as that class uses reflection to stem every word. 

==== solr.WordDelimiterFilterFactory ====

Creates `solr.analysis.WordDelimiterFilter`.

Splits words into subwords and performs optional transformations on subword groups.
Words are split into subwords with the following rules:
 * split on intra-word delimiters (by default, all non alpha-numeric characters).
   * `"Wi-Fi" -> "Wi", "Fi"`
 * split on case transitions
   * `"PowerShot" -> "Power", "Shot"`
 * split on letter-number transitions
   * `"SD500" -> "SD", "500"`
 * leading and trailing intra-word delimiters on each subword are ignored
   * `"//hello---there, 'dude'" -> "hello", "there", "dude"`
 * trailing "'s" are removed for each subword
   * `"O'Neil's" -> "O", "Neil"`
     * Note: this step isn't performed in a separate filter because of possible subword combinations.

There are a number of parameters that affect what tokens are generated and if subwords are combined:
 * '''generateWordParts="1"''' causes parts of words to be generated:
   * `"PowerShot" => "Power" "Shot"`
 * '''generateNumberParts="1"''' causes number subwords to be generated:
   * `"500-42" => "500" "42"`
 * '''catenateWords="1"''' causes maximum runs of word parts to be catenated:
    * `"wi-fi" => "wifi"`
 * '''catenateNumers="1"''' causes maximum runs of number parts to be catenated:
   * `"500-42" => "50042"`
 * '''catenateAll="1"''' causes all subword parts to be catenated:
   * `"wi-fi-4000" => "wifi4000"`

These parameters may be combined in any way.  
 * Example of generateWordParts="1" and  catenateWords="1":
   * `"PowerShot" -> 0:"Power", 1:"Shot" 1:"PowerShot"` [[BR]] (where 0,1,1 are token positions)
   * `"A's+B's&C's" -> 0:"A", 1:"B", 2:"C", 2:"ABC"`
   * `"Super-Duper-XL500-42-AutoCoder!" -> 0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"`

One use for !WordDelimiterFilter is to help match words with [:SolrRelevancyCookbook#IntraWordDelimiters:different delimiters].  One way of doing so is to specify `generateWordParts="1" catenateWords="1"` in the analyzer used for indexing, and `generateWordParts="1"` in the analyzer used for querying.  Given that the current !StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that leaves them in place (such as !WhitespaceTokenizer). 

{{{
    <fieldtype name="subword" class="solr.TextField">
      <analyzer type="query">
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="1"   
                generateNumberParts="1" 
                catenateWords="0"       
                catenateNumbers="0"     
                catenateAll="0"         
                />
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory"/>
          <filter class="solr.EnglishPorterFilterFactory"/>
      </analyzer>
      <analyzer type="index">
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="1"   
                generateNumberParts="1" 
                catenateWords="1"       
                catenateNumbers="1"     
                catenateAll="0"         
                />
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory"/>
          <filter class="solr.EnglishPorterFilterFactory"/>
      </analyzer>
    </fieldtype>
}}}

==== solr.SynonymFilterFactory ====

Creates `SynonymFilter`.

Matches strings of tokens and replaces them with other strings of tokens.

 * The '''synonyms''' parameter names an external file defining the synonyms.
 * If '''ignoreCase''' is true, matching will lowercase before checking equality.
 * If '''expand''' is true, a synonym will be expanded to all equivalent synonyms.  If it is false, all equivalent synonyms will be reduced to the first in the list.

Example usage in schema:
{{{
    <fieldtype name="syn" class="solr.TextField">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.SynonymFilterFactory synonyms="syn.txt" ignoreCase="true" expand="false"/>
      </analyzer>
    </fieldtype>
}}}

Synonym file format:
{{{
# blank lines and lines starting with pound are comments.

#Explicit mappings match any token sequence on the LHS of "=>"
#and replace with all alternatives on the RHS.  These types of mappings
#ignore the expand parameter in the schema.
#Examples:
i-pod, i pod => ipod, 
sea biscuit, sea biscit => seabiscuit

#Equivalent synonyms may be separated with commas and give
#no explicit mapping.  In this case the mapping behavior will
#be taken from the expand parameter in the schema.  This allows
#the same synonym file to be used in different synonym handling strategies.
#Examples:
ipod, i-pod, i pod
foozball , foosball
universe , cosmos

# If expand==true, "ipod, i-pod, i pod" is equivalent to the explicit mapping:
ipod, i-pod, i pod => ipod, i-pod, i pod
# If expand==false, "ipod, i-pod, i pod" is equivalent to the explicit mapping:
ipod, i-pod, i pod => ipod

#multiple synonym mapping entries are merged.
foo => foo bar
foo => baz
#is equivalent to
foo => foo bar, baz

}}}