You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2007/09/01 19:32:08 UTC

[Solr Wiki] Trivial Update of "AnalyzersTokenizersTokenFilters" by noodl

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by noodl:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

The comment on the change is:
A few typos

------------------------------------------------------------------------------
    <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
  </fieldtype>
  }}}
-   1.  Specifing a '''!TokenizerFactory''' followed by a list of optional !TokenFilterFactories that are applied in the listed order. Factories that can create the tokenizers or token filters are used to prepare configuration for the tokenizer or filter and avoid the overhead of creation via reflection. [[BR]] Example: [[BR]] {{{
+   1.  Specifying a '''!TokenizerFactory''' followed by a list of optional !TokenFilterFactories that are applied in the listed order. Factories that can create the tokenizers or token filters are used to prepare configuration for the tokenizer or filter and avoid the overhead of creation via reflection. [[BR]] Example: [[BR]] {{{
  <fieldtype name="text" class="solr.TextField">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
@@ -103, +103 @@

  
  Creates `org.apache.lucene.analysis.standard.StandardTokenizer`.
  
- A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values.  Token types are only useful for subsequent token filters that are type-aware.  The !StandardFilter is currently the only Lucene filter that utilizes token type.
+ A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values.  Token types are only useful for subsequent token filters that are type-aware.  The !StandardFilter is currently the only Lucene filter that utilizes token types.
     
  Some token types are number, alphanumeric, email, acronym, URL, etc. &#151;
  
@@ -243, +243 @@

  
  Creates `org.apache.lucene.analysis.PorterStemFilter`.
  
- Standard Lucene implementation of the     [http://tartarus.org/~martin/PorterStemmer/ Porter Stemming Algorithm], a normalization process that removes common endings from words.
+ Standard Lucene implementation of the [http://tartarus.org/~martin/PorterStemmer/ Porter Stemming Algorithm], a normalization process that removes common endings from words.
  
    Example: "riding", "rides", "horses" ==> "ride", "ride", "hors".
  
@@ -439, +439 @@

   1. The Lucene !QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words `sea biscit` the analyzer will be given the words "sea" and "biscit" seperately, and will not know that they match a synonym.
   1. Phrase searching (ie: `"sea biscit"`) will cause the !QueryParser to pass the entire string to the analyzer, but if the !SynonymFilter is configured to expand the synonyms, then when the !QueryParser gets the resulting list of tokens back from the Analyzer, it will construct a !MultiPhraseQuery that will not have the desired effect.  This is because of the limited mechanism available for the Analyzer to indicate that two terms occupy the same position: there is no way to indicate that a "phrase" occupies the same position as a term.  For our example the resulting !MultiPhraseQuery would be `"(sea | sea | seabiscuit) (biscuit | biscit)"` which would not match the simple case of "seabisuit" occuring in a document
  
- Even when you aren't worried about multi-word synonyms, idf differences still make index time synonyms a good idea. Consider the following scenerio:
+ Even when you aren't worried about multi-word synonyms, idf differences still make index time synonyms a good idea. Consider the following scenario:
  
     * An index with a "text" field, which at query time uses the !SynonymFilter with the synonym `TV, Televesion` and `expand="true"`
     * Many thousands of documents containing the term "text:TV"
     * A few hundred documents containing the term "text:Television"
  
- A query for `text:TV` will expand into `(text:TV text:Television)` and the lower docFreq for `text:Television` will give the documents that match "Television" a much higher score then docs that match "TV" comparably -- which may be somewhat counter intuative to the client.  Index time expansion (or reduction) will result in the same idf for all documents regardless of which term the original text contained.
+ A query for `text:TV` will expand into `(text:TV text:Television)` and the lower docFreq for `text:Television` will give the documents that match "Television" a much higher score then docs that match "TV" comparably -- which may be somewhat counter intuitive to the client.  Index time expansion (or reduction) will result in the same idf for all documents regardless of which term the original text contained.
  
  [[Anchor(RemoveDuplicatesTokenFilter)]]
  ==== solr.RemoveDuplicatesTokenFilterFactory ====