You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2009/08/07 04:57:48 UTC

[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by KojiSekiguchi

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by KojiSekiguchi:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

------------------------------------------------------------------------------
  
  ==== solr.HTMLStripCharFilterFactory ====
  
+ Creates `org.apache.solr.analysis.HTMLStripCharFilter`. HTMLStripCharFilter strips HTML from the input stream and passes the result to either CharFilter or Tokenizer.
- Creates `org.apache.solr.analysis.HTMLStripCharFilter`.
- 
- === TokenizerFactories ===
- 
- Solr provides the following  !TokenizerFactories (Tokenizers and !TokenFilters):
- 
- ==== solr.LetterTokenizerFactory ====
- 
- Creates `org.apache.lucene.analysis.LetterTokenizer`.
- 
- Creates tokens consisting of strings of contiguous letters. Any non-letter characters will be discarded.
- 
-   Example: `"I can't" ==> "I", "can", "t"`
- 
- [[Anchor(WhitespaceTokenizer)]]
- ==== solr.WhitespaceTokenizerFactory ====
- 
- Creates `org.apache.lucene.analysis.WhitespaceTokenizer`.
- 
- Creates tokens of characters separated by splitting on whitespace.
- 
- ==== solr.LowerCaseTokenizerFactory ====
- 
- Creates `org.apache.lucene.analysis.LowerCaseTokenizer`.
- 
- Creates tokens by lowercasing all letters and dropping non-letters.
- 
-   Example: `"I can't" ==> "i", "can", "t"`
- 
- [[Anchor(StandardTokenizer)]]
- ==== solr.StandardTokenizerFactory ====
- 
- Creates `org.apache.lucene.analysis.standard.StandardTokenizer`.
- 
- A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values.  Token types are only useful for subsequent token filters that are type-aware.  The !StandardFilter is currently the only Lucene filter that utilizes token types.
- 
- Some token types are number, alphanumeric, email, acronym, URL, etc. &#151;
- 
-   Example: `"I.B.M. cat's can't" ==> ACRONYM: "I.B.M.", APOSTROPHE:"cat's", APOSTROPHE:"can't"`
- 
- [[Anchor(HTMLStripWhitespaceTokenizer)]]
- ==== solr.HTMLStripWhitespaceTokenizerFactory ====
- 
- Strips HTML from the input stream and passes the result to a !WhitespaceTokenizer.
  
  HTML stripping features:
   * The input need not be an HTML document as only constructs that look like HTML will be removed.
@@ -154, +111 @@

  || hello <td height=22 nowrap align="left"> || hello ||
  || a&lt;b &#65 Alpha&Omega &Omega; || a<b A Alpha&Omega Ω ||
  
+ === TokenizerFactories ===
+ 
+ Solr provides the following  !TokenizerFactories (Tokenizers and !TokenFilters):
+ 
+ ==== solr.LetterTokenizerFactory ====
+ 
+ Creates `org.apache.lucene.analysis.LetterTokenizer`.
+ 
+ Creates tokens consisting of strings of contiguous letters. Any non-letter characters will be discarded.
+ 
+   Example: `"I can't" ==> "I", "can", "t"`
+ 
+ [[Anchor(WhitespaceTokenizer)]]
+ ==== solr.WhitespaceTokenizerFactory ====
+ 
+ Creates `org.apache.lucene.analysis.WhitespaceTokenizer`.
+ 
+ Creates tokens of characters separated by splitting on whitespace.
+ 
+ ==== solr.LowerCaseTokenizerFactory ====
+ 
+ Creates `org.apache.lucene.analysis.LowerCaseTokenizer`.
+ 
+ Creates tokens by lowercasing all letters and dropping non-letters.
+ 
+   Example: `"I can't" ==> "i", "can", "t"`
+ 
+ [[Anchor(StandardTokenizer)]]
+ ==== solr.StandardTokenizerFactory ====
+ 
+ Creates `org.apache.lucene.analysis.standard.StandardTokenizer`.
+ 
+ A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values.  Token types are only useful for subsequent token filters that are type-aware.  The !StandardFilter is currently the only Lucene filter that utilizes token types.
+ 
+ Some token types are number, alphanumeric, email, acronym, URL, etc. &#151;
+ 
+   Example: `"I.B.M. cat's can't" ==> ACRONYM: "I.B.M.", APOSTROPHE:"cat's", APOSTROPHE:"can't"`
+ 
+ [[Anchor(HTMLStripWhitespaceTokenizer)]]
+ ==== solr.HTMLStripWhitespaceTokenizerFactory ====
+ 
+ Strips HTML from the input stream and passes the result to a !WhitespaceTokenizer.
+ 
+ See {{{solr.HTMLStripCharFilterFactory}}} for details on HTML stripping.
  
  ==== solr.HTMLStripStandardTokenizerFactory ====
  
  Strips HTML from the input stream and passes the result to a !StandardTokenizer.
  
- See {{{solr.HTMLStripWhitespaceTokenizerFactory}}} for details on HTML stripping.
+ See {{{solr.HTMLStripCharFilterFactory}}} for details on HTML stripping.
  
  ==== solr.PatternTokenizerFactory ====