You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2009/08/07 04:57:48 UTC
[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by KojiSekiguchi
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The following page has been changed by KojiSekiguchi:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
------------------------------------------------------------------------------
==== solr.HTMLStripCharFilterFactory ====
+ Creates `org.apache.solr.analysis.HTMLStripCharFilter`. HTMLStripCharFilter strips HTML from the input stream and passes the result to either CharFilter or Tokenizer.
- Creates `org.apache.solr.analysis.HTMLStripCharFilter`.
-
- === TokenizerFactories ===
-
- Solr provides the following !TokenizerFactories (Tokenizers and !TokenFilters):
-
- ==== solr.LetterTokenizerFactory ====
-
- Creates `org.apache.lucene.analysis.LetterTokenizer`.
-
- Creates tokens consisting of strings of contiguous letters. Any non-letter characters will be discarded.
-
- Example: `"I can't" ==> "I", "can", "t"`
-
- [[Anchor(WhitespaceTokenizer)]]
- ==== solr.WhitespaceTokenizerFactory ====
-
- Creates `org.apache.lucene.analysis.WhitespaceTokenizer`.
-
- Creates tokens of characters separated by splitting on whitespace.
-
- ==== solr.LowerCaseTokenizerFactory ====
-
- Creates `org.apache.lucene.analysis.LowerCaseTokenizer`.
-
- Creates tokens by lowercasing all letters and dropping non-letters.
-
- Example: `"I can't" ==> "i", "can", "t"`
-
- [[Anchor(StandardTokenizer)]]
- ==== solr.StandardTokenizerFactory ====
-
- Creates `org.apache.lucene.analysis.standard.StandardTokenizer`.
-
- A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values. Token types are only useful for subsequent token filters that are type-aware. The !StandardFilter is currently the only Lucene filter that utilizes token types.
-
- Some token types are number, alphanumeric, email, acronym, URL, etc. —
-
- Example: `"I.B.M. cat's can't" ==> ACRONYM: "I.B.M.", APOSTROPHE:"cat's", APOSTROPHE:"can't"`
-
- [[Anchor(HTMLStripWhitespaceTokenizer)]]
- ==== solr.HTMLStripWhitespaceTokenizerFactory ====
-
- Strips HTML from the input stream and passes the result to a !WhitespaceTokenizer.
HTML stripping features:
* The input need not be an HTML document as only constructs that look like HTML will be removed.
@@ -154, +111 @@
|| hello <td height=22 nowrap align="left"> || hello ||
|| a<b A Alpha&Omega Ω || a<b A Alpha&Omega Ω ||
+ === TokenizerFactories ===
+
+ Solr provides the following !TokenizerFactories (Tokenizers and !TokenFilters):
+
+ ==== solr.LetterTokenizerFactory ====
+
+ Creates `org.apache.lucene.analysis.LetterTokenizer`.
+
+ Creates tokens consisting of strings of contiguous letters. Any non-letter characters will be discarded.
+
+ Example: `"I can't" ==> "I", "can", "t"`
+
+ [[Anchor(WhitespaceTokenizer)]]
+ ==== solr.WhitespaceTokenizerFactory ====
+
+ Creates `org.apache.lucene.analysis.WhitespaceTokenizer`.
+
+ Creates tokens of characters separated by splitting on whitespace.
+
+ ==== solr.LowerCaseTokenizerFactory ====
+
+ Creates `org.apache.lucene.analysis.LowerCaseTokenizer`.
+
+ Creates tokens by lowercasing all letters and dropping non-letters.
+
+ Example: `"I can't" ==> "i", "can", "t"`
+
+ [[Anchor(StandardTokenizer)]]
+ ==== solr.StandardTokenizerFactory ====
+
+ Creates `org.apache.lucene.analysis.standard.StandardTokenizer`.
+
+ A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values. Token types are only useful for subsequent token filters that are type-aware. The !StandardFilter is currently the only Lucene filter that utilizes token types.
+
+ Some token types are number, alphanumeric, email, acronym, URL, etc. —
+
+ Example: `"I.B.M. cat's can't" ==> ACRONYM: "I.B.M.", APOSTROPHE:"cat's", APOSTROPHE:"can't"`
+
+ [[Anchor(HTMLStripWhitespaceTokenizer)]]
+ ==== solr.HTMLStripWhitespaceTokenizerFactory ====
+
+ Strips HTML from the input stream and passes the result to a !WhitespaceTokenizer.
+
+ See {{{solr.HTMLStripCharFilterFactory}}} for details on HTML stripping.
==== solr.HTMLStripStandardTokenizerFactory ====
Strips HTML from the input stream and passes the result to a !StandardTokenizer.
- See {{{solr.HTMLStripWhitespaceTokenizerFactory}}} for details on HTML stripping.
+ See {{{solr.HTMLStripCharFilterFactory}}} for details on HTML stripping.
==== solr.PatternTokenizerFactory ====