You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2010/12/15 23:48:58 UTC
[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by SteveRowe
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The "AnalyzersTokenizersTokenFilters" page has been changed by SteveRowe.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=diff&rev1=98&rev2=99
--------------------------------------------------
HTML stripping examples:
||my <a href="www.foo.bar">link</a> ||my link ||
- ||<br>hello<!--comment--> ||hello ||
+ || <br>hello<!--comment--> ||hello ||
||hello<script><-- f('<--internal--></script>'); --></script> ||hello ||
||if a<b then print a; ||if a<b then print a; ||
||hello <td height=22 nowrap align="left"> ||hello ||
- ||a<b A Alpha&Omega Ω ||a<b A Alpha&Omega Ω ||
+ ||a<b &#65; Alpha&Omega Ω ||a<b A Alpha&Omega Ω ||
@@ -180, +180 @@
A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values. Token types are only useful for subsequent token filters that are type-aware. The !StandardFilter is currently the only Lucene filter that utilizes token types.
+ ||'''Solr Version'''||'''Behavior'''||
+ ||pre-3.1||Some token types are number, alphanumeric, email, acronym, URL, etc. —<<BR>><<BR>>Example: `"I.B.M. cat's can't" ==> ACRONYM: "I.B.M.", APOSTROPHE:"cat's", APOSTROPHE:"can't"`||
+ || <!> [[Solr3.1]]||Word boundary rules from [[http://unicode.org/reports/tr29/#Word_Boundaries|Unicode standard annex UAX#29]]<<BR>>Token types: `<ALPHANUM>`, `<NUM>`, `<SOUTHEAST_ASIAN>`, `<IDEOGRAPHIC>`, and `<HIRAGANA>`<<BR>><<BR>>Example: `"I.B.M. 8.5 can't!!!" ==> ALPHANUM: "I.B.M.", NUM:"8.5", ALPHANUM:"can't"`||
+
+ ||'''arg''' ||'''default value''' ||'''note''' ||
+ ||maxTokenLength ||255 || <!> [[Solr3.1]] -- [[https://issues.apache.org/jira/browse/SOLR-2188|SOLR-2188]]<<BR>>Tokens longer than this are silently ignored.||
+
+ <<Anchor(ClassicTokenizer)>>
+
+ === solr.ClassicTokenizerFactory ===
+ <!> [[Solr3.1]]
+
+ Creates `org.apache.lucene.analysis.standard.ClassicTokenizer`.
+
+ This tokenizer preserves !StandardTokenizer's behavior pre-Solr 3.1: A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values. Token types are only useful for subsequent token filters that are type-aware. The !StandardFilter is currently the only Lucene filter that utilizes token types.
+
Some token types are number, alphanumeric, email, acronym, URL, etc. —
. Example: `"I.B.M. cat's can't" ==> ACRONYM: "I.B.M.", APOSTROPHE:"cat's", APOSTROPHE:"can't"`
+
+ ||'''arg''' ||'''default value''' ||'''note''' ||
+ ||maxTokenLength ||255 || <!> [[Solr3.1]] -- [[https://issues.apache.org/jira/browse/SOLR-2188|SOLR-2188]]<<BR>>Tokens longer than `maxTokenLength` are silently ignored. ||
+
+
+ <<Anchor(UAX29URLEmailTokenizer)>>
+
+ === solr.UAX29URLEmailTokenizerFactory ===
+ <!> [[Solr3.1]]
+
+ Creates `org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer`.
+
+ Like !StandardTokenizer, this tokenizer implements the word boundary rules from [[http://unicode.org/reports/tr29/#Word_Boundaries|Unicode standard annex UAX#29]]. In addition, this tokenizer recognizes: full URLs using the `file:://`, `http(s)://`, and `ftp://` schemes; hostnames with a registered TLD (top level domain, e.g. ".com"); IPv4 and IPv6 addresses; and e-mail addresses.
+
+ In addition to the token types output by !StandardTokenizer from [[Solr3.1]] onward, !UAX29URLEmailTokenizer can also output `<URL>` and `<EMAIL>` token types.
+
+ . Example: `"Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com"`
+ . `==> ALPHANUM:"Visit", URL: "http://accarol.com/contact.htm?from=external&a=10", ALPHANUM:"or", ALPHANUM:"e-mail" EMAIL:"bob.cratchet@accarol.com"`
+
+ ||'''arg''' ||'''default value''' ||'''note''' ||
+ ||maxTokenLength ||255 || <!> [[Solr3.1]] -- [[https://issues.apache.org/jira/browse/SOLR-2188|SOLR-2188]]<<BR>>Tokens longer than `maxTokenLength` are silently ignored. ||
+
<<Anchor(HTMLStripWhitespaceTokenizer)>>