You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2010/12/15 23:48:58 UTC

[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by SteveRowe

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "AnalyzersTokenizersTokenFilters" page has been changed by SteveRowe.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=diff&rev1=98&rev2=99

--------------------------------------------------

  
  HTML stripping examples:
  ||my <a href="www.foo.bar">link</a> ||my link ||
- ||<br>hello<!--comment--> ||hello ||
+ || <br>hello<!--comment--> ||hello ||
  ||hello<script><-- f('<--internal--></script>'); --></script> ||hello ||
  ||if a<b then print a; ||if a<b then print a; ||
  ||hello <td height=22 nowrap align="left"> ||hello ||
- ||a<b &#65 Alpha&Omega Ω ||a<b A Alpha&Omega Ω ||
+ ||a<b &amp;#65; Alpha&Omega Ω ||a<b A Alpha&Omega Ω ||
  
  
  
@@ -180, +180 @@

  
  A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values.  Token types are only useful for subsequent token filters that are type-aware.  The !StandardFilter is currently the only Lucene filter that utilizes token types.
  
+ ||'''Solr Version'''||'''Behavior'''||
+ ||pre-3.1||Some token types are number, alphanumeric, email, acronym, URL, etc. —<<BR>><<BR>>Example: `"I.B.M. cat's can't" ==> ACRONYM: "I.B.M.", APOSTROPHE:"cat's", APOSTROPHE:"can't"`||
+ || <!> [[Solr3.1]]||Word boundary rules from [[http://unicode.org/reports/tr29/#Word_Boundaries|Unicode standard annex UAX#29]]<<BR>>Token types: `<ALPHANUM>`, `<NUM>`, `<SOUTHEAST_ASIAN>`, `<IDEOGRAPHIC>`, and `<HIRAGANA>`<<BR>><<BR>>Example: `"I.B.M. 8.5 can't!!!" ==> ALPHANUM: "I.B.M.", NUM:"8.5", ALPHANUM:"can't"`||
+ 
+ ||'''arg''' ||'''default value''' ||'''note''' ||
+ ||maxTokenLength ||255 || <!> [[Solr3.1]] -- [[https://issues.apache.org/jira/browse/SOLR-2188|SOLR-2188]]<<BR>>Tokens longer than this are silently ignored.||
+ 
+ <<Anchor(ClassicTokenizer)>>
+ 
+ === solr.ClassicTokenizerFactory ===
+ <!> [[Solr3.1]]
+ 
+ Creates `org.apache.lucene.analysis.standard.ClassicTokenizer`.
+ 
+ This tokenizer preserves !StandardTokenizer's behavior pre-Solr 3.1: A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values.  Token types are only useful for subsequent token filters that are type-aware.  The !StandardFilter is currently the only Lucene filter that utilizes token types.
+ 
  Some token types are number, alphanumeric, email, acronym, URL, etc. —
  
   . Example: `"I.B.M. cat's can't" ==> ACRONYM: "I.B.M.", APOSTROPHE:"cat's", APOSTROPHE:"can't"`
+ 
+ ||'''arg''' ||'''default value''' ||'''note''' ||
+ ||maxTokenLength ||255 || <!> [[Solr3.1]] -- [[https://issues.apache.org/jira/browse/SOLR-2188|SOLR-2188]]<<BR>>Tokens longer than `maxTokenLength` are silently ignored. ||
+ 
+ 
+ <<Anchor(UAX29URLEmailTokenizer)>>
+ 
+ === solr.UAX29URLEmailTokenizerFactory ===
+ <!> [[Solr3.1]]
+ 
+ Creates `org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer`.
+ 
+ Like !StandardTokenizer, this tokenizer implements the word boundary rules from [[http://unicode.org/reports/tr29/#Word_Boundaries|Unicode standard annex UAX#29]].  In addition, this tokenizer recognizes: full URLs using the `file:://`, `http(s)://`, and `ftp://` schemes; hostnames with a registered TLD (top level domain, e.g. ".com"); IPv4 and IPv6 addresses; and e-mail addresses.
+ 
+ In addition to the token types output by !StandardTokenizer from [[Solr3.1]] onward, !UAX29URLEmailTokenizer can also output `<URL>` and `<EMAIL>` token types.
+ 
+  . Example: `"Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com"` 
+  . `==> ALPHANUM:"Visit", URL: "http://accarol.com/contact.htm?from=external&a=10", ALPHANUM:"or", ALPHANUM:"e-mail" EMAIL:"bob.cratchet@accarol.com"`
+ 
+ ||'''arg''' ||'''default value''' ||'''note''' ||
+ ||maxTokenLength ||255 || <!> [[Solr3.1]] -- [[https://issues.apache.org/jira/browse/SOLR-2188|SOLR-2188]]<<BR>>Tokens longer than `maxTokenLength` are silently ignored. ||
+ 
  
  <<Anchor(HTMLStripWhitespaceTokenizer)>>