You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2011/07/08 08:56:31 UTC
[Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by Bill Bell
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The "AnalyzersTokenizersTokenFilters" page has been changed by Bill Bell:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=diff&rev1=121&rev2=122
=== solr.HTMLStripCharFilterFactory ===
Creates `org.apache.solr.analysis.HTMLStripCharFilter`. `HTMLStripCharFilter` strips HTML from the input stream and passes the result to either `CharFilter` or `Tokenizer`. Like other CharFilters, it's specified using a <charFilter> tag, and must come before the <tokenizer>. An example:
+
{{{
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
@@ -116, +117 @@
<filter class="solr.StandardFilterFactory"/>
</analyzer>
}}}
-
HTML stripping features:
* The input need not be an HTML document as only constructs that look like HTML will be removed.
@@ -134, +134 @@
* terminating '`;`' is mandatory to avoid false matches on something like "`Alpha&Omega Corp`"
HTML stripping examples:
- ||{{{my <a href="www.foo.bar">link</a> }}}||`my link `||
+ ||{{{my <a href="www.foo.bar">link</a> }}} ||`my link ` ||
- ||{{{<br>hello<!--comment--> }}}||`hello `||
+ ||{{{<br>hello<!--comment--> }}} ||`hello ` ||
- ||{{{hello<script><!-- f('<!--internal--></script>'); --></script> }}}||`hello `||
+ ||{{{hello<script><!-- f('<!--internal--></script>'); --></script> }}} ||`hello ` ||
- ||{{{if a<b then print a; }}}||`if a<b then print a; `||
+ ||{{{if a<b then print a; }}} ||`if a<b then print a; ` ||
- ||{{{hello <td height=22 nowrap align="left"> }}}||`hello `||
+ ||{{{hello <td height=22 nowrap align="left"> }}} ||`hello ` ||
- ||{{{a<b A Alpha&Omega O}}} ||`a<b A Alpha&Omega O `||
+ ||{{{a<b A Alpha&Omega O}}} ||`a<b A Alpha&Omega O ` ||
- ||{{{México}}}||`México`||
+ ||{{{México}}} ||`México` ||