You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Koji Sekiguchi <ko...@r.email.ne.jp> on 2009/08/07 05:04:22 UTC

Re: Multi tokenizer

Chris Hostetter wrote:
> : I need to tokenize my field on whitespaces, html, punctuation, apostrophe
>
> : but if I use HTMLStripStandardTokenizerFactory it strips only html.... 
> : but no apostrophes
>
> you might consider using one of the HTML Tokenizers, and then use a 
> PatternReplaceFilterFilter ... or if you know java write a 
> simple Tokenizer that uses the HTMLStripReader.
>
> in the long run, changing the HTMLStripReader to be useble as a 
> "CharFilter" so it can work with any Tokenizer is probably the way we'll 
> go -- but i don't think anyone has started working on a patch for that.
>
>   
I opened:
https://issues.apache.org/jira/browse/SOLR-1343

Koji