You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andy <an...@yahoo.com> on 2010/10/05 07:21:10 UTC

Differences between FilterFactory and TokenizerFactory?

There are EdgeNGramFilterFactory & EdgeNGramTokenizerFactory.

Likewise there are StandardFilterFactory & StandardTokenizerFactory.

LowerCaseFilterFactory & LowerCaseTokenizerFactory.

Seems like they always come in pairs. 

What are the differences between FilterFactory and TokenizerFactory? When should I use one as opposed to the other?

Thanks



      

Re: Differences between FilterFactory and TokenizerFactory?

Posted by Ahmet Arslan <io...@yahoo.com>.
> There are EdgeNGramFilterFactory
> & EdgeNGramTokenizerFactory.
> 
> Likewise there are StandardFilterFactory &
> StandardTokenizerFactory.
> 
> LowerCaseFilterFactory & LowerCaseTokenizerFactory.
> 
> Seems like they always come in pairs. 
> 
> What are the differences between FilterFactory and
> TokenizerFactory? When should I use one as opposed to the
> other?

Tokenizer breaks input text into words/tokens. Its input is a Reader. Only one tokenizer exists in an Analyzer. For example StandardTokenizer removes punctuations, recognizes e-mail addresses. 

TokenFilters operate on output of tokenizer. Its input is words/tokens.

LowerCaseTokenizerFactory can be expressed as combination of LetterTokenizer + LowerCaseFilter.

EdgeNGramTokenizerFactory can be think as KeywordTokenizer + EdgeNGramFilterFactory.

For example when you have LetterTokenizer + LowerCaseFilter combination in your analyzer chain, you can replace them with LowerCaseTokenizerFactory for performance gain.