You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Dan Armbrust <da...@gmail.com> on 2005/08/08 16:43:53 UTC
Analyzer question
It is my understanding that the StandardAnalyzer will remove underscores
- so "some_word" be indexed as 'some' and 'word'.
I want to keep the underscores, so I was thinking of changing over to an
Analyzer that uses the WhiteSpaceTokenizer, LowerCaseFilter, and StopFilter.
What other tokenizing magic will I lose by changing away from the
StandardAnalyzer?
Thanks,
Dan
--
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Analyzer question
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Aug 8, 2005, at 10:43 AM, Dan Armbrust wrote:
> It is my understanding that the StandardAnalyzer will remove
> underscores - so "some_word" be indexed as 'some' and 'word'.
>
> I want to keep the underscores, so I was thinking of changing over
> to an Analyzer that uses the WhiteSpaceTokenizer, LowerCaseFilter,
> and StopFilter.
>
> What other tokenizing magic will I lose by changing away from the
> StandardAnalyzer?
The best thing you can do is set up a test environment to try out
sample text with various analyzers. Lucene in Action's source code
(http://www.lucenebook.com) comes with such a demo that you can
easily tweak. Here's a sample of running "ant AnalyzerDemo":
[echo] Running lia.analysis.AnalyzerDemo...
[java] Analyzing "some_word"
[java] WhitespaceAnalyzer:
[java] [some_word]
[java] SimpleAnalyzer:
[java] [some] [word]
[java] StopAnalyzer:
[java] [some] [word]
[java] StandardAnalyzer:
[java] [some] [word]
[java] SnowballAnalyzer:
[java] [some] [word]
[java] SnowballAnalyzer:
[java] [some] [word]
[java] SnowballAnalyzer:
[java] [some] [word]
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org