You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Dan Armbrust <da...@gmail.com> on 2005/08/08 16:43:53 UTC

Analyzer question

It is my understanding that the StandardAnalyzer will remove underscores 
- so "some_word" be indexed as 'some' and 'word'.

I want to keep the underscores, so I was thinking of changing over to an 
Analyzer that uses the WhiteSpaceTokenizer, LowerCaseFilter, and StopFilter.

What other tokenizing magic will I lose by changing away from the 
StandardAnalyzer?

Thanks,

Dan

-- 
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Analyzer question

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Aug 8, 2005, at 10:43 AM, Dan Armbrust wrote:
> It is my understanding that the StandardAnalyzer will remove  
> underscores - so "some_word" be indexed as 'some' and 'word'.
>
> I want to keep the underscores, so I was thinking of changing over  
> to an Analyzer that uses the WhiteSpaceTokenizer, LowerCaseFilter,  
> and StopFilter.
>
> What other tokenizing magic will I lose by changing away from the  
> StandardAnalyzer?

The best thing you can do is set up a test environment to try out  
sample text with various analyzers.  Lucene in Action's source code  
(http://www.lucenebook.com) comes with such a demo that you can  
easily tweak.  Here's a sample of running "ant AnalyzerDemo":

      [echo] Running lia.analysis.AnalyzerDemo...
      [java] Analyzing "some_word"
      [java]   WhitespaceAnalyzer:
      [java]     [some_word]

      [java]   SimpleAnalyzer:
      [java]     [some] [word]

      [java]   StopAnalyzer:
      [java]     [some] [word]

      [java]   StandardAnalyzer:
      [java]     [some] [word]

      [java]   SnowballAnalyzer:
      [java]     [some] [word]

      [java]   SnowballAnalyzer:
      [java]     [some] [word]

      [java]   SnowballAnalyzer:
      [java]     [some] [word]

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org