You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Clas Rydergren <cl...@hotmail.com> on 2003/09/06 00:23:48 UTC

Modify the StandardAnalyzer

Hi,

I have been experimenting with Lucene for a few hours, and now I'm looking 
for a solution to this:

When using the SimpleAnalyzer for indexing text, data like www.hotmail.com 
seem to be indexed as www, hotmail and com which mean that a search for 
"hotmail" will return a record. This is the behavior I am looking for! 
However, since SimpleAnalyzer do not index numbers by default, I would like 
to use the StandardAnalyzer. But, Standardanalyzer do not split the input 
stream at ".".

Ideally I should propably make my own analyser, but that seems to be a bit 
complicated to me :(. Which is the simplest possible modification that I 
need to make to the Lucene source to make the StandardAnalyzer split, for 
example web-addresses, at "." into separately indexed words?

Can this be made by modifications to the StandardTokenizer.jj? How? What is 
the easiest way of getting such modification into the "compiled" Lucene? Is 
there a need for recompiling everything?

Appreciate all help!

regards
clas

_________________________________________________________________
STOP MORE SPAM with the new MSN 8 and get 2 months FREE* 
http://join.msn.com/?page=features/junkmail

Re: Modify the StandardAnalyzer

Posted by Incze Lajos <in...@mail.matav.hu>.

On Fri, Sep 05, 2003 at 10:23:48PM +0000, Clas Rydergren wrote:
> Hi,
> 
> I have been experimenting with Lucene for a few hours, and now I'm looking 
> for a solution to this:
> 
> When using the SimpleAnalyzer for indexing text, data like www.hotmail.com 
> seem to be indexed as www, hotmail and com which mean that a search for 
> "hotmail" will return a record. This is the behavior I am looking for! 
> However, since SimpleAnalyzer do not index numbers by default, I would like 
> to use the StandardAnalyzer. But, Standardanalyzer do not split the input 
> stream at ".".
> 
> Ideally I should propably make my own analyser, but that seems to be a bit 
> complicated to me :(. Which is the simplest possible modification that I 
> need to make to the Lucene source to make the StandardAnalyzer split, for 
> example web-addresses, at "." into separately indexed words?
> 
> Can this be made by modifications to the StandardTokenizer.jj? How? What is 
> the easiest way of getting such modification into the "compiled" Lucene? Is 
> there a need for recompiling everything?
> 
> Appreciate all help!
> 
> regards
> clas

You can stack up the two analyzers, first run the simple then the standard
on the poutput.

incze