You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Clas Rydergren <cl...@hotmail.com> on 2003/09/06 00:23:48 UTC
Modify the StandardAnalyzer
Hi,
I have been experimenting with Lucene for a few hours, and now I'm looking
for a solution to this:
When using the SimpleAnalyzer for indexing text, data like www.hotmail.com
seem to be indexed as www, hotmail and com which mean that a search for
"hotmail" will return a record. This is the behavior I am looking for!
However, since SimpleAnalyzer do not index numbers by default, I would like
to use the StandardAnalyzer. But, Standardanalyzer do not split the input
stream at ".".
Ideally I should propably make my own analyser, but that seems to be a bit
complicated to me :(. Which is the simplest possible modification that I
need to make to the Lucene source to make the StandardAnalyzer split, for
example web-addresses, at "." into separately indexed words?
Can this be made by modifications to the StandardTokenizer.jj? How? What is
the easiest way of getting such modification into the "compiled" Lucene? Is
there a need for recompiling everything?
Appreciate all help!
regards
clas
_________________________________________________________________
STOP MORE SPAM with the new MSN 8 and get 2 months FREE*
http://join.msn.com/?page=features/junkmail
Re: Modify the StandardAnalyzer
Posted by Incze Lajos <in...@mail.matav.hu>.
On Fri, Sep 05, 2003 at 10:23:48PM +0000, Clas Rydergren wrote:
> Hi,
>
> I have been experimenting with Lucene for a few hours, and now I'm looking
> for a solution to this:
>
> When using the SimpleAnalyzer for indexing text, data like www.hotmail.com
> seem to be indexed as www, hotmail and com which mean that a search for
> "hotmail" will return a record. This is the behavior I am looking for!
> However, since SimpleAnalyzer do not index numbers by default, I would like
> to use the StandardAnalyzer. But, Standardanalyzer do not split the input
> stream at ".".
>
> Ideally I should propably make my own analyser, but that seems to be a bit
> complicated to me :(. Which is the simplest possible modification that I
> need to make to the Lucene source to make the StandardAnalyzer split, for
> example web-addresses, at "." into separately indexed words?
>
> Can this be made by modifications to the StandardTokenizer.jj? How? What is
> the easiest way of getting such modification into the "compiled" Lucene? Is
> there a need for recompiling everything?
>
> Appreciate all help!
>
> regards
> clas
You can stack up the two analyzers, first run the simple then the standard
on the poutput.
incze