You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Rob Young <bu...@gmail.com> on 2005/10/27 18:13:56 UTC
Better analysis of hyphenated words
Hi,
I'm using StandardAnalyzer during indexing and I have noticed that it
splits hyphenated words in two, ditching the hyphen. This is messing up
some of my search results. I would like to keep using StandardAnalyzer
because it's very good on the whole, however I would like to add an
extra term in these cases. I am fine doing everything except figuring
out when StandardTokenizer has split a hyphenated word. All I get is the
individual tokens with a type ALPHANUM. Can anyone think of a way I can
do this without having to dive into StandardTokenizer?
I have looked at the source for StandardTokenizer and I really really
really don't want to have to go there :/
Cheers
Rob
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Better analysis of hyphenated words
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On 27 Oct 2005, at 12:13, Rob Young wrote:
> I'm using StandardAnalyzer during indexing and I have noticed that
> it splits hyphenated words in two, ditching the hyphen. This is
> messing up some of my search results. I would like to keep using
> StandardAnalyzer because it's very good on the whole, however I
> would like to add an extra term in these cases. I am fine doing
> everything except figuring out when StandardTokenizer has split a
> hyphenated word. All I get is the individual tokens with a type
> ALPHANUM. Can anyone think of a way I can do this without having to
> dive into StandardTokenizer?
>
> I have looked at the source for StandardTokenizer and I really
> really really don't want to have to go there :/
StandardTokenizer is a JavaCC grammar - and it's actually not that
complex, though JavaCC is a whole other technology to learn if you've
not done it before. Look at StandardTokenizer.jj, not .java.
You could pretty easily modify the .jj file and add the hyphen to the
alphanumeric tokens, rebuild it using JavaCC (the Ant build file for
Lucene can do this for you once you have JavaCC).
Using StandardTokenizer without modifying it won't be possible to
achieve what you're after - the damage is already done on the output
of StandardTokenizer.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org