You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Rob Young <bu...@gmail.com> on 2005/10/27 18:13:56 UTC

Better analysis of hyphenated words

Hi,

I'm using StandardAnalyzer during indexing and I have noticed that it 
splits hyphenated words in two, ditching the hyphen. This is messing up 
some of my search results. I would like to keep using StandardAnalyzer 
because it's very good on the whole, however I would like to add an 
extra term in these cases. I am fine doing everything except figuring 
out when StandardTokenizer has split a hyphenated word. All I get is the 
individual tokens with a type ALPHANUM. Can anyone think of a way I can 
do this without having to dive into StandardTokenizer?

I have looked at the source for StandardTokenizer and I really really 
really don't want to have to go there :/

Cheers
Rob

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Better analysis of hyphenated words

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On 27 Oct 2005, at 12:13, Rob Young wrote:
> I'm using StandardAnalyzer during indexing and I have noticed that  
> it splits hyphenated words in two, ditching the hyphen. This is  
> messing up some of my search results. I would like to keep using  
> StandardAnalyzer because it's very good on the whole, however I  
> would like to add an extra term in these cases. I am fine doing  
> everything except figuring out when StandardTokenizer has split a  
> hyphenated word. All I get is the individual tokens with a type  
> ALPHANUM. Can anyone think of a way I can do this without having to  
> dive into StandardTokenizer?
>
> I have looked at the source for StandardTokenizer and I really  
> really really don't want to have to go there :/

StandardTokenizer is a JavaCC grammar - and it's actually not that  
complex, though JavaCC is a whole other technology to learn if you've  
not done it before.  Look at StandardTokenizer.jj, not .java.

You could pretty easily modify the .jj file and add the hyphen to the  
alphanumeric tokens, rebuild it using JavaCC (the Ant build file for  
Lucene can do this for you once you have JavaCC).

Using StandardTokenizer without modifying it won't be possible to  
achieve what you're after - the damage is already done on the output  
of StandardTokenizer.

     Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org