You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Michael Böckling <Mi...@dmc.de> on 2007/05/30 11:57:49 UTC

AW: Modifying StandardAnalyzer so that it also splits words after pun ctuation characters that are not followed by whitespace

Ok, I've followed your advice and commented out some Lines in the NUM
section. It now works as espected, thanks a lot, I just tried and it does
what I wanted it to do now. It looks scary, but isn't that bad. 

Thanks!

Regards,
Michael



> -----Ursprüngliche Nachricht-----
> Von: Steven Rowe [mailto:sarowe@syr.edu]
> Gesendet: Dienstag, 29. Mai 2007 19:54
> An: java-user@lucene.apache.org
> Betreff: Re: Modifying StandardAnalyzer so that it also splits words
> after pun ctuation characters that are not followed by whitespace
> 
> 
> Hi Michael,
> 
> Michael Böckling wrote:
> > Hi folks!
> > 
> > The topic says it all: I want to modify the 
> StandardAnalyzer so that it also
> > splits words after punctuation characters (.,: etc.) that 
> are NOT followed
> > by a whitespace character, in addition to punctuation 
> characters that ARE
> > followed by whitespace.
> > 
> > Of course i've looked at StandardTokenizer.jj, but I don't 
> quite get it. The
> > recursive nature of the grammar bends my mind.
> > 
> > Can someone smarter than me help here?
> 
> Um, that probably disqualifies me, but anyway...
> 
> There are several regexes in StandardTokenizer.jj that generate tokens
> containing punctuation.  You should be able to selectively 
> comment them
> out to achieve what you want:
> 
> 1. Acronyms:
> 
>   | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
> 
> 2. Company names:
> 
>   | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
> 
> 3. Email addresses:
> 
>   | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
>     (("."|"-") <ALPHANUM>)+ >
> 
> 4. Hostnames:
> 
>   | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
> 
> 5. The <NUM>, <P> and <HAS_DIGIT> regexes, for IP addresses, etc.:
> 
>   | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
>          | <HAS_DIGIT> <P> <ALPHANUM>
>          | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
>          | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
>          | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> 
> <HAS_DIGIT>)+
>          | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> 
> <ALPHANUM>)+
>           )
>     >
>   | <#P: ("_"|"-"|"/"|"."|",") >
>   | <#HAS_DIGIT:		  // at least one digit
>     (<LETTER>|<DIGIT>)*
>     <DIGIT>
>     (<LETTER>|<DIGIT>)*
>     >
> 
> 
> Steve
> 
> -- 
> Steve Rowe
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org