You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Michael Böckling <Mi...@dmc.de> on 2007/05/29 17:36:53 UTC

Modifying StandardAnalyzer so that it also splits words after pun ctuation characters that are not followed by whitespace

Hi folks!

The topic says it all: I want to modify the StandardAnalyzer so that it also
splits words after punctuation characters (.,: etc.) that are NOT followed
by a whitespace character, in addition to punctuation characters that ARE
followed by whitespace.

Of course i've looked at StandardTokenizer.jj, but I don't quite get it. The
recursive nature of the grammar bends my mind.

Can someone smarter than me help here? I'd be most thankful!
Regards,


Michael


-- 
Michael Böckling
Java Engineer
dmc digital media center GmbH 
Rommelstraße 11 
70376 Stuttgart (Germany) 
Telefon: +49 711 601747-0
Telefax: +49 711 601747-141 
E-Mail: Michael.Boeckling@dmc.de 
Internet: www.dmc.de 

Handelsregister: AG Stuttgart HRB 18974
Geschäftsführer: Andreas Magg, Daniel Rebhorn, Andreas Schwend

---------------------------------------------
Besseres E-Business.
dmc ist die kreative Vernetzung von Agentur, Systemhaus und Service. Seit
über 10 Jahren entwickeln und realisieren wir zukunftweisende und
erfolgreiche E-Business-Lösungen. Zu unseren langjährigen Kunden zählen
neckermann.de, Kodak und Telekom Training.

dmc auf Platz 8 im aktuellen New Media Service Ranking.
Als inhabergeführte und netzwerkunabhängige Agentur gehören wir mit einem
Umsatz von 13,50 Mio. Euro zu den Top 10 der erfolgreichsten New Media
Dienstleister in Deutschland.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Modifying StandardAnalyzer so that it also splits words after pun ctuation characters that are not followed by whitespace

Posted by Steven Rowe <sa...@syr.edu>.
Hi Michael,

Michael Böckling wrote:
> Hi folks!
> 
> The topic says it all: I want to modify the StandardAnalyzer so that it also
> splits words after punctuation characters (.,: etc.) that are NOT followed
> by a whitespace character, in addition to punctuation characters that ARE
> followed by whitespace.
> 
> Of course i've looked at StandardTokenizer.jj, but I don't quite get it. The
> recursive nature of the grammar bends my mind.
> 
> Can someone smarter than me help here?

Um, that probably disqualifies me, but anyway...

There are several regexes in StandardTokenizer.jj that generate tokens
containing punctuation.  You should be able to selectively comment them
out to achieve what you want:

1. Acronyms:

  | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >

2. Company names:

  | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >

3. Email addresses:

  | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
    (("."|"-") <ALPHANUM>)+ >

4. Hostnames:

  | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >

5. The <NUM>, <P> and <HAS_DIGIT> regexes, for IP addresses, etc.:

  | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
         | <HAS_DIGIT> <P> <ALPHANUM>
         | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
         | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
         | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
         | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
          )
    >
  | <#P: ("_"|"-"|"/"|"."|",") >
  | <#HAS_DIGIT:		  // at least one digit
    (<LETTER>|<DIGIT>)*
    <DIGIT>
    (<LETTER>|<DIGIT>)*
    >


Steve

-- 
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Modifying StandardAnalyzer so that it also splits words after pun ctuation characters that are not followed by whitespace

Posted by Erick Erickson <er...@gmail.com>.
Well, one possibility is to do something simpler. Rather than
modifying StandardAnalyzer, modify the input stream. That is,
substitute spaces for punctuation NOT followed by whitespace
and then just let the analyzer handle the result.

For that matter, if you're going to alter the input stream
before giving it to the analyzer, you can then use pretty
simple analyzers unless you need some of the
other characteristics of StandardAnalyzer....

Or not....

Erick

On 5/29/07, Michael Böckling <Mi...@dmc.de> wrote:
>
> Hi folks!
>
> The topic says it all: I want to modify the StandardAnalyzer so that it
> also
> splits words after punctuation characters (.,: etc.) that are NOT followed
> by a whitespace character, in addition to punctuation characters that ARE
> followed by whitespace.
>
> Of course i've looked at StandardTokenizer.jj, but I don't quite get it.
> The
> recursive nature of the grammar bends my mind.
>
> Can someone smarter than me help here? I'd be most thankful!
> Regards,
>
>
> Michael
>
>
> --
> Michael Böckling
> Java Engineer
> dmc digital media center GmbH
> Rommelstraße 11
> 70376 Stuttgart (Germany)
> Telefon: +49 711 601747-0
> Telefax: +49 711 601747-141
> E-Mail: Michael.Boeckling@dmc.de
> Internet: www.dmc.de
>
> Handelsregister: AG Stuttgart HRB 18974
> Geschäftsführer: Andreas Magg, Daniel Rebhorn, Andreas Schwend
>
> ---------------------------------------------
> Besseres E-Business.
> dmc ist die kreative Vernetzung von Agentur, Systemhaus und Service. Seit
> über 10 Jahren entwickeln und realisieren wir zukunftweisende und
> erfolgreiche E-Business-Lösungen. Zu unseren langjährigen Kunden zählen
> neckermann.de, Kodak und Telekom Training.
>
> dmc auf Platz 8 im aktuellen New Media Service Ranking.
> Als inhabergeführte und netzwerkunabhängige Agentur gehören wir mit einem
> Umsatz von 13,50 Mio. Euro zu den Top 10 der erfolgreichsten New Media
> Dienstleister in Deutschland.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>