You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Christian Schrader <sc...@evendi.de> on 2002/05/29 11:19:19 UTC

JavaCC Tokenizer

I need to construct a Tokenizer that tokenizes at word/number boundaries, so
that "IBM Deskstar IC35L060AVER07" would result in the following tokens:
IBM
Deskstar
IC
35
L
060
AVER
07

Has anybody solved this with the StandardTokenizer?

Christian


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: JavaCC Tokenizer

Posted by Peter Carlson <ca...@bookandhammer.com>.

Hi Christian,

You will need to create your own Tokenizer.
Use the StandardTokenizer.jj file as a guide and instead of using a tokens
like

  // basic word: a sequence of digits & letters
  <ALPHANUM: (<LETTER>|<DIGIT>)+ >

Use

<ALPHAONLY: (<LETTER>)+>

And

<NUMONLY: (<DIGIT>)+>

I don't know what your patterns are, but this will help you out.

Also, you may have to change the QueryParser.jj to do the same thing.

--Peter

On 5/29/02 2:19 AM, "Christian Schrader" <sc...@evendi.de> wrote:

> I need to construct a Tokenizer that tokenizes at word/number boundaries, so
> that "IBM Deskstar IC35L060AVER07" would result in the following tokens:
> IBM
> Deskstar
> IC
> 35
> L
> 060
> AVER
> 07
> 
> Has anybody solved this with the StandardTokenizer?
> 
> Christian
> 
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 
> 

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>