You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Christian Schrader <sc...@evendi.de> on 2002/05/29 11:19:19 UTC
JavaCC Tokenizer
I need to construct a Tokenizer that tokenizes at word/number boundaries, so
that "IBM Deskstar IC35L060AVER07" would result in the following tokens:
IBM
Deskstar
IC
35
L
060
AVER
07
Has anybody solved this with the StandardTokenizer?
Christian
--
To unsubscribe, e-mail: <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>
Re: JavaCC Tokenizer
Posted by Peter Carlson <ca...@bookandhammer.com>.
Hi Christian,
You will need to create your own Tokenizer.
Use the StandardTokenizer.jj file as a guide and instead of using a tokens
like
// basic word: a sequence of digits & letters
<ALPHANUM: (<LETTER>|<DIGIT>)+ >
Use
<ALPHAONLY: (<LETTER>)+>
And
<NUMONLY: (<DIGIT>)+>
I don't know what your patterns are, but this will help you out.
Also, you may have to change the QueryParser.jj to do the same thing.
--Peter
On 5/29/02 2:19 AM, "Christian Schrader" <sc...@evendi.de> wrote:
> I need to construct a Tokenizer that tokenizes at word/number boundaries, so
> that "IBM Deskstar IC35L060AVER07" would result in the following tokens:
> IBM
> Deskstar
> IC
> 35
> L
> 060
> AVER
> 07
>
> Has anybody solved this with the StandardTokenizer?
>
> Christian
>
>
> --
> To unsubscribe, e-mail: <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
>
>
--
To unsubscribe, e-mail: <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>