You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by ma...@oksphere.com on 2002/08/19 18:45:08 UTC

StandardTokenizer and Unicode

Hi all,

Has anyone had any luck using StandardTokenizer for
Unicode behind Latin-1 set? I have tried to use it for
Cyrillic (U+0400..U+04FF) and it looks like the
characters don't get through, despite the fact that
Cyrillic IS included in StandardTokenizer.jj (i.e. is a
subset of Unicode symbols, used to describe the Letter
token). If I try to specify UNICODE_INPUT = true in
StandardTokenizer.jj (and disable USER_CHAR_STREAM =
true), it starts working perfectly.
So does that mean I have to have my own version of
StandardTokenizer to make Unicode input possible?

Boris Okner 

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>