You are viewing a plain text version of this content. The canonical link for it is here.
Posted to regexp-user@jakarta.apache.org by Sanjay Chopra <sa...@sapient.com> on 2002/01/16 14:34:18 UTC
detecting UTF-8 characters
I am trying to write a validation function that would allow me to detect any
UTF-8 characters
Just to give the context -- its a user driven program and we would like to
detect when the user has entered any UTF-8 character Vs. only ASCII
characters.
An expression something like this : "[a-zA-Z][\\w\\-]{2,31}" would allow to
enter the ascii characters
Is there a way by way of expressions that I can detect if the user is
entering UTF-8 characters ... \u or \x something of that sorts...
any help would be greatly appreciated..
many thanks
-sanjay
Re: detecting UTF-8 characters
Posted by Holger Stratmann <Ho...@cheerful.com>.
Hi Sanjay,
even though I'm not 100% sure what you're trying to do and why you're using
regular expressions for it, I think I can roughly guess and give you some hints:
\xNN is used to refer to the character with hex-code NN (e.g. \x20 = ASCII 32 =
space)
Therefore, you could use [\x20-\x7F] for ASCII 32-128 or [\x20-\xFF] for ANSI
32-255
Likewise - obviously, [^\x20-\xFF] would be any character above 255 (or below
32)
I hope that will help you solve your problem - some additional hints about your
question below:
> I am trying to write a validation function that would allow me to detect any
> UTF-8 characters
just a clarification:
UTF-8 is just a "transfer" format: a special way of not wasting too much space
when transferring/saving Unicode-characters. There is no such thing as a
"UTF-8-character"
If there was, it would be a superset of ASCII and every ASCII character would
also be a UTF-8 character
> Just to give the context -- its a user driven program and we would like to
> detect when the user has entered any UTF-8 character Vs. only ASCII
> characters.
By the way: If that's all you're trying to do, it will probably be MUCH more
efficient if you just check each character
(like: for (int i = 0; i < s.length(); i++) {if (s.charAt(i) > 255) return i;}
return -1;)
RegExp does nothing less - actually, it will do much more and cause a lot of
overhead
> Is there a way by way of expressions that I can detect if the user is
> entering UTF-8 characters ... \u or \x something of that sorts...
\uNNNN checks for Unicode characters
Btw: If you live in Europe, the EURO-symbol () is an excellent thing for
testing :-)
Easy to enter (if you live in Europe *g*), but a "strange" Unicode character in
Java (\u20AC)
> any help would be greatly appreciated..
HTH,
Holger
--
To unsubscribe, e-mail: <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>