You are viewing a plain text version of this content. The canonical link for it is here.

Posted to regexp-user@jakarta.apache.org by Sanjay Chopra <sa...@sapient.com> on 2002/01/16 14:34:18 UTC

detecting UTF-8 characters

I am trying to write a validation function that would allow me to detect any
UTF-8 characters

Just to give the context -- its a user driven program and we would like to
detect when the user has entered any UTF-8 character Vs. only ASCII
characters.

An expression something like this : "[a-zA-Z][\\w\\-]{2,31}"  would allow to
enter the ascii characters

Is there a way by way of expressions that I can detect if the user is
entering UTF-8 characters ... \u or \x something of that sorts...

any help would be greatly appreciated..

many thanks
-sanjay

Re: detecting UTF-8 characters

Posted by Holger Stratmann <Ho...@cheerful.com>.

Hi Sanjay,

even though I'm not 100% sure what you're trying to do and why you're using
regular expressions for it, I think I can roughly guess and give you some hints:

\xNN is used to refer to the character with hex-code NN (e.g. \x20 = ASCII 32 =
space)
Therefore, you could use [\x20-\x7F] for ASCII 32-128 or [\x20-\xFF] for ANSI
32-255
Likewise - obviously, [^\x20-\xFF] would be any character above 255 (or below
32)

I hope that will help you solve your problem - some additional hints about your
question below:

> I am trying to write a validation function that would allow me to detect any
> UTF-8 characters

just a clarification:
UTF-8 is just a "transfer" format: a special way of not wasting too much space
when transferring/saving Unicode-characters. There is no such thing as a
"UTF-8-character"
If there was, it would be a superset of ASCII and every ASCII character would
also be a UTF-8 character

> Just to give the context -- its a user driven program and we would like to
> detect when the user has entered any UTF-8 character Vs. only ASCII
> characters.

By the way: If that's all you're trying to do, it will probably be MUCH more
efficient if you just check each character
(like: for (int i = 0; i < s.length(); i++) {if (s.charAt(i) > 255) return i;}
return -1;)
RegExp does nothing less - actually, it will do much more and cause a lot of
overhead

> Is there a way by way of expressions that I can detect if the user is
> entering UTF-8 characters ... \u or \x something of that sorts...

\uNNNN checks for Unicode characters
Btw: If you live in Europe, the EURO-symbol () is an excellent thing for
testing :-)
Easy to enter (if you live in Europe *g*), but a "strange" Unicode character in
Java (\u20AC)

> any help would be greatly appreciated..

HTH,
     Holger




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>