You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Benson Cheng <Be...@viacore.net> on 2002/11/21 23:16:12 UTC
UTF-8 encoding question
I have a XML document has an international character in it, see below (hex value is 0xD2), if I use the "US-ASCII" processing instruction (<?xml version="1.0" encoding="US-ASCII"?>, then I can view the document from the IE without any problems, but if I change to "UTF-8" (<?xml version="1.0" encoding="UTF-8"?>), then the IE reports an "invalid character was found" error. The xerces (1.4.4) does not report any error on either encoding. My question is, is this character a valid UTF-8 character? If it is not, then how come the xerces did not report any error?
<FreeFormText>POSTBOKS 60 SKÒYEN</FreeFormText>
thanks,
Benson.
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org
Re: UTF-8 encoding question
Posted by Joseph Kesselman <ke...@us.ibm.com>.
In UTF-8, characters over 0x7F are encoded as multi-byte sequences. Your
0xD2 character (binary 11010010) should be encoded as the two bytes
11000011 10010010, or 0xC3 0x92.
See http://www.faqs.org/rfcs/rfc2279.html for the exact details.
As to why an ancient version of Xerces accepted it: It was a bug. Try a
modern release of Xerces and see if still accepts that byte; I'd bet it
won't.
______________________________________
Joe Kesselman / IBM Research
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org