You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by Benson Cheng <Be...@viacore.net> on 2002/11/21 23:16:12 UTC

UTF-8 encoding question

I have a XML document has an international character in it, see below (hex value is 0xD2), if I use the "US-ASCII" processing instruction (<?xml version="1.0" encoding="US-ASCII"?>, then I can view the document from the IE without any problems, but if I change to "UTF-8" (<?xml version="1.0" encoding="UTF-8"?>), then the IE reports an "invalid character was found" error.  The xerces (1.4.4) does not report any error on either encoding.  My question is, is this character a valid UTF-8 character?  If it is not, then how come the xerces did not report any error?

<FreeFormText>POSTBOKS 60 SKÒYEN</FreeFormText>

thanks,
Benson.

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

Re: UTF-8 encoding question

Posted by Joseph Kesselman <ke...@us.ibm.com>.

 In UTF-8, characters over 0x7F are encoded as multi-byte sequences.  Your 
0xD2 character (binary 11010010) should be encoded as the two bytes 
11000011 10010010, or 0xC3 0x92.

See http://www.faqs.org/rfcs/rfc2279.html for the exact details.

As to why an ancient version of Xerces accepted it: It was a bug. Try a 
modern release of Xerces and see if still accepts that byte; I'd bet it 
won't.

______________________________________
Joe Kesselman  / IBM Research

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org