You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Benson Cheng <Be...@viacore.net> on 2002/12/03 02:25:24 UTC

RE: UTF-8 encoding question

Thanks for the info, the xerces 2.2.1 did report error (java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence) on the following line.  

<FreeFormText>POSTBOKS 60 SKÒYEN</FreeFormText>

But I have another question is when I replace the international character with "&#210;", the xerces 2.2.1 did not report any error, is this correct?  I thought escaped a character is just same as use the character, for example &#65; = A.

<FreeFormText>POSTBOKS 60 SK&#210;YEN</FreeFormText>


thanks for your help,
Benson.
-----Original Message-----
From: Joseph Kesselman [mailto:keshlam@us.ibm.com]
Sent: Thursday, November 21, 2002 3:19 PM
To: xerces-j-user@xml.apache.org
Subject: Re: UTF-8 encoding question


 In UTF-8, characters over 0x7F are encoded as multi-byte sequences.  Your 
0xD2 character (binary 11010010) should be encoded as the two bytes 
11000011 10010010, or 0xC3 0x92.

See http://www.faqs.org/rfcs/rfc2279.html for the exact details.

As to why an ancient version of Xerces accepted it: It was a bug. Try a 
modern release of Xerces and see if still accepts that byte; I'd bet it 
won't.

______________________________________
Joe Kesselman  / IBM Research

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: UTF-8 encoding question

Posted by Andy Clark <an...@apache.org>.
Benson Cheng wrote:
> Thanks for the info, the xerces 2.2.1 did report error (java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence) on the following line.  
> 
> <FreeFormText>POSTBOKS 60 SKÒYEN</FreeFormText>

You get this error when you use a character in your
document but incorrectly specify the file encoding.
The first line of the XML document (called the
XMLDecl) specifies the encoding of the file. For
example:

   <?xml version='1.0' encoding='ISO-8869-1'?>

If this line is missing, then the default encoding
is UTF-8. However, if you've created your document
with a text editor like Notepad, it will save the
file with the default encoding of the system --
usually Cp1252 (aka Windows-1252).

However, be aware that simply adding an XMLDecl
line to your file does *not* change the encoding.
To do that, the program that creates the file MUST
save the contents in the appropriate encoding. In
Notepad under Win2K or XP, there is an encoding
selection on the Save dialog that allows you to
select various Unicode encodings like "UTF-8".

Hope this helps...

-- 
Andy Clark * andyc@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org