You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by rf <ru...@yahoo.com> on 2002/12/31 07:52:06 UTC

Encoding and mystery

hello
This might be a fairly simple one, but I dont know it.
How does an XML reader know of the encoding of an XML
before reading it - the encoding is mentioned inside
the XML in the first processing instruction. One book
says that if you are reading an XML accross a
network(say http), then you (have to) mention the
encoding in the MIME type header. But no mention about
reading files from the disk - whats the answer?

Thanks.
~rf~

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

Re: Encoding and mystery

Posted by Joseph Kesselman <ke...@us.ibm.com>.

On Monday, 12/30/2002 at 10:52 PST, rf <ru...@yahoo.com> wrote:
> How does an XML reader know of the encoding of an XML
> before reading it

The simple answer is, it can't. It needs to examing the XML Declaration 
*before* selecting the encoding. The XML Recommendation discusses this 
process:

Look for the byte-order mark, which may or may not be present.

Look for the start of the XML Declaration. Since we know it starts with 
"<?", we can usually recognize the general family of encodings 
(UTF-8-like, UTF-16-like, EBCDIC-like, etc) from those first few bytes.

Use that information to interpret the rest of the XML Declaration. If an 
encoding was specified, read the rest of the document using that encoding. 
If it wasn't specified, you can/should usually assume it's UTF-8 or 
UTF-16.

> One book
> says that if you are reading an XML accross a
> network(say http), then you (have to) mention the
> encoding in the MIME type header.

This is highly encouraged, since switching encodings after you've started 
reading the stream tends to be less efficient. But a correctly-implemented 
parser *ought* to be able to able to handle the cases where the encoding 
is specified only by the file.

> reading files from the disk - whats the answer?

If it isn't specified in the XML Declaration, the data should be read as 
UTF-8 or UTF-16. Some parsers may attempt to guess non-UTF encodings if 
you haven't specified the encoding, but that isn't reliable and shouldn't 
be relied upon.
______________________________________
Joe Kesselman  / IBM Research

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org