You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cocoon.apache.org by Jeremy Quinn <je...@media.demon.co.uk> on 2000/09/01 11:34:52 UTC

Re: How to determine encoding?

At 11:19 +0100 04/08/00, Jeremy Quinn wrote:
>At 14:09 -0500 03/08/00, Ricardo Rocha wrote:
>>
>>Is there a way to determine what encoding was specified for an
>>XML document after parsing?

Ricardo,

Did you ever work out how to do this?

How?

thanks Jeremy
-- 
   ___________________________________________________________________

   Jeremy Quinn                                           Karma Divers
                                                       webSpace Design
                                            HyperMedia Research Centre

   <ma...@mac.com>     		 <http://www.media.demon.co.uk>
    <phone:+44.[0].20.7737.6831>        <pa...@sms.genie.co.uk>

Re: [Cocoon Devel] Re: How to determine encoding?

Posted by Stephen Zisk <sz...@mediabridge.net>.


> >>Is there a way to determine what encoding was specified for an
> >>XML document after parsing?
>
>Ricardo,
>Did you ever work out how to do this?
>How?
>
>thanks Jeremy


Am I missing something? I would have said that if there is not explicit 
encoding information, there is no way to accurately derive the encoding. 
The ISO-8859-x character encoding definitions, Windows code pages, and even 
UTF-8 all represent the ASCII character complement using the same one-byte 
encoding as ASCII itself, so unless you propose accented character and 
language matching, how can you distinguish among any of these in a file 
when most of the characters are part of the ASCII complement?

You might have a chance distinguishing UTF-8 from the others by recognizing 
common multi-byte sequences, but for all of the one-byte encodings, most of 
the non-ASCII character codes represent meaningful characters. This is 
especially true for minor variants like ISO-8859-1 vs ISO-8859-17.

Stephen Zisk

----------
Stephen Zisk                      MediaBridge Technologies
email:  szisk@mediabridge.net     100 Nagog Park
tel:    978-795-7040              Acton, MA 01720    USA
fax:    978-795-7100              http://www.mediabridge.net