You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Stephen Zisk <sz...@mediabridge.net> on 2000/09/01 19:16:37 UTC

Re: [Cocoon Devel] Re: How to determine encoding?


> >>Is there a way to determine what encoding was specified for an
> >>XML document after parsing?
>
>Ricardo,
>Did you ever work out how to do this?
>How?
>
>thanks Jeremy


Am I missing something? I would have said that if there is not explicit 
encoding information, there is no way to accurately derive the encoding. 
The ISO-8859-x character encoding definitions, Windows code pages, and even 
UTF-8 all represent the ASCII character complement using the same one-byte 
encoding as ASCII itself, so unless you propose accented character and 
language matching, how can you distinguish among any of these in a file 
when most of the characters are part of the ASCII complement?

You might have a chance distinguishing UTF-8 from the others by recognizing 
common multi-byte sequences, but for all of the one-byte encodings, most of 
the non-ASCII character codes represent meaningful characters. This is 
especially true for minor variants like ISO-8859-1 vs ISO-8859-17.

Stephen Zisk

----------
Stephen Zisk                      MediaBridge Technologies
email:  szisk@mediabridge.net     100 Nagog Park
tel:    978-795-7040              Acton, MA 01720    USA
fax:    978-795-7100              http://www.mediabridge.net