You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Jeremy Quinn <je...@media.demon.co.uk> on 2000/09/01 11:34:52 UTC
Re: How to determine encoding?
At 11:19 +0100 04/08/00, Jeremy Quinn wrote:
>At 14:09 -0500 03/08/00, Ricardo Rocha wrote:
>>
>>Is there a way to determine what encoding was specified for an
>>XML document after parsing?
Ricardo,
Did you ever work out how to do this?
How?
thanks Jeremy
--
___________________________________________________________________
Jeremy Quinn Karma Divers
webSpace Design
HyperMedia Research Centre
<ma...@mac.com> <http://www.media.demon.co.uk>
<phone:+44.[0].20.7737.6831> <pa...@sms.genie.co.uk>
Re: [Cocoon Devel] Re: How to determine encoding?
Posted by Stephen Zisk <sz...@mediabridge.net>.
> >>Is there a way to determine what encoding was specified for an
> >>XML document after parsing?
>
>Ricardo,
>Did you ever work out how to do this?
>How?
>
>thanks Jeremy
Am I missing something? I would have said that if there is not explicit
encoding information, there is no way to accurately derive the encoding.
The ISO-8859-x character encoding definitions, Windows code pages, and even
UTF-8 all represent the ASCII character complement using the same one-byte
encoding as ASCII itself, so unless you propose accented character and
language matching, how can you distinguish among any of these in a file
when most of the characters are part of the ASCII complement?
You might have a chance distinguishing UTF-8 from the others by recognizing
common multi-byte sequences, but for all of the one-byte encodings, most of
the non-ASCII character codes represent meaningful characters. This is
especially true for minor variants like ISO-8859-1 vs ISO-8859-17.
Stephen Zisk
----------
Stephen Zisk MediaBridge Technologies
email: szisk@mediabridge.net 100 Nagog Park
tel: 978-795-7040 Acton, MA 01720 USA
fax: 978-795-7100 http://www.mediabridge.net