You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Ricardo Rocha <ri...@apache.org> on 2000/08/03 21:09:16 UTC
How to determine encoding?
Is there a way to determine what encoding was specified for an
XML document after parsing?
I'm referring to the case when an encoding is explicitly specified
in the XML declaration, like in:
<?xml version="1.0" encoding="ISO-8859-5"?>
I've browsed the SAX and DOM javadocs looking for an "official"
way of determining the document's original encoding to no avail
so far... (Btw, my tests reveal that this declaration is _not_
processed as a processing instruction)
This would be very handy for XSP: for proper i18n support,
generated Java programs should be compiled using the same
encoding as the original document. Right now, the author must
specify (redundantly) the encoding as an attribute in the
<xsp:page> root element. We're working on making this root
element optional, so if there's a way of finding out what the
original encoding was we'd remove one more case in which the
<xsp:page> root element is necessary.
Any ideas?
Ricardo
Re: [Cocoon Devel] Re: How to determine encoding?
Posted by Stephen Zisk <sz...@mediabridge.net>.
> >>Is there a way to determine what encoding was specified for an
> >>XML document after parsing?
>
>Ricardo,
>Did you ever work out how to do this?
>How?
>
>thanks Jeremy
Am I missing something? I would have said that if there is not explicit
encoding information, there is no way to accurately derive the encoding.
The ISO-8859-x character encoding definitions, Windows code pages, and even
UTF-8 all represent the ASCII character complement using the same one-byte
encoding as ASCII itself, so unless you propose accented character and
language matching, how can you distinguish among any of these in a file
when most of the characters are part of the ASCII complement?
You might have a chance distinguishing UTF-8 from the others by recognizing
common multi-byte sequences, but for all of the one-byte encodings, most of
the non-ASCII character codes represent meaningful characters. This is
especially true for minor variants like ISO-8859-1 vs ISO-8859-17.
Stephen Zisk
----------
Stephen Zisk MediaBridge Technologies
email: szisk@mediabridge.net 100 Nagog Park
tel: 978-795-7040 Acton, MA 01720 USA
fax: 978-795-7100 http://www.mediabridge.net
Re: How to determine encoding?
Posted by Jeremy Quinn <je...@media.demon.co.uk>.
At 11:19 +0100 04/08/00, Jeremy Quinn wrote:
>At 14:09 -0500 03/08/00, Ricardo Rocha wrote:
>>
>>Is there a way to determine what encoding was specified for an
>>XML document after parsing?
Ricardo,
Did you ever work out how to do this?
How?
thanks Jeremy
--
___________________________________________________________________
Jeremy Quinn Karma Divers
webSpace Design
HyperMedia Research Centre
<ma...@mac.com> <http://www.media.demon.co.uk>
<phone:+44.[0].20.7737.6831> <pa...@sms.genie.co.uk>
Re: How to determine encoding?
Posted by Jeremy Quinn <je...@media.demon.co.uk>.
At 14:09 -0500 03/08/00, Ricardo Rocha wrote:
>
>Is there a way to determine what encoding was specified for an
>XML document after parsing?
I doubt this helps ....
But I did something like this in the FP TagLib fpResource Library:
public Document loadDocument(File f) {
DOMParser parser = new DOMParser();
try {
InputSource is = new InputSource(f.toString());
String enc = is.getEncoding();
if (!enc.equals("")) { workEncoding = enc; }
parser.parse(is);
} catch (Exception e) {}
}
regards Jeremy
--
___________________________________________________________________
Jeremy Quinn Karma Divers
webSpace Design
HyperMedia Research Centre
<ma...@mac.com> <http://www.media.demon.co.uk>
<phone:+44.[0].20.7737.6831> <pa...@sms.genie.co.uk>