You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cocoon.apache.org by Ricardo Rocha <ri...@apache.org> on 2000/08/03 21:09:16 UTC

How to determine encoding?

Is there a way to determine what encoding was specified for an
XML document after parsing?

I'm referring to the case when an encoding is explicitly specified
in the XML declaration, like in:

  <?xml version="1.0" encoding="ISO-8859-5"?>

I've browsed the SAX and DOM javadocs looking for an "official"
way of determining the document's original encoding to no avail
so far... (Btw, my tests reveal that this declaration is _not_
processed as a processing instruction)

This would be very handy for XSP: for proper i18n support,
generated Java programs should be compiled using the same
encoding as the original document. Right now, the author must
specify (redundantly) the encoding as an attribute in the
<xsp:page> root element. We're working on making this root
element optional, so if there's a way of finding out what the
original encoding was we'd remove one more case in which the
<xsp:page> root element is necessary.

Any ideas?

Ricardo

Re: [Cocoon Devel] Re: How to determine encoding?

Posted by Stephen Zisk <sz...@mediabridge.net>.


> >>Is there a way to determine what encoding was specified for an
> >>XML document after parsing?
>
>Ricardo,
>Did you ever work out how to do this?
>How?
>
>thanks Jeremy


Am I missing something? I would have said that if there is not explicit 
encoding information, there is no way to accurately derive the encoding. 
The ISO-8859-x character encoding definitions, Windows code pages, and even 
UTF-8 all represent the ASCII character complement using the same one-byte 
encoding as ASCII itself, so unless you propose accented character and 
language matching, how can you distinguish among any of these in a file 
when most of the characters are part of the ASCII complement?

You might have a chance distinguishing UTF-8 from the others by recognizing 
common multi-byte sequences, but for all of the one-byte encodings, most of 
the non-ASCII character codes represent meaningful characters. This is 
especially true for minor variants like ISO-8859-1 vs ISO-8859-17.

Stephen Zisk

----------
Stephen Zisk                      MediaBridge Technologies
email:  szisk@mediabridge.net     100 Nagog Park
tel:    978-795-7040              Acton, MA 01720    USA
fax:    978-795-7100              http://www.mediabridge.net

Re: How to determine encoding?

Posted by Jeremy Quinn <je...@media.demon.co.uk>.

At 11:19 +0100 04/08/00, Jeremy Quinn wrote:
>At 14:09 -0500 03/08/00, Ricardo Rocha wrote:
>>
>>Is there a way to determine what encoding was specified for an
>>XML document after parsing?

Ricardo,

Did you ever work out how to do this?

How?

thanks Jeremy
-- 
   ___________________________________________________________________

   Jeremy Quinn                                           Karma Divers
                                                       webSpace Design
                                            HyperMedia Research Centre

   <ma...@mac.com>     		 <http://www.media.demon.co.uk>
    <phone:+44.[0].20.7737.6831>        <pa...@sms.genie.co.uk>

Re: How to determine encoding?

Posted by Jeremy Quinn <je...@media.demon.co.uk>.

At 14:09 -0500 03/08/00, Ricardo Rocha wrote:
>
>Is there a way to determine what encoding was specified for an
>XML document after parsing?

I doubt this helps ....

But I did something like this in the FP TagLib fpResource Library:


public Document loadDocument(File f) {
	DOMParser parser = new DOMParser();
	try {
		InputSource is = new InputSource(f.toString());
		String enc = is.getEncoding();
		if (!enc.equals("")) { workEncoding = enc; }
		parser.parse(is);
	} catch (Exception e) {}
}

regards Jeremy
-- 
   ___________________________________________________________________

   Jeremy Quinn                                           Karma Divers
                                                       webSpace Design
                                            HyperMedia Research Centre

   <ma...@mac.com>     		 <http://www.media.demon.co.uk>
    <phone:+44.[0].20.7737.6831>        <pa...@sms.genie.co.uk>