You are viewing a plain text version of this content. The canonical link for it is here.
Posted to xsp-dev@xml.apache.org by Robin Green <gr...@hotmail.com> on 2000/11/09 13:36:51 UTC
Re: Encoding and Ext. Entities (was Deprecating the root element)
I've annotated this with initials...
Matt Sergeant <ma...@sergeant.org> wrote:
> > > > [RR] The "encoding" attribute introduced in Cocoon2, should remain
>in
> > > > place as a root-level attribute:
> > > > . . .
> > > > For the Java language (at least) this information is necessary
> > > > to direct the compiler to generate a class with the proper
> > > > string encoding (javac -encoding ...)
> > >
> > > [MS] Hmm, why don't you get everything from the parser in UTF16 (or
>UTF8)? I'm
> > > confused...
[RDG] Actually we get everything from the parser as java Strings, which are
effectively encoding neutral, because there is NO WAY for any code outside
the JVM itself to determine what encoding is actually being used to store
them. This is by design. So in theory a JVM could use whatever encoding it
likes to store Strings.
> > [RR] This is probably a Java-specific problem. The parser does not
>provide
> > document encoding infor _and_ the Java compiler requires specifying
> > the proper encoding for locale-specific characters to be processed
> > correctly.
[RDG] I think this whole problem stems from one basic confusion. The current
"solution" seems to work for some encodings, but it does not make much sense
to me (it's probably just good luck that it works for some encodings). From
the JDK1.3 tool docs for native2ascii I quote:
"The Java compiler and other Java tools can only process files which contain
Latin-1 and/or Unicode-encoded (\udddd notation) characters.
native2ascii converts files which contain other character encodings into
files containing Latin-1 and/or Unicode-encoded charaters."
Therefore we should NOT be trying to compile files with non-Latin-1
characters in them! We should either output as UTF8 or UTF16 and use
native2ascii to convert (the easy option) or escape the out-of-range
characters in the XSPJavaProcessor (could in theory escape all characters,
but this would make the source totally unreadable!). Fortunately, neither of
these approaches requires knowing the input encoding.
The OUTPUT encoding is a separate issue, which is taken care of currently in
C1 by cocoon.properties (although I think this is also Latin-centric because
it assumes that you only want to output one encoding per Cocoon engine,
which is blatantly not good enough, in my experience helping a
Chinese-English ISP, so it should be an option to <?cocoon-format?>. But
anyway.)
What I must stress is that it doesn't matter what encoding is used for
intermediate representations, as long as the encoding can cope with all the
characters used. Since "ASCII" Java source files can represent ANY character
using Unicode, we should follow the JDK documentation instead of the current
hacky approach (sorry Ricardo, no offense).
>[MS] That cocoon doesn't invalidate the cache when external entities
>change. Its a bug, and it needs fixing to be able to apply Cocoon to any
>reasonable XML documents... Anyway, thats way off topic (although I've
>seen it brought up on the cocoon-users archive before now).
IMHO this is a bug in the XML spec, which does not mandate that. Ditto with
the XSLT spec and <xsl:include> and <xsl:import>. Since Cocoon does not
mandate the use of a specific parser (or xslt processor), in principle it
cannot guarantee that external entities will be cached correctly (even if it
gets its own caching right). We should instead recommend the use of
<util:include-*> tags and/or XInclude, both of which are under Cocoon's
control and can be made to work correctly.
(Sorry this post is very Cocoon-centric! ;)
_________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.
Share information about yourself, create your own public profile at
http://profiles.msn.com.
Re: Encoding and Ext. Entities (was Deprecating the root
element)
Posted by Matt Sergeant <ma...@sergeant.org>.
On Thu, 9 Nov 2000, Robin Green wrote:
> "The Java compiler and other Java tools can only process files which contain
> Latin-1 and/or Unicode-encoded (\udddd notation) characters.
> native2ascii converts files which contain other character encodings into
> files containing Latin-1 and/or Unicode-encoded charaters."
Yikes - how odd that Java is so forward looking in Unicode, yet the
parsers can't cope with UTF encoded files directly (that is what
the above says, right?). *shrug*
Anyway, I think your solution is better, robin. It certainly sounds more
like the right thing to do. FWIW, its a non-issue in Perl, in case anyone
was wondering.
--
<Matt/>
/|| ** Director and CTO **
//|| ** AxKit.com Ltd ** ** XML Application Serving **
// || ** http://axkit.org ** ** XSLT, XPathScript, XSP **
// \\| // ** Personal Web Site: http://sergeant.org/ **
\\//
//\\
// \\