You are viewing a plain text version of this content. The canonical link for it is here.
Posted to xsp-dev@xml.apache.org by Robin Green <gr...@hotmail.com> on 2000/11/09 13:36:51 UTC

Re: Encoding and Ext. Entities (was Deprecating the root element)

I've annotated this with initials...

Matt Sergeant <ma...@sergeant.org> wrote:
> > > > [RR] The "encoding" attribute introduced in Cocoon2, should remain 
>in
> > > > place as a root-level attribute:
> > > >  . . .
> > > > For the Java language (at least) this information is necessary
> > > > to direct the compiler to generate a class with the proper
> > > > string encoding (javac -encoding ...)
> > >
> > > [MS] Hmm, why don't you get everything from the parser in UTF16 (or 
>UTF8)? I'm
> > > confused...

[RDG] Actually we get everything from the parser as java Strings, which are 
effectively encoding neutral, because there is NO WAY for any code outside 
the JVM itself to determine what encoding is actually being used to store 
them. This is by design. So in theory a JVM could use whatever encoding it 
likes to store Strings.

> > [RR] This  is probably a Java-specific problem. The parser does not 
>provide
> > document encoding infor _and_ the Java compiler requires specifying
> > the proper encoding for locale-specific characters to be processed
> > correctly.

[RDG] I think this whole problem stems from one basic confusion. The current 
"solution" seems to work for some encodings, but it does not make much sense 
to me (it's probably just good luck that it works for some encodings). From 
the JDK1.3 tool docs for native2ascii I quote:

"The Java compiler and other Java tools can only process files which contain 
Latin-1 and/or Unicode-encoded (\udddd notation) characters.
native2ascii converts files which contain other character encodings into 
files containing Latin-1 and/or Unicode-encoded charaters."

Therefore we should NOT be trying to compile files with non-Latin-1 
characters in them! We should either output as UTF8 or UTF16 and use 
native2ascii to convert (the easy option) or escape the out-of-range 
characters in the XSPJavaProcessor (could in theory escape all characters, 
but this would make the source totally unreadable!). Fortunately, neither of 
these approaches requires knowing the input encoding.

The OUTPUT encoding is a separate issue, which is taken care of currently in 
C1 by cocoon.properties (although I think this is also Latin-centric because 
it assumes that you only want to output one encoding per Cocoon engine, 
which is blatantly not good enough, in my experience helping a 
Chinese-English ISP, so it should be an option to <?cocoon-format?>. But 
anyway.)

What I must stress is that it doesn't matter what encoding is used for 
intermediate representations, as long as the encoding can cope with all the 
characters used. Since "ASCII" Java source files can represent ANY character 
using Unicode, we should follow the JDK documentation instead of the current 
hacky approach (sorry Ricardo, no offense).

>[MS] That cocoon doesn't invalidate the cache when external entities
>change. Its a bug, and it needs fixing to be able to apply Cocoon to any
>reasonable XML documents... Anyway, thats way off topic (although I've
>seen it brought up on the cocoon-users archive before now).

IMHO this is a bug in the XML spec, which does not mandate that. Ditto with 
the XSLT spec and <xsl:include> and <xsl:import>. Since Cocoon does not 
mandate the use of a specific parser (or xslt processor), in principle it 
cannot guarantee that external entities will be cached correctly (even if it 
gets its own caching right). We should instead recommend the use of 
<util:include-*> tags and/or XInclude, both of which are under Cocoon's 
control and can be made to work correctly.

(Sorry this post is very Cocoon-centric! ;)



_________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.

Share information about yourself, create your own public profile at 
http://profiles.msn.com.


Re: Encoding and Ext. Entities (was Deprecating the root element)

Posted by Matt Sergeant <ma...@sergeant.org>.
On Thu, 9 Nov 2000, Robin Green wrote:

> "The Java compiler and other Java tools can only process files which contain 
> Latin-1 and/or Unicode-encoded (\udddd notation) characters.
> native2ascii converts files which contain other character encodings into 
> files containing Latin-1 and/or Unicode-encoded charaters."

Yikes - how odd that Java is so forward looking in Unicode, yet the
parsers can't cope with UTF encoded files directly (that is what
the above says, right?). *shrug*

Anyway, I think your solution is better, robin. It certainly sounds more
like the right thing to do. FWIW, its a non-issue in Perl, in case anyone
was wondering.

-- 
<Matt/>

    /||    ** Director and CTO **
   //||    **  AxKit.com Ltd   **  ** XML Application Serving **
  // ||    ** http://axkit.org **  ** XSLT, XPathScript, XSP  **
 // \\| // **     Personal Web Site: http://sergeant.org/     **
     \\//
     //\\
    //  \\