You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@xalan.apache.org by Jon Smirl <jo...@mediaone.net> on 2000/08/18 02:20:59 UTC

Xalan-C: strings and Unicode

Here's another idea that came up while talking about Unicode on the Xerces
list. It will only work in the C version of Xalan.

98% of the time the input and output charsets of an XSLT transform are the
same.  The notable exception to this is when the input document is in EBCDIC
and the output is Latin1. But right now Xalan is dealing with everything in
UCS2. USC2 causes a double transcode to always take place and doubles memory
consumption/copy time for eight bit input documents.

The second part of this is that many of the user's character strings (not
element names) as passed from input to output without Xalan ever looking at
them. Xalan just takes an input pointer to the string and then copies it to
the output stream. In other words, doing an xsl:substring() in Xalan is an
uncommon thing.

The idea is to keep the strings in native charset until Xalan needed to
access them. So a string that is simply copied from input to output would be
stored in it's native format (in my case 8 bits) and just be copied to the
output stream without being transcoded. If the sheet did something like a
substring() on the string it would trigger a conversion to UCS2 and then be
transcoded back at output time.

This enhancement would be in the middle of my performance gain list. But
when Xalan converts to using an internal, build on demand DOM, it could be
designed into the internal string support. This is also going to require
some cooperation from the Xerces people. Right now strings in Xerces' events
have already been transcoded to UCS2.

Jon Smirl
jonsmirl@mediaone.net