You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@xalan.apache.org by Isaac Shabtay <is...@netvision.net.il> on 2001/10/11 23:03:34 UTC

Apache's Java XML projects' portability into non-ASCII environments

Hey there.
Been to Prague during the past week, took some fresh air there... What a
beautiful city.

---------------------------

It appears that Apache's Java-based XML projects use some "configuration
files" in order to operate. Those files are being read by a Reader object
(such as java.io.BufferedReader, java.io.InputStreamReader). Such files are
being read by constructing a reader on the input stream, defaulting to the
platform-default encoding.

For example, in the Xalan-J project, class
org.apache.xalan.serialize.CharInfo, when we read XMLEntities.res or
HTMLEntities.res,
a BufferedReader is constructed on an InputStreamReader, which is
constructed on an inputstream, without mentioning the encoding in which the
file resides - thus assuming default platform-encoding.

This works pretty well on ASCII environments, since those text files really
reside as ASCII text files, so the default platform-encoding is enough.

However, a problem arises when trying to use Xalan-J on a non-ASCII
environment, such as IBM's OS/390 which uses EBCDIC. The product simply
doesn't work. I already opened a new bug report about the specific problem
in org.apache.xalan.serialize.CharInfo (see the link to bug #4000 in the
bottom), and I believe it should be fixed, but it might not be enough.

Xalan-J was an example. If I recall right, Xerces-J has this problem too.

It's obvious that if we want to make Apache's Java-based XML projects as
portable as possible, then every place in which a text file, which is NOT
encoding-standardized (such as Manifest files, which must be UTF-8 for
example) is being read - should NOT make any assumption on the default
platform encoding. If Xalan is being built in an ASCII environment, and its
configuration files are in ASCII, then ASCII should explicitly be mentioned
when reading the configuration file.

I'm not talking about the actual way of doing this (hard-coding into the
source, or reading some properties from the manifest-files, which are
standardized to UTF-8 so no problem here), I'm just rising the problem.

I think we should state our position in this subject. We can go and fix
whenever needed; we can also just state that "these are the changes that
must be made in order for this product to run on a non-ASCII environment"
etc; anyway, some action is needed. My opinion is that the code should be as
portable as possible, with no modifications needed to be performed by the
user (such as converting ASCII to EBCDIC).


Any comments, please?


    - Isaac


P.S. here's the link to the original problem reported by me using BugZilla -

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=4000