You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@xerces.apache.org by ro...@us.ibm.com on 2000/01/13 02:08:36 UTC

More Xerces-C character encoding discussion



[Xerces-C : HPUX]

We have been discussing some isssues with the HP folks about the character
encoding issues with their platforms. Some of this has been tangentially
discussed already, but this time I'll concentrate on the HP specific
issues. Anyone with any comments please feel free to speak up.

Issue:

HP platforms don't necessarily store Unicode in their wchar_t type. What
they store is actually locale specific, and I assume its never actually
Unicode in any locale. This definitely raises some issues, because of the
fact that the parser is very Unicode-centric.


Possible Solutions:

1 - Definition of XMLCh.

I don't think that such platforms would want to float XMLCh to char_t. They
should set XMLCh to a 16 bit unsigned value. This setting is controlled in
the per-compiler file. The XML parser code will automatically readjust to
this, though their might be some remaining issues in some of the platform
specific files like the pluggable transcoders, which will get worked out as
we find them. But the rest of the XML parser code and DOM code will
automagically just compile with XMLCh set to either a 16 or 32 bit value.

This will prevent any accidental interopability of the local non-Unicode
wide character APIs and the Unicode APIs of the parser. All of the parser
APIs would then only accept unsigned shorts and it would also spit out
unsigned short XML data, so it would be obvious where and when you needed
to transcode in and out of the system, and L"foo" won't be passable to any
XMLCh API.

2 - Calls to the System:

All calls to system or runtime APIs from the parser itself go through the
base abstractions that are plugged into the bottom of the parser. In
particular all of the system APIs are called in the per-platform support
file. So, for a platform such as HP, certainly these support files will
have to preflight the incoming XMLCh data before passing it on to the
system APIs that they call. By providing such pre-flighting code in the HP
platform support file, a large amount of the problems will be taken care
of.

3 - Transcoders.

There are issues wrt the plugged in transcoder implementation. It is likely
that the HP platform will have to provide its own Iconv based transcoder
service implementation. This implementation will have to put a buffer
between incoming Unicode and the local iconv APIs, and between any outgoing
transcoded text that needs to come back to the parser in Unicode format.
Providing this specialized transcoder implementation will handle the bulk
of the remaining issues.

Whether this means that the existing Iconv based transcoder is just spiffed
up with some conditional code or not, I don't know. If supporting these
extra steps imposed any significant extra complexity or overhead, I would
argue for the HP folks maintaining their own Iconv based transcoder
implementation. But they can always just do the work and lets see what the
differences are. If they are reasonable to get into the existing iconv
transcoder, then we can go that way.

4 - Unicode normalization. The XML parser assumes that all plugged in
transcoding services pre-normalize all code that it transcodes into the
Unicode encoding. If it does not, then the parser will make no attempt to
compensate for this. So, if you provide a transcoding service, and
normalization is important to you, you might have to do some post
processing of transcoded text blocks to pre-normalize them before returning
the block of Unicode characters to the parser. The HP folks believe that
the HP implementation of iconv does not do this. We do no know if the other
Unixes do so.

5 - The Samples.

Basically, we are leaning towards saying that on platforms such as the HP
ones, where wide chars are not Unicode, then samples just won't work. We
are loath to turn the samples into overly complicated lessons on
internationalization, when they really intended to be simple demonstrations
of how to use the parser. Making them industrial strength would not
necessarily be in the best interests of keeping them relatively
straightforward.

We will probably just have to document this fact, and provide some basic
guidance about the real effort required in writing fully portable code on
top of the parser. Though I don't think that we can provide any really deep
tutorial on the subject, since it would be a book unto itself and that's
not what we are geared for. Perhaps the Internationalization folks might
provide some good links to send people to look at. This probably all of
their discussion is likely oriented towards wchar_t being Unicode as well.
But at least we can provide a little warning that such platforms present
special concerns.

6 -  Short Character Constants

Basically, most incoming APIs of the parser have an alternate method that
takes a short character. This character is transcoded to Unicode using the
'local code page transcoder'. This trancoder is obtained by the parser by
asking the installed transcoding service to provide one. The platform code
initialization of each particular platform should do whatever is required
to make sure that this transcoder is doing the right thing for whatever
encoding a short character constant (i.e. "foo") means on that platform. If
this means consulting locale data or whatever, this can be done by the
platform implementation's initialization code. The parser does not get
involved in such things.


* We want to stress that there should be NO calls to system APIs or wide
character runtime APIs in the parser itself. If you find any, please report
them since they are bugs. Any such work done by the parser should be done
via the provided abstraction classes in the util/ directory, mostly
XMLString and the transcoding service abstractions. If this is not strictly
followed, then #2 and #3 won't work correctly in this types of situations.



I think that, if the HP platform utilities are written to take this issue
into account, and they provide a transcoder aware of the issues, probably
that will be sufficient to solve the vast bulk of the issues, and perhaps
all of them (at least all of the ones that we believe should be dealt
with.) It will always be required on such platforms to transcode data going
into the parser or coming out of it. This only leaves the access by the
parser to system and transcoding services. As long as appropriately aware
versions of such services are plugged into the parser from the bottom,
everything should work out.

Anyway, these are some of the obvious issues, and is intended to just kick
start the discussion. Please respond to this document with any thoughts you
have on the subject and lets beat them out.

----------------------------------------
Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley
roddey@us.ibm.com