You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cocoon.apache.org by Stefano Mazzocchi <st...@apache.org> on 2003/11/13 19:06:35 UTC

[d'oh!] java APIs are not powerful enough to handle the XML spec!!

The day somebody asks you why java needs to be replaced, one answer 
will be 'it only supports 16-bits chars'. laughable as it might seem, 
it's true.

yes, people, a Unicode char is not 16 bit (as I always though!) but 32!!

And even the XML specification says so.

Char ::=  #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
[#x10000-#x10FFFF]

do the math and you find that #x10000 cannot fit in 16 bits!

now, if you thought you could take the character() SAX event and create 
a String out of it and do something useful with is (like print it, for 
example), forget it. The result will very likely not be the one you 
expect.

Another reason not to use Stings at all.

--
Stefano.

Re: [d'oh!] java APIs are not powerful enough to handle the XML spec!!

Posted by Konstantin Piroumian <kp...@apache.org>.

According to Unicode Design principles in Unicode 3.0 specification:
<quot>Unicode characters have a width of 16 bits.</quot>

While in Unicode 4.0 standart there are no character width related
principles.

And according to JavaDocs of Character class (J2SE 1.4):
<quot>Character information is based on the Unicode Standard, version 3.0.
</quot>
And one more from the Java Language Specification:
<quot>
Versions of the Java programming language prior to 1.1 used Unicode version
1.1.5 (see The Unicode Standard: Worldwide Character Encoding (1.4) and
updates). Later versions prior to JDK version 1.1.7 used Unicode version
2.0. Since JDK version 1.1.7, Unicode 2.1 has been in use. The Java platform
will track the Unicode specification as it evolves. The precise version of
Unicode used by a given release is specified in the documentation of the
class Character.
</quot>

So, it seems that the only thing (optimistic mode is on) that should be
changed in further versions of Java to support Unicode 4.0 is to modify the
Character class.

Regards,
  Konstantin Piroumian

----- Original Message ----- 
From: "Stefano Mazzocchi" <st...@apache.org>
To: "Apache Cocoon" <de...@cocoon.apache.org>
Sent: Thursday, November 13, 2003 21:06
Subject: [d'oh!] java APIs are not powerful enough to handle the XML spec!!


The day somebody asks you why java needs to be replaced, one answer
will be 'it only supports 16-bits chars'. laughable as it might seem,
it's true.

yes, people, a Unicode char is not 16 bit (as I always though!) but 32!!

And even the XML specification says so.

Char ::=══#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]

do the math and you find that #x10000 cannot fit in 16 bits!

now, if you thought you could take the character() SAX event and create
a String out of it and do something useful with is (like print it, for
example), forget it. The result will very likely not be the one you
expect.

Another reason not to use Stings at all.

--
Stefano.

Re: [d'oh!] java APIs are not powerful enough to handle the XML spec!!

Posted by Stefano Mazzocchi <st...@apache.org>.

On 13 Nov 2003, at 21:08, J.Pietschmann wrote:

> Stefano Mazzocchi wrote:
>> The day somebody asks you why java needs to be replaced, one answer 
>> will be 'it only supports 16-bits chars'. laughable as it might seem, 
>> it's true.
>> yes, people, a Unicode char is not 16 bit (as I always though!) but 
>> 32!!
>
> This is a misconception.

yeah, well, I'm not talking about the encoding, but about the fact that 
you can't fit all unicode chars in 16 bits address space, that was my 
point.

UTF-32 uses 32 bit flat encoding. UTF-16 and UTF-8 use a different type 
of encoding (which is the same I used in the SAX compiler that cocoon 
uses).

> Unicode is an odd mixture: at the same time it defines codepoints for
> representing characters and "surrogate characters" for encoding
> non-baseplane characters (whose codepoints don't fit into 16 bit).
>
> ISO 10646 originally intended to use full 32bit for 2^64 characters.
> Because of slow progress an complaints about "wasting space", the
> Unicode consortium was formed which made quick progress on specifying
> a 16-bit charcater set. The surrogate characters were built in in case
> more than 2^16 characters came along, and for giving people plenty of
> room to experiment themself in the "private areas" there. Meanwhile,
> ISO-10646 and Unicode converged: ISO limited the charset to 0x110000
> characters, which should be enough for everyone, and Unicode dropped
> the "16 bit charset" notation, they just define codepoints.
> Unfortunately for them, they can't undo the surrogate character mess
> and other wicked problems they now like to get rid of (singletons,
> certain compatibility characters, some presentation forms, ligatures).

I'm very ignorant on these things, I must admit! thanks for sharing.

> A Java "char" variable can't hold non-baseplane Unicode charaters, but
> Java strings can. For Sun JVMs, they are basically a UTF-16 encoded
> Unicode strings. BTW there are JVMs out there which use UTF-8 in
> Java Strings, the same way strings are stored in class files.

> The point is of course: can the run time libraries handle non-baseplane
> characters?

It's even worse! Is javac able to handle UTF-16 encoded files? If so, 
would it be able to do:

  String nonBaseplaneString = "... some non-baseplane chars here ";

and what would be the use of this, if I can't guarantee that

  (new 
String(nonBaseplaneString.toCharArray()).equals(nonBaseplaneString)

will yield true all the time?

> The java.text.BreakIterator can, but that's no magic. I
> have no idea whether for example AWT display routines can display non-
> baseplane characters, mainly because I've yet to get an appropriate
> font. The TTF unicode mapping tables allocate, lo and behold, 16 bits
> for the character. Who's complaining about Java?
>
> BTW Mozilla can't deal with non-baseplane characters either, to the
> chagrin of the MathML folks who use them for mathematical presentation
> forms. Guess what's the main reason, beside fonts: C's wchar_t is 16 
> bit
> too.

well, to be honest, I thought as well that moving from 8 bits to 16 
bits for address space would have solved all our issues with chars once 
and for all... so I don't feel like blaming them for not having thought 
of more complex issues :-/

>> now, if you thought you could take the character() SAX event and 
>> create a String out of it and do something useful with is (like print 
>> it, for example), forget it. The result will very likely not be the 
>> one you expect.
>
> That's an interesting observation. I never had problems in this area.
> But this may have something to do with the fact that I never went out 
> of
> the Unicode baseplane with my chars.

Yeah, nobody ever did (this came out after testing Slide for webdav 
compliance)... but I have the feeling this will bite us in the back in 
the future.

>> Another reason not to use Stings at all.
> Stings are bad, of course :-)

:-)

> Strings are another matter. In fact, Strings should be preferred over
> char arrays because they can hide the actual representation of the 
> Unicode
> strings.

Very true! Missed that.

> If you use character arrays, you have to deal with surrogate
> character pairs yourseelf. A substring() could be implemented to deal
> with non-baseplane characters correctly. Of course, Java was invented
> when people thought of Unicode as 16 bit charset, and the standardized
> behaviour is that the String methods operate on the internal char 
> array.

talking about a mess :-(

--
Stefano.

Re: [d'oh!] java APIs are not powerful enough to handle the XML spec!!

Posted by "J.Pietschmann" <j3...@yahoo.de>.

Stefano Mazzocchi wrote:
> The day somebody asks you why java needs to be replaced, one answer will 
> be 'it only supports 16-bits chars'. laughable as it might seem, it's true.
> 
> yes, people, a Unicode char is not 16 bit (as I always though!) but 32!!

This is a misconception.
Unicode is an odd mixture: at the same time it defines codepoints for
representing characters and "surrogate characters" for encoding
non-baseplane characters (whose codepoints don't fit into 16 bit).

ISO 10646 originally intended to use full 32bit for 2^64 characters.
Because of slow progress an complaints about "wasting space", the
Unicode consortium was formed which made quick progress on specifying
a 16-bit charcater set. The surrogate characters were built in in case
more than 2^16 characters came along, and for giving people plenty of
room to experiment themself in the "private areas" there. Meanwhile,
ISO-10646 and Unicode converged: ISO limited the charset to 0x110000
characters, which should be enough for everyone, and Unicode dropped
the "16 bit charset" notation, they just define codepoints.
Unfortunately for them, they can't undo the surrogate character mess
and other wicked problems they now like to get rid of (singletons,
certain compatibility characters, some presentation forms, ligatures).

A Java "char" variable can't hold non-baseplane Unicode charaters, but
Java strings can. For Sun JVMs, they are basically a UTF-16 encoded
Unicode strings. BTW there are JVMs out there which use UTF-8 in
Java Strings, the same way strings are stored in class files.

The point is of course: can the run time libraries handle non-baseplane
characters? The java.text.BreakIterator can, but that's no magic. I
have no idea whether for example AWT display routines can display non-
baseplane characters, mainly because I've yet to get an appropriate
font. The TTF unicode mapping tables allocate, lo and behold, 16 bits
for the character. Who's complaining about Java?

BTW Mozilla can't deal with non-baseplane characters either, to the
chagrin of the MathML folks who use them for mathematical presentation
forms. Guess what's the main reason, beside fonts: C's wchar_t is 16 bit
too.

> now, if you thought you could take the character() SAX event and create 
> a String out of it and do something useful with is (like print it, for 
> example), forget it. The result will very likely not be the one you expect.

That's an interesting observation. I never had problems in this area.
But this may have something to do with the fact that I never went out of
the Unicode baseplane with my chars. Heck, I'

> Another reason not to use Stings at all.
Stings are bad, of course :-)
Strings are another matter. In fact, Strings should be preferred over
char arrays because they can hide the actual representation of the Unicode
strings. If you use character arrays, you have to deal with surrogate
character pairs yourseelf. A substring() could be implemented to deal
with non-baseplane characters correctly. Of course, Java was invented
when people thought of Unicode as 16 bit charset, and the standardized
behaviour is that the String methods operate on the internal char array.

J.Pietschmann

Re: [d'oh!] java APIs are not powerful enough to handle the XML spec!!

Posted by Sylvain Wallez <sy...@apache.org>.

Stefano Mazzocchi wrote:

> The day somebody asks you why java needs to be replaced, one answer 
> will be 'it only supports 16-bits chars'. laughable as it might seem, 
> it's true.
>
> yes, people, a Unicode char is not 16 bit (as I always though!) but 32!!
>
> And even the XML specification says so.
>
> Char ::=  #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
> [#x10000-#x10FFFF]
>
> do the math and you find that #x10000 cannot fit in 16 bits!
>
> now, if you thought you could take the character() SAX event and 
> create a String out of it and do something useful with is (like print 
> it, for example), forget it. The result will very likely not be the 
> one you expect.
>
> Another reason not to use Stings at all.


What are Strings but a char[] wrapped into an object? How is it 
different from the char[] given by characters() ?

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com