You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-dev@xerces.apache.org by Gianni Mariani <ma...@orconet.com> on 2000/07/24 03:58:03 UTC

UTF-16 or UTF-8 or UCS-4

I've noticed that xerxes defines XMLCh as a 4 byte sze type for
Linux/GCC.

util/Compilers/GCCDefs.hpp:118:typedef wchar_t XMLCh;

Apart from

    a) Not being as per the DOM Spec
    b) Exceptionally wastful of memory on Linux

It also is less useful for most implementors than pure UTF-8.

What are the plans for handling a more standard string type
(like the C++ standard "string") as part of the interface and assuming
that it is utf-8 ?

Re: UTF-16 or UTF-8 or UCS-4

Posted by Andy Heninger <an...@jtcsv.com>.

Yow, what a thread.

Just to be clear, what is being discussed is the format in which the
parser delivers data back to the application, not the format in which XML
documents are delivered to the parser.

In no particular order,



Gianni Mariani wrote

> [...]it is really silly for the XML standard to specify the DOMString
> encoding,
> it should specify the character set but it's really silly to specify the
> encoding.

At this point, the DOM is a standard and is not going to change.  And
compatibility with it is one of the reasons that Xerces-C XMLCh uses
UTF-16 encoding.

(As an historical aside, much of the DOM definition was a converging and
standardization of the JavaScript support found in Netscape 3 and IE 3 for
access to HTML documents.)

The other reason for 16 bit strings is the influence of Java strings.  The
C++ parser has its origins in the Xerces (XML4J at the time) Java parser,
where characters are internally 16 bits.

There's a question here that I don't know the answer to, though.  Just
what is the encoding of those 16 bit Java characters?

Unicode character values can range from  0 to 0xfffff.  (This isn't saying
how they are represented as bits, but just saying what the abstract
numerical range of character values is.  These numbers are often called
Code Points.

UTF-8, UTF-16 and UCS4 can all represent all of these values.  UTF-8 and
UTF-16 may require more than one of their basic 8 or 16 bit chunks,
depending on the value to be represented.  UCS-4 just uses a 32 bit int to
hold the 20 bit value.

UCS-2 is a 16 bit fixed size encoding that can only represent values <
64k.   UCS-2 does not allow surrogate pairs, and can not represent the
characters between 0x10000 and 0xfffff.

So now, the question - Are Java characters UCS-2 or UTF-16.  Which is to
say, are surrogate pairs supposed to be allowed in Java strings?  I
couldn't quickly find a clear statement one way or the other on Sun's web
site.

-------

Dean Roddey wrote

> Most applications, safely, can ignore surrogates and they rarely
> occur. If
> they are worried about them, they can deal with them, but very few
things
> generate them in the real world.

> We only spit out surrogate pairs at this time, so we don't create 32 bit
> characters even if the local character size is 32 bits. This has been
> discussed as we could do so, but we wouldn't drive it purely off the
size of
> the wide char.

I disagree.

The parser needs to handle all Unicode 100% correctly.  If an individual
application decides that surrogates aren't worth the trouble, that's their
business.  But Xerces itself needs to be correct, and support applications
that want to be correct.

Which means that we probably should not put UTF-16 encoded values into
UCS-4 encoded wchar_t variables.

The encoding of wchar_t is all over the place, anyway.  In addition to the
16 - 32 bit split, HPUX uses something unique to it (not Unicode based at
all) and 390 mainframes use an EBCDIC variant.

My inclination would be to never use wchar_t for the output delivered by
the parser.  Use something that is very explicitly a Unicode type that
makes no claims of direct interoperability with wchar_t.

Gianni Mariani asked

> What are the plans for handling a more standard string type
> (like the C++ standard "string") as part of the interface and assuming
> that it is utf-8 ?

Using the C++ std library string would be a problem, in that we are on
some antique compilers that don't have it available.

As for the memory footprint and performance advantages of using utf-8
encoded char * strings as the parser's native internal type, I can
definitely see how this could, particularly if the input stream was UTF-8
or ASCII to start with.  If we were rewriting the parser, I'd want to
seriously consider it.



> You [Dean]  said intel won the war - well, there were way more MIPS
processors
> shipped last year than intel processors, most of them running big endian
> in embedded environments.
>
> doing
>
> cat file1 file2 > file3
>
> is no longer guarenteed to work.
>
> Or 2 programs on a networked file system appending to a "log" file can
get
> their messages all mixed up.
>
> Heterogenous environments work fine in utf-8.
>

The whole arguement was, I thought, over the internal format in which the
parser delivers data to the appication code.  If the data is going back
out over a wire, or into a log file, or anywhere outside of the
application process, you've got a different problem on your hands.  XML
would be a good solution; utf-8 would be a good encoding.

------


Julian Pardoe wrote

> What's more the claim that Xerces currently supports internationalized
> programming is bogus in that I don't have a library of routines that
> correctly converts array of unsigned shorts from upper to lower case or
> collates them using locale-specific rules.  In fact I have no routines
that
> treat unsigned shorts as Unicode characters at all!  I can't even << a
> string.

This is essentially the arguement that caused us to switch XMLCh to be
wchar_t.

Also, see http://oss.software.ibm.com/icu/ for a free opensource library
that does all of these things on (!) strings of unsigned shorts.



Andy Heninger
IBM XML Technology Group, Cupertino, CA
heninger@us.ibm.com

Re: UTF-16 or UTF-8 or UCS-4

Posted by Gianni Mariani <ma...@orconet.com>.

On the European issue you make a good point and it's not just European.
It turns out that for many applications, there are still large portions of
documents
that are 1 or 2 byte characters so the impact (not for all applications) of
utf8
is minimal.  Then when you weigh in the other factors it turns in my
conclusion,
overall, it's a non issue - not everyone can win.

Andy Heninger wrote:

> Dean wrote
>
> > I too think its not correct for the W3C to indicate the storage format,
> > though of course the encoding is perfectly legal for them to indicate.
> By
> > their definition, on Linux for instance, no text that we spit out could
> be
> > passed to local wide character APIs, which would be pretty much a waste
> of
> > time.
> >
>
> The W3C specifies the storage format for Java, ECMAScript and Corba IDL.
> The interfaces are live, compilable code that implementations use
> directly; the types have be nailed for them to work.
>
> For languages, such as C++, where the W3C has not specified a binding, we
> can do whatever we feel makes the most sense.

If you encode utf-16 make sure they are not 4 byte .

>
>
> UTF-8 Strings could be used with a DOM implementation.  The interfaces are
> just that, interfaces;  the data itself is not be directly exposed by the
> DOM API.  Indexing operations would be slower in such an implementation,
> but this might well be a reasonable trade off for the reduced storage.
>
> (On the other hand, non-Europeans may have good grounds for complaint with
> utf-8 based implementations. Utf-8 requires substantially more storage
> than utf-16 for non-latin based data)
>
> Andy Heninger
> IBM XML Technology Group, Cupertino, CA
> heninger@us.ibm.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

Re: UTF-16 or UTF-8 or UCS-4

Posted by Andy Heninger <an...@jtcsv.com>.

Dean wrote

> I too think its not correct for the W3C to indicate the storage format,
> though of course the encoding is perfectly legal for them to indicate.
By
> their definition, on Linux for instance, no text that we spit out could
be
> passed to local wide character APIs, which would be pretty much a waste
of
> time.
>

The W3C specifies the storage format for Java, ECMAScript and Corba IDL.
The interfaces are live, compilable code that implementations use
directly; the types have be nailed for them to work.

For languages, such as C++, where the W3C has not specified a binding, we
can do whatever we feel makes the most sense.

UTF-8 Strings could be used with a DOM implementation.  The interfaces are
just that, interfaces;  the data itself is not be directly exposed by the
DOM API.  Indexing operations would be slower in such an implementation,
but this might well be a reasonable trade off for the reduced storage.

(On the other hand, non-Europeans may have good grounds for complaint with
utf-8 based implementations. Utf-8 requires substantially more storage
than utf-16 for non-latin based data)



Andy Heninger
IBM XML Technology Group, Cupertino, CA
heninger@us.ibm.com

Re: UTF-16 or UTF-8 or UCS-4

Posted by Dean Roddey <dr...@charmedquark.com>.

This conversation, and I contributed, keeps getting encodings and storage
mixed up. On almost all of the platforms we are on, it is UTF-16. We only
*store* it in the wchar_t. If that happens to be 32 bits, then that's what
it is. We have to be able to pass our stuff to local wide character APIs, or
its a huge PITA for our users. So the storage format doesn't necessarily
indicate the encoding. We currently always spit out UTF-16 (with surrogates
if the content causes them), and either spit them out in 16 or 32 bit
storage, according to the custom of the local host.

I too think its not correct for the W3C to indicate the storage format,
though of course the encoding is perfectly legal for them to indicate. By
their definition, on Linux for instance, no text that we spit out could be
passed to local wide character APIs, which would be pretty much a waste of
time.

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"You young, and you gotcha health. Whatchoo wanna job fer?"


----- Original Message -----
From: "Joe Polastre" <jp...@apache.org>
To: <xe...@xml.apache.org>; <ma...@orconet.com>
Sent: Monday, July 24, 2000 10:34 AM
Subject: Re: UTF-16 or UTF-8 or UCS-4


> From: "Gianni Mariani" <ma...@orconet.com>
> > My point exactly - so why is it using wchar_t - which is deffinitly not
> defined
> > as UTF-16 except on IBM and NT platforms ?
>
> It is my understanding that wchar_t is UTF-16 on just about every platform
> except for HPUX.  Or at least that XMLCh is typedef'd to something UTF-16
on
> most platforms.
>
> This is in no way an "IBM-centric" thing.
>
> Dean & Arundhati probably know much more about this than I do though.
>
> -Joe Polastre  (jpolast@apache.org)
> IBM Cupertino, XML Technology Group
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
>

Re: SAX2??!

Posted by Joe Polastre <jp...@apache.org>.

Simon Fell wrote up a SAX2 implementation.  It is not yet complete (alot of
the naming conventions weren't changed from SAX1 and features/properties
weren't supported)

Anyway, I'm working with Simon and we've made some significant progress..
Like I said, we're hoping to get a fully functional SAX2 implementation in
the next point release 1.2.1.

-Joe Polastre  (jpolast@apache.org)
IBM Cupertino, XML Technology Group


----- Original Message -----
From: "Dean Roddey" <dr...@charmedquark.com>
To: <xe...@xml.apache.org>
Sent: Tuesday, July 25, 2000 7:59 PM
Subject: Re: SAX2??!


> The parser supports SAX(1), which has no namespace support. Someone
recently
> whipped up a partial SAX2 implementation and put it forward for
acceptance,
> but I'm not sure if anyone on the primary Xerces crew has taken him up on
it
> yet.
>
> --------------------------
> Dean Roddey
> The CIDLib C++ Frameworks
> Charmed Quark Software
> droddey@charmedquark.com
> http://www.charmedquark.com
>
> "You young, and you gotcha health. Whatchoo wanna job fer?"
>
>
> ----- Original Message -----
> From: "Octav Chipara" <oc...@cse.unl.edu>
> To: <xe...@xml.apache.org>
> Sent: Tuesday, July 25, 2000 6:52 AM
> Subject: SAX2??!
>
>
> >
> >
> > Hi,
> >
> > Could some one point out what is the namespace/schema support for a SAX
> > parser in the xerces.
> >
> > Thank you in advance,
> > - Octav
> >
> >
>
****************************************************************************
> **
> > e-mail:         ochipara@cse.unl.edu
> > phone: (402)472-9492
> > web page: www.cse.unl.edu/~ochipara
> >
>
****************************************************************************
> **
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> > For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
>
>

Re: SAX2??!

Posted by Dean Roddey <dr...@charmedquark.com>.

The parser supports SAX(1), which has no namespace support. Someone recently
whipped up a partial SAX2 implementation and put it forward for acceptance,
but I'm not sure if anyone on the primary Xerces crew has taken him up on it
yet.

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"You young, and you gotcha health. Whatchoo wanna job fer?"


----- Original Message -----
From: "Octav Chipara" <oc...@cse.unl.edu>
To: <xe...@xml.apache.org>
Sent: Tuesday, July 25, 2000 6:52 AM
Subject: SAX2??!


>
>
> Hi,
>
> Could some one point out what is the namespace/schema support for a SAX
> parser in the xerces.
>
> Thank you in advance,
> - Octav
>
>
****************************************************************************
**
> e-mail:         ochipara@cse.unl.edu
> phone: (402)472-9492
> web page: www.cse.unl.edu/~ochipara
>
****************************************************************************
**
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
>

Re: SAX2??!

Posted by Joe Polastre <jp...@apache.org>.

currently there is no sax2 support.

simon fell has submitted a sax2 patch that I'm going to look at today... if
it is a good fit with the current code, it'll all be included.

the goal is to have full sax2 functionality by the next point release
(1.2.1) which should come in the next month or two.

-Joe Polastre  (jpolast@apache.org)
IBM Cupertino, XML Technology Group


----- Original Message -----
From: "Octav Chipara" <oc...@cse.unl.edu>
To: <xe...@xml.apache.org>
Sent: Tuesday, July 25, 2000 6:52 AM
Subject: SAX2??!


>
>
> Hi,
>
> Could some one point out what is the namespace/schema support for a SAX
> parser in the xerces.
>
> Thank you in advance,
> - Octav
>
>
****************************************************************************
**
> e-mail:         ochipara@cse.unl.edu
> phone: (402)472-9492
> web page: www.cse.unl.edu/~ochipara
>
****************************************************************************
**
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
>
>

SAX2??!

Posted by Octav Chipara <oc...@cse.unl.edu>.


Hi,

Could some one point out what is the namespace/schema support for a SAX
parser in the xerces. 

Thank you in advance,
- Octav

******************************************************************************
e-mail:         ochipara@cse.unl.edu
phone:		(402)472-9492
web page:	www.cse.unl.edu/~ochipara
******************************************************************************

Re: UTF-16 or UTF-8 or UCS-4

Posted by Dean Roddey <dr...@charmedquark.com>.

Yes it is. The local wide character APIs on Linux take wchar_t, and they
expect Unicode code points in it. We are doing exactly what is desireable on
Linux. If we didn't do this, they what we spit out wouldn't ber passable to
the local wide character APIs, which would be completely stupid. What part
of this argument don't you understand? You really think that we're going to
do that just because its technically against the standard? Do you really
think that they people who use the Linux platform would accept that argument
for why they have to transcode the Unicode that we spit out before they can
pass it to their own local Unicode APIs? I doubt it seriously. But, if you
want to take on this quest, feel free. If you get all the Linux users to
agree with you, I'm sure that we'd change it.

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"You young, and you gotcha health. Whatchoo wanna job fer?"

----- Original Message -----
From: "Gianni Mariani" <ma...@orconet.com>
To: "Dean Roddey" <dr...@charmedquark.com>
Cc: <xe...@xml.apache.org>
Sent: Monday, July 24, 2000 11:39 PM
Subject: Re: UTF-16 or UTF-8 or UCS-4

> Dean Roddey wrote:
>
> > We *do* define our own type, which is XMLCh. On those platforms where
> > wchar_t *is* Unicode, we define XMLCh to be wchar_t. If its not, XMLCh
is
> > defined to what is appropriate.
>
> Well, that is not configured correctly for Linux.
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
>

Re: UTF-16 or UTF-8 or UCS-4

Posted by Jon Smirl <jo...@mediaone.net>.

Xalan is already thinking along these lines...

http://xml.apache.org/websrc/cvsweb.cgi/xml-xalan/c/src/XalanDOM/XalanDOMStr
ing.hpp?rev=1.2&content-type=text/vnd.viewcvs-markup

Jon Smirl
jonsmirl@mediaone.net

Re: UTF-16 or UTF-8 or UCS-4

Posted by Gianni Mariani <ma...@orconet.com>.

I'd suggest it would be a requirement that at least Xalan and Xerces share
a compatable string model.  You will regret it otherwise.

Andy Heninger wrote:

> ----- Original Message -----
> From: "Gianni Mariani" <ma...@orconet.com>
>
> > You probably won't like my patch.
>
> I don't know whether I'd like it or not, but it's certain that
> it wouldn't be compatible with what's there now, so it couldn't
> just transparently replace the existing implementation.
>
> If any major redesign / re-implementation is going to happen,
> I've got a long wish list too.  We should not
> just start making major API overhauls without some significant
> discussion of requirements and design alternatives first.
>
> The DOM and the Scanner/Parser/SAX API need to be treated
> separately - somehow they seem to keep getting mushed together
> in these discussions.
>
> I've got no significant complaint with the existing null terminated
> XMLCh * strings being delivered from the scanner as a way to
> feed DOM building code.  It's simple, fast, and memory consumption
> is pretty much a non-issue, since the same buffer is reused
> forever.
>
> Andy Heninger
> IBM XML Technology Group, Cupertino, CA
> heninger@us.ibm.com
>
> >
> > Firstly, -I'd make DOMString a template based on standard c++
> basic_string.
> > Secondly, - I would make pure virtual base classes of all the interfaces
> as
> > templates
> > on a DOMString specialization.
> >
> > OR
> >
> > I would make a DOMstring class that exposes itself as either a standard
> c++
> > string
> > or wstring and that it manages to convert when appropriate.  I've
> implemented
> > a similar
> > string for WinCE ports and it works remarkably well.
> >
> > OR
> >
> > Do both of the options above.
> >
> > I would remove DOMString from Xerces and place it in a standard string
> > library that
> > had very rich support for all kinds of Unicode facilities.  I would
> start
> > getting C++
> > standardization support for it.
> >
> > And, I would not depend on wchar_t to mean anything.
> >
> > If you would take a patch like this I'd be happy to do it but I would
> need a
> > commitment
> > that it would be applied and released.
> >
> > Joe Polastre wrote:
> >
> > > ----- Original Message -----
> > > From: "Gianni Mariani" <ma...@orconet.com>
> > > > > We *do* define our own type, which is XMLCh. On those platforms
> where
> > > > > wchar_t *is* Unicode, we define XMLCh to be wchar_t. If its not,
> XMLCh
> > > is
> > > > > defined to what is appropriate.
> > > > Well, that is not configured correctly for Linux.
> > >
> > > alright, this is getting ridiculous!
> > >
> > > this is an open source project!  if everyone wants the change (as you
> > > advocate) and it is incorrectly configured for linux, fix it!  and
> send me
> > > or dean the patch and we'll throw it into CVS for you...  you can
> complain
> > > until you're blue in the face on the mailing list, but nothing is
> going to
> > > happen unless you do it.
> > >
> > > -Joe Polastre  (jpolast@apache.org)
> > > IBM Cupertino, XML Technology Group
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> > For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

Re: UTF-16 or UTF-8 or UCS-4

Posted by Andy Heninger <an...@jtcsv.com>.


----- Original Message -----
From: "Gianni Mariani" <ma...@orconet.com>


> You probably won't like my patch.

I don't know whether I'd like it or not, but it's certain that
it wouldn't be compatible with what's there now, so it couldn't
just transparently replace the existing implementation.

If any major redesign / re-implementation is going to happen,
I've got a long wish list too.  We should not
just start making major API overhauls without some significant
discussion of requirements and design alternatives first.


The DOM and the Scanner/Parser/SAX API need to be treated
separately - somehow they seem to keep getting mushed together
in these discussions.

I've got no significant complaint with the existing null terminated
XMLCh * strings being delivered from the scanner as a way to
feed DOM building code.  It's simple, fast, and memory consumption
is pretty much a non-issue, since the same buffer is reused
forever.

Andy Heninger
IBM XML Technology Group, Cupertino, CA
heninger@us.ibm.com


>
> Firstly, -I'd make DOMString a template based on standard c++
basic_string.
> Secondly, - I would make pure virtual base classes of all the interfaces
as
> templates
> on a DOMString specialization.
>
> OR
>
> I would make a DOMstring class that exposes itself as either a standard
c++
> string
> or wstring and that it manages to convert when appropriate.  I've
implemented
> a similar
> string for WinCE ports and it works remarkably well.
>
> OR
>
> Do both of the options above.
>
> I would remove DOMString from Xerces and place it in a standard string
> library that
> had very rich support for all kinds of Unicode facilities.  I would
start
> getting C++
> standardization support for it.
>
> And, I would not depend on wchar_t to mean anything.
>
> If you would take a patch like this I'd be happy to do it but I would
need a
> commitment
> that it would be applied and released.
>
> Joe Polastre wrote:
>
> > ----- Original Message -----
> > From: "Gianni Mariani" <ma...@orconet.com>
> > > > We *do* define our own type, which is XMLCh. On those platforms
where
> > > > wchar_t *is* Unicode, we define XMLCh to be wchar_t. If its not,
XMLCh
> > is
> > > > defined to what is appropriate.
> > > Well, that is not configured correctly for Linux.
> >
> > alright, this is getting ridiculous!
> >
> > this is an open source project!  if everyone wants the change (as you
> > advocate) and it is incorrectly configured for linux, fix it!  and
send me
> > or dean the patch and we'll throw it into CVS for you...  you can
complain
> > until you're blue in the face on the mailing list, but nothing is
going to
> > happen unless you do it.
> >
> > -Joe Polastre  (jpolast@apache.org)
> > IBM Cupertino, XML Technology Group
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
>
>

Re: UTF-16 or UTF-8 or UCS-4

Posted by Joe Polastre <jp...@apache.org>.

well, I can't *guarentee* that a patch would go in...

but...

if your patch allows Xerces to continue to conform to the specs (being able
to get DOMStrings as UTF-16 by default to external applications [don't
really care what the underlying implementation is]), and does not have a
performance/size hit, then i'm all for it.  oh, and it has to compile and
work on all the platforms that are currently supported: AIX, HPUX, Solaris,
Linux, NT, OS/390, Mac, Win32/COM, etc...

-Joe Polastre  (jpolast@apache.org)
IBM Cupertino, XML Technology Group


----- Original Message -----
From: "Gianni Mariani" <ma...@orconet.com>
To: "Joe Polastre" <jp...@apache.org>
Cc: <xe...@xml.apache.org>; "Dean Roddey" <dr...@charmedquark.com>
Sent: Tuesday, July 25, 2000 11:04 AM
Subject: Re: UTF-16 or UTF-8 or UCS-4


>
> You probably won't like my patch.
>
> Firstly, -I'd make DOMString a template based on standard c++
basic_string.
> Secondly, - I would make pure virtual base classes of all the interfaces
as
> templates
> on a DOMString specialization.
>
> OR
>
> I would make a DOMstring class that exposes itself as either a standard
c++
> string
> or wstring and that it manages to convert when appropriate.  I've
implemented
> a similar
> string for WinCE ports and it works remarkably well.
>
> OR
>
> Do both of the options above.
>
> I would remove DOMString from Xerces and place it in a standard string
> library that
> had very rich support for all kinds of Unicode facilities.  I would start
> getting C++
> standardization support for it.
>
> And, I would not depend on wchar_t to mean anything.
>
> If you would take a patch like this I'd be happy to do it but I would need
a
> commitment
> that it would be applied and released.
>
> Joe Polastre wrote:
>
> > ----- Original Message -----
> > From: "Gianni Mariani" <ma...@orconet.com>
> > > > We *do* define our own type, which is XMLCh. On those platforms
where
> > > > wchar_t *is* Unicode, we define XMLCh to be wchar_t. If its not,
XMLCh
> > is
> > > > defined to what is appropriate.
> > > Well, that is not configured correctly for Linux.
> >
> > alright, this is getting ridiculous!
> >
> > this is an open source project!  if everyone wants the change (as you
> > advocate) and it is incorrectly configured for linux, fix it!  and send
me
> > or dean the patch and we'll throw it into CVS for you...  you can
complain
> > until you're blue in the face on the mailing list, but nothing is going
to
> > happen unless you do it.
> >
> > -Joe Polastre  (jpolast@apache.org)
> > IBM Cupertino, XML Technology Group
>
>

Re: UTF-16 or UTF-8 or UCS-4

Posted by Gianni Mariani <ma...@orconet.com>.

You probably won't like my patch.

Firstly, -I'd make DOMString a template based on standard c++ basic_string.
Secondly, - I would make pure virtual base classes of all the interfaces as
templates
on a DOMString specialization.

OR

I would make a DOMstring class that exposes itself as either a standard c++
string
or wstring and that it manages to convert when appropriate.  I've implemented
a similar
string for WinCE ports and it works remarkably well.

OR

Do both of the options above.

I would remove DOMString from Xerces and place it in a standard string
library that
had very rich support for all kinds of Unicode facilities.  I would start
getting C++
standardization support for it.

And, I would not depend on wchar_t to mean anything.

If you would take a patch like this I'd be happy to do it but I would need a
commitment
that it would be applied and released.

Joe Polastre wrote:

> ----- Original Message -----
> From: "Gianni Mariani" <ma...@orconet.com>
> > > We *do* define our own type, which is XMLCh. On those platforms where
> > > wchar_t *is* Unicode, we define XMLCh to be wchar_t. If its not, XMLCh
> is
> > > defined to what is appropriate.
> > Well, that is not configured correctly for Linux.
>
> alright, this is getting ridiculous!
>
> this is an open source project!  if everyone wants the change (as you
> advocate) and it is incorrectly configured for linux, fix it!  and send me
> or dean the patch and we'll throw it into CVS for you...  you can complain
> until you're blue in the face on the mailing list, but nothing is going to
> happen unless you do it.
>
> -Joe Polastre  (jpolast@apache.org)
> IBM Cupertino, XML Technology Group

Re: UTF-16 or UTF-8 or UCS-4

Posted by Joe Polastre <jp...@apache.org>.

----- Original Message -----
From: "Gianni Mariani" <ma...@orconet.com>
> > We *do* define our own type, which is XMLCh. On those platforms where
> > wchar_t *is* Unicode, we define XMLCh to be wchar_t. If its not, XMLCh
is
> > defined to what is appropriate.
> Well, that is not configured correctly for Linux.

alright, this is getting ridiculous!

this is an open source project!  if everyone wants the change (as you
advocate) and it is incorrectly configured for linux, fix it!  and send me
or dean the patch and we'll throw it into CVS for you...  you can complain
until you're blue in the face on the mailing list, but nothing is going to
happen unless you do it.

-Joe Polastre  (jpolast@apache.org)
IBM Cupertino, XML Technology Group

Re: UTF-16 or UTF-8 or UCS-4

Posted by Gianni Mariani <ma...@orconet.com>.

Dean Roddey wrote:

> We *do* define our own type, which is XMLCh. On those platforms where
> wchar_t *is* Unicode, we define XMLCh to be wchar_t. If its not, XMLCh is
> defined to what is appropriate.

Well, that is not configured correctly for Linux.

Re: UTF-16 or UTF-8 or UCS-4

Posted by Dean Roddey <dr...@charmedquark.com>.

> You cannot AND MUST NOT assume that wchar_t is any particular
> encoding, it is non-portable, even though some vendors encourage
> it's use.
>
> You are left with defining your own type if you expect to be portable.
>

We *do* define our own type, which is XMLCh. On those platforms where
wchar_t *is* Unicode, we define XMLCh to be wchar_t. If its not, XMLCh is
defined to what is appropriate.

> iconv is available and on IRIX and Linux at least, it knows how to convert
> to and from utf-16 (and utf-8).
>

But ICU is something that we have to deal with. It supports a huge number of
encodings and we would always want to support it on all platforms as an
alternative to the local transcoding services.

> > I said "if you want". If you want, you can. If that's what some platform
has
> > done, then they did it. Perhaps they forgot to ask you, but they did it.
So
> > we have little choice but deal with it.
>
> Xerces can't do it.
>

Well, it does. Its a completely practical matter that supercedes any
technicalities of the specs.

> > Even if that is technically, supposedly required,
> > it doesn't mean that its always done. Its just UTF-16 code points that
> > happen to be stored in something larger. And, in reality, since
surrogated
> > values seldom occur in the real world, most UCS-4 streams *are* just
UTF-16
> > stored in 32 bit values.
>
> Then Xerces is not compliant with the XML spec.
>

The people on those platforms where their wide character APIs take Unicode
would not be happy if they could not pass our output to local wide character
APIs, and I doubt very seriously if they give a flying whatever that its not
strictly compliant to the spec in this regard. If that platform decides to
define a 32 bit wchar_t and use UTF-16 code points in 32 bit values, we'll
conform to that because its the only practical thing to do.

The parser is compliant where its important. Basically no one but you has
ever brought this up, probably because we've taken a completely logical and
practical approach that works. And you aren't bringing it up because its
causing you any problem, for that matter. You are just nit picking because
you're pissed off.

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"You young, and you gotcha health. Whatchoo wanna job fer?"

Re: UTF-16 or UTF-8 or UCS-4

Posted by Gianni Mariani <ma...@orconet.com>.

A UTF-16 string must be stored in an array of elements where each
element is ONLY the size of 2 bytes.  PERIOD.  Otherwise IT IS NOT
UTF-16.

You cannot AND MUST NOT assume that wchar_t is any particular
encoding, it is non-portable, even though some vendors encourage
it's use.

You are left with defining your own type if you expect to be portable.

Dean Roddey wrote:

> > > The size of wchar_t has nothing to do with which encoding is used.
> >
> > The Unicode standard is very clear on this.  UTF-16 must be encoded
> > in 2 octet pairs.  I don't have a copy handy, I'd welcome you to track it
> > down.
> >
>
> It is *encoded* in two octet pairs. Its just stored in something larger.
> Look, you can talk about it all you want. But the fact is, that some
> platforms have set wchar_t to 32 bits so that it can hold any kind of wide
> character that someone might need to. If they then choose to have their wide
> character APIs work in terms of UTF-16 code points, then that's just the way
> it is. You can complain about it, but it makes no sense for us not to adjust
> to that fact. If you don't like it, take it up with them. We are just doing
> what is necessary to make the text we spit out passable to local wide
> character APIs.

I don't wish to complain - I'm trying to state fact.

>
>
> >
> > > Just
> > > because its 4 bytes doesn't mean its UCS-4. They are often just allowing
> it
> > > to hold any kind of wide character someone might want to use. But the
> wide
> > > character APIs on most of those are probably still expecting UTF-16,
> though
> > > I could be wrong and if someone wants to confirm or deny that, it would
> be
> > > useful.
> >
> > It's been a couple of years, but the plans were for DEC, IRIX and Solaris
> to
> > use UCS-4 for wide characters.  I don't think IRIX has converted to it
> > yet - I may be wrong.
> >
>
> If that is true, then the work should be done to deal with that in the
> parser, which wouldn't be too hard overall. The biggest problem will be ICU.
> For the enodings that we do ourselves, we can do it trivially with a
> conditional that is set per-platform. For the ones that we hand off to ICU,
> we'd have to get them to do this also, or post-process what we get back from
> them.

iconv is available and on IRIX and Linux at least, it knows how to convert
to and from utf-16 (and utf-8).

>
>
> > > if yo want to, and you can
> > > store surrogated UTF-16 characters in them as well.
> >
> > Nope - it's not a standard Unicode format, unless you're inventing new
> > standards
> > and I assure you there are alot of people who have put much more energy
> into
> > the
> > Unicode standard than you or I who would really like you not to do that !
> >
>
> I said "if you want". If you want, you can. If that's what some platform has
> done, then they did it. Perhaps they forgot to ask you, but they did it. So
> we have little choice but deal with it.

Xerces can't do it.

>
>
> > > And I suspect that at
> > > least some of the platforms you mention do so.
> >
> > Actually - according to the ISO-10646 standard you cannot store surrogates
> > in a UCS-4 stream - you must not.  Similarly, you cannot, must not, encode
> > surrogates in a utf-8 stream.  This is strictly disallowed, even Microsoft
> > gets this right (as of recent OS's).
> >
>
> Its not a UCS-4 stream. Again, you assume that just because the wchar_t is
> 32 bits, that its UCS-4.

That's what it is by the Unicode standard.

> Even if that is technically, supposedly required,
> it doesn't mean that its always done. Its just UTF-16 code points that
> happen to be stored in something larger. And, in reality, since surrogated
> values seldom occur in the real world, most UCS-4 streams *are* just UTF-16
> stored in 32 bit values.

Then Xerces is not compliant with the XML spec.

>
>
> --------------------------
> Dean Roddey
> The CIDLib C++ Frameworks
> Charmed Quark Software
> droddey@charmedquark.com
> http://www.charmedquark.com
>
> "You young, and you gotcha health. Whatchoo wanna job fer?"
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

Re: UTF-16 or UTF-8 or UCS-4

Posted by Dean Roddey <dr...@charmedquark.com>.

> > The size of wchar_t has nothing to do with which encoding is used.
>
> The Unicode standard is very clear on this.  UTF-16 must be encoded
> in 2 octet pairs.  I don't have a copy handy, I'd welcome you to track it
> down.
>

It is *encoded* in two octet pairs. Its just stored in something larger.
Look, you can talk about it all you want. But the fact is, that some
platforms have set wchar_t to 32 bits so that it can hold any kind of wide
character that someone might need to. If they then choose to have their wide
character APIs work in terms of UTF-16 code points, then that's just the way
it is. You can complain about it, but it makes no sense for us not to adjust
to that fact. If you don't like it, take it up with them. We are just doing
what is necessary to make the text we spit out passable to local wide
character APIs.

>
> > Just
> > because its 4 bytes doesn't mean its UCS-4. They are often just allowing
it
> > to hold any kind of wide character someone might want to use. But the
wide
> > character APIs on most of those are probably still expecting UTF-16,
though
> > I could be wrong and if someone wants to confirm or deny that, it would
be
> > useful.
>
> It's been a couple of years, but the plans were for DEC, IRIX and Solaris
to
> use UCS-4 for wide characters.  I don't think IRIX has converted to it
> yet - I may be wrong.
>

If that is true, then the work should be done to deal with that in the
parser, which wouldn't be too hard overall. The biggest problem will be ICU.
For the enodings that we do ourselves, we can do it trivially with a
conditional that is set per-platform. For the ones that we hand off to ICU,
we'd have to get them to do this also, or post-process what we get back from
them.

> > if yo want to, and you can
> > store surrogated UTF-16 characters in them as well.
>
> Nope - it's not a standard Unicode format, unless you're inventing new
> standards
> and I assure you there are alot of people who have put much more energy
into
> the
> Unicode standard than you or I who would really like you not to do that !
>

I said "if you want". If you want, you can. If that's what some platform has
done, then they did it. Perhaps they forgot to ask you, but they did it. So
we have little choice but deal with it.

> > And I suspect that at
> > least some of the platforms you mention do so.
>
> Actually - according to the ISO-10646 standard you cannot store surrogates
> in a UCS-4 stream - you must not.  Similarly, you cannot, must not, encode
> surrogates in a utf-8 stream.  This is strictly disallowed, even Microsoft
> gets this right (as of recent OS's).
>

Its not a UCS-4 stream. Again, you assume that just because the wchar_t is
32 bits, that its UCS-4. Even if that is technically, supposedly required,
it doesn't mean that its always done. Its just UTF-16 code points that
happen to be stored in something larger. And, in reality, since surrogated
values seldom occur in the real world, most UCS-4 streams *are* just UTF-16
stored in 32 bit values.

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"You young, and you gotcha health. Whatchoo wanna job fer?"

Re: UTF-16 or UTF-8 or UCS-4

Posted by Gianni Mariani <ma...@orconet.com>.

Dean Roddey wrote:

> The size of wchar_t has nothing to do with which encoding is used.

The Unicode standard is very clear on this.  UTF-16 must be encoded
in 2 octet pairs.  I don't have a copy handy, I'd welcome you to track it
down.


> Just
> because its 4 bytes doesn't mean its UCS-4. They are often just allowing it
> to hold any kind of wide character someone might want to use. But the wide
> character APIs on most of those are probably still expecting UTF-16, though
> I could be wrong and if someone wants to confirm or deny that, it would be
> useful.

It's been a couple of years, but the plans were for DEC, IRIX and Solaris to
use UCS-4 for wide characters.  I don't think IRIX has converted to it
yet - I may be wrong.

>
> Anyway, there is a difference between storage format and encoding. You can
> store ASCII code points in 32 bit characters ...

That happens to UCS-4 standards compliant - spelled out.

> if yo want to, and you can
> store surrogated UTF-16 characters in them as well.

Nope - it's not a standard Unicode format, unless you're inventing new
standards
and I assure you there are alot of people who have put much more energy into
the
Unicode standard than you or I who would really like you not to do that !

> And I suspect that at
> least some of the platforms you mention do so.

Actually - according to the ISO-10646 standard you cannot store surrogates
in a UCS-4 stream - you must not.  Similarly, you cannot, must not, encode
surrogates in a utf-8 stream.  This is strictly disallowed, even Microsoft
gets this right (as of recent OS's).

>
> --------------------------
> Dean Roddey
> The CIDLib C++ Frameworks
> Charmed Quark Software
> droddey@charmedquark.com
> http://www.charmedquark.com
>
> "You young, and you gotcha health. Whatchoo wanna job fer?"
>
> ----- Original Message -----
> From: "Gianni Mariani" <ma...@orconet.com>
> To: <xe...@xml.apache.org>
> Sent: Monday, July 24, 2000 10:49 AM
> Subject: Re: UTF-16 or UTF-8 or UCS-4
>
> >
> > Joe - do some reserach - UTF-16 is a 2 byte encoding - by defintion.
> > There are a whole lot more platforms out there that use a 4 byte wchar_t
> > including Solaris, IRIX, FreeBSD and Linux.
> >
> > Joe Polastre wrote:
> >
> > > From: "Gianni Mariani" <ma...@orconet.com>
> > > > My point exactly - so why is it using wchar_t - which is deffinitly
> not
> > > defined
> > > > as UTF-16 except on IBM and NT platforms ?
> > >
> > > It is my understanding that wchar_t is UTF-16 on just about every
> platform
> > > except for HPUX.  Or at least that XMLCh is typedef'd to something
> UTF-16 on
> > > most platforms.
> > >
> > > This is in no way an "IBM-centric" thing.
> > >
> > > Dean & Arundhati probably know much more about this than I do though.
> > >
> > > -Joe Polastre  (jpolast@apache.org)
> > > IBM Cupertino, XML Technology Group
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> > > For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> > For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

Re: UTF-16 or UTF-8 or UCS-4

Posted by Dean Roddey <dr...@charmedquark.com>.

The size of wchar_t has nothing to do with which encoding is used. Just
because its 4 bytes doesn't mean its UCS-4. They are often just allowing it
to hold any kind of wide character someone might want to use. But the wide
character APIs on most of those are probably still expecting UTF-16, though
I could be wrong and if someone wants to confirm or deny that, it would be
useful.

Anyway, there is a difference between storage format and encoding. You can
store ASCII code points in 32 bit characters if yo want to, and you can
store surrogated UTF-16 characters in them as well. And I suspect that at
least some of the platforms you mention do so.

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"You young, and you gotcha health. Whatchoo wanna job fer?"

----- Original Message -----
From: "Gianni Mariani" <ma...@orconet.com>
To: <xe...@xml.apache.org>
Sent: Monday, July 24, 2000 10:49 AM
Subject: Re: UTF-16 or UTF-8 or UCS-4

>
> Joe - do some reserach - UTF-16 is a 2 byte encoding - by defintion.
> There are a whole lot more platforms out there that use a 4 byte wchar_t
> including Solaris, IRIX, FreeBSD and Linux.
>
> Joe Polastre wrote:
>
> > From: "Gianni Mariani" <ma...@orconet.com>
> > > My point exactly - so why is it using wchar_t - which is deffinitly
not
> > defined
> > > as UTF-16 except on IBM and NT platforms ?
> >
> > It is my understanding that wchar_t is UTF-16 on just about every
platform
> > except for HPUX.  Or at least that XMLCh is typedef'd to something
UTF-16 on
> > most platforms.
> >
> > This is in no way an "IBM-centric" thing.
> >
> > Dean & Arundhati probably know much more about this than I do though.
> >
> > -Joe Polastre  (jpolast@apache.org)
> > IBM Cupertino, XML Technology Group
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> > For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
>

Re: UTF-16 or UTF-8 or UCS-4

Posted by Gianni Mariani <ma...@orconet.com>.

Joe - do some reserach - UTF-16 is a 2 byte encoding - by defintion.
There are a whole lot more platforms out there that use a 4 byte wchar_t
including Solaris, IRIX, FreeBSD and Linux.

Joe Polastre wrote:

> From: "Gianni Mariani" <ma...@orconet.com>
> > My point exactly - so why is it using wchar_t - which is deffinitly not
> defined
> > as UTF-16 except on IBM and NT platforms ?
>
> It is my understanding that wchar_t is UTF-16 on just about every platform
> except for HPUX.  Or at least that XMLCh is typedef'd to something UTF-16 on
> most platforms.
>
> This is in no way an "IBM-centric" thing.
>
> Dean & Arundhati probably know much more about this than I do though.
>
> -Joe Polastre  (jpolast@apache.org)
> IBM Cupertino, XML Technology Group
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

Re: UTF-16 or UTF-8 or UCS-4

Posted by Joe Polastre <jp...@apache.org>.

From: "Gianni Mariani" <ma...@orconet.com>
> My point exactly - so why is it using wchar_t - which is deffinitly not
defined
> as UTF-16 except on IBM and NT platforms ?

It is my understanding that wchar_t is UTF-16 on just about every platform
except for HPUX.  Or at least that XMLCh is typedef'd to something UTF-16 on
most platforms.

This is in no way an "IBM-centric" thing.

Dean & Arundhati probably know much more about this than I do though.

-Joe Polastre  (jpolast@apache.org)
IBM Cupertino, XML Technology Group

Re: UTF-16 or UTF-8 or UCS-4

Posted by Gianni Mariani <ma...@orconet.com>.

My point exactly - so why is it using wchar_t - which is deffinitly not defined

as UTF-16 except on IBM and NT platforms ?

Secondly, it is really silly for the XML standard to specify the DOMString
encoding,
it should specify the character set but it's really silly to specify the
encoding.

It would be much more convenient to C++ developers to use the C++ standard
"string" class.  It would be so much more compatible with everything else.

So I'll never use Xerces (in it's native form).

Joe Polastre wrote:

> > Excerpt from the w3c idl ...
> >
> > #pragma prefix "w3c.org"
> > module dom
> > {
> >   typedef sequence<unsigned short> DOMString;
>
> while you're in the process of quoting W3C, why don't you read this:
>
> "Applications must encode DOMString using UTF-16 (defined in Appendix C.3 of
> [UNICODE] and Amendment 1 of [ISO-10646]).  The UTF-16 encoding was chosen
> because of its widespread industry practice.  Please note that for both HTML
> and XML, the document character set (and therefore the notation of numeric
> character references) is based on UCS-4."
>
> so it doesn't matter, UTF-8 will _never_ be supported as the default type by
> Xerces.
>
> www.w3.org/TR/1998/REC-DOM-Level-1-19981001/level-one-core.html
>
> -Joe Polastre  (jpolast@apache.org)
> IBM Cupertino, XML Technology Group

Re: reuse validator

Posted by Dean Roddey <dr...@charmedquark.com>.

The deal with reusing the parse is:

1. You parse once with the flag set to false, to load the DTD.
2. You parse after that with it set to true, to reuse the DTD.
3. The target XML files cannot have any internal subset
4. It can have or not have an external subset. That subset will be ignored
5. In order to get another DTD loaded up, you must parse something that uses
that new DTD and set the flag to false again.
6. That new DTD then becomes the DTD that will be reused.

If you think that you are following all those rules, and its not working,
then something must be broken.

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"You young, and you gotcha health. Whatchoo wanna job fer?"


----- Original Message -----
From: "Drew Besser" <db...@aego.com>
To: <xe...@xml.apache.org>
Sent: Thursday, July 27, 2000 2:42 AM
Subject: reuse validator


> on xerces-c 1.2 (whatever new version it is) for WINDOWS,
> I'm continually parsing hundreds of pages, all with the same DTD.
> One would think I would want to reuse the validator.
> However, when I set the reuse parameter to TRUE, even after I've
> successfully parsed 1 .wml page, i get the error that it doesn't
understand
> the wml doctype, which is defined in the DTD of the first page that I
> parsed.
> Do i have my terms confused here, or do I need to be parsing using
different
> methods.
> Any help would be appreciated.
>
> ALSO, many many thanks to those that solved the DTDs and such thread and
> fixed the http retrieval code.
>
> Cheers,
> Drew Besser
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
>

reuse validator

Posted by Drew Besser <db...@aego.com>.

on xerces-c 1.2 (whatever new version it is) for WINDOWS,
I'm continually parsing hundreds of pages, all with the same DTD.
One would think I would want to reuse the validator.
However, when I set the reuse parameter to TRUE, even after I've
successfully parsed 1 .wml page, i get the error that it doesn't understand
the wml doctype, which is defined in the DTD of the first page that I
parsed.
Do i have my terms confused here, or do I need to be parsing using different
methods.
Any help would be appreciated.

ALSO, many many thanks to those that solved the DTDs and such thread and
fixed the http retrieval code.

Cheers,
Drew Besser

Re: UTF-16 or UTF-8 or UCS-4

Posted by Joe Polastre <jp...@apache.org>.

> Excerpt from the w3c idl ...
>
> #pragma prefix "w3c.org"
> module dom
> {
>   typedef sequence<unsigned short> DOMString;

while you're in the process of quoting W3C, why don't you read this:

"Applications must encode DOMString using UTF-16 (defined in Appendix C.3 of
[UNICODE] and Amendment 1 of [ISO-10646]).  The UTF-16 encoding was chosen
because of its widespread industry practice.  Please note that for both HTML
and XML, the document character set (and therefore the notation of numeric
character references) is based on UCS-4."

so it doesn't matter, UTF-8 will _never_ be supported as the default type by
Xerces.

www.w3.org/TR/1998/REC-DOM-Level-1-19981001/level-one-core.html

-Joe Polastre  (jpolast@apache.org)
IBM Cupertino, XML Technology Group

Re: UTF-16 or UTF-8 or UCS-4

Posted by Gianni Mariani <ma...@orconet.com>.

Here is code that will implement it for you.

ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF/CVTUTF.C
ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF/CVTUTF.H

You can get a full description of the utf-8 transformation from the
Unicode Standard.

Basically it goes like this

Unicode code point
0-0x7f                     -> 0xxxxxxx    (1 byte ascii)
0x80-0x7ff              -> 110xxxxx 10xxxxxxx
0x800-0xffff            -> 1110xxxx  10xxxxxx 10xxxxxx
0x10000->0x1fffff    -> 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0x200000 ->0x3ffffff -> 111110xx 10xxxxxx 10xxxxxxx 10xxxxxxx 10xxxxxxx
0x4000000-2^31-1  -> 1111110x 10xxxxxx 10xxxxxxx 10xxxxxxx 10xxxxxxx 10xxxxxx

The algorithm in CVTUTF.C also deals with surrogates correctly for UTF-16 and
it optimizes fairly well.


Yen Trinh wrote:

> I'm currently looking at converting our XML document into UTF-8
> unicode.  Where can I find definition of UTF-8 character set?
>
> Thanks,
> Yen
>
> Dean Roddey wrote:
>
> > > I am contemplating writing a utf-8  dom library - yes, either by hacking
> > > Xerces or patching one together.  I have yet to decide.  I would sure
> > > use it .
> > >
> >
> > All you have to do is to transcode the data you get from the current DOM
> > into UTF-8. Otherwise, you won't gain a whole lot. You'll save some DOM
> > memory, but that's it. The rest of the parser will still spit out Unicode in
> > the local wide string format. You'll just trancode it further upstream.
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

Re: UTF-16 or UTF-8 or UCS-4

Posted by Yen Trinh <yt...@interlog.com>.

I'm currently looking at converting our XML document into UTF-8
unicode.  Where can I find definition of UTF-8 character set?

Thanks,
Yen

Dean Roddey wrote:

> > I am contemplating writing a utf-8  dom library - yes, either by hacking
> > Xerces or patching one together.  I have yet to decide.  I would sure
> > use it .
> >
> 
> All you have to do is to transcode the data you get from the current DOM
> into UTF-8. Otherwise, you won't gain a whole lot. You'll save some DOM
> memory, but that's it. The rest of the parser will still spit out Unicode in
> the local wide string format. You'll just trancode it further upstream.
>

Re: UTF-16 or UTF-8 or UCS-4

Posted by Gianni Mariani <ma...@orconet.com>.

Dean Roddey wrote:

> > > Most applications, safely, can ignore surrogates and they rarely occur.
> If
> > > they are worried about them, they can deal with them, but very few
> things
> > > generate them in the real world.
> >
> > Well sure - just like in most code I can ignore everything but ascii !
> >
>
> That's a completely bogus argument. You can ignore Unicode surrogates for
> allmost every language on the planet. You can ignore evertying but ASCII for
> almost nothing but English.

It's not a bogus argument - it's true.  It's not true for all multibyte
encodings
it is true for utf-8 !  Think a little harder.

Most byte based parsing algorithms like - finding fields in a tab delimited
list
or finding fields in a ":" delimited list or finding the elements of a path
name
work fine if the encoding is utf-8.  This is not true for SJIS or BIG-5.

I challenge you to find code in Xerces that really benefits from being
utf-16 based.

>
> > >
> > > We only spit out surrogate pairs at this time, so we don't create 32 bit
> > > characters even if the local character size is 32 bits. This has been
> > > discussed as we could do so, but we wouldn't drive it purely off the
> size of
> > > the wide char.
> >
> > Well, that's not going to work if the host wchar_t is UCS-4  !
> >
>
> Just because the host wchar_t is 32 bits doesn't mean its UCS-4 necessarily.
> But, if it is, the parser could be changed to deal with this relatively
> easily. It hasn't been done yet, because we've yet to port to a platform in
> which this is the case. Linux has a wchar_t that is 32 bits, but its still
> UTF-16, AFAIK.
>
> > >
> > > We know that wchar_t is not necessarily Unicode, which is why we use
> XMLCh
> > > instead. On those platforms where its not Unicode, and HP is our only
> one,
> > > they can choose to deal with it however they feel most appropriate. But,
> any
> > > system which does not use Unicode for its wide char representation is
> kind
> > > of out of the mainstream and they just have to deal with it.
> >
> > I believe Solaris is UCS-4 as well. IRIX is probably some old unix
> encoding.
> > Really, it's a mish mash - you can't rely on it ...
> >
>
> I don't think Solaris is, because already run on Solaris, though I don't get
> involved in the ports personally.

The point is you can't rely on it !  You're uncertainty makes my point.


>
>
> > > > Oh we forgot to mention how you're supposed to deal with BOM
> > > > (or ISO-10646 signatures), so now you have a stateful encoding as well
> > > > while utf-8 is endian independant.
> > > >
> > >
> > > We deal with BOM's for you, and in the data we spit out, its a non
> issue,
> > > since its not there.
> >
> > That's nice - you wouldn't have to concern yourself at all with utf-8.
> >
>
> Ok, that's 2 lines of code or so. That should reduce the complexity of the
> parser considerably.

Actually it's alot more complex than just 2 lines of code.  In a heterogeneous
environment it's a nightmare.

You said intel won the war - well, there were way more MIPS processors
shipped last year than intel processors, most of them running big endian
in embedded environments.

doing

cat file1 file2 > file3

is no longer guarenteed to work.

Or 2 programs on a networked file system appending to a "log" file can get
their messages all mixed up.

Heterogenous environments work fine in utf-8.


>
>
> > > UTF-8 isn in no way superior to Unicode, unless you just want it for
> some
> > > reason. But, its the lingua franca of encodings these days and its what
> we
> > > will continue to use I'm sure.
> >
> > I think you're missing the point - UTF-8 is a normative part of the
> Unicode
> > standard - It IS Unicode.  When you refer to Unicode you may need to be
> more
> > specific.  UTF-16 is an abortion, but also a normative part of the
> standard.
> >
>
> I meant to say UTF-16, I was typing too fast. Your opinion of UTF-16 is not
> shared by most of the world. If you have some deep issues with UTF-16, there
> are probably drugs that can help with that. In the meantime, the rest of the
> world will continue to move forward.

Unicode the character set is the lingua franca - no questions.  UTF-16 has
a long long way to go before it can be deemed the lingua franca.


>
>
> > > Get used to it. Its the way the entire world will be before long. We
> deal
> > > with all the surrogate pairs. So what we spit out has dealt with them.
> If
> > > you choose to worry about them, which mostly you won't need to do, you
> can.
> >
> > I beg to differ.  I managed internationalization projects for 5 years.
> Never
> > needed to do a whole hog conversion to utf-16 and it is the wrong answer
> > because it is too expensive in every respect.  A simple string class that
> > has 2 both interfaces does the trick and gets product out sooner without
> > having months of re-development.
> >
>
> That's your opinion, but I think you are flat wrong.

Nice argument.  I'm sorry if you feel cornered and have no valid response
but some meat supporting your perspective would be good.


>
>
> > > And, by everyone continuing to use UTF-8 you expect this to change?
> >
> > utf-16 can die for all I care.  utf-8 and utf-16 cover exactly the same
> > character set - actually, utf-8 covers more code-points.
> >
>
> Obviously you do have deep issues. Did a UTF encoding hurt you as a child or
> something?

Let's not get personal.

>
>
> > It sounds like you've been drinking the Microsoft coolade for
> > too long.  The support for utf-16 is a political one, it would be
> > disaster for Microsoft & IBM if it was not.  No one else cares that
> > much since most other implementations of wchar_t are 4 byte.
> >
>
> I could care less what MS thinks. *I* think that Unicode, in either a UTF-16
> or UCS-4 format, is where the world is going to end up. UTF-8 sucks. One of
> the major reasons you say UTF-16 sucks (surrogates) is an order of magnitude
> worse in UTF-8. *Any* standard that gets accepted must get away from
> variable byte encodings. UTF-16 is that for basically anything you are
> likely to deal with in reality. If UCS-4 becomes more widely used, fine.
> Memory is disposable these days compared to the stupidities of using
> variable bytes string representations.

Dealing with the real world unfortunatly, Unicode is much more complex
than ISO-8859, SJIS etc.  You may need to deal with combining characters,
bidirectional marks all which can cause even the most benign code to become
completly unmanageable. Gone are the days when a wide character is a character
even for UCS-4.

So if you need to deal with that complexity, you may as well not have to
force people to rewrite all the code they have now and live with the lesser
of 2 evils - utf-8.

>
>
> > Anyway, there are over 100,000 Chinese characters still to be
> > encoded and they will break the barrier alone.  Plus many many
> > languages still to be encoded.
> >
>
> Given that hardly any application on the planet comes close to strainging
> the support provided by UTF-16, I'm not going to lose sleep about it. And
> any application that does, already has wide applicability that it will be
> saleable in almost anywhere anyone has enough money to know what software is
> and why they would want to use it.

Same argument applies for utf-8 !  In fact, anybody who has written code
based on byte based strings will likely have very little to do to support
Unicode the character set.  Much less investment - same basic return.


>
>
> > > I don't know if the DOM deals with this or not. It should be updated if
> > > required. But the issue is that we would deal with it, not the customer
> for
> > > the most part. Most of the DOM doesn't care because it generally doesn't
> > > spit up strings, or when it does its on a known character like the
> colon.
> >
> > ':' is also an ASCII character - works for utf-8 too !
> >
>
> That's hardly the point. The point was that your arguments against UTF-16
> don't make much sense since any splits that DOM does are not subject to any
> surrogate confusion.

Xerces is not Unicode compliant.


>
>
> > > Again, only in rare case where you are dealing with data that will
> create
> > > them. You are trying to use the rare possibility of a problem as an
> argument
> > > because its one of the few you've got.
> >
> > To be Unicode (TM) compliant you have to deal with them.  If you have to
> > deal with them - you may as well stick with utf-8.
> >
>
> We do deal with them. But the point is that most applications know that they
> will never see them, or they can safely explicitly indicate that they will
> not deal with them specifically. I don't think that most people would
> consider this much of a limitation.

Your assumption is invalid - this is the "World Wide Web" with the order
of a billion documents.  XML will be the interchange mechanism - Xerces
for better or worse will need to deal with them correctly sooner or later.
Once
you do, you're in the same boat as if you had used utf-8 from the beginning.


> > > And we have two completely different builds, which have to be tested
> > > separated on every platform, for every compiler. If you want to
> volunteer
> > > for that work, feel free to do so.  And some of our compilers probably
> still
> > > don't support the STL stuff.
> >
> > I am contemplating writing a utf-8  dom library - yes, either by hacking
> > Xerces or patching one together.  I have yet to decide.  I would sure
> > use it .
> >
>
> All you have to do is to transcode the data you get from the current DOM
> into UTF-8. Otherwise, you won't gain a whole lot. You'll save some DOM
> memory, but that's it. The rest of the parser will still spit out Unicode in
> the local wide string format. You'll just trancode it further upstream.

I could also change the string implementation.

Re: UTF-16 or UTF-8 or UCS-4

Posted by Dean Roddey <dr...@charmedquark.com>.

> > Most applications, safely, can ignore surrogates and they rarely occur.
If
> > they are worried about them, they can deal with them, but very few
things
> > generate them in the real world.
>
> Well sure - just like in most code I can ignore everything but ascii !
>

That's a completely bogus argument. You can ignore Unicode surrogates for
allmost every language on the planet. You can ignore evertying but ASCII for
almost nothing but English.

> >
> > We only spit out surrogate pairs at this time, so we don't create 32 bit
> > characters even if the local character size is 32 bits. This has been
> > discussed as we could do so, but we wouldn't drive it purely off the
size of
> > the wide char.
>
> Well, that's not going to work if the host wchar_t is UCS-4  !
>

Just because the host wchar_t is 32 bits doesn't mean its UCS-4 necessarily.
But, if it is, the parser could be changed to deal with this relatively
easily. It hasn't been done yet, because we've yet to port to a platform in
which this is the case. Linux has a wchar_t that is 32 bits, but its still
UTF-16, AFAIK.

> >
> > We know that wchar_t is not necessarily Unicode, which is why we use
XMLCh
> > instead. On those platforms where its not Unicode, and HP is our only
one,
> > they can choose to deal with it however they feel most appropriate. But,
any
> > system which does not use Unicode for its wide char representation is
kind
> > of out of the mainstream and they just have to deal with it.
>
> I believe Solaris is UCS-4 as well. IRIX is probably some old unix
encoding.
> Really, it's a mish mash - you can't rely on it ...
>

I don't think Solaris is, because already run on Solaris, though I don't get
involved in the ports personally.

> > > Oh we forgot to mention how you're supposed to deal with BOM
> > > (or ISO-10646 signatures), so now you have a stateful encoding as well
> > > while utf-8 is endian independant.
> > >
> >
> > We deal with BOM's for you, and in the data we spit out, its a non
issue,
> > since its not there.
>
> That's nice - you wouldn't have to concern yourself at all with utf-8.
>

Ok, that's 2 lines of code or so. That should reduce the complexity of the
parser considerably.

> > UTF-8 isn in no way superior to Unicode, unless you just want it for
some
> > reason. But, its the lingua franca of encodings these days and its what
we
> > will continue to use I'm sure.
>
> I think you're missing the point - UTF-8 is a normative part of the
Unicode
> standard - It IS Unicode.  When you refer to Unicode you may need to be
more
> specific.  UTF-16 is an abortion, but also a normative part of the
standard.
>

I meant to say UTF-16, I was typing too fast. Your opinion of UTF-16 is not
shared by most of the world. If you have some deep issues with UTF-16, there
are probably drugs that can help with that. In the meantime, the rest of the
world will continue to move forward.

> > Get used to it. Its the way the entire world will be before long. We
deal
> > with all the surrogate pairs. So what we spit out has dealt with them.
If
> > you choose to worry about them, which mostly you won't need to do, you
can.
>
> I beg to differ.  I managed internationalization projects for 5 years.
Never
> needed to do a whole hog conversion to utf-16 and it is the wrong answer
> because it is too expensive in every respect.  A simple string class that
> has 2 both interfaces does the trick and gets product out sooner without
> having months of re-development.
>

That's your opinion, but I think you are flat wrong.

> > And, by everyone continuing to use UTF-8 you expect this to change?
>
> utf-16 can die for all I care.  utf-8 and utf-16 cover exactly the same
> character set - actually, utf-8 covers more code-points.
>

Obviously you do have deep issues. Did a UTF encoding hurt you as a child or
something?

> It sounds like you've been drinking the Microsoft coolade for
> too long.  The support for utf-16 is a political one, it would be
> disaster for Microsoft & IBM if it was not.  No one else cares that
> much since most other implementations of wchar_t are 4 byte.
>

I could care less what MS thinks. *I* think that Unicode, in either a UTF-16
or UCS-4 format, is where the world is going to end up. UTF-8 sucks. One of
the major reasons you say UTF-16 sucks (surrogates) is an order of magnitude
worse in UTF-8. *Any* standard that gets accepted must get away from
variable byte encodings. UTF-16 is that for basically anything you are
likely to deal with in reality. If UCS-4 becomes more widely used, fine.
Memory is disposable these days compared to the stupidities of using
variable bytes string representations.

> Anyway, there are over 100,000 Chinese characters still to be
> encoded and they will break the barrier alone.  Plus many many
> languages still to be encoded.
>

Given that hardly any application on the planet comes close to strainging
the support provided by UTF-16, I'm not going to lose sleep about it. And
any application that does, already has wide applicability that it will be
saleable in almost anywhere anyone has enough money to know what software is
and why they would want to use it.

> > I don't know if the DOM deals with this or not. It should be updated if
> > required. But the issue is that we would deal with it, not the customer
for
> > the most part. Most of the DOM doesn't care because it generally doesn't
> > spit up strings, or when it does its on a known character like the
colon.
>
> ':' is also an ASCII character - works for utf-8 too !
>

That's hardly the point. The point was that your arguments against UTF-16
don't make much sense since any splits that DOM does are not subject to any
surrogate confusion.

> > Again, only in rare case where you are dealing with data that will
create
> > them. You are trying to use the rare possibility of a problem as an
argument
> > because its one of the few you've got.
>
> To be Unicode (TM) compliant you have to deal with them.  If you have to
> deal with them - you may as well stick with utf-8.
>

We do deal with them. But the point is that most applications know that they
will never see them, or they can safely explicitly indicate that they will
not deal with them specifically. I don't think that most people would
consider this much of a limitation.

> > And we have two completely different builds, which have to be tested
> > separated on every platform, for every compiler. If you want to
volunteer
> > for that work, feel free to do so.  And some of our compilers probably
still
> > don't support the STL stuff.
>
> I am contemplating writing a utf-8  dom library - yes, either by hacking
> Xerces or patching one together.  I have yet to decide.  I would sure
> use it .
>

All you have to do is to transcode the data you get from the current DOM
into UTF-8. Otherwise, you won't gain a whole lot. You'll save some DOM
memory, but that's it. The rest of the parser will still spit out Unicode in
the local wide string format. You'll just trancode it further upstream.

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"You young, and you gotcha health. Whatchoo wanna job fer?"

Re: UTF-16 or UTF-8 or UCS-4

Posted by Gianni Mariani <ma...@orconet.com>.

Dean Roddey wrote:

> > > Personally, I think you are wrong on all counts.
> > >
> > > Unicode is the future. UTF-8 is a convenient interchange format, but its
> > > pretty useless as a live string format because of its variable byte
> nature.
> > > And, by spitting out Unicode in the local wchar_t format, we allow our
> stuff
> > > to be passed to local wide character APIs directly. This is very, very
> > > important and we took this step very deliberately. You can easily
> transcode
> > > it to a local format if you need to.
> >
> > utf-16 is just as hard as utf-8 since you need to deal with surrogates -
> > UTF-16 is a MULTI-CHARACTER format.  Now, on some machines
> > (e.g. Solaris, IRIX, Linux), wchat_t is UCS-4 and not UTF-16 which
> > means that semantics of wchat_t are different on different platforms.
> > Secondly, wchar_t is not nessasarily unicode (and it is not on many
> > platforms) so having the same data type is misleading at best.
> >
>
> Most applications, safely, can ignore surrogates and they rarely occur. If
> they are worried about them, they can deal with them, but very few things
> generate them in the real world.

Well sure - just like in most code I can ignore everything but ascii !

>
> We only spit out surrogate pairs at this time, so we don't create 32 bit
> characters even if the local character size is 32 bits. This has been
> discussed as we could do so, but we wouldn't drive it purely off the size of
> the wide char.

Well, that's not going to work if the host wchar_t is UCS-4  !

>
> We know that wchar_t is not necessarily Unicode, which is why we use XMLCh
> instead. On those platforms where its not Unicode, and HP is our only one,
> they can choose to deal with it however they feel most appropriate. But, any
> system which does not use Unicode for its wide char representation is kind
> of out of the mainstream and they just have to deal with it.

I believe Solaris is UCS-4 as well. IRIX is probably some old unix encoding.
Really, it's a mish mash - you can't rely on it ...

>
> > Oh we forgot to mention how you're supposed to deal with BOM
> > (or ISO-10646 signatures), so now you have a stateful encoding as well
> > while utf-8 is endian independant.
> >
>
> We deal with BOM's for you, and in the data we spit out, its a non issue,
> since its not there.

That's nice - you wouldn't have to concern yourself at all with utf-8.


>
>
> > utf-8 is superior, it's compatible with many of the API's that are
> available
> > NOW - no need to have to re-implement things just because of Xerces
> > and it probably leads to more efficient and faster code since you deal
> > with less cache misses and use less memory.
> >
>
> UTF-8 isn in no way superior to Unicode, unless you just want it for some
> reason. But, its the lingua franca of encodings these days and its what we
> will continue to use I'm sure.

I think you're missing the point - UTF-8 is a normative part of the Unicode
standard - It IS Unicode.  When you refer to Unicode you may need to be more
specific.  UTF-16 is an abortion, but also a normative part of the standard.

>
>
> > I have ported many applications to use wchar_t only to find that the
> > performance degraded significantly.  Some well placed mbslen() or
> > utf8_len macros (or inlines) does just as well and supports the
> > entire Unicode character set.  You WILL need to do the same kind
> > of thing for utf-16 since it special cases "Surrogate Pairs".
> >
>
> Get used to it. Its the way the entire world will be before long. We deal
> with all the surrogate pairs. So what we spit out has dealt with them. If
> you choose to worry about them, which mostly you won't need to do, you can.

I beg to differ.  I managed internationalization projects for 5 years.  Never
needed to do a whole hog conversion to utf-16 and it is the wrong answer
because it is too expensive in every respect.  A simple string class that
has 2 both interfaces does the trick and gets product out sooner without
having months of re-development.

>
> > utf-8 already has alot of support while utf-16 is implemented poorly
> > at best ( even by good O'l Mircrosoft - I know, I used to work for
> > them ).
> >
>
> And, by everyone continuing to use UTF-8 you expect this to change?

utf-16 can die for all I care.  utf-8 and utf-16 cover exactly the same
character set - actually, utf-8 covers more code-points.

>
> > >
> > > What exactly do you think that the DOM spec says about how
> implementations
> > > represent their Unicode characters? I think you may be reading more into
> it
> > > than really exists.
> >
> > Excerpt from the w3c idl ...
> >
> > #pragma prefix "w3c.org"
> > module dom
> > {
> >   typedef sequence<unsigned short> DOMString;
> >
> >
>
> And how exactly does this help you? It doesn't get you UTF-8, and in the
> meantime we spit out characters that aren't passable to local wide string
> APIs. Pretty stupid if you ask me.

never mind - the point is not worth the effort.

>
> > >
> > >
> > > If you think its wasteful, talk to the Linux folks, not us. Since
> Unicode is
> > > the future, if Linux's wchar_t is really that wasteful, its going to
> have a
> > > hard time in the future. But, as long as they define wchar_t that way,
> we
> > > need to spit it out in that format.
> >
> > I was a member of the Unicode consortium, I know Unicode the character
> > set is the way of the future, however Unicode the encoding or UTF-16 is
> > an abortion at best.  The whole reason to go to a 16 bit code was to not
> > have to deal with multicharacter sequences, well they got that wrong !
> >
>
> Only in pretty rare circumstances. What real world data do you run into that
> really requires surrogates? If you have any questions, you can talk to the
> President of the consortium which works there at JTC with the XML parser
> team.

Let's see what's above the bmp ?

Language tags http://www.unicode.org/unicode/reports/tr7/

- this is an Approved report even the W3C recognizes these !

Wait a minute - here are some code points that are not
even representable in UTF-16 - but representable
in utf-8 !

http://www.unicode.org/unicode/reports/tr19/

This sounds like the old argument - sure 16 bits is enough.

It sounds like you've been drinking the Microsoft coolade for
too long.  The support for utf-16 is a political one, it would be
disaster for Microsoft & IBM if it was not.  No one else cares that
much since most other implementations of wchar_t are 4 byte.

A proposal from IBM to do surrogates in utf-8 (yep 2 multibye
characters to one Unicode character) was rejected with prejudice.
Pollution for bad mistakes.

Anyway, there are over 100,000 Chinese characters still to be
encoded and they will break the barrier alone.  Plus many many
languages still to be encoded.

Yep, they are rare but that's what was said when characters
were 6 bits.

>
> > Let's see - DOMString::substringData is able to split a string between
> > surrogate pairs - oops !
> >
>
> I don't know if the DOM deals with this or not. It should be updated if
> required. But the issue is that we would deal with it, not the customer for
> the most part. Most of the DOM doesn't care because it generally doesn't
> spit up strings, or when it does its on a known character like the colon.

':' is also an ASCII character - works for utf-8 too !

>
>
> > So is length supposed to be the number of Unicode charaters ?  Well
> > it's not !
> >
>
> Again, only in rare case where you are dealing with data that will create
> them. You are trying to use the rare possibility of a problem as an argument
> because its one of the few you've got.

To be Unicode (TM) compliant you have to deal with them.  If you have to
deal with them - you may as well stick with utf-8.

>
> > >
> > > The standard C++ string also supports wide characters, which will be
> > > (suprise) in the local wchar_t format, unless they are really stupid in
> any
> > > particular implementation of the standard. If we didn't use the standard
> > > string, and that would mean that all the compilers we support have to
> > > support this 'standard', I'd imagine we'd use the wide character
> version,
> > > because that's all that would make sense for XML.
> >
> > The stdb++ string is based on the "basic_string" template ... wow, you
> > get the best of both worlds.  It would be nice if Xerces also supported
> > both worlds, heck, even Microsoft supports char * API's.
> >
>
> And we have two completely different builds, which have to be tested
> separated on every platform, for every compiler. If you want to volunteer
> for that work, feel free to do so.  And some of our compilers probably still
> don't support the STL stuff.

I am contemplating writing a utf-8  dom library - yes, either by hacking
Xerces or patching one together.  I have yet to decide.  I would sure
use it .

>
> --------------------------
> Dean Roddey
> The CIDLib C++ Frameworks
> Charmed Quark Software
> droddey@charmedquark.com
> http://www.charmedquark.com
>
> "You young, and you gotcha health. Whatchoo wanna job fer?"
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

Re: UTF-16 or UTF-8 or UCS-4

Posted by Dean Roddey <dr...@charmedquark.com>.

> > Personally, I think you are wrong on all counts.
> >
> > Unicode is the future. UTF-8 is a convenient interchange format, but its
> > pretty useless as a live string format because of its variable byte
nature.
> > And, by spitting out Unicode in the local wchar_t format, we allow our
stuff
> > to be passed to local wide character APIs directly. This is very, very
> > important and we took this step very deliberately. You can easily
transcode
> > it to a local format if you need to.
>
> utf-16 is just as hard as utf-8 since you need to deal with surrogates -
> UTF-16 is a MULTI-CHARACTER format.  Now, on some machines
> (e.g. Solaris, IRIX, Linux), wchat_t is UCS-4 and not UTF-16 which
> means that semantics of wchat_t are different on different platforms.
> Secondly, wchar_t is not nessasarily unicode (and it is not on many
> platforms) so having the same data type is misleading at best.
>

Most applications, safely, can ignore surrogates and they rarely occur. If
they are worried about them, they can deal with them, but very few things
generate them in the real world.

We only spit out surrogate pairs at this time, so we don't create 32 bit
characters even if the local character size is 32 bits. This has been
discussed as we could do so, but we wouldn't drive it purely off the size of
the wide char.

We know that wchar_t is not necessarily Unicode, which is why we use XMLCh
instead. On those platforms where its not Unicode, and HP is our only one,
they can choose to deal with it however they feel most appropriate. But, any
system which does not use Unicode for its wide char representation is kind
of out of the mainstream and they just have to deal with it.

> Oh we forgot to mention how you're supposed to deal with BOM
> (or ISO-10646 signatures), so now you have a stateful encoding as well
> while utf-8 is endian independant.
>

We deal with BOM's for you, and in the data we spit out, its a non issue,
since its not there.

> utf-8 is superior, it's compatible with many of the API's that are
available
> NOW - no need to have to re-implement things just because of Xerces
> and it probably leads to more efficient and faster code since you deal
> with less cache misses and use less memory.
>

UTF-8 isn in no way superior to Unicode, unless you just want it for some
reason. But, its the lingua franca of encodings these days and its what we
will continue to use I'm sure.

> I have ported many applications to use wchar_t only to find that the
> performance degraded significantly.  Some well placed mbslen() or
> utf8_len macros (or inlines) does just as well and supports the
> entire Unicode character set.  You WILL need to do the same kind
> of thing for utf-16 since it special cases "Surrogate Pairs".
>

Get used to it. Its the way the entire world will be before long. We deal
with all the surrogate pairs. So what we spit out has dealt with them. If
you choose to worry about them, which mostly you won't need to do, you can.

> utf-8 already has alot of support while utf-16 is implemented poorly
> at best ( even by good O'l Mircrosoft - I know, I used to work for
> them ).
>

And, by everyone continuing to use UTF-8 you expect this to change?

> >
> > What exactly do you think that the DOM spec says about how
implementations
> > represent their Unicode characters? I think you may be reading more into
it
> > than really exists.
>
> Excerpt from the w3c idl ...
>
> #pragma prefix "w3c.org"
> module dom
> {
>   typedef sequence<unsigned short> DOMString;
>
>

And how exactly does this help you? It doesn't get you UTF-8, and in the
meantime we spit out characters that aren't passable to local wide string
APIs. Pretty stupid if you ask me.

> >
> >
> > If you think its wasteful, talk to the Linux folks, not us. Since
Unicode is
> > the future, if Linux's wchar_t is really that wasteful, its going to
have a
> > hard time in the future. But, as long as they define wchar_t that way,
we
> > need to spit it out in that format.
>
> I was a member of the Unicode consortium, I know Unicode the character
> set is the way of the future, however Unicode the encoding or UTF-16 is
> an abortion at best.  The whole reason to go to a 16 bit code was to not
> have to deal with multicharacter sequences, well they got that wrong !
>

Only in pretty rare circumstances. What real world data do you run into that
really requires surrogates? If you have any questions, you can talk to the
President of the consortium which works there at JTC with the XML parser
team.

> Let's see - DOMString::substringData is able to split a string between
> surrogate pairs - oops !
>

I don't know if the DOM deals with this or not. It should be updated if
required. But the issue is that we would deal with it, not the customer for
the most part. Most of the DOM doesn't care because it generally doesn't
spit up strings, or when it does its on a known character like the colon.

> So is length supposed to be the number of Unicode charaters ?  Well
> it's not !
>

Again, only in rare case where you are dealing with data that will create
them. You are trying to use the rare possibility of a problem as an argument
because its one of the few you've got.

> >
> > The standard C++ string also supports wide characters, which will be
> > (suprise) in the local wchar_t format, unless they are really stupid in
any
> > particular implementation of the standard. If we didn't use the standard
> > string, and that would mean that all the compilers we support have to
> > support this 'standard', I'd imagine we'd use the wide character
version,
> > because that's all that would make sense for XML.
>
> The stdb++ string is based on the "basic_string" template ... wow, you
> get the best of both worlds.  It would be nice if Xerces also supported
> both worlds, heck, even Microsoft supports char * API's.
>

And we have two completely different builds, which have to be tested
separated on every platform, for every compiler. If you want to volunteer
for that work, feel free to do so.  And some of our compilers probably still
don't support the STL stuff.

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"You young, and you gotcha health. Whatchoo wanna job fer?"

Re: UTF-16 or UTF-8 or UCS-4

Posted by Gianni Mariani <ma...@orconet.com>.

Dean Roddey wrote:

> > I've noticed that xerxes defines XMLCh as a 4 byte sze type for
> > Linux/GCC.
> >
> > util/Compilers/GCCDefs.hpp:118:typedef wchar_t XMLCh;
> >
> > Apart from
> >
> >     a) Not being as per the DOM Spec
> >     b) Exceptionally wastful of memory on Linux
> >
> > It also is less useful for most implementors than pure UTF-8.
> >
> > What are the plans for handling a more standard string type
> > (like the C++ standard "string") as part of the interface and assuming
> > that it is utf-8 ?
> >
>
> Personally, I think you are wrong on all counts.
>
> Unicode is the future. UTF-8 is a convenient interchange format, but its
> pretty useless as a live string format because of its variable byte nature.
> And, by spitting out Unicode in the local wchar_t format, we allow our stuff
> to be passed to local wide character APIs directly. This is very, very
> important and we took this step very deliberately. You can easily transcode
> it to a local format if you need to.

utf-16 is just as hard as utf-8 since you need to deal with surrogates -
UTF-16 is a MULTI-CHARACTER format.  Now, on some machines
(e.g. Solaris, IRIX, Linux), wchat_t is UCS-4 and not UTF-16 which
means that semantics of wchat_t are different on different platforms.
Secondly, wchar_t is not nessasarily unicode (and it is not on many
platforms) so having the same data type is misleading at best.

Oh we forgot to mention how you're supposed to deal with BOM
(or ISO-10646 signatures), so now you have a stateful encoding as well
while utf-8 is endian independant.

utf-8 is superior, it's compatible with many of the API's that are available
NOW - no need to have to re-implement things just because of Xerces
and it probably leads to more efficient and faster code since you deal
with less cache misses and use less memory.

I have ported many applications to use wchar_t only to find that the
performance degraded significantly.  Some well placed mbslen() or
utf8_len macros (or inlines) does just as well and supports the
entire Unicode character set.  You WILL need to do the same kind
of thing for utf-16 since it special cases "Surrogate Pairs".

utf-8 already has alot of support while utf-16 is implemented poorly
at best ( even by good O'l Mircrosoft - I know, I used to work for
them ).

>
> What exactly do you think that the DOM spec says about how implementations
> represent their Unicode characters? I think you may be reading more into it
> than really exists.

Excerpt from the w3c idl ...

#pragma prefix "w3c.org"
module dom
{
  typedef sequence<unsigned short> DOMString;

>
>
> If you think its wasteful, talk to the Linux folks, not us. Since Unicode is
> the future, if Linux's wchar_t is really that wasteful, its going to have a
> hard time in the future. But, as long as they define wchar_t that way, we
> need to spit it out in that format.

I was a member of the Unicode consortium, I know Unicode the character
set is the way of the future, however Unicode the encoding or UTF-16 is
an abortion at best.  The whole reason to go to a 16 bit code was to not
have to deal with multicharacter sequences, well they got that wrong !

Let's see - DOMString::substringData is able to split a string between
surrogate pairs - oops !

So is length supposed to be the number of Unicode charaters ?  Well
it's not !

>
> The standard C++ string also supports wide characters, which will be
> (suprise) in the local wchar_t format, unless they are really stupid in any
> particular implementation of the standard. If we didn't use the standard
> string, and that would mean that all the compilers we support have to
> support this 'standard', I'd imagine we'd use the wide character version,
> because that's all that would make sense for XML.

The stdb++ string is based on the "basic_string" template ... wow, you
get the best of both worlds.  It would be nice if Xerces also supported
both worlds, heck, even Microsoft supports char * API's.

>
>
> --------------------------
> Dean Roddey
> The CIDLib C++ Frameworks
> Charmed Quark Software
> droddey@charmedquark.com
> http://www.charmedquark.com
>
> "You young, and you gotcha health. Whatchoo wanna job fer?"

Re: UTF-16 or UTF-8 or UCS-4

Posted by Dean Roddey <dr...@charmedquark.com>.

> I've noticed that xerxes defines XMLCh as a 4 byte sze type for
> Linux/GCC.
>
> util/Compilers/GCCDefs.hpp:118:typedef wchar_t XMLCh;
>
> Apart from
>
>     a) Not being as per the DOM Spec
>     b) Exceptionally wastful of memory on Linux
>
> It also is less useful for most implementors than pure UTF-8.
>
> What are the plans for handling a more standard string type
> (like the C++ standard "string") as part of the interface and assuming
> that it is utf-8 ?
>

Personally, I think you are wrong on all counts.

Unicode is the future. UTF-8 is a convenient interchange format, but its
pretty useless as a live string format because of its variable byte nature.
And, by spitting out Unicode in the local wchar_t format, we allow our stuff
to be passed to local wide character APIs directly. This is very, very
important and we took this step very deliberately. You can easily transcode
it to a local format if you need to.

What exactly do you think that the DOM spec says about how implementations
represent their Unicode characters? I think you may be reading more into it
than really exists.

If you think its wasteful, talk to the Linux folks, not us. Since Unicode is
the future, if Linux's wchar_t is really that wasteful, its going to have a
hard time in the future. But, as long as they define wchar_t that way, we
need to spit it out in that format.

The standard C++ string also supports wide characters, which will be
(suprise) in the local wchar_t format, unless they are really stupid in any
particular implementation of the standard. If we didn't use the standard
string, and that would mean that all the compilers we support have to
support this 'standard', I'd imagine we'd use the wide character version,
because that's all that would make sense for XML.

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"You young, and you gotcha health. Whatchoo wanna job fer?"