You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-dev@xerces.apache.org by Mu...@GDM.DE on 2000/07/24 18:33:30 UTC

Xerces UTF-16 vs Windows UCS2


The discussion about Unicode support in Xerces as well as being highly
entertaining has got me thinking about how this stuff should work in
Windows:

As I understand it, Xerces uses UTF-16, which has variable numbers of bytes
per character, and Windows uses UCS2 which is always 2 bytes per character.
I'll forget about Windows' MBCS because I try to avoid it.

I know that I can get an 8-bit character array from a DOMString by using
DOMString::transcode(), but how do I get a UCS2 string from a DOM_String()?
Transcode converts to the 'local code page'. Presumably I need to tell it
what the local code page is, but how should I do that?

Murray Cumming
murrayc@usa.net
www.murrayc.com

Re: Xerces UTF-16 vs Windows UCS2

Posted by Andy Heninger <an...@jtcsv.com>.

A UTF-16 string that contains no surrogate pairs is exactly a UCS2 string.

A UTF-16 string that contains surrogate pairs can not be represented in
UCS2.

Dean wrote,
> For the most part, UTF-16 and UCS2 are kind of the same. UTF-16 will
only
> use surrogates if required for any characters outside of the BMP.
Otherwise,
> its not variable byte. If UCS-2 doesn't have surrogates, then I assume
it
> maps straight to the BMP of UTF-16? I haven't looked at it lately so I'm
> talking straight off the top of my head and could be wrong.
>
> > I know that I can get an 8-bit character array from a DOMString by
using
> > DOMString::transcode(), but how do I get a UCS2 string from a
> DOM_String()?
> > Transcode converts to the 'local code page'. Presumably I need to tell
it
> > what the local code page is, but how should I do that?


Andy Heninger
IBM XML Technology Group, Cupertino, CA
heninger@us.ibm.com

Re: Xerces UTF-16 vs Windows UCS2

Posted by Dean Roddey <dr...@charmedquark.com>.

> The discussion about Unicode support in Xerces as well as being highly
> entertaining has got me thinking about how this stuff should work in
> Windows:
>
> As I understand it, Xerces uses UTF-16, which has variable numbers of
bytes
> per character, and Windows uses UCS2 which is always 2 bytes per
character.
> I'll forget about Windows' MBCS because I try to avoid it.
>

For the most part, UTF-16 and UCS2 are kind of the same. UTF-16 will only
use surrogates if required for any characters outside of the BMP. Otherwise,
its not variable byte. If UCS-2 doesn't have surrogates, then I assume it
maps straight to the BMP of UTF-16? I haven't looked at it lately so I'm
talking straight off the top of my head and could be wrong.

> I know that I can get an 8-bit character array from a DOMString by using
> DOMString::transcode(), but how do I get a UCS2 string from a
DOM_String()?
> Transcode converts to the 'local code page'. Presumably I need to tell it
> what the local code page is, but how should I do that?
>

You can use the transcoding support to transcode the DOMString to whatever
encodings are supported by the transcoding system your version of the parser
uses. Just create a transcoder, give it an encoding name, and then pass it
the wide character content of the DOMString to transcode.

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"You young, and you gotcha health. Whatchoo wanna job fer?"

Re: Xerces UTF-16 vs Windows UCS2

Posted by Gianni Mariani <ma...@orconet.com>.

UCS2 is deprecated and is subsumed by UTF-16.  For all intents and purposes
there is no transformation that goes from UTF-16 to UCS-2.  However the
subset of all UCS-2 (minus the 2 1024 byte ranges for surrogates) strings are
exactly UTF-16.  It's the same as all pure ASCII strings are also UTF-8
strings.

Murray.Cumming@GDM.DE wrote:

> The discussion about Unicode support in Xerces as well as being highly
> entertaining has got me thinking about how this stuff should work in
> Windows:
>
> As I understand it, Xerces uses UTF-16, which has variable numbers of bytes
> per character, and Windows uses UCS2 which is always 2 bytes per character.
> I'll forget about Windows' MBCS because I try to avoid it.
>
> I know that I can get an 8-bit character array from a DOMString by using
> DOMString::transcode(), but how do I get a UCS2 string from a DOM_String()?
> Transcode converts to the 'local code page'. Presumably I need to tell it
> what the local code page is, but how should I do that?
>
> Murray Cumming
> murrayc@usa.net
> www.murrayc.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

Output of a DOM

Posted by Jérôme Lecomte <jl...@ifrance.com>.

Hi all,

I parsed xhtml file with Xerces 1.2 and I try to output the DOM
document back on a file. I have two questions :

- I tried unsuccessfully to make Xerces stream out the entities as
they
are defined in the dtd. Xerces manages them correctly in input but
seem
to discard that knowledge and output every non representable character
as hexa number. Is there a way to get the entities wrote back in
output ?

- Assuming I have to write my own output routines, then what is the
proper way to output a DOM document, is overloading XMLFormatTarget
with a XHTMLFormatTarget (to hook the function WrtiteTo and replace
the hexa with entities) the proper way to go?

Thanks.


 
______________________________________________________________________________
message envoye depuis http://www.ifrance.com
emails (pop)-sites persos (espace illimite)-agenda-favoris (bookmarks)-forums 
Ecoutez ce message par tel ! : 08 92 68 92 15 (france uniquement)