You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-users@xerces.apache.org by Ben Griffin <be...@redsnapper.net> on 2010/01/20 14:45:28 UTC

Shorthand safe way of declaring literal const XMLCh* for GCC compiler?

We use many literal XMLCh* string declarations in our codebase.

I am still not sure what is the safest, but most efficient way of declaring these WITHOUT RELYING UPON A TRANSCODE.

With the compilation setting of -fshort-wchar (we are only interested in gcc) are there any problems or caveats with using:

std::basic_string<XMLCh> my_string =  (const XMLCh*)(L"the string that I wish to declare");

Are there neater ways of  doing the same?

I know of the alternative of using the chXX characters, etc:

std::basic_string<XMLCh> my_string = {'t','h','e',' ','s','t','r','i','n', ...  ,chNull};

I don't really find this acceptable - the code is more or less unreadable.  It seems crazy to have to use a transcoder because there isn't a tidy way to define a string literal.

Re: Shorthand safe way of declaring literal const XMLCh* for GCC compiler?

Posted by David Webber <da...@musical.demon.co.uk>.

From: "Ben Griffin" <be...@redsnapper.net>

> We use many literal XMLCh* string declarations in our codebase.
> I am still not sure what is the safest, but most efficient way of 
> declaring these WITHOUT RELYING UPON A TRANSCODE.
>...

I must admit that for strings containing only the usual (strict) ASCII 
character set, I have just been doing

const XMLCh *szDataString = L"The string";

It works fine.  I should add that I'm using Microsoft Visual Studio 2008, 
and have no intention of porting to any non Windows platform (or pre XP 
platform).   All my text is UTF-16.

I should also say that I'm currently parsing well-defined XML files where 
node names etc are all composed of a-z, 0-9 and hyphen.    But at some point 
I am going to have to consider saving text entries in XML files which, in 
the originating software, can arbitrary UTF-16 text, and at that point I 
suppose I may have to worry about this sort of thing.    For a start, I'd 
like to have a firmer idea of what "transcode" actually does?

Dave
David Webber
Mozart Music Software
http://www.mozart.co.uk
For discussion and support see
http://www.mozart.co.uk/mozartists/mailinglist.htm

Re: Shorthand safe way of declaring literal const XMLCh* for GCC compiler?

Posted by David Webber <da...@musical.demon.co.uk>.

From: "Ben Griffin" <be...@redsnapper.net>

> * About the only common platform we don't compile under is Microsoft - our 
> code base is for unix-flavours. :D.
> * AFAIK UTF-*  formats are not fixed length encodings and, AFAIK, the 
> wchar_t is always fixed length.
>     - (Are you sure that Microsoft wchar_t IS UTF-16 (LE) and not UCS-2 
> (LE)?  I do not know - just curious).

It's definitely UTF-16.    That said, how many of the Windows APIs actually 
handle surrogate pairs (rather than leaving it to the programmer) is not a 
question I'd like to comment on :-)    I suspect we'll find out when fonts 
containing symbols at code points requiring surrogate pairs start to become 
commonplace.

> * Under a default gcc compile on Linux, L defines a four byte character, 
> not a two byte one.  It isn't UTF-32.

Not UTF-32?   If you have 4-byte characters then I'd have thought UTF-32 
would be the reason.  Many people think Microsoft's choice of UTF-16 should 
have been UTF-32 (because every character is the a single unit - no 
surrogate pairs).   [Don't know what will happen if we discover that humans 
have invented more than 2^32 different symbols and we need a font with all 
of them :-)  ]

> * XMLCh is not always defined as wchar_t - as you discovered. Eg. on Mac 
> OS X it's uint16_t by default. I need to allow for that.

I am starting to appreciate that.

> * Yes, const wchar_t szSysName[] = L"System font";  is legal - AFAIK, 
> difficulties arise when you need a static cast over the declaration.

> Regarding your suggestion of a class derivation for the STL template 
> instance std::basic_string<XMLCh> , at some point it may well be 
> worthwhile for us to define an internal string class for dealing with all 
> these issues, but currently, I am quite happy to continue to use 
> std::basic_string<XMLCh> or a typdef.  My main issue is with the way of 
> declaring literals.

> The preprocessor directive I am using at the moment is
>
> #define UCS2(x) (const XMLCh*)(x)
>
> So that I can declare literals as follows:
>
> const XMLCh* myAnyString= UCS2(L"ANY");   //not perfect but better than 
> const XMLCh myAnyString[] = { chLatin_A, chLatin_N, chLatin_Y, chNull }; 
> // "ANY"
> I am happy enough with a static cast over L - especially as it seems the 
> two XMLCh options will work - it will either be redundant or it will be a 
> reliable cast.

That looks good.    In fact you could include the 'L' in the definition of 
the macro, and it would be very similar to Microsoft's  _T("xyz") which 
evaluates to L"xyz" or "xyz"  according to a definition or otherwise in the 
project.   (I am not a big fan of everything Microsoft does, but this one 
was immensely helpful in converting a very big old project to Unicode.)

I appreciate this discussion.  I'm starting to feel more confident of mixing 
XMLCh and wchar_t  with Microsoft Visual Studio.   I'm not proposing to 
abandon the Microsoft compiler as I have too many shares in the MFC library 
for that.   But I'm starting to get a better overview of the portability 
issues too.

Dave
David Webber
Mozart Music Software
http://www.mozart.co.uk
For discussion and support see
http://www.mozart.co.uk/mozartists/mailinglist.htm

Re: Shorthand safe way of declaring literal const XMLCh* for GCC compiler?

Posted by Ben Griffin <be...@redsnapper.net>.

Hi Dave (W),
Thanks for your notes and observations.

Just for clarity ^_^

* About the only common platform we don't compile under is Microsoft - our code base is for unix-flavours. :D.
* AFAIK UTF-*  formats are not fixed length encodings and, AFAIK, the wchar_t is always fixed length.
     - (Are you sure that Microsoft wchar_t IS UTF-16 (LE) and not UCS-2 (LE)?  I do not know - just curious).
* Under a default gcc compile on Linux, L defines a four byte character, not a two byte one.  It isn't UTF-32.
* XMLCh is not always defined as wchar_t - as you discovered. Eg. on Mac OS X it's uint16_t by default. I need to allow for that.
* Yes, const wchar_t szSysName[] = L"System font";  is legal - AFAIK, difficulties arise when you need a static cast over the declaration.

Regarding your suggestion of a class derivation for the STL template instance std::basic_string<XMLCh> , at some point it may well be worthwhile for us to define an internal string class for dealing with all these issues, but currently, I am quite happy to continue to use std::basic_string<XMLCh> or a typdef.  My main issue is with the way of declaring literals.

The preprocessor directive I am using at the moment is

#define UCS2(x) (const XMLCh*)(x)

So that I can declare literals as follows: 

const XMLCh* myAnyString= UCS2(L"ANY");   //not perfect but better than const XMLCh myAnyString[] = { chLatin_A, chLatin_N, chLatin_Y, chNull }; // "ANY"

I am happy enough with a static cast over L - especially as it seems the two XMLCh options will work - it will either be redundant or it will be a reliable cast.

-B

Re: Shorthand safe way of declaring literal const XMLCh* for GCC compiler?

Posted by David Webber <da...@musical.demon.co.uk>.

From: "Ben Griffin" <be...@redsnapper.net>

>> For those with the L operator, then
>> const XMLCh XMLUni::fgAnyString[] = { L'A', L'N', L'Y', L'\0' }
>> const XMLCh XMLUni::fgAnyString[] = L"ANY";
>
> As I understand it, the two things above may not generate the same
> results, as the width of L is sometimes more than two bytes, hence the
> need (in my OP) of the compile time flag -fshort-wchar

You mean L might define UTF-32??   Ok.  (I'm so ensconced in Microsoft's 
UTF-16 (LE) that I hadn't thought of that.)

Just some observations in case they might be helpful:

> Likewise on GCC,
> std::basic_string<XMLCh> my_string = L"the string that I wish to declare";
> (ie, without a static cast ) will generate an error message:  invalid
> conversion from 'const wchar_t*'  to 'const short unsigned int*'

I now see that XMLCh is actually defined as wchar_t  when used with Visual 
Studio 2008, so maybe I'm at an advantage here.   Does GCC not have wchar_t?

> And the example above
>> const XMLCh XMLUni::fgAnyString[] = L"ANY";
>
> generates the error "array must be initialized with a brace-enclosed
> initializer" - which is understandable.

Well maybe again if XMLCh is defined as a 2-byte integer, but character 
array initialisation in the form

const wchar_t szSysName[] = L"System font";

has, I believe, always been legal with wchar_t in the C++ spec (and is 
explicitly allowed with char in my 1991 copy of Stroustrup).

> Not usign a basic_string construct still generates the same invalid
> conversion error.
>> const XMLCh* XMLUni::fgAnyString = L"ANY";
> Produces the same effect (invalid conversion)
>
> This is why I need to use a static cast as follows:
> std::basic_string<XMLCh> my_string =  (const XMLCh*)(L"the string that I
> wish to declare");
> Using preprocessor macros (yechh) I can tidy that up somewhat of course.

It would be neater to derive a class from

std::basic_string<XMLCh>

and give it appropriate constructors and assignment operators.

> Dave (Bertoni), your question regarding if short-wchar guarantees UTF-16
> code points is a good one; albeit that we are using the short-wchar flag.
>
> I was not aware that XercescC XMLCh implementation was UTF-16;  I guess I
> erroneously thought that it was UCS-2.
> (The UCS-2 encoding form is identical to that of UTF-16, except that it
> does not support surrogate pairs and therefore can only encode characters
> in the BMP range U+0000 through U+FFFF. As a consequence it is a
> fixed-length encoding that always encodes characters into a single 16-bit
> value.)

[The only case where I have found that I personally would have to worry 
about the difference is in the collection Unicode music symbols.   But as 
fonts don't usually have them, even that is a bit academic.]

> My string declarations only use characters that are in the UCS-2 / BMP
> range, so I am not so concerned about the need to encode surrogate pairs
> as constants. Regardless, the proposal of using the method in
> src/xercesc/util/XMLUni.cpp does not support non BMP characters.

That's what I found curious.

> More to the point of your question though; regarding the GCC C++
> flag -fshort-wchar
> http://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/Code-Gen-Options.html#Code%20Gen%20Options
> tells us this flag "overrides the underlying type for wchar_t to be short
> unsigned int instead of the default for the target. This option is useful
> for building programs to run under WINE."

My software runs well under wine, just using Microsoft's in-built wchar_t. 
I don't know if that is a useful observation though.

> What is salient to us is that IIRC (by default) XMLCh is defined to be a
> short unsigned int also.
>
> Therefore XMLCh == short unsigned int == wchar_t  (when the -fshort-wchar
> flag is used in GCC).
> If this is the case then, as I understand it, using the static cast (const
> XMLCh*)(L"the string that I wish to declare") should be perfectly fine.

In my version I have

typedef XERCES_XMLCH_T  XMLCh;

#ifdef _NATIVE_WCHAR_T_DEFINED
#define XERCES_XMLCH_T      wchar_t
#else
#define XERCES_XMLCH_T      unsigned short
#endif

and somewhere

_NATIVE_WCHAR_T_DEFINED

is indeed defined.

Dave
David Webber
Mozart Music Software
http://www.mozart.co.uk
For discussion and support see
http://www.mozart.co.uk/mozartists/mailinglist.htm

Re: Shorthand safe way of declaring literal const XMLCh* for GCC compiler?

Posted by David Bertoni <db...@apache.org>.

On 1/21/2010 7:03 PM, Ben Griffin wrote:
> On 21 Jan 2010, at 19:43, David Bertoni wrote:
>
>> But you'll need to rebuild Xerces-C with the -fshort-wchar, so why can't you have the header file changed at the same time?
> No, no we will not.
Now you've got me really confused, because I thought this whole thread 
was an issue with using this GCC option with Xerces-C. You must compile 
all binaries with this option, because it changes the ABI.

>
>> you can stop using the "L" prefix and save us all a great deal of time.
>
> You really can be rude sometimes.
Sorry, I wasn't trying to be rude. I was simply trying to reiterate what 
I stated in my previous email about the portability of wchar_t and wide 
character string constants in C++.

That said, I think it's important you understand this is a volunteer 
support forum. If you think I'm not answering your questions reasonably, 
you can always ignore my responses.

Dave

Re: Shorthand safe way of declaring literal const XMLCh* for GCC compiler?

Posted by Ben Griffin <be...@redsnapper.net>.

On 21 Jan 2010, at 19:43, David Bertoni wrote:

> But you'll need to rebuild Xerces-C with the -fshort-wchar, so why can't you have the header file changed at the same time?
No, no we will not.

> you can stop using the "L" prefix and save us all a great deal of time.

You really can be rude sometimes.

It's of no matter. The solutions that I mentioned work fine for me with no recompile of XercesC, across all our target operating systems.

Re: Shorthand safe way of declaring literal const XMLCh* for GCC compiler?

Posted by David Bertoni <db...@apache.org>.

On 1/21/2010 4:18 AM, Ben Griffin wrote:
> Hi Dave(s),
> First of all, regarding the point about changing the configuration of the xercesc headers; unfortunately we do not control that part of the environment.
But you'll need to rebuild Xerces-C with the -fshort-wchar, so why can't 
you have the header file changed at the same time?

>
>> For those with the L operator, then
>> const XMLCh XMLUni::fgAnyString[] = { L'A', L'N', L'Y', L'\0' }
>> const XMLCh XMLUni::fgAnyString[] = L"ANY";
>
> As I understand it, the two things above may not generate the same results, as the width of L is sometimes more than two bytes, hence the need (in my OP) of the compile time flag -fshort-wchar
I never suggested you use the "L" prefix, as it's not very portable. 
Neither of these snippets will work unless XMLCh is a typedef for 
wchar_t, and that is the case only on Windows.

>
> Likewise on GCC,
>   std::basic_string<XMLCh>  my_string = L"the string that I wish to declare";
> (ie, without a static cast ) will generate an error message:  invalid conversion from 'const wchar_t*'  to 'const short unsigned int*'

Yes, because wchar_t is a distinct type. That's why I suggested you
change the definition of XMLCh to wchar_t if you plan to use the
-fshort-wchar switch.

>
> And the example above
>> const XMLCh XMLUni::fgAnyString[] = L"ANY";
>
> generates the error "array must be initialized with a brace-enclosed initializer" - which is understandable.
Again, unless you plan to make XMLCh a typedef for wchar_t, you can stop 
using the "L" prefix and save us all a great deal of time.

This will work:

const wchar_t foo[] = { L"ANY" };

However:

const XMLCh foo[] = { L"ANY" };

won't work because wchar_t and XMLCh are distinct types.

>
> Not usign a basic_string construct still generates the same invalid conversion error.
>> const XMLCh* XMLUni::fgAnyString = L"ANY";
> Produces the same effect (invalid conversion)
>
> This is why I need to use a static cast as follows:
> std::basic_string<XMLCh>  my_string =  (const XMLCh*)(L"the string that I wish to declare");
> Using preprocessor macros (yechh) I can tidy that up somewhat of course.
That's not a static cast, it's a C-style cast, which is effectively a 
reinterpret_cast in C++. Again, that's why I suggested you change the 
typedef for XMLCh.

>
> Dave (Bertoni), your question regarding if short-wchar guarantees UTF-16 code points is a good one; albeit that we are using the short-wchar flag.
>
> I was not aware that XercescC XMLCh implementation was UTF-16;  I guess I erroneously thought that it was UCS-2.
> (The UCS-2 encoding form is identical to that of UTF-16, except that it does not support surrogate pairs and therefore can only encode characters in the BMP range U+0000 through U+FFFF. As a consequence it is a fixed-length encoding that always encodes characters into a single 16-bit value.)
Yes, I'm well aware of the differences between UTF-16 and UCS-2. 
Xerces-C has always supported UTF-16, since the XML recommendation 
requires support for characters outside of the BMP.

>
> My string declarations only use characters that are in the UCS-2 / BMP range, so I am not so concerned about the need to encode surrogate pairs as constants. Regardless, the proposal of using the method in src/xercesc/util/XMLUni.cpp does not support non BMP characters.
You can always initialize such surrogate pairs with integer constants. 
There just aren't any mnemonics for them, since they're not needed in 
the parser.

>
> More to the point of your question though; regarding the GCC C++ flag -fshort-wchar
> http://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/Code-Gen-Options.html#Code%20Gen%20Options
> tells us this flag "overrides the underlying type for wchar_t to be short unsigned int instead of the default for the target. This option is useful for building programs to run under WINE."
Yes, I read that. However, it's important to understand that, even 
though wchar_t will be the same size as short unsigned int, it will be a 
distinct type, so this comment is misleading. This is a difference 
between C and C++, because C considers wchar_t to be a typedef.

>
> What is salient to us is that IIRC (by default) XMLCh is defined to be a short unsigned int also.
Yes, short unsigned int, which is not the same type as wchar_t.

>
> Therefore XMLCh == short unsigned int == wchar_t  (when the -fshort-wchar flag is used in GCC).
> If this is the case then, as I understand it, using the static cast (const XMLCh*)(L"the string that I wish to declare") should be perfectly fine.
I think you meant to say "XMLCh == short unsigned int" and 
"sizeof(XMLCh) == sizeof(wchar_t)" when the -fshort-wchar flag is used.

Dave

Re: Shorthand safe way of declaring literal const XMLCh* for GCC compiler?

Posted by Ben Griffin <be...@redsnapper.net>.

Hi Dave(s), 
First of all, regarding the point about changing the configuration of the xercesc headers; unfortunately we do not control that part of the environment.

> For those with the L operator, then
> const XMLCh XMLUni::fgAnyString[] = { L'A', L'N', L'Y', L'\0' }
> const XMLCh XMLUni::fgAnyString[] = L"ANY";

As I understand it, the two things above may not generate the same results, as the width of L is sometimes more than two bytes, hence the need (in my OP) of the compile time flag -fshort-wchar

Likewise on GCC, 
 std::basic_string<XMLCh> my_string = L"the string that I wish to declare";
(ie, without a static cast ) will generate an error message:  invalid conversion from 'const wchar_t*'  to 'const short unsigned int*'

And the example above 
> const XMLCh XMLUni::fgAnyString[] = L"ANY";

generates the error "array must be initialized with a brace-enclosed initializer" - which is understandable.

Not usign a basic_string construct still generates the same invalid conversion error.
> const XMLCh* XMLUni::fgAnyString = L"ANY";
Produces the same effect (invalid conversion)

This is why I need to use a static cast as follows:
std::basic_string<XMLCh> my_string =  (const XMLCh*)(L"the string that I wish to declare");
Using preprocessor macros (yechh) I can tidy that up somewhat of course.

Dave (Bertoni), your question regarding if short-wchar guarantees UTF-16 code points is a good one; albeit that we are using the short-wchar flag.

I was not aware that XercescC XMLCh implementation was UTF-16;  I guess I erroneously thought that it was UCS-2. 
(The UCS-2 encoding form is identical to that of UTF-16, except that it does not support surrogate pairs and therefore can only encode characters in the BMP range U+0000 through U+FFFF. As a consequence it is a fixed-length encoding that always encodes characters into a single 16-bit value.)

My string declarations only use characters that are in the UCS-2 / BMP range, so I am not so concerned about the need to encode surrogate pairs as constants. Regardless, the proposal of using the method in src/xercesc/util/XMLUni.cpp does not support non BMP characters.

More to the point of your question though; regarding the GCC C++ flag -fshort-wchar
http://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/Code-Gen-Options.html#Code%20Gen%20Options
tells us this flag "overrides the underlying type for wchar_t to be short unsigned int instead of the default for the target. This option is useful for building programs to run under WINE."

What is salient to us is that IIRC (by default) XMLCh is defined to be a short unsigned int also.

Therefore XMLCh == short unsigned int == wchar_t  (when the -fshort-wchar flag is used in GCC).
If this is the case then, as I understand it, using the static cast (const XMLCh*)(L"the string that I wish to declare") should be perfectly fine.

Ben

Re: Shorthand safe way of declaring literal const XMLCh* for GCC compiler?

Posted by David Webber <da...@musical.demon.co.uk>.

From: "David Bertoni" <db...@apache.org>

> On 1/20/2010 5:45 AM, Ben Griffin wrote:
>> We use many literal XMLCh* string declarations in our codebase.
>>
>> I am still not sure what is the safest, but most efficient way of 
>> declaring these WITHOUT RELYING UPON A TRANSCODE.
> Take a look at src/xercesc/util/XMLUni.cpp:
>
> const XMLCh XMLUni::fgAnyString[] =
> {
>     chLatin_A, chLatin_N, chLatin_Y, chNull
> };
>
> You could make this more readable by adding a literal string as a comment:
>
> // "ANY"
> const XMLCh XMLUni::fgAnyString[] =
> {
>     chLatin_A, chLatin_N, chLatin_Y, chNull
> };

I must admit I looked at code like that, and wondered what it was for. 
The chLatin...  definitions are in XmlUniDefs.hpp and cover  0-9, A-Z and 
a-z and very few others.   An example is:

const XMLCh chLatin_A               = 0x41;

This is not going to come as a surprise on any machine with ASCII, UTF-16 or 
any Windows (and probably other) code page.    It would be (for example) a 
bit radical on an IBM360  which (in my day <g>) used the EBCDIC character 
encoding, which is completely different from ASCII (and designed IIRC to 
reduce the frequency of getting too many holes punched too close together on 
the old punched card system - happy days).    So it leaves me wondering what 
sort of extreme portability is aimed at here.

const XMLCh XMLUni::fgAnyString[] = { 'A', 'N', 'Y', '\0' }

(casting each character to a word) would surely do the same thing on any 
system of practical interest?

For those with the L operator, then

const XMLCh XMLUni::fgAnyString[] = { L'A', L'N', L'Y', L'\0' }
const XMLCh XMLUni::fgAnyString[] = L"ANY";

would surely be the same again.   But I've never used the gcc compiler and 
maybe I'm missing something?

>The C++ standard doesn't define a portable way to indicate UTF-16 string 
>constants, so it's not surprising it's a problem. This should change in the 
>next version of the standard, but it will be a long time before compilers 
>that support it are widely available.

Are there compilers use something other than L"ANY"?    Is the standard 
likely to be something else?

Dave
David Webber
Mozart Music Software
http://www.mozart.co.uk
For discussion and support see
http://www.mozart.co.uk/mozartists/mailinglist.htm

Re: Shorthand safe way of declaring literal const XMLCh* for GCC compiler?

Posted by David Bertoni <db...@apache.org>.

On 1/20/2010 5:45 AM, Ben Griffin wrote:
> We use many literal XMLCh* string declarations in our codebase.
>
> I am still not sure what is the safest, but most efficient way of declaring these WITHOUT RELYING UPON A TRANSCODE.
Take a look at src/xercesc/util/XMLUni.cpp:

const XMLCh XMLUni::fgAnyString[] =
{
     chLatin_A, chLatin_N, chLatin_Y, chNull
};

You could make this more readable by adding a literal string as a comment:

// "ANY"
const XMLCh XMLUni::fgAnyString[] =
{
     chLatin_A, chLatin_N, chLatin_Y, chNull
};

>
> With the compilation setting of -fshort-wchar (we are only interested in gcc) are there any problems or caveats with using:
>
> std::basic_string<XMLCh>  my_string =  (const XMLCh*)(L"the string that I wish to declare");
I'm not sure if short-wchar guarantees UTF-16 code points, although it 
seems to be implied, since the manual mentions it as useful for creating 
programs that run under WINE. If you do this, I would also recommend you 
modify src/xerces/util/Xerces_autoconf_config.hpp to reflect that 
wchar_t should be the type for XMLCh:

#define XERCES_XMLCH_T wchar_t

You should make this change after you configure Xerces-C, but before you 
build it. If you do this, you won't need to cast between wchar_t and XMLCh.

If you use this option, then any other object code you link with must be 
built with it, and anyone who uses your code will need it as well.

>
> Are there neater ways of  doing the same?
>
> I know of the alternative of using the chXX characters, etc:
>
> std::basic_string<XMLCh>  my_string = {'t','h','e',' ','s','t','r','i','n', ...  ,chNull};
>
> I don't really find this acceptable - the code is more or less unreadable.  It seems crazy to have to use a transcoder because there isn't a tidy way to define a string literal.
The C++ standard doesn't define a portable way to indicate UTF-16 string 
constants, so it's not surprising it's a problem. This should change in 
the next version of the standard, but it will be a long time before 
compilers that support it are widely available.

Dave