You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-users@xerces.apache.org by Rudolfs Mazurs <ru...@gmail.com> on 2019/01/11 10:23:32 UTC

What is a correct way to set a locale for xerces?

Hi,
I have a service that is using xerces-c and has to be run stared under C
locale for LANG and LC_*. I need xerces to be able to parse xml with UTF-8
characters, so I used this workaround:

setlocale(LC_CTYPE,"en_US.UTF-8");
XMLPlatformUtils::Initialize();

And while it seems to work, I noticed that Initialize constructor has a
parameter “const char *const locale”, which I assume [1] overrides any
system variables. However,

XMLPlatformUtils::Initialize("en_US.UTF-8");

this code compiles, but still throws exceptions when it encounters UTF-8
characters.

Am I correct to assume that parameter “locale” in constructor “Initialize”
overrides LC_CTYPE, LC_ALL and LANG variables when choosing how to
interpret characters? Or is there a bug in the constructor or is the
“locale” parameter written wrong?

Is my current work around a "good practice"?

[1] http://xerces.apache.org/xerces-c/apiDocs-3/classXMLPlatformUtils.html

Re: What is a correct way to set a locale for xerces?

Posted by Rudolfs Mazurs <ru...@gmail.com>.
piektd., 2019. g. 11. janv., plkst. 13:25 — lietotājs Roger Leigh (<
rleigh@codelibre.net>) rakstīja:

> My understanding is that this locale parameter only affects the
> selection of the message catalogue used for printing messages.  Since
> there is only a single en_US message catalogue, overriding it won't do
> anything useful.  So in terms of UTF-8 processing, I think this is a red
> herring.
>

That is a shame. It looked like a good option.


> Which transcoder have you configured Xerces-C to use?  I notice that GNU
> iconv does some querying of the current charset with setlocale (but
> doesn't use the simpler and more correct nl_langinfo).  If you're using
> gnuiconv, maybe try ICU instead?
>

I am using gnuiconv. Tried to use iconv, but that one silently drops
non-ascii characters. My build system doesn't have ICU components. I guess
I'll have to have a conversation with the responsible colleagues.

For the software I maintain, we were forced to mandate the use of UTF-8
> locale for correct operation.
>

Thanks for the advice!

Re: What is a correct way to set a locale for xerces?

Posted by Roger Leigh <rl...@codelibre.net>.
On 11/01/2019 10:23, Rudolfs Mazurs wrote:
> Hi,
> I have a service that is using xerces-c and has to be run stared under C
> locale for LANG and LC_*. I need xerces to be able to parse xml with UTF-8
> characters, so I used this workaround:
> 
> setlocale(LC_CTYPE,"en_US.UTF-8");
> XMLPlatformUtils::Initialize();
> 
> And while it seems to work, I noticed that Initialize constructor has a
> parameter “const char *const locale”, which I assume [1] overrides any
> system variables. However,

My understanding is that this locale parameter only affects the 
selection of the message catalogue used for printing messages.  Since 
there is only a single en_US message catalogue, overriding it won't do 
anything useful.  So in terms of UTF-8 processing, I think this is a red 
herring.

I would have hoped that Xerces-C would behave in a locale-independent 
manner and work the same in all locales except maybe with respect to the 
locale-defined stream encoding (which might be part of the problem).

Which transcoder have you configured Xerces-C to use?  I notice that GNU 
iconv does some querying of the current charset with setlocale (but 
doesn't use the simpler and more correct nl_langinfo).  If you're using 
gnuiconv, maybe try ICU instead?

For the software I maintain, we were forced to mandate the use of UTF-8 
locale for correct operation.


Regards,
Roger