You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by Francesco Pretto <ce...@gmail.com> on 2020/07/16 09:52:50 UTC

New features for xerces-c 4.0.0?

Hello,

I notice there is some work towards a 4.0.0 version. Can you point me
to features that you are planning for this release? Since XML has been
losing its momentum for years, I suggest the next release to be truly
innovative and helpful for developers to justify the major version
step. This is my wishlist:
- First citizen support for UTF-8 as internal encoding: to my
knowledge xerces-c supports only UTF-16 as internal encoding, which it
was a good choice in the 90s. Today some frameworks are moving
internal string encoding to UTF-8, meaning that using xerces-c will
always require expensive bi-directional conversions. Notable
frameworks are Swift[1] (NSString should already support utf-8 storage
when interoping with Objective-C and c++) and qt6 may choose to also
have utf-8 storage in QString. xerces-c could be worked on so XMLCh
maps to char or char8_t (C++20), intending an internal utf-8 encoding;
- A non cached, forward-only parser API (also called a pull parser
API) similar to XmlReader[2]: to my knowledge xerces-c still doesn't
have it, while libxml2 wisely implemented it a long time ago[3]. A
pull parser API is intended to be more resource efficient and easier
to use than event based SAX API, possibly being just a little bit more
verbose.

Cheers,
Francesco

[1] https://swift.org/blog/utf8-string/
[2] https://docs.microsoft.com/en-us/dotnet/api/system.xml.xmlreader
[3] http://www.xmlsoft.org/xmlreader.html

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


Re: New features for xerces-c 4.0.0?

Posted by "Cantor, Scott" <ca...@osu.edu>.
On 7/16/20, 9:15 AM, "Francesco Pretto" <ce...@gmail.com> wrote:

> Are you really sure you want to pursue this direction?

Well, I think there's very little chance, bordering on zero, that if there isn't the will to actually convert the internals to UTF-8 *now*, there isn't going to be.

Even so, all it takes a 5.0 to move to UTF-8 if there's suddenly the will to do that. I don't see any doors closing forever. Heck, you can define XMLCh yourself and keep using it as a synonym for any touchpoints with Xerces if you really wanted to.

If we don't commit to char16_t, then it's impossible to write non-Windows code that can make the assumption that it can use string literals and char16_t and count on that working, so that effectively makes a change pointless (obviously for custom builds, one can make the assumption now since it's possible to configure it to use char16_t already).

However, since I'm not the one doing any of the work, I'll let people speak for themselves.

-- Scott



Re: New features for xerces-c 4.0.0?

Posted by Roger Leigh <rl...@codelibre.net>.
Hi Francesco,


To clarify a few points in your message. Xerces-C++ has always been 
effectively UTF-16-only.  Previously the 16-bit type used to represent 
XMLCh was one of several possible types, including char16_t, wchar_t, 
uint16_t, unsigned short or unsigned int or possibly others, depending 
upon your platform.  The change I proposed for the 4.0.0 was to switch 
to C++11/14 and use char16_t.  I.e. standardising upon the standard type 
for a UTF-16 codepoint.  It doesn't change any assumption about UTF-16 
usage internally: those assumptions were already in place.  The purpose 
of the change is as Scott stated: it's to improve interoperability and 
allow for the use of UTF-16 character and string literals, and to remove 
platform-specific variation in favour of a single type that's consistent 
across platforms.


Personally, I would absolutely prefer for Xerces-C++ to use UTF-8 in its 
external interfaces and its internal representation.  Like most people, 
I have input in UTF-8, output in UTF-8, and all of the parameters I want 
to pass into Xerces-C++ like element and attribute names, text content 
etc. are UTF-8.  All of this needs transcoding to and from UTF-16.  This 
bears a hugely significant fraction (~50%) of CPU utilisation when I've 
profiled it in the past, and it makes using Xerces-C++ unnecessarily 
painful.  But Xerces-C++ is a product of its time.  Back then, before 
widespread UTF-8 adoption, it was likely seen as forward-looking.  ICU 
and other libraries of the same era are all using UTF-16, or 
Unicode/UCS-2 as it was then.


However, such a change would be massively breaking.  I don't have the 
time or resources to do the work, and even if I did it would be hard to 
justify such a breaking change unless it could be introduced without 
breaking compatibility with the UTF-16 interfaces.  If such a change 
could be made compatibly, I would be in favour of it, but I doubt I 
could spare the required time and effort to do it myself.  I no longer 
get paid to work on Xerces-C++-related projects, and I have a full-time 
job to do which is of much greater priority.  What time I can contribute 
as part of Xerces-C++-using open source project work is limited and as 
such I need to make sure that the work I do is tightly-focussed and 
realistic in its objectives.  The above char16_t work is an example of 
such a change.  It takes great care to avoid a compatibility break (you 
can already opt into it with Xerces-C++ 3.2.x).


If you would like to investigate the changes which would be needed to 
change the internal representation from UTF-16 to UTF-8 and/or 
supplement the external interfaces with UTF-8 alternatives to the UTF-16 
interfaces we have at present, I'm sure we would all be very interested 
to hear your proposals.  As a long-term project goal, I think it would 
be beneficial.  For myself, the question isn't whether the change is 
desirable, it's whether it's realistic and achievable while not breaking 
all the existing projects which have invested time and money into using 
Xerces-C++.  Xerces-C++ has a long history at this point, and breaking 
changes are not something which I think any of us would countenance.  
(The current interfaces do expose some internal details; maybe 
hiding/changing some of them might be justifiable; but certainly not the 
core user-facing APIs.)


Kind regards,

Roger


On 16/07/2020 14:15, Francesco Pretto wrote:
> Migrating XMLChar to char16_t basically means setting in stone and
> forever that xerces-c is an utf-16 only library so it's going in a
> radically different direction than I was suggesting, so I'm not very
> happy to hear about it. I think it can be safely stated that this move
> actually closes more doors than it opens. While simplifying the code
> base and test grid is great, as I'm reading in [1], I would have
> chosen to drop support for wchar_t and int16_t but keep XMLChar for
> future possibility to support utf8 for internal encoding. Are you really
> sure you want to pursue this direction?
>
> [1] https://issues.apache.org/jira/browse/XERCESC-2206
>
>
>
> On Thu, 16 Jul 2020 at 14:22, Cantor, Scott <ca...@osu.edu> wrote:
>> On 7/16/20, 8:07 AM, "Francesco Pretto" <ce...@gmail.com> wrote:
>>
>>>     Thank you, and thank you for frankness! Probably of the two the utf-8
>>>     for internal encoding would be more oriented towards c++ modernization
>>>     changes, as you said, but probably a big change touching all the code
>>>     base.
>> It's massive. What Roger is doing is making XMLChar work as char16_t. That eliminates a lot of problems with literals and STL integration, but it doesn’t change the fact that virtually every other C or older C++ library still won't integrate well.
>>
>> -- Scott
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
>> For additional commands, e-mail: c-dev-help@xerces.apache.org
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: c-dev-help@xerces.apache.org
>

Re: New features for xerces-c 4.0.0?

Posted by Francesco Pretto <ce...@gmail.com>.
Migrating XMLChar to char16_t basically means setting in stone and
forever that xerces-c is an utf-16 only library so it's going in a
radically different direction than I was suggesting, so I'm not very
happy to hear about it. I think it can be safely stated that this move
actually closes more doors than it opens. While simplifying the code
base and test grid is great, as I'm reading in [1], I would have
chosen to drop support for wchar_t and int16_t but keep XMLChar for
future possibility to support utf8 for internal encoding. Are you really
sure you want to pursue this direction?

[1] https://issues.apache.org/jira/browse/XERCESC-2206



On Thu, 16 Jul 2020 at 14:22, Cantor, Scott <ca...@osu.edu> wrote:
>
> On 7/16/20, 8:07 AM, "Francesco Pretto" <ce...@gmail.com> wrote:
>
> >    Thank you, and thank you for frankness! Probably of the two the utf-8
> >    for internal encoding would be more oriented towards c++ modernization
> >    changes, as you said, but probably a big change touching all the code
> >    base.
>
> It's massive. What Roger is doing is making XMLChar work as char16_t. That eliminates a lot of problems with literals and STL integration, but it doesn’t change the fact that virtually every other C or older C++ library still won't integrate well.
>
> -- Scott
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: c-dev-help@xerces.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


Re: New features for xerces-c 4.0.0?

Posted by "Cantor, Scott" <ca...@osu.edu>.
On 7/16/20, 8:07 AM, "Francesco Pretto" <ce...@gmail.com> wrote:

>    Thank you, and thank you for frankness! Probably of the two the utf-8
>    for internal encoding would be more oriented towards c++ modernization
>    changes, as you said, but probably a big change touching all the code
>    base.

It's massive. What Roger is doing is making XMLChar work as char16_t. That eliminates a lot of problems with literals and STL integration, but it doesn’t change the fact that virtually every other C or older C++ library still won't integrate well.

-- Scott



---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


Re: New features for xerces-c 4.0.0?

Posted by Francesco Pretto <ce...@gmail.com>.
On Thu, 16 Jul 2020 at 13:52, Cantor, Scott <ca...@osu.edu> wrote:
> Look in JIRA if you want to see what's scheduled for it. [...] As for your suggestions, unless you're volunteering, you probably need to recalibrate your expectations.
>

Thank you, and thank you for frankness! Probably of the two the utf-8
for internal encoding would be more oriented towards c++ modernization
changes, as you said, but probably a big change touching all the code
base.

Regards,
Francesco

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


Re: New features for xerces-c 4.0.0?

Posted by "Cantor, Scott" <ca...@osu.edu>.
On 7/16/20, 5:53 AM, "Francesco Pretto" <ce...@gmail.com> wrote:

>    I notice there is some work towards a 4.0.0 version. Can you point me
>    to features that you are planning for this release?

Look in JIRA if you want to see what's scheduled for it. It's primarily C++ modernization changes.

As for your suggestions, unless you're volunteering, you probably need to recalibrate your expectations. We can't even get security bugs fixed.

-- Scott



---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org