You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by Roger Leigh <rl...@codelibre.net> on 2020/06/16 18:20:55 UTC

Proposal for future development of Xerces-C 4.0.0

Dear all,


To follow up to the suggestion Boris made in the discussion on 
https://issues.apache.org/jira/browse/XERCESC-2204, I would like to 
outline a proposal for the development and release of a version 4.0.0 of 
Xerces-C.  Being a new major version, this would allow co-installation 
with the 3.x library.

Some of the suggested changes already have issues as part of the 3.3.0 
release (https://issues.apache.org/jira/projects/XERCESC/versions/12346666)


In using Xerces-C++ for the past 10 years, I have encountered quite a 
few compatibility, usability and performance problems.  Most of these 
can be worked around to some degree, but improving the situation would 
be desirable.  One of the issues I encountered was difficulty in 
building on modern platforms, Windows in particular, which was the 
impetus for developing the CMake build now incorporated officially in 
the Xerces-C 3.2.x releases.


One problem is the use of UTF-16 character strings in our APIs.  They 
are deeply unfriendly to use with typical C++ code, and they also impose 
a performance penalty when the input to Xerces API calls need 
transcoding.  These are too deeply entrenched to consider removing at 
this point, but C++11 brings native char16_t and char32_t character 
types which could alleviate some of the problems.  Depending upon the 
platform and compiler, Xerces-C currently supports various types as 
XMLCh, including unsigned short, uint16_t, wchar_t and char16_t.  With 
C++11, it is possible to use UTF-16 string literals directly with 
u"string", and use these transparently with the Xerces-C API.  While 
this is possible with the 3.2.x releases if you build with char16_t 
support, as an application developer the availability of char16_t 
support in Xerces-C can't be guaranteed.  By making XMLCh char16_t, we 
become directly interoperable with the current C++ language types, 
library features and so on.  It means any application developer can use 
UTF-16 character literals and string literals freely.

An additional benefit of C++11 is the guaranteed availability of 
standard sized integer types.

The change in PR#21 
(https://github.com/apache/xerces-c/pull/21/files#diff-6fc894653e06e51bde4bbba985b1b340R126) 
makes the basic primitive types use C++11 integer types and character 
types.  It remains API compatible with previous Xerces-C releases.

With a switch of XMLCh to char16_t, this would enable direct use of 
unicode string literals.  I have developed a complete switchover here 
(https://github.com/rleigh-codelibre/xerces-c/compare/xerces-XERCESC-2208_Use_cstdint...rleigh-codelibre:XERCESC-2206_unicode_literals?expand=1#diff-5feef1625e289192e9a2d9b2c1a1308bR147) 
but am still proofreading and reviewing the result before I will submit 
it.  This is a complete replacement of most use of XMLUniDefs constants 
in the source tree with C++11 literals. You can see the result is vastly 
more readable and maintainable, and you'll also see in the commit 
messages that I uncovered three separate bugs in the existing strings 
while doing the conversion which I'll fix separately on the 3.2 branch 
after I'm sure there are no other bugs there.  The code of applications 
using Xerces-C becomes similarly more readable.  This one is a bit big 
and intimidating, but almost all the changes were made with search and 
replace with sed or other tools.


On the interoperability side, Xerces has never really integrated well 
with the standard library.  Whether you want to catch exceptions, use 
strings and streams, these all require extra effort to use.  I'd like to 
refactor some of the classes to ease use with applications using the 
standard library.  This includes:

* having exceptions derive from std::runtime_error; existing types can 
remain compatible with wide strings

* support use of streams, including stringstreams, directly with Xerces 
e.g. InputSource, perhaps as a set of adaptors

* where the C++ language and standard replace functionality in Xerces, 
it would be worth considering replacement where there is a benefit; 
language thread support might come under this category


On the maintainability side, I'd like to reduce the number of 
configuration options to keep testing and support within reason.  
Adopting C++11 removes a lot of complexity and configuration variants.  
In addition:

* We have three message loaders and three sets of translations for 
en_US, but no other translations.  Some or all of them might be worth 
thinking about dropping given the complete lack of utility these 
provide.  Does anyone actually use the translation functionality or have 
any message catalogues other than en_US?

* We have several network accessors.  But with the modern push for using 
HTTPS everywhere, should Xerces be providing its own or should we simply 
require CURL or platform-specific functionality?

* Building with a modern compiler or using a modern IDE flags up tens of 
thousands of warnings.  Some can be fixed by using features like 
"mutable" which previously were not universally available.  It would be 
worth running the codebase through clang-format and clang-tidy to clean 
it up stepwise.  It could well fix quite a few bugs and help improve 
performance and correctness.


Finally, I should note that while the above might look quite disruptive, 
I'm not suggesting any sort of API breakage at this point.  It may be 
the case that there are others who would like to propose such changes, 
or existing issues which can only resolved with a breaking change.


Kind regards,

Roger


Re: Proposal for future development of Xerces-C 4.0.0

Posted by Boris Kolpackov <bo...@codesynthesis.com>.
Roger Leigh <rl...@codelibre.net> writes:

> I'm not entirely sure how to class code written using L"". It's not really
> portable, being Windows-only as you say (Windows being the only platform
> where wchar_t is 16-bit and usable as XMLCh). And it's not strictly portable
> even to different builds of Xerces-C, given that XMLCh is configurable and
> that wchar_t isn't even the default (char16_t is the default for C++11 and
> above).

From experience dealing with clients, there are applications that only
need to target Windows and they use L"" strings liberally.

Still, I think the benefits of switching to char16_t outweighs this
drawback.


> These both make sense. For the networking side, I think we could also argue
> the case for the MacOS accessor as well (cfurl), so long as it supports
> SSL.

The problem with MacOS-specific accessor is that I believe it requires
linking a framework which make things messy (static linking on user side,
pkg-config, etc). But if there is strong desire to have it, I am fine
with that, especially if the default is "no network".


> >In another project I am working on (build2) we had good results with
> >picking GCC 4.9, Clang 3.7, and MSVC 14u3 and getting a very usable
> >subset of C++14 (including move capture and generic lambdas).
>
> I'm on slightly more recent versions, as is the AppVeyor and Travis CI, but
> so long as we can agree on the required C++ subset we can identify the
> minimum version requirement.

That would probably require quite a bit of effort. As in, are we going
to create a document listing every C++11/14 feature we are allowed to
use (with some of them having multiple semantic revisions)?

In this sense picking the minimum versions of the three major compilers
and saying that any commit that compiles and works with these is fair
game is a lot simpler and crispier criterion. Though that would mean we
need to have these versions always available for CI (not sure how doable
that is with AppVeyor and/or Travis CI).


> >Another issue that we will need to decide on is which standard we
> >are going to build for (at least by default) and whether we will
> >be making it configurable. The problem here is that there is no
> >guarantee that code built for different standards is ABI-compatible
> >(and there are cases where the C++ standard itself broke this
> >compatibility). As a result, the only sure way to avoid surprises
> >is to build everything (Xerces-C++ and the application that uses
> >it) for the same standard.
> >
> >For example, in build2 by default we use the latest available
> >standard for any given compiler/version but there is also a way to
> >override it for the entire build configuration. I am not sure if
> >CMake has anything like this.
> 
> It does, see https://cmake.org/cmake/help/latest/prop_tgt/CXX_STANDARD.html
> 
> We currently have it set like this:
> https://github.com/apache/xerces-c/blob/master/CMakeLists.txt#L37 :
> 
>    # Try C++17, then fall back to C++14, C++11 then C++98.  Used for
>    feature tests
>    # for optional features.
>    set(CMAKE_CXX_STANDARD 17)
> 
> CMAKE_CXX_STANDARD sets the requested standard version.  If unavailable, it
> will fall back to earlier standards.  Since Xerces-C++ doesn't have a
> minimum standard at present, we allow it to fall back without restriction
> and then do specific feature tests to see what's available.  If we require
> C++14, we will need to add
> 
>    set(CMAKE_CXX_STANDARD 17)
> 
>    set(CMAKE_CXX_STANDARD_REQUIRED ON)
> 
> in order to mandate C++14. It will then fail if it is not possible to
> achieve this.

You probably meant to write `set(CMAKE_CXX_STANDARD 14)` here, right?


> What I do in my projects is something like this:
> 
>    # Prefer C++17 support
>    if(NOT CMAKE_CXX_STANDARD)
>    set(CMAKE_CXX_STANDARD 17)
>    set(CMAKE_CXX_STANDARD_REQUIRED FALSE)
>    endif()
> 
> This is to permit the user to override it, and only default if unset.

There is no special `latest` value or some such for CMAKE_CXX_STANDARD?

If not, I guess something like this will have to do:

if(NOT CMAKE_CXX_STANDARD)
  set(CMAKE_CXX_STANDARD 20)
  set(CMAKE_CXX_STANDARD_REQUIRED FALSE)
endif()

We will then have to keep updating it to make sure Xerces-C++ is compatible
with the latest standard.

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


Re: Proposal for future development of Xerces-C 4.0.0

Posted by Roger Leigh <rl...@codelibre.net>.
On 19/06/2020 13:44, Boris Kolpackov wrote:

> Hi Roger,
>
> Thanks for getting the ball rolling. See my comments below.
>
> Roger Leigh <rl...@codelibre.net> writes:
>
>> One of the issues I encountered was difficulty in building on modern
>> platforms, Windows in particular, which was the impetus for developing
>> the CMake build now incorporated officially in the Xerces-C 3.2.x
>> releases.
> I would suggest we drop support for autotools in 4.0.0. I personally
> view both (autotools and CMake) as pretty bad and I don't see a reason
> to maintain two bad options.

I wouldn't object to this.  It halves the testing required on Unix 
platforms.  I can't say I adore CMake; it's a big improvement over the 
autotools but still not as nice as I would like, but it serves its 
purpose, and my use of it is primarily pragmatic (I have submitted a 
fair few upstream contributions though, including FindXercesC and 
FindXalanC).

>> These are too deeply entrenched to consider removing at this point, but
>> C++11 brings native char16_t and char32_t character types which could
>> alleviate some of the problems.
> Agree. Do you know what's the story with char16_t vs wchar_t on Windows?
> Specifically, will (Windows-only) codebases that pass L""-strings to
> Xerces-C++ API need to be changed?

They are not directly compatible, but a simple typecast is sufficient to 
convert them.  We already do this in the Win32 transcoder if I recall 
correctly.

I'm not entirely sure how to class code written using L"".  It's not 
really portable, being Windows-only as you say (Windows being the only 
platform where wchar_t is 16-bit and usable as XMLCh). And it's not 
strictly portable even to different builds of Xerces-C, given that XMLCh 
is configurable and that wchar_t isn't even the default (char16_t is the 
default for C++11 and above).

As a result, I think that for any users who are using L"", they would 
have three options:

1. Replace L"" with u"".  This would work with Xerces-C 4.0 and C++11, 
but would not be backward compatible with older Xerces-C versions.

2. Add static_cast<const XMLCh *>() around L"" strings when passing them 
to Xerces-C.  This is portable and works even with older Xerces 
versions, so would be useful as a transitional step for codebases which 
want to support 4.0 and 3.2 and earlier.

3. Add static_cast<const XMLCh *>() around u"" strings when passing them 
to Xerces-C.  As for (2) this provides the same portability to old and 
new versions (it's a no-op when XMLCh is char16_t).  This might be a 
better option when you want to use C++11 but you also need to support 
3.2 and earlier.

So long as we had these clearly documented, I think that would provide 
reasonable guidance and an effective means to transition while retaining 
compatibility with older versions.

Personally, I'd go with (3) for my own projects which are already using 
C++11 and UTF-8-encoded source files and then switch to (1) once all the 
target platforms are using Xerces-C 4.x.

>> * having exceptions derive from std::runtime_error; existing types can
>> remain compatible with wide strings
>>
>> * support use of streams, including stringstreams, directly with Xerces e.g.
>> InputSource, perhaps as a set of adaptors
>>
>> * where the C++ language and standard replace functionality in Xerces, it
>> would be worth considering replacement where there is a benefit; language
>> thread support might come under this category
> Sounds great! If we require C++11 (or later), I see no reason not to
> switch to std::thread & friends.

I think here we could also look to Xalan-C for examples.  It has 
historically made use of C++98 features including some support for 
strings, streams etc., and we might be able to directly copy some of its 
implementation choices (providing they make sense).

One of the nice aspects of Xerces-C is that it compiles like greased 
lightning due to not making heavy use of standard library headers.  It 
would be nice to retain that if possible.  We could possibly constrain 
use of streams to specific modules, for example.

>> On the maintainability side, I'd like to reduce the number of configuration
>> options to keep testing and support within reason.
> Agree. One thing that I would like to keep is the ability to build
> Xerces-C++ without any third-party library dependencies.
Yes, I think that point is well made, particularly when it comes to the 
network accessors, transcoders and the like.  There should be a 
self-contained option or none at all as possibilities.
>> * We have three message loaders and three sets of translations for en_US,
>> but no other translations. Some or all of them might be worth thinking
>> about dropping given the complete lack of utility these provide.
> To me keeping only ICU and inmemory sounds like the way to go.
Agreed.
>> * We have several network accessors. But with the modern push for using
>> HTTPS everywhere, should Xerces be providing its own or should we simply
>> require CURL or platform-specific functionality?
> Yes, I believe there should be only two options: no network support and
> CURL.
>
> There are also several transcoder options. Again, I think we should
> only keep the built-in stuff and ICU.
These both make sense.  For the networking side, I think we could also 
argue the case for the MacOS accessor as well (cfurl), so long as it 
supports SSL.  The ones I think we should drop are the two direct socket 
implementations (unix socket and win32 socket).
>> Finally, I should note that while the above might look quite disruptive, I'm
>> not suggesting any sort of API breakage at this point.
> I agree. I think we should be mindful of migration efforts that will be
> required on the user's side.

Absolutely.  Many users of Xerces-C have well established codebases, and 
we should take care not to break them.  I think every change suggested 
here should be possible without making any API break.

The only possible source of breakage is the XMLCh switch, and that's not 
really a break at all given how flexible that type is today: portable 
code already handles it being of varying type.

> Another thing worth discussing is which C++ standard we should target. I
> think at a minimum C++11 but perhaps we should be bold and aim a bit
> higher?
>
> In fact, IMO, talking about targeting a C++ standard like C++11 or C++14
> is not very useful since every major C++ compiler (GCC, Clang, MSVC)
> completes support for the next standard over multiple releases. As
> a result, while compilers may not have complete support, they often
> include a perfectly usable subset of the features.
>
> In this light, what we found more useful is to specify the minimum
> versions of the three major compilers that we are willing to support
> and any features that are available in all three are fair game.

I would consider C++11 a minimum, but I would prefer C++14 as the 
baseline if possible.  In practice, most compilers supporting C++11 also 
support a useful subset of C++14.

I agree that picking the minimum version of MSVC, GCC and LLVM is a 
useful means of determining the subset of features which are permitted.

I should note that when Xerces-C 3.2 was released, and included in the 
FreeBSD ports, I built it with C++11 and XMLCh=char16_t and ensured that 
every (packaged) open source user of Xerces-C++ was capable of being 
built both with C++11 and with char16_t [amberfish, 
apache-xml-security-c, cegui, enigma, freecad, gdal, glest, kdepim, 
libepp-nicbr, libkolabxml, opensaml, passwordsafe, pktanon, qbox, qgis, 
qgis-ltr, shibboleth-sp, sumo, traingame, xalan-c, xmlcopyeditor, 
xmltooling, xsd, zorba].  Every patch has been submitted and 
incorporated upstream as far as I'm aware. This means that making the 
switch to C++11 or C++14 should be completely transparent for most 
Xerces-C users.

> In another project I am working on (build2) we had good results with
> picking GCC 4.9, Clang 3.7, and MSVC 14u3 and getting a very usable
> subset of C++14 (including move capture and generic lambdas).
I'm on slightly more recent versions, as is the AppVeyor and Travis CI, 
but so long as we can agree on the required C++ subset we can identify 
the minimum version requirement.
> Another issue that we will need to decide on is which standard we
> are going to build for (at least by default) and whether we will
> be making it configurable. The problem here is that there is no
> guarantee that code built for different standards is ABI-compatible
> (and there are cases where the C++ standard itself broke this
> compatibility). As a result, the only sure way to avoid surprises
> is to build everything (Xerces-C++ and the application that uses
> it) for the same standard.
>
> For example, in build2 by default we use the latest available
> standard for any given compiler/version but there is also a way to
> override it for the entire build configuration. I am not sure if
> CMake has anything like this.

It does, see https://cmake.org/cmake/help/latest/prop_tgt/CXX_STANDARD.html

We currently have it set like this: 
https://github.com/apache/xerces-c/blob/master/CMakeLists.txt#L37 :

    # Try C++17, then fall back to C++14, C++11 then C++98.  Used for
    feature tests
    # for optional features.
    set(CMAKE_CXX_STANDARD 17)

CMAKE_CXX_STANDARD sets the requested standard version.  If unavailable, 
it will fall back to earlier standards.  Since Xerces-C++ doesn't have a 
minimum standard at present, we allow it to fall back without 
restriction and then do specific feature tests to see what's available.  
If we require C++14, we will need to add

    set(CMAKE_CXX_STANDARD 17)

    set(CMAKE_CXX_STANDARD_REQUIRED ON)

in order to mandate C++14.  It will then fail if it is not possible to 
achieve this.

What I do in my projects is something like this:

    # Prefer C++17 support
    if(NOT CMAKE_CXX_STANDARD)
    set(CMAKE_CXX_STANDARD 17)
    set(CMAKE_CXX_STANDARD_REQUIRED FALSE)
    endif()

This is to permit the user to override it, and only default if unset.


Kind regards,

Roger


Re: Proposal for future development of Xerces-C 4.0.0

Posted by Boris Kolpackov <bo...@codesynthesis.com>.
Hi Roger,

Thanks for getting the ball rolling. See my comments below.

Roger Leigh <rl...@codelibre.net> writes:

> One of the issues I encountered was difficulty in building on modern
> platforms, Windows in particular, which was the impetus for developing
> the CMake build now incorporated officially in the Xerces-C 3.2.x
> releases.

I would suggest we drop support for autotools in 4.0.0. I personally
view both (autotools and CMake) as pretty bad and I don't see a reason
to maintain two bad options.


> These are too deeply entrenched to consider removing at this point, but
> C++11 brings native char16_t and char32_t character types which could
> alleviate some of the problems.

Agree. Do you know what's the story with char16_t vs wchar_t on Windows?
Specifically, will (Windows-only) codebases that pass L""-strings to
Xerces-C++ API need to be changed?


> With a switch of XMLCh to char16_t, this would enable direct use of unicode
> string literals. [...] This is a complete replacement of most use of
> XMLUniDefs constants in the source tree with C++11 literals.

Sounds great!


> * having exceptions derive from std::runtime_error; existing types can
> remain compatible with wide strings
> 
> * support use of streams, including stringstreams, directly with Xerces e.g.
> InputSource, perhaps as a set of adaptors
> 
> * where the C++ language and standard replace functionality in Xerces, it
> would be worth considering replacement where there is a benefit; language
> thread support might come under this category

Sounds great! If we require C++11 (or later), I see no reason not to
switch to std::thread & friends.


> On the maintainability side, I'd like to reduce the number of configuration
> options to keep testing and support within reason.

Agree. One thing that I would like to keep is the ability to build
Xerces-C++ without any third-party library dependencies.


> * We have three message loaders and three sets of translations for en_US,
> but no other translations. Some or all of them might be worth thinking
> about dropping given the complete lack of utility these provide.

To me keeping only ICU and inmemory sounds like the way to go.



> * We have several network accessors. But with the modern push for using
> HTTPS everywhere, should Xerces be providing its own or should we simply
> require CURL or platform-specific functionality?

Yes, I believe there should be only two options: no network support and
CURL.

There are also several transcoder options. Again, I think we should
only keep the built-in stuff and ICU.


> * Building with a modern compiler or using a modern IDE flags up tens of
> thousands of warnings. Some can be fixed by using features like "mutable"
> which previously were not universally available. It would be worth running
> the codebase through clang-format and clang-tidy to clean it up stepwise.
> It could well fix quite a few bugs and help improve performance and
> correctness.

Don't think anyone will object to that.


> Finally, I should note that while the above might look quite disruptive, I'm
> not suggesting any sort of API breakage at this point.

I agree. I think we should be mindful of migration efforts that will be
required on the user's side.

Another thing worth discussing is which C++ standard we should target. I
think at a minimum C++11 but perhaps we should be bold and aim a bit
higher?

In fact, IMO, talking about targeting a C++ standard like C++11 or C++14
is not very useful since every major C++ compiler (GCC, Clang, MSVC)
completes support for the next standard over multiple releases. As
a result, while compilers may not have complete support, they often
include a perfectly usable subset of the features.

In this light, what we found more useful is to specify the minimum
versions of the three major compilers that we are willing to support
and any features that are available in all three are fair game.

In another project I am working on (build2) we had good results with
picking GCC 4.9, Clang 3.7, and MSVC 14u3 and getting a very usable
subset of C++14 (including move capture and generic lambdas).

Another issue that we will need to decide on is which standard we
are going to build for (at least by default) and whether we will
be making it configurable. The problem here is that there is no
guarantee that code built for different standards is ABI-compatible
(and there are cases where the C++ standard itself broke this
compatibility). As a result, the only sure way to avoid surprises
is to build everything (Xerces-C++ and the application that uses
it) for the same standard.

For example, in build2 by default we use the latest available
standard for any given compiler/version but there is also a way to
override it for the entire build configuration. I am not sure if
CMake has anything like this.

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org