You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-users@xerces.apache.org by Stefan de Konink <st...@konink.de> on 2020/05/12 15:03:56 UTC
Large XSD-schema, speed and identity constraint validation
Hi,
I am part of the standardisation group that works on a public transport
standard for network and timetable exchange. It is available as XSD on
github <https://github.com/NeTEx-CEN/NeTEx> under a GPL license.
One of the main problems that we face is the syntax validation of 100MB+
XML-document with this schema, but especially: identity constraint
validation. Practically I am looking for a better than libxml2/xmllint
speed, where I notice that many - if not all - tools have a direct single
threaded performance bottleneck. I am trying to find a generic form to
overcome this, I am surprised that it is difficult to find one. Practically
parallel syntax validation using sharding could work for us, but identity
constraint validation needs all parts of the document, hence I would expect
a "better way".
From the Codesynthesis XSD mailinglist I arrived here. I am specifically
interested in any effort that can make identity constraint validation
faster, or "XML Screamer" like approaches.
But towards my question. When I compare Xerces Java and Xerces C++ I
noticed the following on the same file.
The Java version is capable of doing this:
org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key
'StopArea_KeyRef' with value 'SYNTUS:StopArea:60103,20200422' not found for
identity constraint of element 'PublicationDelivery'.
org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key
'ScheduledStopPoint_KeyRef' with value
'SYNTUS:ScheduledStoppoint:50203005,20200422' not found for identity
constraint of element 'PublicationDelivery'.
org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key
'TransportAdministrativeZone_KeyRef' with value
'NL:AdministrativeZone:AL,any' not found for identity constraint of element
'PublicationDelivery'.
org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key
'Operator_KeyRef' with value 'SYNTUS,20200422' not found for identity
constraint of element 'PublicationDelivery'.
While the C++ version does:
/var/tmp/NeTEx_SYNTUS_20200422_New_NDOV-pushed.xml:1499081:23 error:
identity constraint key for element 'PublicationDelivery' not found
(duplicated: 1196 times)
So I am missing the "Key/Value" report and instead get an ocean of
duplicates where I can't find out the reason. Could anyone help me out how
I an resolve this?
I am currently using this reference code shows from the XSD project.
int main()
{
xml_schema::properties props;
props.schema_location ("http://www.netex.org.uk/netex",
"file:///home/skinkie/Sources/NeTEx-NL/xsd/netex-nl.xsd");
try
{
//
// Parse, work with object model, serialize.
//
netex::PublicationDelivery_
("/var/tmp/NeTEx_SYNTUS_20200422_New_NDOV.xml", 0, props);
}
catch (const xml_schema::exception& e)
{
cerr << e << endl;
return 1;
}
catch (const xml_schema::properties::argument&)
{
cerr << "invalid property argument (empty namespace or location)" <<
endl;
return 1;
}
catch (const xsd::cxx::xml::invalid_utf16_string&)
{
cerr << "invalid UTF-16 text in DOM model" << endl;
return 1;
}
catch (const xsd::cxx::xml::invalid_utf8_string&)
{
cerr << "invalid UTF-8 text in object model" << endl;
return 1;
}
return 0;
}
--
Stefan