You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-users@xerces.apache.org by Stefan de Konink <st...@konink.de> on 2020/05/12 15:03:56 UTC

Large XSD-schema, speed and identity constraint validation

Hi,

I am part of the standardisation group that works on a public transport 
standard for network and timetable exchange. It is available as XSD on 
github <https://github.com/NeTEx-CEN/NeTEx> under a GPL license.

One of the main problems that we face is the syntax validation of 100MB+ 
XML-document with this schema, but especially: identity constraint 
validation. Practically I am looking for a better than libxml2/xmllint 
speed, where I notice that many - if not all - tools have a direct single 
threaded performance bottleneck. I am trying to find a generic form to 
overcome this, I am surprised that it is difficult to find one. Practically 
parallel syntax validation using sharding could work for us, but identity 
constraint validation needs all parts of the document, hence I would expect 
a "better way".

From the Codesynthesis XSD mailinglist I arrived here. I am specifically 
interested in any effort that can make identity constraint validation 
faster, or "XML Screamer" like approaches.


But towards my question. When I compare Xerces Java and Xerces C++ I 
noticed the following on the same file.

The Java version is capable of doing this:

org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key 
'StopArea_KeyRef' with value 'SYNTUS:StopArea:60103,20200422' not found for 
identity constraint of element 'PublicationDelivery'.
org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key 
'ScheduledStopPoint_KeyRef' with value 
'SYNTUS:ScheduledStoppoint:50203005,20200422' not found for identity 
constraint of element 'PublicationDelivery'.
org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key 
'TransportAdministrativeZone_KeyRef' with value 
'NL:AdministrativeZone:AL,any' not found for identity constraint of element 
'PublicationDelivery'.
org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key 
'Operator_KeyRef' with value 'SYNTUS,20200422' not found for identity 
constraint of element 'PublicationDelivery'.


While the C++ version does:

/var/tmp/NeTEx_SYNTUS_20200422_New_NDOV-pushed.xml:1499081:23 error: 
identity constraint key for element 'PublicationDelivery' not found
(duplicated: 1196 times)


So I am missing the "Key/Value" report and instead get an ocean of 
duplicates where I can't find out the reason. Could anyone help me out how 
I an resolve this?


I am currently using this reference code shows from the XSD project.


int main() 
{ 
   xml_schema::properties props;
   props.schema_location ("http://www.netex.org.uk/netex", 
"file:///home/skinkie/Sources/NeTEx-NL/xsd/netex-nl.xsd");
try
{
  //
  // Parse, work with object model, serialize.
  //
  netex::PublicationDelivery_ 
("/var/tmp/NeTEx_SYNTUS_20200422_New_NDOV.xml", 0, props);
}
catch (const xml_schema::exception& e)
{
  cerr << e << endl;
  return 1;
}
catch (const xml_schema::properties::argument&)
{
  cerr << "invalid property argument (empty namespace or location)" << 
endl;
  return 1;
}
catch (const xsd::cxx::xml::invalid_utf16_string&)
{
  cerr << "invalid UTF-16 text in DOM model" << endl;
  return 1;
}
catch (const xsd::cxx::xml::invalid_utf8_string&)
{
  cerr << "invalid UTF-8 text in object model" << endl;
  return 1;
}       
    return 0; 
}

-- 
Stefan