You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by "Zimmel, Daniel" <D....@ESVmedien.de> on 2021/03/25 11:43:18 UTC

Java Heap Space problems with XSD 1.1 validation, asserts and large files

Hi,

I ran into some serious performance issues with Xerces 2.12.1 Java.
This obviously seems related to assertions and can be reproduced easily.

My XML file is deeply nested and has 440.000 lines when indented.

When I start the validation with a random freely available XSD 1.0 (https://jats.nlm.nih.gov/extensions/bits/2.0/xsd.html), validation results are returning quite fast (1s). This might be related that my file is not valid in any case against the XSD.
Anyhow, when I change the XSD version to 1.1 and insert a sample assertion (xsd:assert test="false()") in the content model for my root element, my CPU and memory are filling up quite fast, even giving me a Heap Space Error.

This is strange because the assertion is only defined in the root element (it does not matter where I am placing the assertion - it is always giving Xerces a hard time).

It is always the same behaviour no matter if I run it on the command line (java .... jaxp.SourceValidator -xsd11 -f -fx) or via another validation implementation that is integrating Xerces (I am using XML editor Oxygen and XML database BaseX).

Should I file a JIRA bug issue?

Can you confirm this behaviour?

Should I send you a file zip?

Thanks, Daniel





---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

AW: Java Heap Space problems with XSD 1.1 validation, asserts and large files

Posted by "Zimmel, Daniel" <D....@ESVmedien.de>.
Thanks for the insight, I appreciate it a lot (I have no idea how the Xerces implementation works!)

I also hope that there will be better tool support in the future for 1.1 (there is not much in the direction from commercial vendors, but as you say the features *are* significant improvements).
I certainly won’t be going back to 1.0 – but perhaps I will need to be more pragmatic in some edge use cases.

Best,
Daniel

Von: Mukul Gandhi <mu...@apache.org>
Gesendet: Samstag, 3. April 2021 11:26
An: Zimmel, Daniel <D....@ESVmedien.de>
Cc: j-users@xerces.apache.org
Betreff: Re: Java Heap Space problems with XSD 1.1 validation, asserts and large files

On Tue, Mar 30, 2021 at 1:13 PM Zimmel, Daniel <D....@esvmedien.de>> wrote:
Duplicating the tree does indeed explain a lot

I don't think that, Xerces XSD 1.1 <assert> implementation duplicates creating XML fragment trees. The main validation logic for Xerces XSD 1.1 & 1.0 implementations occurs in a streaming fashion (its termed XNI [xerces native interface] within Xerces, which is similar to XML SAX events). Only when <assert> are encountered during XSD 1.1 validation processing, Xerces builds DOM tree and hands it over to the XPath 2.0 engine. The Eclipse XPath 2.0 engine (over which Xerces XSD 1.1 <assert> implementation is based upon, and is a third party dependency for Xerces XSD 1.1 implementation), requires the XDM (XPath data model) tree to be constructed as a DOM tree.

In general I always feel that XSD 1.1 adoption (and using assertions) is not that widespread when I talk to other XML users/devs so I can understand the incentive for improving this are quite non-existent.

IMHO, I differ with you somewhat on this point.

The latest specs for XSLT (3.0), XQuery (3.1) and XPath (3.1) mention that, implementors of these languages can use either XSD 1.0 or 1.1 (this aspect is implementation defined) as a language for their type system. This I think emphasizes the importance of XSD 1.1 within the main XML based standards.

The XSD 1.1 language has lots of new features (other than <assert>) as compared to XSD 1.0, which are significant improvement over XSD 1.0, and certainly it shall be prudent that XSD 1.0 users should consider adopting XSD 1.1.



--
Regards,
Mukul Gandhi

Re: Java Heap Space problems with XSD 1.1 validation, asserts and large files

Posted by Mukul Gandhi <mu...@apache.org>.
On Tue, Mar 30, 2021 at 1:13 PM Zimmel, Daniel <D....@esvmedien.de>
wrote:

> Duplicating the tree does indeed explain a lot
>

I don't think that, Xerces XSD 1.1 <assert> implementation duplicates
creating XML fragment trees. The main validation logic for Xerces XSD 1.1 &
1.0 implementations occurs in a streaming fashion (its termed XNI [xerces
native interface] within Xerces, which is similar to XML SAX events). Only
when <assert> are encountered during XSD 1.1 validation processing, Xerces
builds DOM tree and hands it over to the XPath 2.0 engine. The Eclipse
XPath 2.0 engine (over which Xerces XSD 1.1 <assert> implementation is
based upon, and is a third party dependency for Xerces XSD 1.1
implementation), requires the XDM (XPath data model) tree to be constructed
as a DOM tree.


> In general I always feel that XSD 1.1 adoption (and using assertions) is
> not that widespread when I talk to other XML users/devs so I can understand
> the incentive for improving this are quite non-existent.
>

IMHO, I differ with you somewhat on this point.

The latest specs for XSLT (3.0), XQuery (3.1) and XPath (3.1) mention that,
implementors of these languages can use either XSD 1.0 or 1.1 (this aspect
is implementation defined) as a language for their type system. This I
think emphasizes the importance of XSD 1.1 within the main XML based
standards.

The XSD 1.1 language has lots of new features (other than <assert>) as
compared to XSD 1.0, which are significant improvement over XSD 1.0, and
certainly it shall be prudent that XSD 1.0 users should consider adopting
XSD 1.1.



-- 
Regards,
Mukul Gandhi

AW: Java Heap Space problems with XSD 1.1 validation, asserts and large files

Posted by "Zimmel, Daniel" <D....@ESVmedien.de>.
Thanks Mukul for the implementation insights.
Duplicating the tree does indeed explain a lot – this is a thing that Saxon EE is somehow handling differently in its implementation (which is quite fast), when I compare it directly.

In general I always feel that XSD 1.1 adoption (and using assertions) is not that widespread when I talk to other XML users/devs so I can understand the incentive for improving this are quite non-existent.

I will see if I can find a way around this limitation.

Thanks, Daniel


Von: Mukul Gandhi <mu...@apache.org>
Gesendet: Samstag, 27. März 2021 06:32
An: j-users@xerces.apache.org
Betreff: Re: Java Heap Space problems with XSD 1.1 validation, asserts and large files

On Thu, Mar 25, 2021 at 5:13 PM Zimmel, Daniel <D....@esvmedien.de>> wrote:
My XML file is deeply nested and has 440.000 lines when indented.

I hope, you mean that, your XML file has 440000 lines.

Anyhow, when I change the XSD version to 1.1 and insert a sample assertion (xsd:assert test="false()") in the content model for my root element, my CPU and memory are filling up quite fast, even giving me a Heap Space Error.

That's an expected behaviour with Xerces. The Xerces XSD 1.1 implementation, constructs an XML in-memory DOM/XDM tree for (each) <xsd:assert>, which is rooted at an XML instance element that is validated by a xsd:complexType that has an <xsd:assert>. This is to say that, <xsd:assert> implementation is memory hungry for large XML instance documents that are validated by <xsd:assert> for XML elements on/near root of the XML instance tree, and also particularly when the <xsd:assert> XML instance tree is deeply nested.

Some of the measures that I could advise, for issues described by you are following,
1) If possible, use IDC constraints or CTA, instead of <xsd:assert>. Or, use any other non <xsd:assert> XSD constructs for validation.
2) Do part of XML instance validation, within your client code that is invoking Xerces XSD 1.1 validation.
3) Try using the JVM options -Xms and -Xmx, to tune the heap memory to best extent. If possible (if it's a production and profit making project), use more RAM on the workstation where XSD 1.1 validation is taking place.

Should I file a JIRA bug issue?

Its up to you. From my point of view, this issue won't likely result in Xerces XSD 1.1 implementation code improvements.



--
Regards,
Mukul Gandhi

Re: Java Heap Space problems with XSD 1.1 validation, asserts and large files

Posted by Mukul Gandhi <mu...@apache.org>.
On Thu, Mar 25, 2021 at 5:13 PM Zimmel, Daniel <D....@esvmedien.de>
wrote:

> My XML file is deeply nested and has 440.000 lines when indented.
>

I hope, you mean that, your XML file has 440000 lines.


> Anyhow, when I change the XSD version to 1.1 and insert a sample assertion
> (xsd:assert test="false()") in the content model for my root element, my
> CPU and memory are filling up quite fast, even giving me a Heap Space Error.
>

That's an expected behaviour with Xerces. The Xerces XSD 1.1
implementation, constructs an XML in-memory DOM/XDM tree for (each)
<xsd:assert>, which is rooted at an XML instance element that is validated
by a xsd:complexType that has an <xsd:assert>. This is to say that,
<xsd:assert> implementation is memory hungry for large XML
instance documents that are validated by <xsd:assert> for XML elements
on/near root of the XML instance tree, and also particularly when the
<xsd:assert> XML instance tree is deeply nested.

Some of the measures that I could advise, for issues described by you are
following,
1) If possible, use IDC constraints or CTA, instead of <xsd:assert>. Or,
use any other non <xsd:assert> XSD constructs for validation.
2) Do part of XML instance validation, within your client code that is
invoking Xerces XSD 1.1 validation.
3) Try using the JVM options -Xms and -Xmx, to tune the heap memory to best
extent. If possible (if it's a production and profit making project), use
more RAM on the workstation where XSD 1.1 validation is taking place.

Should I file a JIRA bug issue?
>

Its up to you. From my point of view, this issue won't likely result in
Xerces XSD 1.1 implementation code improvements.



-- 
Regards,
Mukul Gandhi