You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by "Mukul Gandhi (Jira)" <xe...@xml.apache.org> on 2021/11/20 10:52:00 UTC

[jira] [Comment Edited] (XERCESJ-1716) Validating XML against XSD is slow for long strings if pattern restrictions are defined, even if maxLength is restricted.

    [ https://issues.apache.org/jira/browse/XERCESJ-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445809#comment-17445809 ] 

Mukul Gandhi edited comment on XERCESJ-1716 at 11/20/21, 10:51 AM:
-------------------------------------------------------------------

I got a chance, to look at the original bug report with this thread.

Instead of,

<xs:simpleType name="SimpleText255NotBlankType">
        <xs:annotation>
            <xs:documentation xml:lang="en">String of maximum 255 characters, not blank</xs:documentation>
        </xs:annotation>
        <xs:restriction base="xs:string">
            <xs:minLength value="1"/>
            <xs:maxLength value="255"/>
            <xs:pattern value=".*[^\s].*"/>            
        </xs:restriction>
</xs:simpleType>

We can write (and that runs very fast on the provided XML document long_string.xml),

<xs:simpleType name="SimpleText255NotBlankType">
        <xs:annotation>
            <xs:documentation xml:lang="en">String of maximum 255 characters, not blank</xs:documentation>
        </xs:annotation>
        <xs:restriction base="xs:string">            
            <xs:pattern value="[^\s]\{1,255}"/>
        </xs:restriction>
</xs:simpleType>

I think that, Xerces XSD processor in general, should not evaluate xs:minLength, xs:maxLength facets before xs:pattern facet. The XSD specification doesn't prescribe, any such guideline, and implementers can determine order of XSD facet evaluation within a simple type as implementation dependent.


was (Author: mukul_gandhi):
I got a chance, to look at the original bug report with this thread.

Instead of,

<xs:simpleType name="SimpleText255NotBlankType">
        <xs:annotation>
            <xs:documentation xml:lang="en">String of maximum 255 characters, not blank</xs:documentation>
        </xs:annotation>
        <xs:restriction base="xs:string">
            <xs:minLength value="1"/>
            <xs:maxLength value="255"/>
            <xs:pattern value=".*[^\s].*"/>            
        </xs:restriction>
</xs:simpleType>

We can write (and that runs very fast on the provided XML document long_string.xml),

<xs:simpleType name="SimpleText255NotBlankType">
        <xs:annotation>
            <xs:documentation xml:lang="en">String of maximum 255 characters, not blank</xs:documentation>
        </xs:annotation>
        <xs:restriction base="xs:string">            
            <xs:pattern value="[^\s]\{1,255}"/>
        </xs:restriction>
</xs:simpleType>

I think that, Xerces XSD processor in general, should not evaluate xs:minLength, xs:maxLength facets before xs:pattern facet. The XSD specification doesn't prescribe, any such guideline, and implementers can determine order of XSD facet evaluation within a simple type as implementation dependent.

> Validating XML against XSD is slow for long strings if pattern restrictions are defined, even if maxLength is restricted.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: XERCESJ-1716
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1716
>             Project: Xerces2-J
>          Issue Type: Improvement
>            Reporter: Márk Petrényi
>            Assignee: Mukul Gandhi
>            Priority: Major
>         Attachments: long_string.xml, unsafe.xsd, workaround.xsd
>
>
> Validating XML against XSD is slow for long strings if pattern restrictions are defined, even if maxLength is restricted.
> We have the following simple type defined in our xsd (unsafe.xsd):
> {code:xml}
> <xsd:simpleType name="SimpleText255NotBlankType">
>  <xsd:annotation>
>  <xsd:documentation xml:lang="en">String of maximum 255 characters, not blank</xsd:documentation>
>  </xsd:annotation>
>  <xsd:restriction base="xsd:string">
>  <xsd:minLength value="1"/>
>  <xsd:maxLength value="255"/>
>  <xsd:pattern value=".*[^\s].*"/>
>  </xsd:restriction>
> </xsd:simpleType>
> {code}
> The problem is when a really long string (ca. 1000000 characters) is provided as a value in the input xml, we would assume that it is regarded invalid quickly because of the length. Actually the validation takes several minutes since the regex gets evaluated before the maxLength restriction.
> We found a workaround for the issue if we define the simpleType this way (workaround.xsd):
> {code:xml}
>  <xsd:simpleType name="SimpleText255Type">
>  <xsd:annotation>
>  <xsd:documentation xml:lang="en">String of maximum 255 characters</xsd:documentation>
>  </xsd:annotation>
>  <xsd:restriction base="xsd:string">
>  <xsd:minLength value="1"/>
>  <xsd:maxLength value="255"/>
>  <xsd:pattern value=".\{1,255}"/>
>  </xsd:restriction>
>  </xsd:simpleType>
>  <xsd:simpleType name="SimpleText255NotBlankType">
>  <xsd:annotation>
>  <xsd:documentation xml:lang="en">String of maximum 255 characters, not blank</xsd:documentation>
>  </xsd:annotation>
>  <xsd:restriction base="SimpleText255Type">
>  <xsd:pattern value=".*[^\s].*"/>
>  </xsd:restriction>
>  </xsd:simpleType>
> {code}
> The workaround only works because the implementation of the XSSimpleType builds a Vector of the regex patterns and the {{.{1,255}}} pattern will be evaluated first and it fails relatively quickly thus the time consuming second regex wont be checked.
> It would be great to have the regex pattern checked after validating other xsd restrictions (minLength, maxLength, etc..) or to have control over the validation ordering, thus avoiding unneccesseraly slow validations and the use of a workaround based on undocumented features.
> I attached the xsd-s referenced above and an xml containing a long string value. The problem can be checked using the SourceValidator from Xerces2-J samples:
> The original xsd with slow validation:
> {code:java}
> java jaxp.SourceValidator -a unsafe.xsd -i long_string.xml
> {code}
> The workaround xsd with normal run-time:
> {code:java}
> java jaxp.SourceValidator -a workaround.xsd -i long_string.xml
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org