You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@daffodil.apache.org by Roger L Costello <co...@mitre.org> on 2022/10/07 14:09:20 UTC

Can't have a DFDL schema that accepts well-formed but invalid data and always produces correct XML

Hi Folks,

My input contains a social security number (SSN), e.g.,

123-45-6789

If I declare the SSN element like this:

<xs:element name="SSN"
                      dfdl:lengthKind="explicit"
                     dfdl:length="11">
    <xs:simpleType>
        <xs:restriction base="xs:string">
            <xs:pattern value="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
        </xs:restriction>
    </xs:simpleType>
</xs:element>

then the parser will accept well-formed but invalid data such as this:

xxx-45-6789

If I want to be notified that the data is not valid, then I can use the -V limited option. Then the parser will both generate XML and notify me that the input is not valid.

If I add checkConstraints:

<xs:element name="SSN"
                       dfdl:lengthKind="explicit"
                       dfdl:length="11">
    <xs:annotation>
        <xs:appinfo source="http://www.ogf.org/dfdl/">
            <dfdl:assert>{ dfdl:checkConstraints(.) }</dfdl:assert>
        </xs:appinfo>
    </xs:annotation>
    <xs:simpleType>
        <xs:restriction base="xs:string">
            <xs:pattern value="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
        </xs:restriction>
    </xs:simpleType>
</xs:element>

then the parser no longer accepts well-formed but invalid data. No XML is generated.

Lesson Learned: Don't use checkConstraints if you want parsing to accept well-formed but invalid input.

But, but, but, ........

Things aren't that simple.

Suppose SSN is part of a choice. The choice has two branches. The first branch specifies RealID space SSN, the second branch specifies SSN space RealID.

Consider this valid input:

123-45-6789 A12345678

If the DFDL does not use checkConstraints, then this incorrect XML is generated:

  <PersonID>
    <RealID>123-45-6789</RealID>
    <Space> </Space>
    <SSN>A12345678</SSN>
  </PersonID>

Notice that the <RealID> value is the ssn and the <SSN> value is the real id.

If we want to get correct XML, then we must use checkConstraints.

Lesson Learned: Use checkConstraints if you want parsing to generate correct XML.

Overall Lesson Learned: You can't have a DFDL schema that both accepts well-formed but invalid data and always produces correct XML.

Do you agree?

/Roger

Re: Can't have a DFDL schema that accepts well-formed but invalid data and always produces correct XML

Posted by Steve Lawrence <sl...@apache.org>.
This is correct, but your second overall lesson feels a bit strong, or 
at least could maybe be interpreted as well-formed but invalid data can 
*never* create correct XML. I might slightly reword it to something like:

 > In some cases data validity must be tested (e.g. via 
checkConstraints) while parsing to discriminate data to get the correct XML.

The issue with this particular example is that there is ambiguity in the 
data, and the only way to discriminate which element to parse is to 
actually inspect/validate the data.

As a counter example, imagine the data format looked like this:

   SSN:123-45-6789 REALID:A12345678

Then our choice can look like this:

   <choice>
     <element ref="RealID" dfdl:initiator="REALID:">
     <element ref="SSN" dfdl:initiator="SSN:">
   </choice>

The RealID and SSN elements now do not need the checkConstraints 
assertion because the initiator discriminates which element to parse. 
And now it is possible to have invalid data while still being considered 
well-formed. For example, this would parse successfully but would be 
invalid:

   RealID:123-45-6789 SSN:A12345678

Another alternative could be to use assert pattern to sort of "guess" 
which choice to take. For example, we know SSN must start with a number 
and RealID must start with a letter. So we could do something like this:

   <element name="SSN" dfdl:lengthKind="explicit" dfdl:length="11">
     <annotation>
       <appinfo source="http://www.ogf.org/dfdl/">
         <dfdl:assert testKind="pattern" testPattern="[0-9]" />
       </appinfo>
     </annotation>
     <simpleType>
       <restriction base="xs:string">
         <pattern value="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
       </restriction>
     </simpleType>
   </element>

RealID would look the similar but have testPattern="[A-Z]".

This way the appropriate choice branch is selected based on whether the 
first character is a letter or digit. Now it's possible to have 
well-formed but invalid data, as well as get the expected XML. For 
example, the follow would parse successfully with the right XML, but 
would cause validation errors for both SSN and RealID:

   123-XX-6789 A123456XX

Note however, that if you had this:

   A23-45-6789

then this approach would consider this a RealID since it starts with a 
letter, even though it looks almost like an SSN. And you would end up 
with a validation error saying this isn't a valid RealID, which in 
practice maybe you would have preferred it to say it's not a valid SSN.

Also note that with this aproach, if the data was

   xxx-45-6789

Then this would be considered not well-formed because it doesn't start 
with a digit or upper case letter.


On 10/7/22 9:09 AM, Roger L Costello wrote:
> Hi Folks,
> 
> My input contains a social security number (SSN), e.g.,
> 
> 123-45-6789
> 
> If I declare the SSN element like this:
> 
> <xs:element name="SSN"
>                        dfdl:lengthKind="explicit"
>                       dfdl:length="11">
>      <xs:simpleType>
>          <xs:restriction base="xs:string">
>              <xs:pattern value="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
>          </xs:restriction>
>      </xs:simpleType>
> </xs:element>
> 
> then the parser will accept well-formed but invalid data such as this:
> 
> xxx-45-6789
> 
> If I want to be notified that the data is not valid, then I can use the -V limited option. Then the parser will both generate XML and notify me that the input is not valid.
> 
> If I add checkConstraints:
> 
> <xs:element name="SSN"
>                         dfdl:lengthKind="explicit"
>                         dfdl:length="11">
>      <xs:annotation>
>          <xs:appinfo source="http://www.ogf.org/dfdl/">
>              <dfdl:assert>{ dfdl:checkConstraints(.) }</dfdl:assert>
>          </xs:appinfo>
>      </xs:annotation>
>      <xs:simpleType>
>          <xs:restriction base="xs:string">
>              <xs:pattern value="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
>          </xs:restriction>
>      </xs:simpleType>
> </xs:element>
> 
> then the parser no longer accepts well-formed but invalid data. No XML is generated.
> 
> Lesson Learned: Don't use checkConstraints if you want parsing to accept well-formed but invalid input.
> 
> But, but, but, ........
> 
> Things aren't that simple.
> 
> Suppose SSN is part of a choice. The choice has two branches. The first branch specifies RealID space SSN, the second branch specifies SSN space RealID.
> 
> Consider this valid input:
> 
> 123-45-6789 A12345678
> 
> If the DFDL does not use checkConstraints, then this incorrect XML is generated:
> 
>    <PersonID>
>      <RealID>123-45-6789</RealID>
>      <Space> </Space>
>      <SSN>A12345678</SSN>
>    </PersonID>
> 
> Notice that the <RealID> value is the ssn and the <SSN> value is the real id.
> 
> If we want to get correct XML, then we must use checkConstraints.
> 
> Lesson Learned: Use checkConstraints if you want parsing to generate correct XML.
> 
> Overall Lesson Learned: You can't have a DFDL schema that both accepts well-formed but invalid data and always produces correct XML.
> 
> Do you agree?
> 
> /Roger