You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@daffodil.apache.org by Roger L Costello <co...@mitre.org> on 2022/10/07 14:09:20 UTC
Can't have a DFDL schema that accepts well-formed but invalid data and always produces correct XML
Hi Folks,
My input contains a social security number (SSN), e.g.,
123-45-6789
If I declare the SSN element like this:
<xs:element name="SSN"
dfdl:lengthKind="explicit"
dfdl:length="11">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:pattern value="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
</xs:restriction>
</xs:simpleType>
</xs:element>
then the parser will accept well-formed but invalid data such as this:
xxx-45-6789
If I want to be notified that the data is not valid, then I can use the -V limited option. Then the parser will both generate XML and notify me that the input is not valid.
If I add checkConstraints:
<xs:element name="SSN"
dfdl:lengthKind="explicit"
dfdl:length="11">
<xs:annotation>
<xs:appinfo source="http://www.ogf.org/dfdl/">
<dfdl:assert>{ dfdl:checkConstraints(.) }</dfdl:assert>
</xs:appinfo>
</xs:annotation>
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:pattern value="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
</xs:restriction>
</xs:simpleType>
</xs:element>
then the parser no longer accepts well-formed but invalid data. No XML is generated.
Lesson Learned: Don't use checkConstraints if you want parsing to accept well-formed but invalid input.
But, but, but, ........
Things aren't that simple.
Suppose SSN is part of a choice. The choice has two branches. The first branch specifies RealID space SSN, the second branch specifies SSN space RealID.
Consider this valid input:
123-45-6789 A12345678
If the DFDL does not use checkConstraints, then this incorrect XML is generated:
<PersonID>
<RealID>123-45-6789</RealID>
<Space> </Space>
<SSN>A12345678</SSN>
</PersonID>
Notice that the <RealID> value is the ssn and the <SSN> value is the real id.
If we want to get correct XML, then we must use checkConstraints.
Lesson Learned: Use checkConstraints if you want parsing to generate correct XML.
Overall Lesson Learned: You can't have a DFDL schema that both accepts well-formed but invalid data and always produces correct XML.
Do you agree?
/Roger
Re: Can't have a DFDL schema that accepts well-formed but invalid data and always produces correct XML
Posted by Steve Lawrence <sl...@apache.org>.
This is correct, but your second overall lesson feels a bit strong, or
at least could maybe be interpreted as well-formed but invalid data can
*never* create correct XML. I might slightly reword it to something like:
> In some cases data validity must be tested (e.g. via
checkConstraints) while parsing to discriminate data to get the correct XML.
The issue with this particular example is that there is ambiguity in the
data, and the only way to discriminate which element to parse is to
actually inspect/validate the data.
As a counter example, imagine the data format looked like this:
SSN:123-45-6789 REALID:A12345678
Then our choice can look like this:
<choice>
<element ref="RealID" dfdl:initiator="REALID:">
<element ref="SSN" dfdl:initiator="SSN:">
</choice>
The RealID and SSN elements now do not need the checkConstraints
assertion because the initiator discriminates which element to parse.
And now it is possible to have invalid data while still being considered
well-formed. For example, this would parse successfully but would be
invalid:
RealID:123-45-6789 SSN:A12345678
Another alternative could be to use assert pattern to sort of "guess"
which choice to take. For example, we know SSN must start with a number
and RealID must start with a letter. So we could do something like this:
<element name="SSN" dfdl:lengthKind="explicit" dfdl:length="11">
<annotation>
<appinfo source="http://www.ogf.org/dfdl/">
<dfdl:assert testKind="pattern" testPattern="[0-9]" />
</appinfo>
</annotation>
<simpleType>
<restriction base="xs:string">
<pattern value="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
</restriction>
</simpleType>
</element>
RealID would look the similar but have testPattern="[A-Z]".
This way the appropriate choice branch is selected based on whether the
first character is a letter or digit. Now it's possible to have
well-formed but invalid data, as well as get the expected XML. For
example, the follow would parse successfully with the right XML, but
would cause validation errors for both SSN and RealID:
123-XX-6789 A123456XX
Note however, that if you had this:
A23-45-6789
then this approach would consider this a RealID since it starts with a
letter, even though it looks almost like an SSN. And you would end up
with a validation error saying this isn't a valid RealID, which in
practice maybe you would have preferred it to say it's not a valid SSN.
Also note that with this aproach, if the data was
xxx-45-6789
Then this would be considered not well-formed because it doesn't start
with a digit or upper case letter.
On 10/7/22 9:09 AM, Roger L Costello wrote:
> Hi Folks,
>
> My input contains a social security number (SSN), e.g.,
>
> 123-45-6789
>
> If I declare the SSN element like this:
>
> <xs:element name="SSN"
> dfdl:lengthKind="explicit"
> dfdl:length="11">
> <xs:simpleType>
> <xs:restriction base="xs:string">
> <xs:pattern value="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
> </xs:restriction>
> </xs:simpleType>
> </xs:element>
>
> then the parser will accept well-formed but invalid data such as this:
>
> xxx-45-6789
>
> If I want to be notified that the data is not valid, then I can use the -V limited option. Then the parser will both generate XML and notify me that the input is not valid.
>
> If I add checkConstraints:
>
> <xs:element name="SSN"
> dfdl:lengthKind="explicit"
> dfdl:length="11">
> <xs:annotation>
> <xs:appinfo source="http://www.ogf.org/dfdl/">
> <dfdl:assert>{ dfdl:checkConstraints(.) }</dfdl:assert>
> </xs:appinfo>
> </xs:annotation>
> <xs:simpleType>
> <xs:restriction base="xs:string">
> <xs:pattern value="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
> </xs:restriction>
> </xs:simpleType>
> </xs:element>
>
> then the parser no longer accepts well-formed but invalid data. No XML is generated.
>
> Lesson Learned: Don't use checkConstraints if you want parsing to accept well-formed but invalid input.
>
> But, but, but, ........
>
> Things aren't that simple.
>
> Suppose SSN is part of a choice. The choice has two branches. The first branch specifies RealID space SSN, the second branch specifies SSN space RealID.
>
> Consider this valid input:
>
> 123-45-6789 A12345678
>
> If the DFDL does not use checkConstraints, then this incorrect XML is generated:
>
> <PersonID>
> <RealID>123-45-6789</RealID>
> <Space> </Space>
> <SSN>A12345678</SSN>
> </PersonID>
>
> Notice that the <RealID> value is the ssn and the <SSN> value is the real id.
>
> If we want to get correct XML, then we must use checkConstraints.
>
> Lesson Learned: Use checkConstraints if you want parsing to generate correct XML.
>
> Overall Lesson Learned: You can't have a DFDL schema that both accepts well-formed but invalid data and always produces correct XML.
>
> Do you agree?
>
> /Roger