You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@daffodil.apache.org by Patrick Grandjean <p....@gmail.com> on 2020/08/27 15:30:49 UTC

Strange behavior with separators

Hi all!

I am having a strange behavior with Apache Daffodil and would like to check
with you if this is normal.

Here is the XSD (named Foo.dfdl.xsd):
START ===================

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"
           xmlns:foo="https://www.foo.com"
           targetNamespace="https://www.foo.com"
           elementFormDefault="unqualified">

  <xs:include schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd"
/>

  <xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/">
      <dfdl:format ref="foo:GeneralFormat" />
    </xs:appinfo>
  </xs:annotation>

  <xs:element name="ret" type="foo:Foo" />

  <xs:complexType name="Foo">
    <xs:sequence dfdl:separator="%NL;">
      <xs:element name="a" type="xs:string" minOccurs="0"
dfdl:length="1" dfdl:lengthKind="explicit">
        <xs:annotation>
          <xs:appinfo source="http://www.ogf.org/dfdl/">
            <dfdl:discriminator testKind="pattern" testPattern="a" />
          </xs:appinfo>
        </xs:annotation>
      </xs:element>
      <xs:element name="bar" type="foo:Bar" minOccurs="0" maxOccurs="2" />
      <xs:element name="z" type="xs:string" minOccurs="0"
dfdl:length="1" dfdl:lengthKind="explicit">
        <xs:annotation>
          <xs:appinfo source="http://www.ogf.org/dfdl/">
            <dfdl:discriminator testKind="pattern" testPattern="z" />
          </xs:appinfo>
        </xs:annotation>
      </xs:element>
    </xs:sequence>
  </xs:complexType>

  <xs:complexType name="Bar">
    <xs:sequence dfdl:separator="%NL;">
      <xs:element name="b" type="xs:string" minOccurs="0"
dfdl:length="1" dfdl:lengthKind="explicit">
        <xs:annotation>
          <xs:appinfo source="http://www.ogf.org/dfdl/">
            <dfdl:discriminator testKind="pattern" testPattern="b" />
          </xs:appinfo>
        </xs:annotation>
      </xs:element>
      <xs:element name="c" type="xs:string" minOccurs="0"
dfdl:length="1" dfdl:lengthKind="explicit">
        <xs:annotation>
          <xs:appinfo source="http://www.ogf.org/dfdl/">
            <dfdl:discriminator testKind="pattern" testPattern="c" />
          </xs:appinfo>
        </xs:annotation>
      </xs:element>
      <xs:element name="d" type="xs:string" minOccurs="0"
dfdl:length="1" dfdl:lengthKind="explicit">
        <xs:annotation>
          <xs:appinfo source="http://www.ogf.org/dfdl/">
            <dfdl:discriminator testKind="pattern" testPattern="d" />
          </xs:appinfo>
        </xs:annotation>
      </xs:element>
    </xs:sequence>
  </xs:complexType>

</xs:schema>

END ===================

Here is a sample file to be parsed (named foo.txt):
START ===================

a
b
c
z

END ===================

Here is the final output of the command:
daffodil -t parse --schema Foo.dfdl.xsd foo.txt

START ===================
  <?xml version="1.0" encoding="UTF-8" ?>
  <foo:ret xmlns:foo="https://www.foo.com">
    <a>a</a>
    <bar>
      <b>b</b>
      <c>c</c>
    </bar>
  </foo:ret>
diff:
  No differences
failure:
  Parse Error: Separator '%NL;' not found
  Schema context: sequence[1] Location line 19 column 6 in
file:/Users/pgrandjean/Development/ti-scalaxb-ticketing/src/main/resources/dfdl/Foo.dfdl.xsd
  Data location was preceding byte 6
----------------------------------------------------------------- 21
<?xml version="1.0" encoding="UTF-8" ?>
<foo:ret xmlns:foo="https://www.foo.com">
  <a>a</a>
  <bar>
    <b>b</b>
    <c>c</c>
  </bar>
</foo:ret>
[warning] Left over data. Consumed 48 bit(s) with at least 16 bit(s)
remaining.
END ===================

The line with "z" is not parsed and Apache Daffodil complains it cannot
find separator %NL at Foo level (= line 19)

What I would expect:
- Bar.d cannot be parsed => OK since it is optional, only one instance of
Bar parsed (instead of 2)
- End type Bar parsing & go back to Foo type
- parse %NL
- parse z

If I remove maxOccurs="2" from element Foo.bar, then it works.

Could you please help me understand?

Patrick.

Re: Strange behavior with separators

Posted by "Beckerle, Mike" <mb...@owlcyberdefense.com>.

I didn't see additional messages about this. Did this get addressed?

In general, daffodil traces don't really make clear what is going on when formats use delimiter scanning. It's a known deficiency I hope we'll get to fix soon.

I suggest you change your schema and example data to use printing characters as the separators? E.g., "$"

In that case your data would be a$b$c$z

What makes this case very tricky is that your bar element is a fully optional array of complex type, all the content of which is also optional.

Hence, after it parses the <bar><b>b</b><c>c</c></bar>, it is looking at "$z", but now it tries to parse the bar element again, so a separator is consumed, and then it wants to find child b (which fails), then c (which fails), then d (which fails), but all those are optional elements, so the parse of <bar/> succeeds.

The dfdl:separatorSuppressionPolicy is 'anyEmpty' and the definition of that in DFDL says:

Non-positional sequence where any occurrences that have zero length representation MAY be omitted from the data, along with their associated separator. It must be possible for speculative parsing to identify which elements are present.

Note use of the word MAY. This wording is a little problematic. By using the word MAY, it suggests that it's also allowed that a zero-length representation for your element bar may appear in the data.

That's in fact what daffodil is finding. It is finding a zero-length representation of bar.

Furthermore, empty complex elements like <bar/> aren't added to the infoset if they are themselves optional array elements. They are only retained if they are required elements (at an occurs index <= minOccurs).
In your case, all bar instances are optional since minOccurs="0".

But the $ separator is consumed because the parse was successful. That consuming of the separator provides "forward progress" and is why your schema is not stuck in an infinite loop here. Forward progress is required for arrays in DFDL.

The parser is then looking at "z", but wants the separator before the next bar element. It doesn't find one, so terminates the bar array.

Then it wants to find element z, but it needs a separator first, which fails, but since element z is optional, it backtracks out the need for the separator, and ends the infoset. So you are left with the infoset having just one bar element and looking at the "z" in the data.

Hence, the test creates an infoset, but has left over data.

I *think* that's what's happening, and I think that's compliant with the DFDL spec.

In general, optional arrays of complex type with all optional content lead to all sorts of conundrums like this.

...mikeb


[cid:2cc08bd0-4eba-4902-a51d-cf5ea0b4b997] Mike Beckerle | Principal Engineer

[OWL Cyber Defense]

P +1-781-330-0412
W owlcyberdefense.com<http://www.owlcyberdefense.com>
________________________________
From: Patrick Grandjean <p....@gmail.com>
Sent: Thursday, August 27, 2020 11:30 AM
To: users@daffodil.apache.org <us...@daffodil.apache.org>
Subject: Strange behavior with separators

Hi all!

I am having a strange behavior with Apache Daffodil and would like to check with you if this is normal.

Here is the XSD (named Foo.dfdl.xsd):
START ===================

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"
           xmlns:foo="https://www.foo.com"
           targetNamespace="https://www.foo.com"
           elementFormDefault="unqualified">

  <xs:include schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />

  <xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/">
      <dfdl:format ref="foo:GeneralFormat" />
    </xs:appinfo>
  </xs:annotation>

  <xs:element name="ret" type="foo:Foo" />

  <xs:complexType name="Foo">
    <xs:sequence dfdl:separator="%NL;">
      <xs:element name="a" type="xs:string" minOccurs="0" dfdl:length="1" dfdl:lengthKind="explicit">
        <xs:annotation>
          <xs:appinfo source="http://www.ogf.org/dfdl/">
            <dfdl:discriminator testKind="pattern" testPattern="a" />
          </xs:appinfo>
        </xs:annotation>
      </xs:element>
      <xs:element name="bar" type="foo:Bar" minOccurs="0" maxOccurs="2" />
      <xs:element name="z" type="xs:string" minOccurs="0" dfdl:length="1" dfdl:lengthKind="explicit">
        <xs:annotation>
          <xs:appinfo source="http://www.ogf.org/dfdl/">
            <dfdl:discriminator testKind="pattern" testPattern="z" />
          </xs:appinfo>
        </xs:annotation>
      </xs:element>
    </xs:sequence>
  </xs:complexType>

  <xs:complexType name="Bar">
    <xs:sequence dfdl:separator="%NL;">
      <xs:element name="b" type="xs:string" minOccurs="0" dfdl:length="1" dfdl:lengthKind="explicit">
        <xs:annotation>
          <xs:appinfo source="http://www.ogf.org/dfdl/">
            <dfdl:discriminator testKind="pattern" testPattern="b" />
          </xs:appinfo>
        </xs:annotation>
      </xs:element>
      <xs:element name="c" type="xs:string" minOccurs="0" dfdl:length="1" dfdl:lengthKind="explicit">
        <xs:annotation>
          <xs:appinfo source="http://www.ogf.org/dfdl/">
            <dfdl:discriminator testKind="pattern" testPattern="c" />
          </xs:appinfo>
        </xs:annotation>
      </xs:element>
      <xs:element name="d" type="xs:string" minOccurs="0" dfdl:length="1" dfdl:lengthKind="explicit">
        <xs:annotation>
          <xs:appinfo source="http://www.ogf.org/dfdl/">
            <dfdl:discriminator testKind="pattern" testPattern="d" />
          </xs:appinfo>
        </xs:annotation>
      </xs:element>
    </xs:sequence>
  </xs:complexType>

</xs:schema>

END ===================

Here is a sample file to be parsed (named foo.txt):
START ===================

a
b
c
z

END ===================

Here is the final output of the command:
daffodil -t parse --schema Foo.dfdl.xsd foo.txt

START ===================
  <?xml version="1.0" encoding="UTF-8" ?>
  <foo:ret xmlns:foo="https://www.foo.com">
    <a>a</a>
    <bar>
      <b>b</b>
      <c>c</c>
    </bar>
  </foo:ret>
diff:
  No differences
failure:
  Parse Error: Separator '%NL;' not found
  Schema context: sequence[1] Location line 19 column 6 in file:/Users/pgrandjean/Development/ti-scalaxb-ticketing/src/main/resources/dfdl/Foo.dfdl.xsd
  Data location was preceding byte 6
----------------------------------------------------------------- 21
<?xml version="1.0" encoding="UTF-8" ?>
<foo:ret xmlns:foo="https://www.foo.com">
  <a>a</a>
  <bar>
    <b>b</b>
    <c>c</c>
  </bar>
</foo:ret>
[warning] Left over data. Consumed 48 bit(s) with at least 16 bit(s) remaining.
END ===================

The line with "z" is not parsed and Apache Daffodil complains it cannot find separator %NL at Foo level (= line 19)

What I would expect:
- Bar.d cannot be parsed => OK since it is optional, only one instance of Bar parsed (instead of 2)
- End type Bar parsing & go back to Foo type
- parse %NL
- parse z

If I remove maxOccurs="2" from element Foo.bar, then it works.

Could you please help me understand?

Patrick.