You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@daffodil.apache.org by "Costello, Roger L." <co...@mitre.org> on 2019/07/31 18:27:12 UTC

Seek your recommendation on modeling this input

Hello DFDL community,


  *   The input is binary.
  *   There is a sequence of sections.
  *   Section 1 is identified by the 2-bit code 01. There are zero or more of Section 1. The content of each Section 1 is a 3-bit unsigned int followed by a 1-bit unsigned int.
  *   Section 2 is identified by the 2-bit code 10. There are zero or more of Section 2. The content of each Section 2 is a 4-bit unsigned int followed by a 2-bit unsigned int.
  *   Section 3 is identified by the 2-bit code 11. There are zero or more of Section 4. The content of each Section 3 is a 4-bit unsigned int followed by a 5-bit unsigned int.
  *   Afterwards, there is padding to fill the last byte.

One way to model the input is as a sequence of arrays:

<xs:element name="input">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="Section_1" type="Section_1_type"
                       minOccurs="0" maxOccurs="unbounded"
                        dfdl:occursCountKind="implicit"/>
            <xs:element name="Section_2" type="Section_2_type"
                        minOccurs="0" maxOccurs="unbounded"
                        dfdl:occursCountKind="implicit"/>
            <xs:element name="Section_3" type="Section_3_type"
                        minOccurs="0" maxOccurs="unbounded"
                        dfdl:occursCountKind="implicit"/>
            <xs:sequence dfdl:hiddenGroupRef="padToByteBoundary" />
        </xs:sequence>
    </xs:complexType>
</xs:element>

A second way to model the input is as a repeatable section that contains a choice:

<xs:element name="input">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="Section" maxOccurs="unbounded"
                        dfdl:occursCountKind="implicit">
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="code" type="unsignedint2" />
                        <xs:choice dfdl:choiceDispatchKey="{xs:string(./code)}"
                                    dfdl:choiceLengthKind="implicit">
                            <xs:element name="Section_1" type="Section_1_type"
                                    dfdl:choiceBranchKey="1" />
                            <xs:element name="Section_2" type="Section_2_type"
                                    dfdl:choiceBranchKey="2" />
                            <xs:element name="Section_3" type="Section_3_type"
                                    dfdl:choiceBranchKey="3" />
                        </xs:choice>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
            <xs:sequence dfdl:hiddenGroupRef="padToByteBoundary" />
        </xs:sequence>
    </xs:complexType>
</xs:element>

Which way do you recommend and why?

/Roger

Re: Seek your recommendation on modeling this input

Posted by "Beckerle, Mike" <mb...@tresys.com>.

I have a preference for the series of arrays approach.


Because the way you have described the format, data that has sections unordered is not well formed. The array of choice can accept, as well formed, completely unordered data.


Given the dense binary nature of the format you describe, I believe a very strict DFDL schema is the right choice. There's very little redundancy in this data format. If anything is wrong with the data, it's quite likely that the unordered schema would still accept it, erroneously, misleading you into thinking gibberish is in fact well-formed data. Given that this is full of binary int fields, almost nothing about validation would catch this either.


An important property of a DFDL schema is not only that it parses well-formed data, but that it fails to parse, with a useful diagnostic, non-well-formed data. The segment order is almost the only thing to provide any diagnostic behavior about for this specific format.


There is also this other problem with your format. I realize it is contrived for exposition purposes but it is not specific enough to be a realistic format. You need an overall length determination, because there is no way, from what you've described so far, to properly identify the end of one input data element, and the start of the next.


Consider this data 010000 010000 010000 010000 010000


That is 5 copies of section 1. Is that the end of a single instance of the 'input' element with 5 section 1 in it, or is that one 'input' element having four section 1 occurrences, and another 'input' element with a single section 1 occurrence?  The format provides no way to tell the difference.


Now consider this data: 010000 010000 010000 10000000 010000


This shows some advantage of the series of arrays schema design, vs. array-of-choice design.


It is completely ambiguous here (if the unordered array-of-choice schema is used), whether those final 6 bits are another 01 section, or whether those are padding bits to be skipped.


If the series of arrays schema is used, it is unambiguous. They are padding because a 01 section cannot follow a 10 section without padding first to a byte boundary and ending the element.






________________________________
From: Costello, Roger L. <co...@mitre.org>
Sent: Wednesday, July 31, 2019 2:27:12 PM
To: users@daffodil.apache.org <us...@daffodil.apache.org>
Subject: Seek your recommendation on modeling this input


Hello DFDL community,



  *   The input is binary.
  *   There is a sequence of sections.
  *   Section 1 is identified by the 2-bit code 01. There are zero or more of Section 1. The content of each Section 1 is a 3-bit unsigned int followed by a 1-bit unsigned int.
  *   Section 2 is identified by the 2-bit code 10. There are zero or more of Section 2. The content of each Section 2 is a 4-bit unsigned int followed by a 2-bit unsigned int.
  *   Section 3 is identified by the 2-bit code 11. There are zero or more of Section 4. The content of each Section 3 is a 4-bit unsigned int followed by a 5-bit unsigned int.
  *   Afterwards, there is padding to fill the last byte.



One way to model the input is as a sequence of arrays:



<xs:element name="input">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="Section_1" type="Section_1_type"
                       minOccurs="0" maxOccurs="unbounded"
                        dfdl:occursCountKind="implicit"/>
            <xs:element name="Section_2" type="Section_2_type"
                        minOccurs="0" maxOccurs="unbounded"
                        dfdl:occursCountKind="implicit"/>
            <xs:element name="Section_3" type="Section_3_type"
                        minOccurs="0" maxOccurs="unbounded"
                        dfdl:occursCountKind="implicit"/>
            <xs:sequence dfdl:hiddenGroupRef="padToByteBoundary" />
        </xs:sequence>
    </xs:complexType>
</xs:element>



A second way to model the input is as a repeatable section that contains a choice:



<xs:element name="input">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="Section" maxOccurs="unbounded"
                        dfdl:occursCountKind="implicit">
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="code" type="unsignedint2" />
                        <xs:choice dfdl:choiceDispatchKey="{xs:string(./code)}"
                                    dfdl:choiceLengthKind="implicit">
                            <xs:element name="Section_1" type="Section_1_type"
                                    dfdl:choiceBranchKey="1" />
                            <xs:element name="Section_2" type="Section_2_type"
                                    dfdl:choiceBranchKey="2" />
                            <xs:element name="Section_3" type="Section_3_type"
                                    dfdl:choiceBranchKey="3" />
                        </xs:choice>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
            <xs:sequence dfdl:hiddenGroupRef="padToByteBoundary" />
        </xs:sequence>
    </xs:complexType>
</xs:element>



Which way do you recommend and why?



/Roger