You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@daffodil.apache.org by Roger L Costello <co...@mitre.org> on 2021/08/12 17:01:19 UTC

Minimalist DFDL, part II

Hi Folks,

A couple of weeks ago Mike Beckerle pointed out that many data formats contain things like this:

A number, N
N occurrences of something

For example, 3 followed by the names of three students:

3
John Doe
Sally Smith
Judy Jones

How should that be parsed? Using the DFDL occursCount and occursCountKind="expression" and hiddenGroup you can parse the input to ensure that exactly three student names are consumed. The output is this XML:

<Students>
    <name>John Doe</name>
    <name>Sally Smith</name>
    <name>Judy Jones</name>
</Students>

But is it really the job of the parser to "ensure that exactly three student names are consumed"?

I raised this question to the compiler experts on the compilers Usenet list. Here's what one person wrote:

> I would contend that in your example the /syntax/ of lists is really a number 
> followed by zero or more strings (number string*), and that verifying the string 
> count is semantics, not syntax.  I believe that, whenever possible, semantics are 
> best left until after parsing is finished.

In other words, keep your DFDL schema simple: forget occursCountKind="expression" and hiddenGroup; just parse the number and the following strings. The output should be this:

<number>3</number>
<Students>
    <name>John Doe</name>
    <name>Sally Smith</name>
    <name>Judy Jones</name>
</Students>

If you need to "ensure that there are 3 student names" you can do that check *after* parsing.

This is the Minimalist DFDL philosophy.

/Roger

Re: Minimalist DFDL, part II

Posted by "Beckerle, Mike" <mb...@owlcyberdefense.com>.

A further comment on this thread:

Text formats are often full of redundancy, but binary formats are far less redundant usually.

So we have to be cautious of depending on examples based on contrived text examples that contain lots of redundancy, since those will not be illustrative for binary data situations.

One way to help this is to make our expository examples textual (so as to avoid having to use hex dumps), but binary-like in behavior.
Eg., this data is all text characters, but behaves like many binary data formats which store length and count values.

"003008abcd efg0061234ef002xx001000"

This is a 3 digit occurs count, followed by 3 variable-length strings, each is a 3-digit prefix length and the characters of the string.
The first string is length 8 contents "abcd efg". The second is length 6, contents "1234ef", the third length 2, contents "xx".
Then there is a second 3 digit occurs count of 1, followed by one variable-length string of length 0.

The only way to know that 001 is an occurs count, and not the length of another string, is because of the 003 occurs count which appears at the beginning. You need the count just to parse this data properly. There is no redundancy here of length, count, and there are no delimiters nor escape schemes in this format, which is much more typical of binary data.

The logical equivalent XML to this I would say is something like:

<stringArray>
  <count>3</count>
  <str>abcd efg</str>
  <str>1234ef</str>
  <str>xx</str>
</stringArray>
<stringArray>
  <count>1</count>
  <str/>
</stringArray>

-mikeb

________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Thursday, August 12, 2021 2:05 PM
To: users@daffodil.apache.org <us...@daffodil.apache.org>
Subject: Re: Minimalist DFDL, part II

Yep, this is related to the discussion about well-formed vs valid. It's
not uncommon, and often preferred, to model only the syntax of the data
so that you can parse data that is syntactically correct (i.e.
well-formed) but isn't semantically correct (i.e not valid), and then do
the validation later.

That would for example let you parse data with the number "4" but only 3
students listed, but then use XSLT/Schematron to validate that the
counts don't match up.

That said, I think you'll still often need occursCountKind="expression".
Once you start modeling more complicated data formats, you almost always
start seeing repetitions of types, and you often can't use speculative
parsing to differentiate between the types. And the only solution is
with expressions to figure out the occurrences.

For example, say we had this data:

  3
  2
  John Doe
  Sally Smith
  Judy Jones
  Richard Roe
  Bob Barker

We really don't want to think of this as two numbers followed by 5
strings. That just isn't going to be useful. We instead want to think of
this as two numbers that specify the number of students and the number
of teachers, followed by a list of the student names and a list of the
teacher names. And so we really want an infoset that looks like this:

  <People>
    <NumStudents>3</NumStudents>
    <NumTeachers>2</NumTeachers>
    <Students>
      <name>John Doe</name>
      <name>Sally Smith</name>
      <name>Judy Jones</name>
    </Students>
    <Teachers>
      <name>Alice Anderson</name>
      <name>Bob Brown</name>
    </Teacher>
  </People>

Notice this data doesn't allow speculative parsing to differentiate
student names from teacher names--they names have the exact same form.
So the only way to know when one ends and the other begins is by using
occursCountKind="expression" and an expression to reach back into the
parsed numbers to figure out the number of occurrences.

- Steve

On 8/12/21 1:01 PM, Roger L Costello wrote:
> Hi Folks,
>
> A couple of weeks ago Mike Beckerle pointed out that many data formats contain things like this:
>
> A number, N
> N occurrences of something
>
> For example, 3 followed by the names of three students:
>
> 3
> John Doe
> Sally Smith
> Judy Jones
>
> How should that be parsed? Using the DFDL occursCount and occursCountKind="expression" and hiddenGroup you can parse the input to ensure that exactly three student names are consumed. The output is this XML:
>
> <Students>
>     <name>John Doe</name>
>     <name>Sally Smith</name>
>     <name>Judy Jones</name>
> </Students>
>
> But is it really the job of the parser to "ensure that exactly three student names are consumed"?
>
> I raised this question to the compiler experts on the compilers Usenet list. Here's what one person wrote:
>
>> I would contend that in your example the /syntax/ of lists is really a number
>> followed by zero or more strings (number string*), and that verifying the string
>> count is semantics, not syntax.  I believe that, whenever possible, semantics are
>> best left until after parsing is finished.
>
> In other words, keep your DFDL schema simple: forget occursCountKind="expression" and hiddenGroup; just parse the number and the following strings. The output should be this:
>
> <number>3</number>
> <Students>
>     <name>John Doe</name>
>     <name>Sally Smith</name>
>     <name>Judy Jones</name>
> </Students>
>
> If you need to "ensure that there are 3 student names" you can do that check *after* parsing.
>
> This is the Minimalist DFDL philosophy.
>
> /Roger
>
>

Re: Minimalist DFDL, part II

Posted by Steve Lawrence <sl...@apache.org>.

Yep, this is related to the discussion about well-formed vs valid. It's
not uncommon, and often preferred, to model only the syntax of the data
so that you can parse data that is syntactically correct (i.e.
well-formed) but isn't semantically correct (i.e not valid), and then do
the validation later.

That would for example let you parse data with the number "4" but only 3
students listed, but then use XSLT/Schematron to validate that the
counts don't match up.

That said, I think you'll still often need occursCountKind="expression".
Once you start modeling more complicated data formats, you almost always
start seeing repetitions of types, and you often can't use speculative
parsing to differentiate between the types. And the only solution is
with expressions to figure out the occurrences.

For example, say we had this data:

  3
  2
  John Doe
  Sally Smith
  Judy Jones
  Richard Roe
  Bob Barker

We really don't want to think of this as two numbers followed by 5
strings. That just isn't going to be useful. We instead want to think of
this as two numbers that specify the number of students and the number
of teachers, followed by a list of the student names and a list of the
teacher names. And so we really want an infoset that looks like this:

  <People>
    <NumStudents>3</NumStudents>
    <NumTeachers>2</NumTeachers>
    <Students>
      <name>John Doe</name>
      <name>Sally Smith</name>
      <name>Judy Jones</name>
    </Students>
    <Teachers>
      <name>Alice Anderson</name>
      <name>Bob Brown</name>
    </Teacher>
  </People>

Notice this data doesn't allow speculative parsing to differentiate
student names from teacher names--they names have the exact same form.
So the only way to know when one ends and the other begins is by using
occursCountKind="expression" and an expression to reach back into the
parsed numbers to figure out the number of occurrences.

- Steve

On 8/12/21 1:01 PM, Roger L Costello wrote:
> Hi Folks,
> 
> A couple of weeks ago Mike Beckerle pointed out that many data formats contain things like this:
> 
> A number, N
> N occurrences of something
> 
> For example, 3 followed by the names of three students:
> 
> 3
> John Doe
> Sally Smith
> Judy Jones
> 
> How should that be parsed? Using the DFDL occursCount and occursCountKind="expression" and hiddenGroup you can parse the input to ensure that exactly three student names are consumed. The output is this XML:
> 
> <Students>
>     <name>John Doe</name>
>     <name>Sally Smith</name>
>     <name>Judy Jones</name>
> </Students>
> 
> But is it really the job of the parser to "ensure that exactly three student names are consumed"?
> 
> I raised this question to the compiler experts on the compilers Usenet list. Here's what one person wrote:
> 
>> I would contend that in your example the /syntax/ of lists is really a number 
>> followed by zero or more strings (number string*), and that verifying the string 
>> count is semantics, not syntax.  I believe that, whenever possible, semantics are 
>> best left until after parsing is finished.
> 
> In other words, keep your DFDL schema simple: forget occursCountKind="expression" and hiddenGroup; just parse the number and the following strings. The output should be this:
> 
> <number>3</number>
> <Students>
>     <name>John Doe</name>
>     <name>Sally Smith</name>
>     <name>Judy Jones</name>
> </Students>
> 
> If you need to "ensure that there are 3 student names" you can do that check *after* parsing.
> 
> This is the Minimalist DFDL philosophy.
> 
> /Roger
> 
>