You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@daffodil.apache.org by Roger L Costello <co...@mitre.org> on 2021/09/01 13:47:05 UTC

What theory underpins DFDL?

Hi Folks,
Lately I have been learning to create parsers using a parser tool called Flex & Bison. I want to see how Flex & Bison parsers compare to DFDL parsers.
I learned that Flex & Bison parsers are built on solid theory:
The earliest parser back in the 1950s used utterly ad hoc techniques to analyze the syntax of the source code of programs they were parsing. During the 1960s, the field got a lot of academic attention and by the early 1970s, parsing was no longer an arcane art. In the 1970s Aho, Ullman, Knuth, and many others put parsing techniques solidly on their theoretical feet.
The book that I am reading said that one of the first techniques they (Aho, Ullman, Knuth, and others) espoused was to separate lexing (aka scanning, tokenizing) from parsing. Lexing built upon regular expressions, which built upon Finite Automata (FA) theory and Nondeterministic Finite Automata (NFA) theory. FA and NFA were brilliantly melded together with the famous Kleene Theorem. Parsing built on top of a rich theory of grammars - Context Free Grammars, Context Sensitive Grammars, etc. - that Chomsky formulated. Here's a graphic I created depicting the foundation upon which Flex & Bison parsers are built:
[cid:image001.png@01D79F15.E48107F0]
If we leave aside XML Schema which hosts DFDL, what theory underpins the set of DFDL properties - separator, initiator, terminator, separatorPosition, ignoreCase, lengthKind, etc.?
/Roger

Re: What theory underpins DFDL?

Posted by "Beckerle, Mike" <mb...@owlcyberdefense.com>.
No theory underlies these properties to my knowledge.

No theory underlies the variety of data formats that people invented in ad-hoc ways over several decades either.

They did not evolve from computer scientists studying parsing of languages.

The properties in DFDL were gathered from a broad set of industry tools for data description that existed in the 1990's and beyond.
DFDL just standardizes them, clarifies them (hopefully), renames some of them, but does not impose new structure or order on them.
(The exception to this are dfdl:inputValueCalc, dfdl:outputValueCalc, and DFDL hidden groups, which are really new things.)

dfdl:separatorSuppressionPolicy is an example of one very complex property that was clearly just getting cases added to it to accommodate more and more different data behaviors as they were encountered and had to be handled. I have no doubt that a future DFDL enhancement will add at least one more enumeration to this property.

One of the most influential systems we drew from was from a company called Mercator, based in Florida, USA, which was subsequently acquired by IBM.  Their property set was actually copied by numerous data description tools like Microsoft BizTalk as one example. Engineers from Mercator worked on the DFDL working group for a few years.

There is no theory basis to any of this because the property set evolved organically as was needed to handle more and more data formats. Those data formats were invented in ad-hoc fashion by individuals exchanging data over time, enhancing data with new information that had to be accommodated by extending a format (again in some ad-hoc way), and also often data size/space usage was very important.



________________________________
From: Interrante, John A (GE Research, US) <Jo...@ge.com>
Sent: Wednesday, September 1, 2021 11:15 AM
To: users@daffodil.apache.org <us...@daffodil.apache.org>
Subject: RE: What theory underpins DFDL?


I’m very interested in learning what theory underpins DFDL’s separator, initiator, terminator, separatorPosition, ignoreCase, lengthKind, et al properties too.  My impression is that DFDL is a language for describing data formats using these properties as annotations on XML Schemas, and Daffodil is a DFDL processor which uses a schema compiler and a parser/unparser generator to generate a set of parser/unparser combinator objects in memory.

Parser combinator - Wikipedia<https://en.wikipedia.org/wiki/Parser_combinator> offers some insight on the evolution of parser combinators and the theory (recursive descent parsing with memoization and backtracking) which enable their functionality.

Introduction to Parsers. Parsing is a surprisingly challenging… | by Chet Corcos | Medium<https://medium.com/@chetcorcos/introduction-to-parsers-644d1b5d7f3d> also talks a little about the theory of formal grammars (I would skip the second half, which talks about how to write parser combinators in JavaScript).

parsing - When to use a Parser Combinator? When to use a Parser Generator? - Software Engineering Stack Exchange<https://softwareengineering.stackexchange.com/questions/338665/when-to-use-a-parser-combinator-when-to-use-a-parser-generator> has a good discussion of the pros and cons between parser combinators and parser generators.

I have used parser generators and they are very useful.  If you find you need a parser generator, I recommend ANTLR<https://www.antlr.org/> as a better parser generator than Flex & Bison.  It can parse structured text or binary, generate Java or C parser code, and is widely used in academia and industry.

LL(*): the foundation of the ANTLR parser generator: ACM SIGPLAN Notices: Vol 46, No 6<https://dl.acm.org/doi/10.1145/1993316.1993548> describes the formal LL(*) theory underlying the ANTLR parser generator.

John



From: Roger L Costello <co...@mitre.org>
Sent: Wednesday, September 1, 2021 9:47 AM
To: users@daffodil.apache.org
Subject: EXT: What theory underpins DFDL?



Hi Folks,

Lately I have been learning to create parsers using a parser tool called Flex & Bison. I want to see how Flex & Bison parsers compare to DFDL parsers.

I learned that Flex & Bison parsers are built on solid theory:

The earliest parser back in the 1950s used utterly ad hoc techniques to analyze the syntax of the source code of programs they were parsing. During the 1960s, the field got a lot of academic attention and by the early 1970s, parsing was no longer an arcane art. In the 1970s Aho, Ullman, Knuth, and many others put parsing techniques solidly on their theoretical feet.

The book that I am reading said that one of the first techniques they (Aho, Ullman, Knuth, and others) espoused was to separate lexing (aka scanning, tokenizing) from parsing. Lexing built upon regular expressions, which built upon Finite Automata (FA) theory and Nondeterministic Finite Automata (NFA) theory. FA and NFA were brilliantly melded together with the famous Kleene Theorem. Parsing built on top of a rich theory of grammars – Context Free Grammars, Context Sensitive Grammars, etc. – that Chomsky formulated. Here’s a graphic I created depicting the foundation upon which Flex & Bison parsers are built:

[cid:image001.png@01D79F22.B8E35690]

If we leave aside XML Schema which hosts DFDL, what theory underpins the set of DFDL properties – separator, initiator, terminator, separatorPosition, ignoreCase, lengthKind, etc.?

/Roger

RE: What theory underpins DFDL?

Posted by "Interrante, John A (GE Research, US)" <Jo...@ge.com>.
I'm very interested in learning what theory underpins DFDL's separator, initiator, terminator, separatorPosition, ignoreCase, lengthKind, et al properties too.  My impression is that DFDL is a language for describing data formats using these properties as annotations on XML Schemas, and Daffodil is a DFDL processor which uses a schema compiler and a parser/unparser generator to generate a set of parser/unparser combinator objects in memory.
Parser combinator - Wikipedia<https://en.wikipedia.org/wiki/Parser_combinator> offers some insight on the evolution of parser combinators and the theory (recursive descent parsing with memoization and backtracking) which enable their functionality.
Introduction to Parsers. Parsing is a surprisingly challenging... | by Chet Corcos | Medium<https://medium.com/@chetcorcos/introduction-to-parsers-644d1b5d7f3d> also talks a little about the theory of formal grammars (I would skip the second half, which talks about how to write parser combinators in JavaScript).
parsing - When to use a Parser Combinator? When to use a Parser Generator? - Software Engineering Stack Exchange<https://softwareengineering.stackexchange.com/questions/338665/when-to-use-a-parser-combinator-when-to-use-a-parser-generator> has a good discussion of the pros and cons between parser combinators and parser generators.
I have used parser generators and they are very useful.  If you find you need a parser generator, I recommend ANTLR<https://www.antlr.org/> as a better parser generator than Flex & Bison.  It can parse structured text or binary, generate Java or C parser code, and is widely used in academia and industry.
LL(*): the foundation of the ANTLR parser generator: ACM SIGPLAN Notices: Vol 46, No 6<https://dl.acm.org/doi/10.1145/1993316.1993548> describes the formal LL(*) theory underlying the ANTLR parser generator.
John

From: Roger L Costello <co...@mitre.org>
Sent: Wednesday, September 1, 2021 9:47 AM
To: users@daffodil.apache.org
Subject: EXT: What theory underpins DFDL?

Hi Folks,
Lately I have been learning to create parsers using a parser tool called Flex & Bison. I want to see how Flex & Bison parsers compare to DFDL parsers.
I learned that Flex & Bison parsers are built on solid theory:
The earliest parser back in the 1950s used utterly ad hoc techniques to analyze the syntax of the source code of programs they were parsing. During the 1960s, the field got a lot of academic attention and by the early 1970s, parsing was no longer an arcane art. In the 1970s Aho, Ullman, Knuth, and many others put parsing techniques solidly on their theoretical feet.
The book that I am reading said that one of the first techniques they (Aho, Ullman, Knuth, and others) espoused was to separate lexing (aka scanning, tokenizing) from parsing. Lexing built upon regular expressions, which built upon Finite Automata (FA) theory and Nondeterministic Finite Automata (NFA) theory. FA and NFA were brilliantly melded together with the famous Kleene Theorem. Parsing built on top of a rich theory of grammars - Context Free Grammars, Context Sensitive Grammars, etc. - that Chomsky formulated. Here's a graphic I created depicting the foundation upon which Flex & Bison parsers are built:
[cid:image001.png@01D79F22.B8E35690]
If we leave aside XML Schema which hosts DFDL, what theory underpins the set of DFDL properties - separator, initiator, terminator, separatorPosition, ignoreCase, lengthKind, etc.?
/Roger