You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@daffodil.apache.org by GitBox <gi...@apache.org> on 2020/01/21 21:21:36 UTC

[GitHub] [incubator-daffodil-site] mbeckerle opened a new pull request #17: WIP: design note on schema compiler space/speed issue

mbeckerle opened a new pull request #17: WIP: design note on schema compiler space/speed issue
URL: https://github.com/apache/incubator-daffodil-site/pull/17
 
 
   First cut at design note on how to fix the schema compiler space/speed issue, which is JIRA
   DAFFODIL-1444
   
   This design note for review is written in Asciidoc.
   
   Here's a workflow for how to review:
   
   Fetch this PR into your local sandbox git repo for daffodil-site.
   
   Now you can navigate using a file explorer to the daffodil-site/site/dev directory, and open in your firefox or chrome the file:
   
   daffodil-site/site/dev/design-notes/term-sharing-in-schema-compiler.adoc
   
   You should see a formatted page with a number of diagrams in it. You can read this copy.
   
   - This assumes you have setup your chrome or firefox to process asciidoc with the asciidoctor plugin and turned on the diagrams feature.  If you don't see nicely formatted documentation with diagrams, then you need to install the plugin and enable the diagrams feature of the plugin.
   
   Then, as you have thoughts, comments on it to share, you can find the corresponding text lines, and add them to this PR, by putting them as comments in the normal code-like PR way, on this asciidoc (.adoc) source file. 
   
   The diagrams were drawn via ditaa, so are legible, if coarse, in the source file as well. So comments on the diagrams work also.  I have been finding the asciiflow.com (which is both online hosted, and open source) tool very easy to use for drawing these diagrams. 
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-daffodil-site] mbeckerle commented on a change in pull request #17: Design notes on schema compiler space/speed issue

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on a change in pull request #17: Design notes on schema compiler space/speed issue
URL: https://github.com/apache/incubator-daffodil-site/pull/17#discussion_r376431267
 
 

 ##########
 File path: site/dev/design-notes/namespace-binding-minimization.adoc
 ##########
 @@ -0,0 +1,341 @@
+:page-layout: page
+:keywords: schema-compiler performance alignment optimization
+// ///////////////////////////////////////////////////////////////////////////
+//
+// This file is written in AsciiDoc.
+//
+// If you can read this comment, your browser is not rendering asciidoc automatically.
+//
+// You need to install the asciidoc plugin to Chrome or Firefox
+// so that this page will be properly rendered for your viewing pleasure.
+//
+// You can get the plugins by searching the web for 'asciidoc plugin'
+//
+// You will want to change plugin settings to enable diagrams (they're off by default.)
+//
+// You need to view this page with Chrome or Firefox.
+//
+// ///////////////////////////////////////////////////////////////////////////
+//
+// When editing, please start each sentence on a new line.
+// See https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line[one sentence-per-line writing technique.]
+// This makes textual diffs of this file useful in a similar way to the way they work for code.
+//
+// //////////////////////////////////////////////////////////////////////////
+
+== Namespace Binding Minimization
+
+=== Introduction
+
+DFDL schemas are XML schemas and so DFDL inherits the namespace system of XML and XML Schema for composing large schemas from smaller ones, for reusing schema files, and for managing name conflicts. 
+
+A DFDL Infoset isn't necessarily represented as XML however. 
+Some representations won't have any ability to deal with namespaces (JSON for example), and so Daffodil will sometimes issue warnings when compiling a schema if the namespace usage will not allow unambiguous representation without namespaces. 
+
+Most representations of DFDL Infosets will, like XML, use some representation of the namespaces of elements, and in textual forms this will most commonly be by way of namespace prefixes. 
+XML is not the only representation that uses namespaces, however, so this should not be taken as an entirely XML-specific discussion.
+
+There are these goals for namespace-binding minimization. 
+
+. Clarity: Infosets that have redundant namespace bindings are very hard to read and understand, and require namespace-binding-aware tooling to compare them, or clumsy post-processing to remove the excess bindings. 
+
+. Performance: Attaching an element to the infoset at runtime should take constant time.
+
+. Consistency: The prefix-to-namespace bindings used should be drawn from those expressed on the DFDL schema by the schema author, and the prefixes used and bindings introduced when an element is attached to the infoset should be consistent with the set of namespace prefix definitions in place at the point where the element's declaration lexically appears in the DFDL schema. 
+
+These goals are in some tension. 
+Consider 4 elements named A, B, C, and Q.
+Suppose element A contains element B, which contains element Q.
+Suppose elsewhere in the same infoset element A contains element C which contains element Q. 
+From the perspective of element Q, the set of namespace bindings surrounding it are those from (A, B) or those from (A, C). 
+Suppose element Q requires, and introduces, a namespace with prefix "qns" bound to namespace "urn:Q_Namespace". 
+Suppose element C also introduces this same namespace binding.
+Then when element Q appears inside element B, its namespace binding for "qns" is needed. 
+But when element Q appears inside element C, its namespace binding for "qns" is redundant with one already provided by element C.
+
+The conclusion is that the minmal set of namespace bindings introduced by an element depends on the nesting of elements. 
+
+The basic technique is to store, on the runtime element data structure (DPathElementCompileInfo), the complete set of lexical namespace bindings present for the element declaration in the DFDL schema document. 
+==== Namespace Bindings come from the Element Declarations, not Element References
+
+Consider two schema files:
+
+```xml
+<!-- file foo.dfdl.xsd -->
+<schema 
+   xmlns:pre1="namespace1"
+   xmlns:ns1="differentNamespace">
+  <import namespace="namespace1" schemaFileLocation="bar.dfdl.xsd"/>
+  ...
+  <element name="root">
+    <complexType>
+      <sequence>
+        <element ref="pre1:foo" maxOccurs="unbounded"/>
+      </sequence>
+    </complexType>
+  </element>
+
+</schema>
+
+<!-- file bar.dfdl.xsd -->
+<schema targetNamespace="namespace1"
+   xmlns:ns1="namespace1"
+   xmlns:pre1="someOtherNamespace">
+
+  <element name="foo" ..../>
+</schema>
+```
+In the above we have a conflict over the use of the prefix "pre1".
+Now consider an XML document corresponding to this with element 'foo' inside the 'root' element:
+
+```xml
+<root xmlns:pre1="namespace1"
+      xmlns:ns1="differentNamespace">
+  ...
+  <ns1:foo 
+    xmlns:ns1="namespace1" 
+    xmlns:pre1="someOtherNamespace">
+    ...
+  </ns1:foo>
+  ...
+</root>
+```
+
+Notice that element 'foo' appears inside 'root' using the "ns1" prefix but it also introduces a binding for that prefix which supercedes that of the enclosing environment.
+The prefix "pre1" cannot be used for element 'foo' because in the namespace bindings of the bar.dfdl.xsd schema document, the "pre1" prefix is bound to "someOtherNamespace". 
+
+This example illustrates that each element must use, and introduce, the lexically defined prefixes from the point where the element is declared.
+Not from the point of element reference.
+
+Since element 'foo' is recurring, it's unfortunate, but every single instance will, textualized, carry these namespace bindings.
+E.g.,
+
+```xml
+<root xmlns:pre1="namespace1"
+      xmlns:ns1="differentNamespace">
+  ...
+  <ns1:foo 
+    xmlns:ns1="namespace1" 
+    xmlns:pre1="someOtherNamespace">
+    ...
+  </ns1:foo>
+  <ns1:foo 
+    xmlns:ns1="namespace1" 
+    xmlns:pre1="someOtherNamespace">
+    ...
+  </ns1:foo>
+  <ns1:foo 
+    xmlns:ns1="namespace1" 
+    xmlns:pre1="someOtherNamespace">
+    ...
+  </ns1:foo>
+
+  ...
+</root>
+```
+
+This problem is not one Daffodil strives to solve. 
+The schema author can simply avoid these sorts of name clashes and this problem will not occur.
+Automatic renaming of prefixes to avoid this problem is considered unwarranted, as it will confuse users. 
+
+
+
+=== Namespace Minimization
+
+==== Only Element Namespace Prefix Bindings
+
+Only namespace definitions associated with element declarations need to ever be considered for the infoset.
+Namespace definitions that define prefixes used for type, group, format, or escapeScheme references are not included
+in the namespace definitions carried on infoset elements.
+
+==== Avoid Prefix "tns" (or Other Common Ambiguous Names) When Possible
+
+Many DFDL schemas will define prefix "tns" to be that schema document's target namespace. 
+
+This same problem could occur for other prefixes. The "tns" convention is just a common one. 
+
+If the prefix "tns" is ambiguous across the schema set (also used by other schema documents, but for different namespaces), 
+then its use is undesirable. 
+
+If a schema document defines both "tns" and other prefixes for the target namespace, then another prefix is preferred for 
+use as the prefix of elements created from declarations in that schema document.
+
+This situation arises commonly for the default namespace (no prefix, defined by `xmlns="namespaceURI"`). If 
+this is ambiguous across the schema set (highly likely), then an available alternative prefix (from that schema document) 
+is preferred. 
+There is actually no difference between using "tns" and the default namespace. Both are just commonly used, and frequently ambiguous across the schema set.
+
+(This all generalizes to any prefix which is ambiguous across the schema set.)
+
+==== Corner Cases
+
+There are numerous ways schema authors can use and reuse namespace prefixes that can lead to cluttered infosets.
+
+Other than minor heuristics to choose among alternative available prefix definitions, Daffodil does not try to improve on the 
+namespace prefix problem on behalf of schema authors. 
+
+===== No Prefix At All
+A legal schema document can define a target namespace and define no prefix for it at all. 
+
+In this case, the only way elements of that schema document can be used is some other schema document must provide a prefix definition. 
+Daffodil chooses a prefix from those available in the schema set (deterministically - e.g., shortest prefix, with ties resolved by alphanumeric order, avoiding ambiguous prefixes like "tns").
+
 
 Review comment:
   The above won't work if the schema defines  targetNamespace="target1", no prefix for the target namspace, but it does provide a namespace binding for xmlns:foo="someNamespace", Then outside, another schema imports this one, but defines xmlns:foo="target1".  
   
   So we can't use namespace prefix "foo:" because it conflicts with the internal version. 
   
   This isn't likely, but could happen. So we have to check and either fail in this case, generate our own prefix, or just SDE on this whole class of issue i.e., just say no to schemas with target namespaces that have no prefix. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-daffodil-site] mbeckerle merged pull request #17: Design notes on schema compiler space/speed issue

Posted by GitBox <gi...@apache.org>.
mbeckerle merged pull request #17: Design notes on schema compiler space/speed issue
URL: https://github.com/apache/incubator-daffodil-site/pull/17
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-daffodil-site] bsloane1650 commented on a change in pull request #17: Design notes on schema compiler space/speed issue

Posted by GitBox <gi...@apache.org>.
bsloane1650 commented on a change in pull request #17: Design notes on schema compiler space/speed issue
URL: https://github.com/apache/incubator-daffodil-site/pull/17#discussion_r375333404
 
 

 ##########
 File path: site/dev/design-notes/term-sharing-in-schema-compiler.adoc
 ##########
 @@ -0,0 +1,562 @@
+:page-layout: page
+:keywords: schema-compiler performance alignment optimization
+// ///////////////////////////////////////////////////////////////////////////
+//
+// This file is written in AsciiDoc.
+//
+// If you can read this comment, your browser is not rendering asciidoc automatically.
+//
+// You need to install the asciidoc plugin to Chrome or Firefox
+// so that this page will be properly rendered for your viewing pleasure.
+//
+// You can get the plugins by searching the web for 'asciidoc plugin'
+//
+// You will want to change plugin settings to enable diagrams (they're off by default.)
+//
+// You need to view this page with Chrome or Firefox.
+//
+// ///////////////////////////////////////////////////////////////////////////
+//
+// When editing, please start each sentence on a new line.
+// See https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line[one sentence-per-line writing technique.]
+// It is acceptable to start each sentence on a new line, but then wrap the lines to a reasonable length.
+// It's the starting on a new line that matters.
+//
+// This makes textual diffs of this file useful in a similar way to the way they work for code.
+//
+// //////////////////////////////////////////////////////////////////////////
+
+== Term Sharing in the Schema Compiler
+
+=== Introduction
+
+The DFDL language has a composition property known as _referential transparency_. 
+It is not supposed to matter whether you lift a part of the schema out
+and create a reusable group, reusable type, or reusable element from
+it.
+
+Because DFDL (version 1.0) does not allow recursive definitions of any
+kind, it is theoretically possible to implement this by treating all
+reusable definitions in the language like macros.
+I.e., inline-expand all definitions at their point of use. 
+This will work for schemas up to some size. However, it leads to an
+unacceptable expansion in schema compilation time and space, as the
+size of the schema once all these inline substitutions have been
+performed can be exponentially larger than the original
+non-substituted schema.
+
+To avoid this undesirable combinatorial explosion, the Daffodil schema
+compiler arranges to achieve the same macro-like semantics, while
+still sharing the core parts of the reusable components so they need
+only be compiled once for each unique way the component is used.
+
+This is also part of eventually enabling an extension of the DFDL
+language to allow recursive definitions.
+
+The dfdl:alignment property is one of the largest contributors to
+complexity in sharing objects by the schema compiler.
+
+Alignment will be used as the example in this design note to
+illustrate the schema compiler behavior.
+
+=== DSOM Term Objects
+
+DSOM is the Daffodil Schema Object Model.
+Within this model, the entities from a DFDL schema that actually have
+representations in the data stream are called _terms_ and they are
+represented by the subclasses of the class Term.
+There are terms which have no representation at all such as computed elements.
+
+For this discussion we are concerned only with terms that have representation. 
+
+The classes in Daffodil that are used for concrete terms are:
 
 Review comment:
   "concrete term" is never defined.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-daffodil-site] mbeckerle commented on a change in pull request #17: Design notes on schema compiler space/speed issue

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on a change in pull request #17: Design notes on schema compiler space/speed issue
URL: https://github.com/apache/incubator-daffodil-site/pull/17#discussion_r375056108
 
 

 ##########
 File path: site/dev/design-notes/term-sharing-in-schema-compiler.adoc
 ##########
 @@ -0,0 +1,562 @@
+:page-layout: page
+:keywords: schema-compiler performance alignment optimization
+// ///////////////////////////////////////////////////////////////////////////
+//
+// This file is written in AsciiDoc.
+//
+// If you can read this comment, your browser is not rendering asciidoc automatically.
+//
+// You need to install the asciidoc plugin to Chrome or Firefox
+// so that this page will be properly rendered for your viewing pleasure.
+//
+// You can get the plugins by searching the web for 'asciidoc plugin'
+//
+// You will want to change plugin settings to enable diagrams (they're off by default.)
+//
+// You need to view this page with Chrome or Firefox.
+//
+// ///////////////////////////////////////////////////////////////////////////
+//
+// When editing, please start each sentence on a new line.
+// See https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line[one sentence-per-line writing technique.]
+// It is acceptable to start each sentence on a new line, but then wrap the lines to a reasonable length.
+// It's the starting on a new line that matters.
+//
+// This makes textual diffs of this file useful in a similar way to the way they work for code.
+//
+// //////////////////////////////////////////////////////////////////////////
+
+== Term Sharing in the Schema Compiler
+
+=== Introduction
+
+The DFDL language has a composition property known as _referential transparency_. 
+It is not supposed to matter whether you lift a part of the schema out
+and create a reusable group, reusable type, or reusable element from
+it.
+
+Because DFDL (version 1.0) does not allow recursive definitions of any
+kind, it is theoretically possible to implement this by treating all
+reusable definitions in the language like macros.
+I.e., inline-expand all definitions at their point of use. 
+This will work for schemas up to some size. However, it leads to an
+unacceptable expansion in schema compilation time and space, as the
+size of the schema once all these inline substitutions have been
+performed can be exponentially larger than the original
+non-substituted schema.
+
+To avoid this undesirable combinatorial explosion, the Daffodil schema
+compiler arranges to achieve the same macro-like semantics, while
+still sharing the core parts of the reusable components so they need
+only be compiled once for each unique way the component is used.
+
+This is also part of eventually enabling an extension of the DFDL
+language to allow recursive definitions.
+
+The dfdl:alignment property is one of the largest contributors to
+complexity in sharing objects by the schema compiler.
+
+Alignment will be used as the example in this design note to
+illustrate the schema compiler behavior.
+
+=== DSOM Term Objects
+
+DSOM is the Daffodil Schema Object Model.
+Within this model, the entities from a DFDL schema that actually have
+representations in the data stream are called _terms_ and they are
+represented by the subclasses of the class Term.
+There are terms which have no representation at all such as computed elements.
+
+For this discussion we are concerned only with terms that have representation. 
+
+The classes in Daffodil that are used for concrete terms are:
+
+* ElementRef
+* Root (A degenerate element ref to the root element)
+* LocalElementDecl
+* the quasi-elements:
+
+** PrefixLengthQuasiElementDecl (Fake local element decl used for the
+   length fields of Prefixed-length types)
+
+** RepTypeQuasiElementDecl (Fake local element decl used as the
+   representation of elements that are computed via the dfdl:repType
+   mechanism.)
+
+* Sequence
+* SequenceGroupRef
+
+* ChoiceBranchImpliedSequence (The sequence that is inserted when a
+  choice branch is a single element decl/ref)
+
+* Choice
+* ChoiceGroupRef
+
+=== Term Representation Regions
+
+Consider the representation of a term, which we call the term's
+_region_, as appearing between two other term regions in the data
+stream as illustrated below:
+
+[ditaa]
+....
+---+ +-----------------------------------------------+ +---
+   | |                    term                       | |
+   | |                                               | |
+   | | +---------------+ +--------+ +--------------+ | |
+...| | | unique before | | shared | | shared after | | |...
+   | | +---------------+ +--------+ +--------------+ | |
+ --+ +-----------------------------------------------+ +--
+....
+In the above, lower position bits are to the left. 
+Higher position bits to the right. 
+In the diagram, we see that the term's region consists of 3
+sub-regions, named _unique before_, _shared_, and _shared after_. Each
+term appears in a context of possible additional terms before it and
+after it which are shown with ellipsis above.
+
+A term can be an element or a model-group (sequence or choice). 
+By far the most common situation is that a term appears inside a model group. 
+There are two exceptions. The _root_, and the model-group of a complex type. 
+
+The core idea here is that we are separating the representation of a
+term into unique and shared parts.
+The unique region is affected by the surrounding model group context.
+The shared regions are, in many cases, sharable across instances of
+this term and is independent of the surrounding model-group context.
+So if the term was defined by way of a reusable global definition,
+then the goal is that there need be only one implementation of the
+shared part(s).
+
+There are some limits to this sharing. 
+A single implementation is not always achievable, but generally one or
+a small number of implementations are possible.
+
+=== Limits to Sharing
+
+Consider if this term above was defined by way of a XSD group
+reference. 
+The DFDL properties expressed on the group reference must
+be combined with those expressed on the group definition, to form the
+complete set of properties associated with the term. 
+This combining creates situations where a given shared group definition can have
+wildly different representations for different uses. 
+Consider this:
+
+```xml
+<group name="g">
+  <sequence>
+    <element name="a" type="xs:int"/>
+    <element name="b" type="xs:int"/>
+    <element name="c" type="xs:int"/>
+  </sequence>
+</group>
+
+....
+
+<element name="e">
+  <complexType>
+    <sequence>
+      ... some stuff here ...
+      <group ref="tns:g" dfdl:separator="|"/> <!-- separated -->
+      ... more stuff here ...
+      <group ref="tns:g"/> <!-- not separated -->
+      ... yet more stuff here ...
+    </sequence>
+  </complexType>
+</element>
+```
+
+The two uses of the group definition for 'g' are wildly different, in
+that one has separators, the other does not.
+
+This is a rather extreme example, but DFDL implementations must allow
+for this.
+
+Experience with DFDL schemas to-date (2020-01-21) is that this is very
+rare and appears only in test cases designed to exercise it.
+However, less extreme versions of this are possible, such as different
+uses all being separated and delimited textual data, but where the
+distinct uses do not all use the same separator characters, so the
+separator to be used would be specified differently on each group
+reference. Another expected variation could be separated sequence
+groups where some uses are infix separator and others are prefix or
+postfix separator.
+
+Anecdotally, the vast majority of schemas seen to-date reuse groups
+entirely, that is specifying no properties at all on the group
+reference.
+
+==== Property Environment or PropEnv
+
+We define the _property environment_ or _PropEnv_ of a term, as the
+complete set of properties for that term, combining them from all the
+schema components that define the term. This can include element
+references, group references, local or global element declarations,
+global group definitions, and local and global type definitions.
+
+A key observation is that the PropEnv of a term is entirely defined by:
+
+* local properties (including any dfdl:ref property) on the schema component for the term.
+** E.g., for an element reference, the local properties expressed on the element reference itself.
+* default format object at top level of the schema where the term lexically appears. 
+* definition object being referenced.
+** E.g., for an element reference, the global element declaration being referenced. 
+
+So if two terms, say group references, have the same PropEnv, then we
+can share much about their definitions when implementing them.
+
+Hence, when constructing the Daffodil schema compiler's representation
+for a term, we can maintain a cache for each global definition, with
+the key of the cache being the PropEnv. 
+When a given term uses a global definition, if the point of use has
+the same PropEnv as another, both can share the part of the term that
+is context independent, that is, the shared region.
+
+The actual components of the PropEnv of a term are slighly more
+complicated than described above.
+The table below gives the components of the PropEnv for the various
+kinds of terms. Note that the "local properties" below includes any
+dfdl:ref property if it appears, but equality of any property which
+has as its value a QName, like dfdl:ref, the property value is based
+on a resolved QName, not the QName string.
+
+[cols="2,6a"]
+.PropEnv Components
+|===
+|Term Definition (one PropEnv subtype per) | PropEnv Components
+
+|Local Element Decl with type reference
+|
+* Local properties expressed on the Local Element Decl (Set of property name + value pairs)
+* Lexically enclosing default format (object ref)
+* Type definition (object ref) or primitive type (object ref)
+| Local Element Decl with anonymous Simple Type Definition
+| 
+* Combined set of local properties expressed on the Local Element Decl and the anoymous Simple Type definition.
+* Lexically enclosing default format (object ref)
+* Base simple type definition (object ref), or primitive type (object ref)
+|Local Element Decl with anonymous complex type definition
+|
+* Local properties expressed on the local element decl
+* Lexically enclosing default format (object ref)
+* Anonymous complex type (object ref)
+|Element Reference to Global Element Decl 
+|
+* Local properties expressed on the element reference
+* Lexically enclosing default format (object ref) 
+* Global Element Decl (object ref)
+|===
+
+Lookups by PropEnv compare sets by equality of members, and object
+references by pointer equality i.e, eq comparison.
+
+This definition of PropEnv can be improved by memoizing the
+construction of default format objects, so that equivalent default
+formats from different schema documents are represented by the same
+object.
+That is, so that multiple schema documents each containing:
+```xml
+  <dfdl:format ref="prefix:formatName"/>
+```
+are instantiated as the same object when the QName resolves to the same format object.  
+
+==== Details of Unique and Shared Regions
+
+The division of the reprsentation into the unique part which appears
+before the shared parts indicates where we can share implementation,
+and where we must have unique context-specific implementation for the
+term.
+
+To understand this better, the below breaks down the sub-regions of a term further:
+
+[ditaa]
+....
+----+ +------------------------------------------------------------------------------+ +------+ +---------------------------------------------------+ +----
 
 Review comment:
   Yuck. The actual file of course the lines are not wrapped to hell like this. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-daffodil-site] bsloane1650 commented on a change in pull request #17: Design notes on schema compiler space/speed issue

Posted by GitBox <gi...@apache.org>.
bsloane1650 commented on a change in pull request #17: Design notes on schema compiler space/speed issue
URL: https://github.com/apache/incubator-daffodil-site/pull/17#discussion_r375332899
 
 

 ##########
 File path: site/dev/design-notes/term-sharing-in-schema-compiler.adoc
 ##########
 @@ -0,0 +1,562 @@
+:page-layout: page
+:keywords: schema-compiler performance alignment optimization
+// ///////////////////////////////////////////////////////////////////////////
+//
+// This file is written in AsciiDoc.
+//
+// If you can read this comment, your browser is not rendering asciidoc automatically.
+//
+// You need to install the asciidoc plugin to Chrome or Firefox
+// so that this page will be properly rendered for your viewing pleasure.
+//
+// You can get the plugins by searching the web for 'asciidoc plugin'
+//
+// You will want to change plugin settings to enable diagrams (they're off by default.)
+//
+// You need to view this page with Chrome or Firefox.
+//
+// ///////////////////////////////////////////////////////////////////////////
+//
+// When editing, please start each sentence on a new line.
+// See https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line[one sentence-per-line writing technique.]
+// It is acceptable to start each sentence on a new line, but then wrap the lines to a reasonable length.
+// It's the starting on a new line that matters.
+//
+// This makes textual diffs of this file useful in a similar way to the way they work for code.
+//
+// //////////////////////////////////////////////////////////////////////////
+
+== Term Sharing in the Schema Compiler
+
+=== Introduction
+
+The DFDL language has a composition property known as _referential transparency_. 
+It is not supposed to matter whether you lift a part of the schema out
+and create a reusable group, reusable type, or reusable element from
+it.
+
+Because DFDL (version 1.0) does not allow recursive definitions of any
+kind, it is theoretically possible to implement this by treating all
+reusable definitions in the language like macros.
+I.e., inline-expand all definitions at their point of use. 
+This will work for schemas up to some size. However, it leads to an
+unacceptable expansion in schema compilation time and space, as the
+size of the schema once all these inline substitutions have been
+performed can be exponentially larger than the original
+non-substituted schema.
+
+To avoid this undesirable combinatorial explosion, the Daffodil schema
+compiler arranges to achieve the same macro-like semantics, while
+still sharing the core parts of the reusable components so they need
+only be compiled once for each unique way the component is used.
+
+This is also part of eventually enabling an extension of the DFDL
+language to allow recursive definitions.
+
+The dfdl:alignment property is one of the largest contributors to
+complexity in sharing objects by the schema compiler.
+
+Alignment will be used as the example in this design note to
+illustrate the schema compiler behavior.
+
+=== DSOM Term Objects
+
+DSOM is the Daffodil Schema Object Model.
+Within this model, the entities from a DFDL schema that actually have
+representations in the data stream are called _terms_ and they are
+represented by the subclasses of the class Term.
+There are terms which have no representation at all such as computed elements.
 
 Review comment:
   Wording is confusing here. We just defined terms as entities that "actually have representations", and here are saying that they need not have representations.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-daffodil-site] mbeckerle commented on a change in pull request #17: Design notes on schema compiler space/speed issue

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on a change in pull request #17: Design notes on schema compiler space/speed issue
URL: https://github.com/apache/incubator-daffodil-site/pull/17#discussion_r376115369
 
 

 ##########
 File path: site/dev/aboutAsciiDoc.adoc
 ##########
 @@ -69,23 +69,9 @@ image::http://daffodil.apache.org/assets/themes/apache/img/apache-daffodil-logo.
 == Example Diagrams
 
 Review comment:
   Please ignore this file in your review. I'm just updating it with useful asciidoc tricks/tips as I learn them.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-daffodil-site] mbeckerle commented on a change in pull request #17: Design notes on schema compiler space/speed issue

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on a change in pull request #17: Design notes on schema compiler space/speed issue
URL: https://github.com/apache/incubator-daffodil-site/pull/17#discussion_r376058795
 
 

 ##########
 File path: site/dev/design-notes/namespace-binding-minimization.adoc
 ##########
 @@ -0,0 +1,142 @@
+:page-layout: page
+:keywords: schema-compiler performance alignment optimization
+// ///////////////////////////////////////////////////////////////////////////
+//
+// This file is written in AsciiDoc.
+//
+// If you can read this comment, your browser is not rendering asciidoc automatically.
+//
+// You need to install the asciidoc plugin to Chrome or Firefox
+// so that this page will be properly rendered for your viewing pleasure.
+//
+// You can get the plugins by searching the web for 'asciidoc plugin'
+//
+// You will want to change plugin settings to enable diagrams (they're off by default.)
+//
+// You need to view this page with Chrome or Firefox.
+//
+// ///////////////////////////////////////////////////////////////////////////
+//
+// When editing, please start each sentence on a new line.
+// See https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line[one sentence-per-line writing technique.]
+// This makes textual diffs of this file useful in a similar way to the way they work for code.
+//
+// //////////////////////////////////////////////////////////////////////////
+
+== Namespace Binding Minimization
+
+=== Introduction
+
+DFDL schemas are XML schemas and so DFDL inherits the namespace system of XML and XML Schema for composing large schemas from smaller ones, for reusing schema files, and for managing name conflicts. 
+
+A DFDL Infoset isn't necessarily represented as XML however. 
+Some representations won't have any ability to deal with namespaces (JSON for example), and so Daffodil will sometimes issue warnings when compiling a schema if the namespace usage will not allow unambiguous representation without namespaces. 
+
+Most representations of DFDL Infosets will, like XML, use some representation of the namespaces of elements, and in textual forms this will most commonly be by way of namespace prefixes. 
+XML is not the only representation that uses namespaces, however, so this should not be taken as an entirely XML-specific discussion.
+
+There are two goals for namespace-binding minimization. 
+
+. Clarity: Infosets that have redundant namespace bindings are very hard to read and understand, and require namespace-binding-aware tooling to compare them, or clumsy post-processing to remove the excess bindings. 
+
+. Performance: Attaching an element to the infoset at runtime should take constant time.
+
+The most straightforward way to achieve both these objectives is to express namespace definitions only on the root element of the infoset, and to change the namespace prefixes (or use of default namespace) for some qualified-name elements if that is necessary to enable this.  
 
 Review comment:
   Note that technically, the XML Infoset (ignoring XML Schema) says that the specific characters in namespace prefixes ARE significant and are part of the XML infoset. 
   
   The convention that says "prefixes don't matter, namespaces do", is exactly that, a convention. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-daffodil-site] mbeckerle commented on a change in pull request #17: Design notes on schema compiler space/speed issue

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on a change in pull request #17: Design notes on schema compiler space/speed issue
URL: https://github.com/apache/incubator-daffodil-site/pull/17#discussion_r376057942
 
 

 ##########
 File path: site/dev/design-notes/namespace-binding-minimization.adoc
 ##########
 @@ -0,0 +1,142 @@
+:page-layout: page
+:keywords: schema-compiler performance alignment optimization
+// ///////////////////////////////////////////////////////////////////////////
+//
+// This file is written in AsciiDoc.
+//
+// If you can read this comment, your browser is not rendering asciidoc automatically.
+//
+// You need to install the asciidoc plugin to Chrome or Firefox
+// so that this page will be properly rendered for your viewing pleasure.
+//
+// You can get the plugins by searching the web for 'asciidoc plugin'
+//
+// You will want to change plugin settings to enable diagrams (they're off by default.)
+//
+// You need to view this page with Chrome or Firefox.
+//
+// ///////////////////////////////////////////////////////////////////////////
+//
+// When editing, please start each sentence on a new line.
+// See https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line[one sentence-per-line writing technique.]
+// This makes textual diffs of this file useful in a similar way to the way they work for code.
+//
+// //////////////////////////////////////////////////////////////////////////
+
+== Namespace Binding Minimization
+
+=== Introduction
+
+DFDL schemas are XML schemas and so DFDL inherits the namespace system of XML and XML Schema for composing large schemas from smaller ones, for reusing schema files, and for managing name conflicts. 
+
+A DFDL Infoset isn't necessarily represented as XML however. 
+Some representations won't have any ability to deal with namespaces (JSON for example), and so Daffodil will sometimes issue warnings when compiling a schema if the namespace usage will not allow unambiguous representation without namespaces. 
+
+Most representations of DFDL Infosets will, like XML, use some representation of the namespaces of elements, and in textual forms this will most commonly be by way of namespace prefixes. 
+XML is not the only representation that uses namespaces, however, so this should not be taken as an entirely XML-specific discussion.
+
+There are two goals for namespace-binding minimization. 
+
+. Clarity: Infosets that have redundant namespace bindings are very hard to read and understand, and require namespace-binding-aware tooling to compare them, or clumsy post-processing to remove the excess bindings. 
+
+. Performance: Attaching an element to the infoset at runtime should take constant time.
+
+The most straightforward way to achieve both these objectives is to express namespace definitions only on the root element of the infoset, and to change the namespace prefixes (or use of default namespace) for some qualified-name elements if that is necessary to enable this.  
 
 Review comment:
   From team discussion: messing with prefix definitions like this creates many problems. Even if all the algorithms are deterministic, moving around include files or the order of namespace bindings could produce different namespace prefixes for the same exact logical data. 
   
   Consider just pre-computing a data structure that  will answer the question of what local namespace bindings to introduce, and what prefix to use (hopefully NOT tns). This can't be constant time, though it usuallly would be, but worst case one needs different bindings based on knowing the full nest from root->A->B->C when appending D, the right namespace bindings can depend on the whole set from root on down.  
   
   Usually it won't be that bad. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-daffodil-site] mbeckerle commented on a change in pull request #17: Design notes on schema compiler space/speed issue

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on a change in pull request #17: Design notes on schema compiler space/speed issue
URL: https://github.com/apache/incubator-daffodil-site/pull/17#discussion_r376111928
 
 

 ##########
 File path: site/dev/design-notes/term-sharing-in-schema-compiler.adoc
 ##########
 @@ -0,0 +1,562 @@
+:page-layout: page
+:keywords: schema-compiler performance alignment optimization
+// ///////////////////////////////////////////////////////////////////////////
+//
+// This file is written in AsciiDoc.
+//
+// If you can read this comment, your browser is not rendering asciidoc automatically.
+//
+// You need to install the asciidoc plugin to Chrome or Firefox
+// so that this page will be properly rendered for your viewing pleasure.
+//
+// You can get the plugins by searching the web for 'asciidoc plugin'
+//
+// You will want to change plugin settings to enable diagrams (they're off by default.)
+//
+// You need to view this page with Chrome or Firefox.
+//
+// ///////////////////////////////////////////////////////////////////////////
+//
+// When editing, please start each sentence on a new line.
+// See https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line[one sentence-per-line writing technique.]
+// It is acceptable to start each sentence on a new line, but then wrap the lines to a reasonable length.
+// It's the starting on a new line that matters.
+//
+// This makes textual diffs of this file useful in a similar way to the way they work for code.
+//
+// //////////////////////////////////////////////////////////////////////////
+
+== Term Sharing in the Schema Compiler
+
+=== Introduction
+
+The DFDL language has a composition property known as _referential transparency_. 
+It is not supposed to matter whether you lift a part of the schema out
+and create a reusable group, reusable type, or reusable element from
+it.
+
+Because DFDL (version 1.0) does not allow recursive definitions of any
+kind, it is theoretically possible to implement this by treating all
+reusable definitions in the language like macros.
+I.e., inline-expand all definitions at their point of use. 
+This will work for schemas up to some size. However, it leads to an
+unacceptable expansion in schema compilation time and space, as the
+size of the schema once all these inline substitutions have been
+performed can be exponentially larger than the original
+non-substituted schema.
+
+To avoid this undesirable combinatorial explosion, the Daffodil schema
+compiler arranges to achieve the same macro-like semantics, while
+still sharing the core parts of the reusable components so they need
+only be compiled once for each unique way the component is used.
+
+This is also part of eventually enabling an extension of the DFDL
+language to allow recursive definitions.
+
+The dfdl:alignment property is one of the largest contributors to
+complexity in sharing objects by the schema compiler.
+
+Alignment will be used as the example in this design note to
+illustrate the schema compiler behavior.
+
+=== DSOM Term Objects
+
+DSOM is the Daffodil Schema Object Model.
+Within this model, the entities from a DFDL schema that actually have
+representations in the data stream are called _terms_ and they are
+represented by the subclasses of the class Term.
+There are terms which have no representation at all such as computed elements.
+
+For this discussion we are concerned only with terms that have representation. 
+
+The classes in Daffodil that are used for concrete terms are:
+
+* ElementRef
+* Root (A degenerate element ref to the root element)
+* LocalElementDecl
+* the quasi-elements:
+
+** PrefixLengthQuasiElementDecl (Fake local element decl used for the
+   length fields of Prefixed-length types)
+
+** RepTypeQuasiElementDecl (Fake local element decl used as the
+   representation of elements that are computed via the dfdl:repType
+   mechanism.)
+
+* Sequence
+* SequenceGroupRef
+
+* ChoiceBranchImpliedSequence (The sequence that is inserted when a
+  choice branch is a single element decl/ref)
+
+* Choice
+* ChoiceGroupRef
+
+=== Term Representation Regions
+
+Consider the representation of a term, which we call the term's
+_region_, as appearing between two other term regions in the data
+stream as illustrated below:
+
+[ditaa]
+....
+---+ +-----------------------------------------------+ +---
+   | |                    term                       | |
+   | |                                               | |
+   | | +---------------+ +--------+ +--------------+ | |
+...| | | unique before | | shared | | shared after | | |...
+   | | +---------------+ +--------+ +--------------+ | |
+ --+ +-----------------------------------------------+ +--
+....
+In the above, lower position bits are to the left. 
+Higher position bits to the right. 
+In the diagram, we see that the term's region consists of 3
+sub-regions, named _unique before_, _shared_, and _shared after_. Each
+term appears in a context of possible additional terms before it and
+after it which are shown with ellipsis above.
+
+A term can be an element or a model-group (sequence or choice). 
+By far the most common situation is that a term appears inside a model group. 
+There are two exceptions. The _root_, and the model-group of a complex type. 
+
+The core idea here is that we are separating the representation of a
+term into unique and shared parts.
+The unique region is affected by the surrounding model group context.
+The shared regions are, in many cases, sharable across instances of
+this term and is independent of the surrounding model-group context.
+So if the term was defined by way of a reusable global definition,
+then the goal is that there need be only one implementation of the
+shared part(s).
+
+There are some limits to this sharing. 
+A single implementation is not always achievable, but generally one or
+a small number of implementations are possible.
+
+=== Limits to Sharing
+
+Consider if this term above was defined by way of a XSD group
+reference. 
+The DFDL properties expressed on the group reference must
+be combined with those expressed on the group definition, to form the
+complete set of properties associated with the term. 
+This combining creates situations where a given shared group definition can have
+wildly different representations for different uses. 
+Consider this:
+
+```xml
+<group name="g">
+  <sequence>
+    <element name="a" type="xs:int"/>
+    <element name="b" type="xs:int"/>
+    <element name="c" type="xs:int"/>
+  </sequence>
+</group>
+
+....
+
+<element name="e">
+  <complexType>
+    <sequence>
+      ... some stuff here ...
+      <group ref="tns:g" dfdl:separator="|"/> <!-- separated -->
+      ... more stuff here ...
+      <group ref="tns:g"/> <!-- not separated -->
+      ... yet more stuff here ...
+    </sequence>
+  </complexType>
+</element>
+```
+
+The two uses of the group definition for 'g' are wildly different, in
+that one has separators, the other does not.
+
+This is a rather extreme example, but DFDL implementations must allow
+for this.
+
+Experience with DFDL schemas to-date (2020-01-21) is that this is very
+rare and appears only in test cases designed to exercise it.
+However, less extreme versions of this are possible, such as different
+uses all being separated and delimited textual data, but where the
+distinct uses do not all use the same separator characters, so the
+separator to be used would be specified differently on each group
+reference. Another expected variation could be separated sequence
+groups where some uses are infix separator and others are prefix or
+postfix separator.
+
+Anecdotally, the vast majority of schemas seen to-date reuse groups
+entirely, that is specifying no properties at all on the group
+reference.
+
+==== Property Environment or PropEnv
+
+We define the _property environment_ or _PropEnv_ of a term, as the
+complete set of properties for that term, combining them from all the
+schema components that define the term. This can include element
+references, group references, local or global element declarations,
+global group definitions, and local and global type definitions.
+
+A key observation is that the PropEnv of a term is entirely defined by:
+
+* local properties (including any dfdl:ref property) on the schema component for the term.
+** E.g., for an element reference, the local properties expressed on the element reference itself.
+* default format object at top level of the schema where the term lexically appears. 
+* definition object being referenced.
+** E.g., for an element reference, the global element declaration being referenced. 
+
+So if two terms, say group references, have the same PropEnv, then we
+can share much about their definitions when implementing them.
+
+Hence, when constructing the Daffodil schema compiler's representation
+for a term, we can maintain a cache for each global definition, with
+the key of the cache being the PropEnv. 
+When a given term uses a global definition, if the point of use has
+the same PropEnv as another, both can share the part of the term that
+is context independent, that is, the shared region.
+
+The actual components of the PropEnv of a term are slighly more
+complicated than described above.
+The table below gives the components of the PropEnv for the various
+kinds of terms. Note that the "local properties" below includes any
+dfdl:ref property if it appears, but equality of any property which
+has as its value a QName, like dfdl:ref, the property value is based
+on a resolved QName, not the QName string.
+
+[cols="2,6a"]
+.PropEnv Components
+|===
+|Term Definition (one PropEnv subtype per) | PropEnv Components
+
+|Local Element Decl with type reference
+|
+* Local properties expressed on the Local Element Decl (Set of property name + value pairs)
+* Lexically enclosing default format (object ref)
+* Type definition (object ref) or primitive type (object ref)
+| Local Element Decl with anonymous Simple Type Definition
+| 
+* Combined set of local properties expressed on the Local Element Decl and the anoymous Simple Type definition.
+* Lexically enclosing default format (object ref)
+* Base simple type definition (object ref), or primitive type (object ref)
+|Local Element Decl with anonymous complex type definition
+|
+* Local properties expressed on the local element decl
+* Lexically enclosing default format (object ref)
+* Anonymous complex type (object ref)
+|Element Reference to Global Element Decl 
+|
+* Local properties expressed on the element reference
+* Lexically enclosing default format (object ref) 
+* Global Element Decl (object ref)
+|===
+
+Lookups by PropEnv compare sets by equality of members, and object
+references by pointer equality i.e, eq comparison.
+
+This definition of PropEnv can be improved by memoizing the
+construction of default format objects, so that equivalent default
+formats from different schema documents are represented by the same
+object.
+That is, so that multiple schema documents each containing:
+```xml
+  <dfdl:format ref="prefix:formatName"/>
+```
+are instantiated as the same object when the QName resolves to the same format object.  
+
+==== Details of Unique and Shared Regions
+
+The division of the reprsentation into the unique part which appears
+before the shared parts indicates where we can share implementation,
+and where we must have unique context-specific implementation for the
+term.
+
+To understand this better, the below breaks down the sub-regions of a term further:
+
+[ditaa]
+....
+----+ +------------------------------------------------------------------------------+ +------+ +---------------------------------------------------+ +----
+    | | +-------------------+                    unique before                       | |shared| |       shared after        +---------------------+ | |
+    | | |  sequence before  |                                                        | |      | |                           |    sequence after   | | |
+    | | | +------+ +------+ | +-----+ +---------+ +------+ +------+ +------+ +-----+ | |      | | +------+ +------+ +-----+ | +-------+ +-------+ | | |
+    | | | |prefix| |prefix| | |lSkip| |alignFill| |init'r| |init'r| |prefix| +value| | |      | | |term'r| |term'r| |tSkip| | |postfix| |postfix| | | |
+    | | | |infix | |infix | | |     | |         | | MTA  | |      | |length| | MTA | | |      | | | MTA  | |      | |     | | | sep'r | | sep'r | | | |
+    | | | |sep'r | |sep'r | | |     | |         | |      | |      | |      | |     | | |      | | |      | |      | |     | | |  MTA  | |       | | | |
+... | | | | MTA  | |      | | |     | |         | |      | |      | |      | |     | | |      | | |      | |      | |     | | |       | |       | | | |...
+    | | | +------+ +------+ | +-----+ +---------+ +------+ +------+ +------+ +-----+ | |      | | +------+ +------+ +-----+ | +-------+ +-------+ | | |
+    | | +-------------------+                                                        | |      | |                           +---------------------+ | |
+----+ +------------------------------------------------------------------------------+ +------+ +---------------------------------------------------+ +----
+
+                                                                                       ^
+                                                                                       |
+                                                                                       |
+                                                                                       +
+                                                                                  Known Alignment Point
+
+....
+
+The central idea here is that all the regions to the left (before) the known alignment point are context dependent.
+That is, whether these alignment fill and mandatory text alignment regions can be statically determined 
+in size is dependent on where the prior term ended.
+
+In contrast, the known alignment point is a fresh start for alignment. The alignment as required by the alignment properties
 
 Review comment:
   This doesn't have exactly macro semantics then. 
   (Brandon's observation).
   The difference is super subtle and involves the interior-alignment problem and won't be detectable except when unparsing. 
   Add description.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-daffodil-site] mbeckerle commented on a change in pull request #17: Design notes on schema compiler space/speed issue

Posted by GitBox <gi...@apache.org>.
mbeckerle commented on a change in pull request #17: Design notes on schema compiler space/speed issue
URL: https://github.com/apache/incubator-daffodil-site/pull/17#discussion_r375056717
 
 

 ##########
 File path: site/dev/design-notes/term-sharing-in-schema-compiler.adoc
 ##########
 @@ -0,0 +1,562 @@
+:page-layout: page
+:keywords: schema-compiler performance alignment optimization
+// ///////////////////////////////////////////////////////////////////////////
+//
+// This file is written in AsciiDoc.
+//
+// If you can read this comment, your browser is not rendering asciidoc automatically.
+//
+// You need to install the asciidoc plugin to Chrome or Firefox
+// so that this page will be properly rendered for your viewing pleasure.
+//
+// You can get the plugins by searching the web for 'asciidoc plugin'
+//
+// You will want to change plugin settings to enable diagrams (they're off by default.)
+//
+// You need to view this page with Chrome or Firefox.
+//
+// ///////////////////////////////////////////////////////////////////////////
+//
+// When editing, please start each sentence on a new line.
+// See https://asciidoctor.org/docs/asciidoc-recommended-practices/#one-sentence-per-line[one sentence-per-line writing technique.]
+// It is acceptable to start each sentence on a new line, but then wrap the lines to a reasonable length.
+// It's the starting on a new line that matters.
+//
+// This makes textual diffs of this file useful in a similar way to the way they work for code.
+//
+// //////////////////////////////////////////////////////////////////////////
+
+== Term Sharing in the Schema Compiler
+
+=== Introduction
+
+The DFDL language has a composition property known as _referential transparency_. 
+It is not supposed to matter whether you lift a part of the schema out
+and create a reusable group, reusable type, or reusable element from
+it.
+
+Because DFDL (version 1.0) does not allow recursive definitions of any
+kind, it is theoretically possible to implement this by treating all
+reusable definitions in the language like macros.
+I.e., inline-expand all definitions at their point of use. 
+This will work for schemas up to some size. However, it leads to an
+unacceptable expansion in schema compilation time and space, as the
+size of the schema once all these inline substitutions have been
+performed can be exponentially larger than the original
+non-substituted schema.
+
+To avoid this undesirable combinatorial explosion, the Daffodil schema
+compiler arranges to achieve the same macro-like semantics, while
+still sharing the core parts of the reusable components so they need
+only be compiled once for each unique way the component is used.
+
+This is also part of eventually enabling an extension of the DFDL
+language to allow recursive definitions.
+
+The dfdl:alignment property is one of the largest contributors to
+complexity in sharing objects by the schema compiler.
+
+Alignment will be used as the example in this design note to
+illustrate the schema compiler behavior.
+
+=== DSOM Term Objects
+
+DSOM is the Daffodil Schema Object Model.
+Within this model, the entities from a DFDL schema that actually have
+representations in the data stream are called _terms_ and they are
+represented by the subclasses of the class Term.
+There are terms which have no representation at all such as computed elements.
+
+For this discussion we are concerned only with terms that have representation. 
+
+The classes in Daffodil that are used for concrete terms are:
+
+* ElementRef
+* Root (A degenerate element ref to the root element)
+* LocalElementDecl
+* the quasi-elements:
+
+** PrefixLengthQuasiElementDecl (Fake local element decl used for the
+   length fields of Prefixed-length types)
+
+** RepTypeQuasiElementDecl (Fake local element decl used as the
+   representation of elements that are computed via the dfdl:repType
+   mechanism.)
+
+* Sequence
+* SequenceGroupRef
+
+* ChoiceBranchImpliedSequence (The sequence that is inserted when a
+  choice branch is a single element decl/ref)
+
+* Choice
+* ChoiceGroupRef
+
+=== Term Representation Regions
+
+Consider the representation of a term, which we call the term's
+_region_, as appearing between two other term regions in the data
+stream as illustrated below:
+
+[ditaa]
+....
+---+ +-----------------------------------------------+ +---
+   | |                    term                       | |
+   | |                                               | |
+   | | +---------------+ +--------+ +--------------+ | |
+...| | | unique before | | shared | | shared after | | |...
+   | | +---------------+ +--------+ +--------------+ | |
+ --+ +-----------------------------------------------+ +--
+....
+In the above, lower position bits are to the left. 
+Higher position bits to the right. 
+In the diagram, we see that the term's region consists of 3
+sub-regions, named _unique before_, _shared_, and _shared after_. Each
+term appears in a context of possible additional terms before it and
+after it which are shown with ellipsis above.
+
+A term can be an element or a model-group (sequence or choice). 
+By far the most common situation is that a term appears inside a model group. 
+There are two exceptions. The _root_, and the model-group of a complex type. 
+
+The core idea here is that we are separating the representation of a
+term into unique and shared parts.
+The unique region is affected by the surrounding model group context.
+The shared regions are, in many cases, sharable across instances of
+this term and is independent of the surrounding model-group context.
+So if the term was defined by way of a reusable global definition,
+then the goal is that there need be only one implementation of the
+shared part(s).
+
+There are some limits to this sharing. 
+A single implementation is not always achievable, but generally one or
+a small number of implementations are possible.
+
+=== Limits to Sharing
+
+Consider if this term above was defined by way of a XSD group
+reference. 
+The DFDL properties expressed on the group reference must
+be combined with those expressed on the group definition, to form the
+complete set of properties associated with the term. 
+This combining creates situations where a given shared group definition can have
+wildly different representations for different uses. 
+Consider this:
+
+```xml
+<group name="g">
+  <sequence>
+    <element name="a" type="xs:int"/>
+    <element name="b" type="xs:int"/>
+    <element name="c" type="xs:int"/>
+  </sequence>
+</group>
+
+....
+
+<element name="e">
+  <complexType>
+    <sequence>
+      ... some stuff here ...
+      <group ref="tns:g" dfdl:separator="|"/> <!-- separated -->
+      ... more stuff here ...
+      <group ref="tns:g"/> <!-- not separated -->
+      ... yet more stuff here ...
+    </sequence>
+  </complexType>
+</element>
+```
+
+The two uses of the group definition for 'g' are wildly different, in
+that one has separators, the other does not.
+
+This is a rather extreme example, but DFDL implementations must allow
+for this.
+
+Experience with DFDL schemas to-date (2020-01-21) is that this is very
+rare and appears only in test cases designed to exercise it.
+However, less extreme versions of this are possible, such as different
+uses all being separated and delimited textual data, but where the
+distinct uses do not all use the same separator characters, so the
+separator to be used would be specified differently on each group
+reference. Another expected variation could be separated sequence
+groups where some uses are infix separator and others are prefix or
+postfix separator.
+
+Anecdotally, the vast majority of schemas seen to-date reuse groups
+entirely, that is specifying no properties at all on the group
+reference.
+
+==== Property Environment or PropEnv
+
+We define the _property environment_ or _PropEnv_ of a term, as the
+complete set of properties for that term, combining them from all the
+schema components that define the term. This can include element
+references, group references, local or global element declarations,
+global group definitions, and local and global type definitions.
+
+A key observation is that the PropEnv of a term is entirely defined by:
+
+* local properties (including any dfdl:ref property) on the schema component for the term.
+** E.g., for an element reference, the local properties expressed on the element reference itself.
+* default format object at top level of the schema where the term lexically appears. 
+* definition object being referenced.
+** E.g., for an element reference, the global element declaration being referenced. 
+
+So if two terms, say group references, have the same PropEnv, then we
+can share much about their definitions when implementing them.
+
+Hence, when constructing the Daffodil schema compiler's representation
+for a term, we can maintain a cache for each global definition, with
+the key of the cache being the PropEnv. 
+When a given term uses a global definition, if the point of use has
+the same PropEnv as another, both can share the part of the term that
+is context independent, that is, the shared region.
+
+The actual components of the PropEnv of a term are slighly more
+complicated than described above.
+The table below gives the components of the PropEnv for the various
+kinds of terms. Note that the "local properties" below includes any
+dfdl:ref property if it appears, but equality of any property which
+has as its value a QName, like dfdl:ref, the property value is based
+on a resolved QName, not the QName string.
+
+[cols="2,6a"]
+.PropEnv Components
+|===
+|Term Definition (one PropEnv subtype per) | PropEnv Components
+
+|Local Element Decl with type reference
+|
+* Local properties expressed on the Local Element Decl (Set of property name + value pairs)
+* Lexically enclosing default format (object ref)
+* Type definition (object ref) or primitive type (object ref)
+| Local Element Decl with anonymous Simple Type Definition
+| 
+* Combined set of local properties expressed on the Local Element Decl and the anoymous Simple Type definition.
+* Lexically enclosing default format (object ref)
+* Base simple type definition (object ref), or primitive type (object ref)
+|Local Element Decl with anonymous complex type definition
+|
+* Local properties expressed on the local element decl
+* Lexically enclosing default format (object ref)
+* Anonymous complex type (object ref)
+|Element Reference to Global Element Decl 
+|
+* Local properties expressed on the element reference
+* Lexically enclosing default format (object ref) 
+* Global Element Decl (object ref)
+|===
+
+Lookups by PropEnv compare sets by equality of members, and object
+references by pointer equality i.e, eq comparison.
+
+This definition of PropEnv can be improved by memoizing the
+construction of default format objects, so that equivalent default
+formats from different schema documents are represented by the same
+object.
+That is, so that multiple schema documents each containing:
+```xml
+  <dfdl:format ref="prefix:formatName"/>
+```
+are instantiated as the same object when the QName resolves to the same format object.  
+
+==== Details of Unique and Shared Regions
+
+The division of the reprsentation into the unique part which appears
+before the shared parts indicates where we can share implementation,
+and where we must have unique context-specific implementation for the
+term.
+
+To understand this better, the below breaks down the sub-regions of a term further:
+
+[ditaa]
+....
+----+ +------------------------------------------------------------------------------+ +------+ +---------------------------------------------------+ +----
 
 Review comment:
   Choose "display the rich diff" on the icon buttons at the top of the file, and you get a more legible view. 
   The problem is that this diagram is quite wide. But ... it needs to be. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services