You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@daffodil.apache.org by Mike Beckerle <mb...@tresys.com> on 2017/10/21 01:43:38 UTC

Notes on Daffodil Schema Compiler redesign - improve footprint and speed

Discussion topics around reducing the memory footprint of the Daffodil Schema Compiler.

Issue: combinatorial explosion of duplicated objects in compiler, all of which have to be compiled. Kind of a silly mistake, but the fact that the Daffodil schema compiler is effectively a big functional program means you'll get the same function regardless of this copying. We just do a lot of recomputations of the same thing.

Fix: stop doing that. Share objects. Max number of DSOM objects should be 1-to-1 equal to number of XML elements, types, groups in the DFDL schema. Can be fewer than that given optimizations.

This sharing requires undoing the unique backpointer that objects have to their referencing context. Lexical parent is still unique, but the containing references, for a shared object there is a set of such places.

Any calculations in the Schema Compiler that depend on traversing "The" unique backpointer have to be removed. If you just delete the backpointer, and then delete all the code that breaks, and then the code that removal breaks, etc. you will eventually remove ALL code because it all boils down to properties, and today even those are using the backpointer. So the property resolution is job 1 to fix in the compiler.

Any runtime mechanisms that depend on such calculations must be removed. E.g., slots, alignment inference (are we pre-aligned based on what is before us, or do we have to do alignment), change parsers/unparsers that only set the encoding into the I/O layer if it determines the encoding may have changed. Those kinds of analysis would have to be generalized to take all contexts into account, as shared objects will have multiple back-refs to contexts that share them. For the most part this is not worth it.

Side benefit of fixing this sharing in the compiler is that recursive schemas should work - this is a useful feature (though an extension to DFDL) enabling description of document formats and other kinds of data with recursive structure. We will still want a flag to turn this on/off and a check to insist on non-recursive schemas.

Runtime Changes

* Slots - remove and use QNames
* note optimization possible by interning QNames - would avoid string comparisons of element name. Note that namespaces are already interned so can be compared via EQ comparisons.
* Unordered sequence combinator can keep state in PState/Ustate enabling gathering of incoming elements into proper buckets.
* We considered whether QNames were enough or a QName plus a number (to make it unique within a parent element) is needed. Concluded just QName is correct.
* DPath expressions are effectively lists of upward moves and downward moves specified by QNames.
* Taylor Wise has code for this almost done.
* Change-parsers/unparsers - eliminate in favor of I/O layer callback to PState/UState to obtain properties needed.
* bitOrderChangeUnparser - still needed - because it behaves like alignment unparser, but see below - we're getting rid of alignment parsers/unparsers too.
* Mike Beckerle has code for this working except for bitOrderChangeUnparser and alignment unparsers are still there.
* Alignment
* SDE on outputValueCalc (OVC) elements w/o Specified Length - so much complexity occurs if the OVC element's length isn't known, that it's just not worth it. A restriction requiring specified length massively simplifies unparser implementation. The only suspensions are for the outputValueCalc elements themselves, which have expressions that may block until later infoset objects arrive at the unparser so their values or length can be used in the OVC calculation.
* If you do this, alignment operations cannot suspend (nor can bitOrderChangeUnparser)
* Alignment operations can all move to Term and text (mandatory text alignment) parsers/unparsers.
* Aligning is basically one line of code, so the overhead is negligible.
* Delimiters(?)
* Because delimiters are pushed on stacks now, they aren't incorporated directly into the definition of a parser. That is, a parser doesn't have to look outward at its containment separators and terminators all the way outward, to get the complete set of things that might delimit it. It's done at runtime by pushing them on a stack as scopes are entered.
* Smaller things: there aren't unique paths from root to every element in the ERD graph any more.
* So we can't use those paths as unique identifiers anymore because ERDs will be shared now.

Compiler Changes

* DSOM - refactor object model.
* See UML diagram sketch (Attached.)
* The part of this sketch around ModelGroups isn't right. It should be analogous to the structure for Elements and ElementRefs.
* DSOM - Remove back pointers. Replace with shared objects having set of back-refs.
* algorithm to construct all objects and backpointer sets will be a first pass over everything to create this linked tree with sharing.
* after that, no more DSOM objects will be created.
* Property resolution w/o using backpointers
* Right now, elements reach over to the referencing element ref via backpointer (if there is one) to grab properties.
* This needs to change to the ElementRef being the place resolving the properties.
* Mike Beckerle - already has this code working - has to separate it from other unsuccessful changes in a branch.
* Expressions with up-and-out relative paths must be type-checked for all back-refs
* means all back-refs must be in place before such compilation is begun
* The actual compiled representation of the path is invariant of context location.
* Issue: same path names, but different types requiring different casts/conversions - or do we say one-path means one type signature? There's a possibility an up-and-out path would be used polymorphically for different types, but still be type-correct in each situation.
* Delimiters (?)
* Hoisting of property-dependent code off of type-objects and global-element-decl and global group def
* All property-dependent code goes onto Term (or TermCodeGen) class.
* Note: An element cannot ask its complex type for children, and then generate code for them, as that would just put back all the copying, just in the parser/unparser generation phase. Instead an element must ask for the shared parsers/unparsers of the children of its type.
* This works because complex types and their model groups do not combine properties with the element referencing them. So they can always be shared.
* However, this lack of combining for complex types with their elements is asymmetric with how simple types and elements work, and it has been proposed to lift this restriction. In that case, one must cache compiled DSOM objects (parsers unparsers) based on the property environment they are in. When any object is compiled and the property environment is the same, the compiled object can be shared because any differences in the compiler output must come based on the properties.
* Optimization: ElementRef/TypeRef/GroupRef consolidation
* if two elementRefs have the same ref, and same properties and same default properties, then they can share representation.
* The new schema compiler should not generate element-ref location specific code.
* Similarly for Elements (aka TypeRefs), and GroupRefs.
* This can result in the number of DSOM objects actually being substantially smaller than the number of XML elements expressed textually in the DFDL schema.
* Feature Needed: Compile all elements as root in one compiled object for the schema. No need to pre-identify the root element before compiling.
* except those that have upward-and-out expressions - cannot be root.
* Need to throw specific exceptions that can be caught to disqualify an element for root use while still allowing it for use from element references.
* A Root really behaves like an element reference.
* In the new UML design, ElementRefs and LocalElementDecls can generate parser/unparsers, but GlobalElementDecls cannot, so one needs to create a Root for each GlobalElementDecl to allow it to be used as a Root.
* Paths - can't trace backpointers anymore - so cannot be used as unique identifiers of a context any more.

Re: Notes on Daffodil Schema Compiler redesign - improve footprint and speed

Posted by Mike Beckerle <mb...@tresys.com>.

So I set out to finish the "property resolving" aspect of this refactoring.


It got much more involved than I expected. I have 80+ daffodil-test failing right now.


Turns out one of the assumptions we made, that we can restrict dfdl:outputValueCalc elements to require specified length, isn't valid.


Turns out dfdl:lengthKind pattern is used heavily in conjunction with outputValueCalc for 7-bit ascii strings in common Mil formats like VMF.


Now, in these formats, there is no need for alignment - It is all 1-bit aligned. So there shouldn't be any alignment unparsers anyway, but the absolute position of something appearing after an OVC element isn't going to be known.

That's too simplistic.


So we need to allow dfdl:lengthKind 'pattern' as well. Unless we add the restriction that everything is then required to be 1-bit alignment we will require alignment unparsers that can block waiting for the absolute position to be known.


I don't like putting in a special case for non 8-bit-aligned characters. So I'm inclined to leave the unparser architecture as is, where we have to have, possibly, alignment unparser.


This means that we still have to compute a compile-time approximation to the provided alignment so as to be able to determine if the required alignment is satisfied, so we can optimize out the alignment unparser. This computation has to take all the possible contexts into account (i.e., all N backrefs) to compute the approximation to the provided alignment.



________________________________
From: Mike Beckerle
Sent: Friday, October 20, 2017 9:43:38 PM
To: dev@daffodil.apache.org
Subject: Notes on Daffodil Schema Compiler redesign - improve footprint and speed


Discussion topics around reducing the memory footprint of the Daffodil Schema Compiler.


Issue: combinatorial explosion of duplicated objects in compiler, all of which have to be compiled. Kind of a silly mistake, but the fact that the Daffodil schema compiler is effectively a big functional program means you'll get the same function regardless of this copying. We just do a lot of recomputations of the same thing.


Fix: stop doing that. Share objects. Max number of DSOM objects should be 1-to-1 equal to number of XML elements, types, groups in the DFDL schema. Can be fewer than that given optimizations.


This sharing requires undoing the unique backpointer that objects have to their referencing context. Lexical parent is still unique, but the containing references, for a shared object there is a set of such places.


Any calculations in the Schema Compiler that depend on traversing "The" unique backpointer have to be removed.  If you just delete the backpointer, and then delete all the code that breaks, and then the code that removal breaks, etc. you will eventually remove ALL code because it all boils down to properties, and today even those are using the backpointer. So the property resolution is job 1 to fix in the compiler.


Any runtime mechanisms that depend on such calculations must be removed. E.g., slots, alignment inference (are we pre-aligned based on what is before us, or do we have to do alignment), change parsers/unparsers that only set the encoding into the I/O layer if it determines the encoding may have changed. Those kinds of analysis would have to be generalized to take all contexts into account, as shared objects will have multiple back-refs to contexts that share them. For the most part this is not worth it.


Side benefit of fixing this sharing in the compiler is that recursive schemas should work - this is a useful feature (though an extension to DFDL) enabling description of document formats and other kinds of data with recursive structure. We will still want a flag to turn this on/off and a check to insist on non-recursive schemas.



Runtime Changes


  *   Slots - remove and use QNames
     *   note optimization possible by interning QNames - would avoid string comparisons of element name. Note that namespaces are already interned so can be compared via EQ comparisons.
     *   Unordered sequence combinator can keep state in PState/Ustate enabling gathering of incoming elements into proper buckets.
     *   We considered whether QNames were enough or a QName plus a number (to make it unique within a parent element) is needed. Concluded just QName is correct.
     *   DPath expressions are effectively lists of upward moves and downward moves specified by QNames.
     *   Taylor Wise has code for this almost done.
  *   Change-parsers/unparsers - eliminate in favor of I/O layer callback to PState/UState to obtain properties needed.
     *   bitOrderChangeUnparser - still needed - because it behaves like alignment unparser, but see below - we're getting rid of alignment parsers/unparsers too.
     *   Mike Beckerle has code for this working except for bitOrderChangeUnparser and alignment unparsers are still there.
  *   Alignment
     *   SDE on outputValueCalc (OVC) elements w/o Specified Length - so much complexity occurs if the OVC element's length isn't known, that it's just not worth it. A restriction requiring specified length massively simplifies unparser implementation. The only suspensions are for the outputValueCalc elements themselves, which have expressions that may block until later infoset objects arrive at the unparser so their values or length can be used in the OVC calculation.
     *   If you do this, alignment operations cannot suspend (nor can bitOrderChangeUnparser)
     *   Alignment operations can all move to Term and text (mandatory text alignment) parsers/unparsers.
     *   Aligning is basically one line of code, so the overhead is negligible.
  *   Delimiters(?)
     *   Because delimiters are pushed on stacks now, they aren't incorporated directly into the definition of a parser. That is, a parser doesn't have to look outward at its containment separators and terminators all the way outward, to get the complete set of things that might delimit it. It's done at runtime by pushing them on a stack as scopes are entered.
  *   Smaller things: there aren't unique paths from root to every element in the ERD graph any more.
     *   So we can't use those paths as unique identifiers anymore because ERDs will be shared now.


Compiler Changes


  *   DSOM - refactor object model.
     *   See UML diagram sketch (Attached.)
     *   The part of this sketch around ModelGroups isn't right. It should be analogous to the structure for Elements and ElementRefs.
  *   DSOM - Remove back pointers. Replace with shared objects having set of back-refs.
     *   algorithm to construct all objects and backpointer sets will be a first pass over everything to create this linked tree with sharing.
     *   after that, no more DSOM objects will be created.
  *   Property resolution w/o using backpointers
     *   Right now, elements reach over to the referencing element ref via backpointer (if there is one) to grab properties.
     *   This needs to change to the ElementRef being the place resolving the properties.
     *   Mike Beckerle - already has this code working - has to separate it from other unsuccessful changes in a branch.
  *   Expressions with up-and-out relative paths must be type-checked for all back-refs
     *   means all back-refs must be in place before such compilation is begun
     *   The actual compiled representation of the path is invariant of context location.
     *   Issue: same path names, but different types requiring different casts/conversions - or do we say one-path means one type signature? There's a possibility an up-and-out path would be used polymorphically for different types, but still be type-correct in each situation.
  *   Delimiters (?)
  *   Hoisting of property-dependent code off of type-objects and global-element-decl and global group def
     *   All property-dependent code goes onto Term (or TermCodeGen) class.
     *   Note: An element cannot ask its complex type for children, and then generate code for them, as that would just put back all the copying, just in the parser/unparser generation phase. Instead an element must ask for the shared parsers/unparsers of the children of its type.
        *   This works because complex types and their model groups do not combine properties with the element referencing them. So they can always be shared.
        *   However, this lack of combining for complex types with their elements is asymmetric with how simple types and elements work, and it has been proposed to lift this restriction. In that case, one must cache compiled DSOM objects (parsers unparsers) based on the property environment they are in. When any object is compiled and the property environment is the same, the compiled object can be shared because any differences in the compiler output must come based on the properties.
  *   Optimization: ElementRef/TypeRef/GroupRef consolidation
     *   if two elementRefs have the same ref, and same properties and same default properties, then they can share representation.
     *   The new schema compiler should not generate element-ref location specific code.
     *   Similarly for Elements (aka TypeRefs), and GroupRefs.
     *   This can result in the number of DSOM objects actually being substantially smaller than the number of XML elements expressed textually in the DFDL schema.
  *   Feature Needed: Compile all elements as root in one compiled object for the schema. No need to pre-identify the root element before compiling.
     *   except those that have upward-and-out expressions - cannot be root.
        *   Need to throw specific exceptions that can be caught to disqualify an element for root use while still allowing it for use from element references.
     *   A Root really behaves like an element reference.
     *   In the new UML design, ElementRefs and LocalElementDecls can generate parser/unparsers, but GlobalElementDecls cannot, so one needs to create a Root for each GlobalElementDecl to allow it to be used as a Root.
  *   Paths - can't trace backpointers anymore - so cannot be used as unique identifiers of a context any more.