You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@daffodil.apache.org by Steve Lawrence <sl...@apache.org> on 2017/10/03 14:05:43 UTC

Re: Remove concept of slots from InfosetImpl Design Discussion

On 09/22/2017 04:25 PM, Taylor Wise wrote:
> Goal: Remove concept of slots from InfosetImpl
> 
> Reasons:
>     1. So that element/node lookup can be name based.
>     2. Ease re-enabling of Unordered Sequences.
>     3. Ease enabling of 'query-style' expressions that return more than
> one node.
> 
> An Infoset is a document (DIDocument) and the document contains exactly
> one element (DIElement) as its root.  These elements can be simple
> (DISimple) or complex (DIComplex).  Simple elements do not contain
> children, complex elements do.  DIArray represents an array (maxOccurs
>>1). A DIArray shall always be a child of DIComplex.  All of these
> inherit from DINode.
> 
> Currently DIComplex models its children as an Array of DINodes. The
> problem with the current implementation is that there's no way for us to
> just ask for an element by name without walking the structure.
> 
> The solution involves modifying DIComplex to represent its children both
> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
> insertion ordering information we need regarding the children as well as
> constant time indexing access to the child in question.  The ArrayBuffer
> also has the added benefit of growing and shrinking as objects are added
> and removed.  While the HashMap will allow us to associate a name
> (String) with DINodes that have that name thus providing a fast named
> lookup.
> 
> This named lookup can be optimized further by only inserting into the
> HashMap those elements that we know are used in expressions. These are
> the elements that would most likely benefit from a fast named lookup and
> will allow us to avoid additional overhead of hashing strings and
> allocating/appending to a sequence.
> 
> The reason for having the HashMap contain a mutable.Seq[DINode] is that
> it's possible for the following:
> 
>     <element name="a" .../>
>     <element name="a" maxOccurs="unbounded"..../>
>     <element name="b".../>
>     <element name="a" ..../>
> 
> Each element named 'a' would have an associated DINode.  Because there
> are multiple DINodes named 'a', we need to represent them in the HashMap
> using the same key 'a'.  To do this, we add all instances of 'a' to a
> Seq[DINode].  In this particular case, the HashMap would look as follows:
> 
>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>     "b" -> mutable.Seq(DIElement)
> 
> The mutable.ArrayBuffer maintains the insertion order and would look as
> follows:
> 
>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
> 
> Back tracking shall be accomplished by going to the last item in the
> DIComplex ArrayBuffer and removing it.
>     1. If the last item is a DIElement, we simply delete it from the
> ArrayBuffer and then remove it also from the HashMap.
>     2. If the last item is a DIArray, we then have to look at the
> contents of the DIArray.
>         a. If the DIArray has more than one entry, remove the last
> entry.  Then update the associated entry in the ArrayBuffer and HashMap
> with the modified DIArray.

I think the second part of this is actually unnecessary. Since the
DIArray is mutable, we only need to remove its last child. The entry in
the ArrayBuffer and HashMap will still point to the same DIArray. The
only difference is the contents of the DIArray will be one smaller.

>         b. If the DIArray has a single entry then we simply remove
> DIArray from ArrayBuffer and the HashMap.
> 
>     <element name="root>
>          <sequence ../>
>          <element name="a" .../>
>          <element name="a" maxOccurs="unbounded">
>            ...
>            <element name="b" .../>
>            <element name="c" .../>
>            ...
>          </element>
>        </sequence>
>     </element>
> 
> If the HashMap of DIComplex-root looks like the following:
>      "a" -> mutable.Seq(DIElement, DIArray)
> 
> ArrayBuffer of DIComplex-root looks like the following:
>      0 -> DIElement("a")
>      1 -> DIArray("a")
> 
> Where DIArray children ArrayBuffer looks like:
>      0 -> DIElement("b")
>      1 -> DIElement("c")
> 
> Given the initial example if we backtrack(1):  // purge one item
>      DIArray:
>          0 -> DIElement("b")  // removed one item from DIArray
> 
>      HashMap DIComplex-root:
>          "a" -> mutable.Seq(DIElement, DIArray)
> 
>      ArrayBuffer DIComplex-root:
>          0 -> DIElement("a")
>          1 -> DIArray("a")
> 
> Given the initial example if we backtrack(2): // purge two items
>      DIArray:
>          Empty
> 
>      HashMap DIComplex-root:
>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>                                         // removed and is now empty
> 
>      ArrayBuffer DIComplex-root:
>          0 -> DIElement("a")
> 
> ===
> 
> class DIComplex {
>   def nSlots = erd.nChildSlots

I think the concept of nSlots and nChildSlots should go away completely.
The children ArrayBuffer should just grow as elements are added and
shrink as they are backtracked. It might make sense to have a tunable
the determines the initial size of the children Array, but the Scala
defaults for initializing and growing a mutable array might already be
reasonable.

>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]

One thing that might make this change easier is to not worry about the
mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
really only necessary to support query style expressions, but that comes
with a whole other slew of concerns (e.g. fn:count, same name but
different types, etc.). There currently exist restrictions that do not
allow query style expressions, so this Seq should always be a size of 1
until remove those restrictions.

> 
>   def addChild(node: DINode) {
>     children += node
>     fastLookup(node.name).append(node)
>   }
> 
>   def getChild(name: String): immutable.Seq[DINode] {
>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>   }
> 
>   def backtrack(int n) {
>     for (i = 0; i < n; i++) {
>       val endIdx = children.size -1
>       val child = children.remove(endIdx) // children.pop()
> 
>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>         case None => // do nothing
>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>         case Some(nodesSeq) => {
>       val lastEntry = nodesSeq.last
>           lastEntry match {
>         case arr: DIArray => {
>               // If the array has >= 2 items, remove last one and update
> nodesSeq
>               // with this array:
>               //  val lastIdx = nodesSeq.length - 1
>           //  arr.remove(arr.size - 1)
>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
> arr))
>               //
>               // If the array has 1 item, remove the array from nodesSeq
>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>         }
>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>           }
> 
>         }
>       }
>     }
>   }
> 
> }
> 
> class DIArray {
>     private val initialSize =
> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
> 
>     def append(ie: InfosetElement): Unit = {
>         _contents.append(ie.asInstanceOf[DIElement])
>         ie.setArray(this)
>     }
> 
>     //
>     // For backtracking purposes
>     //
>     def remove(n: Int): DIElement = {
>         _contents.remove(n)
>     }
> }
> 
> ===
> 
> To facillitate a breadth-first ordering scheme we shall have to

That this is only necessary for unordered sequences, where the insertion
order in the ArrayBuffer differs from the display order, correct? If
that's true, we definitely want to make sure to only do the below
sorting when an element contains an unordred sequence.

> construct a 'global' that lives on the Element Runtime Data (ERD).  It
> shall be called childElementDeclPosition and pertains to the static
> order of the element children of a complex type element.  When sorting
> DINodes, childElementDeclPosition can be obtained by way of reaching
> from the DINode to the coresponding ERD.  Any array shall be a single
> entry in the sequence that needs to be sorted.  All elements in this
> array will have the same childElementDeclPosition.
> 
> The difficulty in implementing this is that an Element's
> dsom.ComplexType can have groups within the groups before finding actual
> element children.  Therefore in the Document Schema Object Model (DSOM)
> the model group object shall have a startingChildElementDeclPosition and
> an endingChildElementDeclPosition.
> 
> 1. The startingChildElementPosition shall be passed to the constructor
> for the Term.
> 2. If the term is an element, the endingChildElementDeclPosition is
> equal to the startingChildElementDeclPosition.  3. If the term is a
> model group, then its first child term's
> startingChildElementDeclPosition is equal to its own
> startingChildElementDeclPosition, the 2nd to Nth child term's
> startingChildElementDeclPosition is the prior term's
> endingChildElementDeclPosition + 1.
> 4. The endingChildElementDeclPosition for a model-group shall be the
> endingChildElmentDeclPosition of its last child term.
> 

Looks good to me. +1. This should definitely make implementation of
unordered sequences and query style expressions easier. Removal of the
slot concept should also help remove concepts that prevent faster schema
compilation.

- Steve

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Steve Lawrence <sl...@apache.org>.

After seeing a performance degredation with this recent change, I did
some profiling with PCAP to investigate the causes. I immediately saw a
lot of HashMap related allocations and insertions. I quickly hacked
together something to only create and insert into the HashMap if it was
used in expressions to see how it affected performance.

PCAP first, in a single parse, PCAP calls addChild about 1600 times, and
about 1000 of those times added a child that was used in an expression.
So this is pretty close to a worst case, where almost 2/3's of elements
need to be inserted into the HashMap. Fixing a couple other minor things
that showed up in profiling that I doubt had much of an affect, the
performance numbers I saw were:

With Slots:
   688.387 files/sec

Without Slots (unoptimized):
   553.145 files/sec

Without Slots (optimized):
   614.983 files/sec

So unoptimized is about 20% slower than slots, and optimized is about
10% slower than slots.


Looking at ATO, it calls addChild about 2600 times, with no elements
used in an expression, so HashMap allocations/lookups should be zero in
the optimized case. Performance numbers with this are:

With Slots:
   68.271 files/sec

Without Slots (unoptimized):
   64.623 files/sec

Without Slots (optimized):
   69.662 files/sec

So in this case, without slots/optimized is actually a hair faster than
with slots. Not sure why that would be the case, maybe there were lots
of empty slots in ATO taking up space? Regardless, this seems like a
good win.

We might be able to pick up that extra 10% in PCAP by interning QNames
to speed up HashMap lookups. Though, I didn't see anything in profiling
related to equality checks in the hash look ups, but sometimes
things are hard to see.

I've created DAFFODIL-1860 to track the issue for reducing HashMap
allocations and DAFFODIL-1861 to track the QName interning.

- Steve


On 10/23/2017 01:25 PM, Mike Beckerle wrote:
> No I think 2.1.0 as we don't understand the performance implications fully yet.
> 
> 
> This should fix bug....
> 
> 
> https://opensource.ncsa.illinois.edu/jira/browse/DFDL-1773
> 
> [DFDL-1773] Choice ambiguous element name results in ... 
> <https://opensource.ncsa.illinois.edu/jira/browse/DFDL-1773>
> opensource.ncsa.illinois.edu
> Daffodil; DFDL-1773; Choice ambiguous element name results in failed expression
> 
> 
> --------------------------------------------------------------------------------
> *From:* Taylor Wise
> *Sent:* Monday, October 23, 2017 12:43:36 PM
> *To:* Mike Beckerle; Steve Lawrence; dev@daffodil.apache.org
> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
> 
> Should this be pushed to 2.0.0?
> 
> --------------------------------------------------------------------------------
> *From:* Mike Beckerle
> *Sent:* Monday, October 23, 2017 12:36:57 PM
> *To:* Taylor Wise; Steve Lawrence; dev@daffodil.apache.org
> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
> 
> Yes. The one giant ticket about the space problem compiling (DFDL-1444) is not 
> specific enough.
> 
> --------------------------------------------------------------------------------
> *From:* Taylor Wise
> *Sent:* Monday, October 23, 2017 12:28:04 PM
> *To:* Steve Lawrence; dev@daffodil.apache.org; Mike Beckerle
> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
> 
> I don't believe there's a ticket associated with these changes.  Should I go 
> ahead and create one?
> 
> --------------------------------------------------------------------------------
> *From:* Steve Lawrence <sl...@apache.org>
> *Sent:* Wednesday, October 18, 2017 8:21:02 AM
> *To:* Taylor Wise; dev@daffodil.apache.org; Mike Beckerle
> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
> On 10/17/2017 07:57 PM, Taylor Wise wrote:
>> Been debugging after implementing your suggested changes.
>> 
>> 
>> First thing to note, is that by moving the creation of the array out of 
>> getChildArray that I had to make additional calls to 'addChild' in the 
>> InfosetInputter.  This corrected things on the 'parser' side.  However, on the 
>> unparser side it became evident that this 'addChild' function cannot always set
>> the parent.  There's an invariant expecting parent to be null along the 
>> execution branch of the unparser side.  As a result, I modified addChild to also
>> accept a default boolean called 'setParent'.  It defaults to true unless 
>> specified.  So now when we call 'addChild' in InfosetInputter on the unparser 
>> side, we actually call 'addChild(e, false)'.
> 
> This sounds to me like addChild is getting called twice on the same
> element and so it's parent is trying to be set twice. Might need some
> more investigation. Maybe the InfosetInputter needs more tweaks to
> support not using getChildArray? Hard to say without seeing your
> InfosetInputter changes.
> 
> Regarding the below tests, I think your assessment is correct. The
> password element doesn't exist and so we should never even try to get
> the -1 index. Instead we should get something like "password element
> does not exist" or something. It would be useful to have that test to
> ensure we get that error message.
> 
> An index of -1 (or anything less than 1) can definitely be caught
> statically at schema compile time. Worth creating a bug for that. But
> there will be times where the index is calculated so we'll still need
> this runtime check. I think it's still a test worth having.
> 
> I think you're modifications are reasonable based on what the tests are
> trying to tests. Probably to proper coverage of tests we need is:
> 
> - trying to access an array that does not exist with both valid and
> invalid indicies (less than 1 and greater than array size)
> - trying to access an array thta does exist but with invalid indicies
> (less than 1 and greater than array size)
> 
> So it's probably worth keeping the existing tests for the 'array does
> not exist' tests and ensure we get a reasonable error message not
> talking about indices. And then just make your modifications new tests.
> 
>> 
>> The above correction reduced the errors from 30+ to just 2.  The only two errors
>> remaining have to do with an array index out of bounds error that's expected.  I
>> believe that it's actually an issue with the test and not the code.
>> 
>> 
>> Error:
>> 
>> edu.illinois.ncsa.daffodil.tdml.TDMLException: Did not find diagnostic message 
>> "Value -1 is out of range" in any of the actual diagnostic messages:
>> Runtime Schema Definition Error: Expression Evaluation Error: Child element 
>> {http://example.com}password does not exist.
>> Schema context: hideme Location line 267 column 12 in 
>> file:/tmp/hiddenElem2658119394211244932.dfdl.xsd
>> Data location was preceding byte 0 limit(bytes) 8
>> 
>> Well, password doesn't exist because we are missing the 'pw:' initiator in the 
>> tdml:document content.  The code now doesn't know to create it because we 
>> haven't seen the initiator.  At least that's what I surmise.
>> 
>> 
>> Existing test:
>> 
>>    <!--
>>       Test Name: arrayIndexOutOfBounds_02
>>          Schema: hiddenElem
>>            Root: e1b
>>         Purpose: This test demonstrates that a proper error occurs if an array 
>> index in an expression is out of bounds
>>    -->
>> 
>>    <tdml:parserTestCase name="arrayIndexOutOfBounds_02" root="e1b"
>>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>> 
>>      <tdml:document>p455w0rd</tdml:document>
>>        <tdml:errors>
>>          <tdml:error>Schema Definition Error</tdml:error>
>>          <tdml:error>expression evaluation error</tdml:error>
>>          <tdml:error>Value -1 is out of range</tdml:error>
>>          <tdml:error>with length 0</tdml:error>
>>        </tdml:errors>
>>    </tdml:parserTestCase>
>> 
>> Schema:
>>      <xs:group name="hiddenData">
>>        <xs:sequence dfdl:separator="|">
>>          <xs:element name="password" type="xs:string" dfdl:initiator="pw:" 
>> minOccurs="0" maxOccurs="1" dfdl:lengthKind="delimited"/>
>>        </xs:sequence>
>>      </xs:group>
>> 
>>      <xs:element name="e1">
>>        <xs:complexType>
>>          <xs:sequence>
>>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{ 
>> fn:concat(fn:substring(/ex:e1/ex:password[1], 1, 1), '***') }"/>
>>          </xs:sequence>
>>        </xs:complexType>
>>      </xs:element>
>>      <xs:element name="e1b">
>>        <xs:complexType>
>>          <xs:sequence>
>>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{ 
>> fn:concat(fn:substring(/ex:e1b/ex:password[-1], 1, 1), '***') }"/>
>>          </xs:sequence>
>>        </xs:complexType>
>>      </xs:element>
>> 
>> 
>> My modification:
>> 
>>    <!--
>>       Test Name: arrayIndexOutOfBounds_02
>>          Schema: hiddenElem
>>            Root: e1b
>>         Purpose: This test demonstrates that a proper error occurs if an array 
>> index in an expression is out of bounds
>>    -->
>> 
>>    <tdml:parserTestCase name="arrayIndexOutOfBounds_02" root="e1b"
>>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>> 
>>      <tdml:document>pw:p455w0rd</tdml:document>
>>        <tdml:errors>
>>          <tdml:error>Schema Definition Error</tdml:error>
>>          <tdml:error>expression evaluation error</tdml:error>
>>          <tdml:error>Value -1 is out of range</tdml:error>
>>          <tdml:error>with length 1</tdml:error>
>>        </tdml:errors>
>>    </tdml:parserTestCase>
>> 
>> But honestly, should we really need to do this?  Shouldn't the code know that -1
>> is not a valid index without having to force construction of the password array?
>> 
>> I'm similarly going to have to modify this other test as well by constructing a
>> different root that attempts to access a character in a position greater than 
>> the length of the array, unless there's something I'm missing:
>> 
>>    <!--
>>       Test Name: arrayIndexOutOfBounds_01
>>          Schema: hiddenElem
>>            Root: e1
>>         Purpose: This test demonstrates that a proper error occurs if an array 
>> index in an expression is out of bounds
>>    -->
>> 
>>    <tdml:parserTestCase name="arrayIndexOutOfBounds_01" root="e1"
>>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>> 
>>      <tdml:document>p455w0rd</tdml:document>
>>        <tdml:errors>
>>          <tdml:error>Schema Definition Error</tdml:error>
>>          <tdml:error>expression evaluation error</tdml:error>
>>          <tdml:error>Value 1 is out of range</tdml:error>
>>          <tdml:error>with length 0</tdml:error>
>>        </tdml:errors>
>>    </tdml:parserTestCase>
>> 
>> Proposed addition to defineSchema:
>> 
>>      <xs:element name="e1c">
>>        <xs:complexType>
>>          <xs:sequence>
>>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{ 
>> fn:concat(fn:substring(/ex:e1c/ex:password[10], 1, 1), '***') }"/>
>>          </xs:sequence>
>>        </xs:complexType>
>>      </xs:element>
>> 
>> Proposed modification to test:
>> 
>>    <!--
>>       Test Name: arrayIndexOutOfBounds_01
>>          Schema: hiddenElem
>>            Root: e1
>>         Purpose: This test demonstrates that a proper error occurs if an array 
>> index in an expression is out of bounds
>>    -->
>> 
>>    <tdml:parserTestCase name="arrayIndexOutOfBounds_01" root="e1c"
>>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>> 
>>      <tdml:document>pw:p455w0rd</tdml:document>
>>        <tdml:errors>
>>          <tdml:error>Schema Definition Error</tdml:error>
>>          <tdml:error>expression evaluation error</tdml:error>
>>          <tdml:error>Value 10 is out of range</tdml:error>
>>          <tdml:error>with length 1</tdml:error>
>>        </tdml:errors>
>>    </tdml:parserTestCase>
>> 
>> 
>> 
>> 
>> --------------------------------------------------------------------------------
>> *From:* Steve Lawrence <sl...@apache.org>
>> *Sent:* Tuesday, October 17, 2017 11:04:03 AM
>> *To:* Taylor Wise; dev@daffodil.apache.org; Mike Beckerle
>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>> I don't think a DIArray could be created earlier and things appended to
>> it. But, currently, I think an empty DIArray could be created earlier.
>> 
>> And I think this is really just a problem with getChildArray always
>> creating DIArrays. What I suspect happened is that we had an array
>> following by something that accessed that array, e.g.:
>> 
>> <xs:sequence>
>>    <xs:element name="arr" maxOccurs="unbounded" ... />
>>    <xs:element name="count" dfdl:inputValueCalc="{ fn:count(../arr) }" />
>> </xs:sequence>
>> 
>> arr was slot 1, count was slot 2.
>> 
>> The setChildArray function used to just do something like
>> 
>>    _lastSlotAdded = slot
>> 
>> Which worked in the case of adding a new array--we add a new array, so
>> the slot it was added in was the last slot added.
>> 
>> But when fn:count was called, _lastSlotAdded was 2 (from the count
>> element). fn:count ended up calling getChildArray, which created a
>> DIArray and called setChildArray, which overwrote _lastSlotAdded to the
>> slot of the array, i.e. 1. So even though the array was finished and
>> couldn't be modified, the use of getChildArray in DPath ended up setting
>> _lastSlotAdded to the wrong value. Using max was a way to fix this so we
>> couldn't set it to an earlier value.
>> 
>> So, now, because getChildArray create elements, it sounds like things
>> could be created out of order. Rather than inserting an empty DIArray
>> into a slot, it will added it to the end and potentially mess things up.
>> This says to me we definitely want to modify getChildArray to not create
>> DIArrays. It should either return the array if it exists, or throw an
>> Infoset exception, very similar to getChild. Then we don't have to worry
>> about DIArray's being created at the wrong time. The only thing that
>> should create a DIArray is addChild.
>> 
>> 
>> On 10/17/2017 10:13 AM, Taylor Wise wrote:
>>> Hmm, very interesting.  This would greatly simplify things.  But what about this
>>> code snippet from before my changes:
>>> 
>>> 
>>>    //  final def setChildArray(slot: Int, arr: DIArray) {
>>>    //    Assert.invariant(_slots(slot) eq null)
>>>    //    _slots(slot) = arr
>>>    //    //
>>>    //    // Because arrays are created if not existing, an expression like ../A[1]
>>>    //    // can cause array A to come into existence (though empty) and A might be
>>>    //    // part of an expression that is reaching backward into earlier data (when
>>>    //    // parsing) so that we might be creating an array earlier in the slots
>>>    //    // of this element. Hence, the slot being passed here is not necessarily
>>>    //    // the last slot added. It might be before it.
>>>    //    //
>>>    //    _lastSlotAdded = math.max(slot, _lastSlotAdded)
>>>    //  }
>>> Given your statement:
>>> 
>>> 
>>> "the issue is determining if the DIArray exists or not. Everything
>>> after that is the same. And since children are inserted in the order
>>> that we see them, a DIArray will always be at the end of the children
>>> ArrayBuffer if it exists. If the DINode at the end of the children
>>> ArrayBuffer is not this array, it needs to be created."
>>> 
>>> 
>>> To me it seems to imply that a DIArray can come into existence without any 
>>> contents.  But then be appended to at a later time.  In which case it's possible
>>> that this DIArray then is not the last item appended.  It could be somewhere 
>>> earlier and then need to be added to.
>>> 
>>> --------------------------------------------------------------------------------
>>> *From:* Steve Lawrence <sl...@apache.org>
>>> *Sent:* Tuesday, October 17, 2017 7:20:42 AM
>>> *To:* dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
>>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>>> Actually, I think we're over complicating the array situation. I think
>>> we can avoid needing a unique ID and I think we can also avoid needing
>>> to always insert arrays into a HashMap.
>>> 
>>> When addChild is called to add an DIElement to an array, there are two
>>> cases we need to think about:
>>> 
>>> 1) The DIArray does not exist and it must be created. The
>>> DIElement is appended to its contents.
>>> 
>>> 2) The DIArray does exist and we just need to find it. The
>>> DIElement is appended to its contents.
>>> 
>>> So the issue is determining if the DIArray exists or not. Everything
>>> after that is the same. And since children are inserted in the order
>>> that we see them, a DIArray will always be at the end of the children
>>> ArrayBuffer if it exists. If the DINode at the end of the children
>>> ArrayBuffer is not this array, it needs to be created. So addChild
>>> just becomes something like this:
>>> 
>>>    def addChild(e: DIElement) {
>>>      if (e.erd.isArray) {
>>>        if (children.isEmpty || children.last.erd != e.erd) {
>>>          // no children, or last child is not a DIArray for
>>>          // this element, create the DIArray
>>>          children.append(new DIArray(...))
>>>        }
>>>        // the array is now always last, add the new child to it
>>>        children.last.append(e)
>>>      } else {
>>>        children.append(e)
>>>      }
>>>    }
>>> 
>>> So we no longer need a unique ID or a tuple for the HashMap key. The
>>> fast HashMap lookup only needs to work on the name of the element, which
>>> allows query-style expressions to work in the future. And we now only
>>> need to insert DIArray's into the fast HashMap if they are used in
>>> expressions addChild no longer needs getChildArray, minimizing hash
>>> look ups.
>>> 
>>> Related, I'm wondering if getChildArray should change as well so that it
>>> does not create a DIArray if it doesn't exist? I think it seems
>>> reasonable that only addChild should create new DIArrays. And if
>>> something like DPath calls getChildArray, we probably don't want to end
>>> up creating an empty DIArray if it doesn't exist. It should probably
>>> just throw an InfosetNoSuchChildElementException if the array doesn't
>>> exist, just like getChild.
>>> 
>>> - Steve
>>> 
>>> On 10/16/2017 12:02 PM, Steve Lawrence wrote:
>>>> The ID might cause some issues when we want to support query-style
>>>> expressions. For example, say we have a schema like this:
>>>> 
>>>> <sequence>
>>>>   <xs:element name="array" maxOccurs="unbounded" ... />
>>>>   <xs:element name="element" ... />
>>>>   <xs:element name="array" maxOccurs="unbounded" ... />
>>>> </xs:sequence>
>>>> 
>>>> So two arrays with the same name separated by a single element. Each
>>>> element would have a unique ID, so would be inserted in the hash table
>>>> as something like
>>>> 
>>>> ("array", 1) -> Seq(DIArray)
>>>> ("element", 2) -> Seq(DISimple)
>>>> ("array", 3) -> Seq(DIArray)
>>>> 
>>>> A query style expression that accesses the "array" elements (e.g.
>>>> fn:count(./array)) would have trouble finding both arrays to get the
>>>> full count since the keys for the two arrays are the same.
>>>> 
>>>> Perhaps one option is to keep the name -> DINode map type, so we'd end
>>>> up with something like this:
>>>> 
>>>> "array" -> Seq(DIArray, DIArray)
>>>> "element" -> Seq(DISimple)
>>>> 
>>>> And then just iterator of the Seq until we find DIArray with the right
>>>> ID? That's could have a performance hit if there were lots of things
>>>> with the same name, but I suspect that won't be super common.
>>>> 
>>>> Alternatively, we could relook at chaning the InfosetImpl API so that
>>>> when an DIArray is created, it's store in the PState for quick access,
>>>> and getChildArray would essentially go away.
>>>> 
>>>> - Steve
>>>> 
>>>> 
>>>> On 10/13/2017 02:14 PM, Mike Beckerle wrote:
>>>>> I think the key has to change to include name and namespace of element, and also an additional integer to disambiguate the first a element from the 2nd, from the Nth. This integer need not be 1, 2, 3. It can be any value at all, so long as they are distinct.
>>>>>
>>>>>
>>>>> Let's make sure another case works. E.g.,  ordered sequence like the one you have here, another element to separate it from  an unordered sequence that also contains elements named a.
>>>>>
>>>>>
>>>>> However, an unordered sequence can't have more than 1 element named a, so if every element a declaration gets an additional integer, the one in the unordered sequence will also get one, and that will behave properly.
>>>>>
>>>>>
>>>>> Is this integer the same breadth-first ordering thing we discussed elsewhere? I expect it is.
>>>>>
>>>>> ________________________________
>>>>> From: Taylor Wise
>>>>> Sent: Friday, October 13, 2017 1:55:22 PM
>>>>> To: Steve Lawrence; dev@daffodil.apache.org; Mike Beckerle
>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>
>>>>>
>>>>> I'm down to a single error: test_dfdlPosition3
>>>>>
>>>>>
>>>>> java.lang.Exception:
>>>>> Comparison failed.
>>>>> Expected
>>>>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2><a><v>1</v></a><a><v>2</v></a></f>
>>>>> Actual
>>>>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2></f>
>>>>> Differences were (path, expected, actual):
>>>>>  (f,<cnt2>2</cnt2>,<a><v>1</v></a>)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>>>
>>>>>     <xs:element name="f">
>>>>>       <xs:complexType>
>>>>>         <xs:sequence>
>>>>>           <xs:element name="cnt1" type="xs:int"
>>>>>             dfdl:lengthKind="delimited" dfdl:terminator=";" />
>>>>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>>>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt1) }">
>>>>>             <xs:complexType>
>>>>>               <xs:sequence>
>>>>>                 <xs:element name="v" type="xs:int"
>>>>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>>>>               </xs:sequence>
>>>>>             </xs:complexType>
>>>>>           </xs:element>
>>>>>           <xs:element name="cnt2" type="xs:int"
>>>>>             dfdl:lengthKind="delimited" />
>>>>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>>>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt2) }">
>>>>>             <xs:complexType>
>>>>>               <xs:sequence>
>>>>>                 <xs:element name="v" type="xs:int"
>>>>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>>>>               </xs:sequence>
>>>>>             </xs:complexType>
>>>>>           </xs:element>
>>>>>         </xs:sequence>
>>>>>       </xs:complexType>
>>>>>     </xs:element>
>>>>>
>>>>>
>>>>>
>>>>> I believe what's happening, from stepping through, is that the getChildArray code is confused as to where to add the second set of <a/> elements.  Below are the values of the structures at the point where <cnt2/> has been added:
>>>>>
>>>>> ChildNodes aka Insertion Order
>>>>> ====
>>>>> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk)), DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a'))), DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk)))
>>>>>
>>>>> FastLookup
>>>>> ====
>>>>> Map(cnt2 -> ArrayBuffer(DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk))), cnt1 -> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk))), a -> ArrayBuffer(DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a')))))
>>>>>
>>>>> So when we go to add the next array 'a', it sees that 'a' is already in the fastLookup and appends it there.  This adds it to the existing DIArray in the insertion order list.  Really, it should create a new DIArray inserting it after <cnt2/> in the insertion  order list.  And inserting it after the DIArray in the fastLookup Hash: a ->
>>> Seq(DIArray(a, ...), DIArray(a, ...))
>>>>>
>>>>> How do we fix this?
>>>>>
>>>>> Because this is an 'ordered' sequence, can we look at the slotIndexInParent?  Isn't this the position in the sequence definition as defined in the schema above?  This should differentiate between the two 'a' arrays.  And then for 'unordered' sequences, this  shouldn't matter.  We shouldn't run into two or more members with the same
>>> name and namespace.  Correct?
>>>>>
>>>>> This also means that the 'key' will have to change to include the namespace in addition to the element name.
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: Steve Lawrence <sl...@apache.org>
>>>>> Sent: Friday, October 13, 2017 8:45:06 AM
>>>>> To: dev@daffodil.apache.org; Taylor Wise; Mike Beckerle
>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>
>>>>> On 10/12/2017 08:19 PM, Taylor Wise wrote:
>>>>>> Been debugging using the TestSimpleTypes tests.  I'm down to just 1 failure.  Expected results are slightly off.  I've got an additional <text/> in the result.  So, getting closer.  Stopping for the night.
>>>>>>
>>>>>>
>>>>>> I'm not so sure that we need the _lastSlotAdded variable anymore.  I changed it to be just a def that reflects the current length of childNodes (ArrayBuffer).
>>>>>> z
>>>>>>
>>>>>> I'm betting my current issue is within the restore functions I modified, but I'll continue looking at that in the morning.  I would think that 'restoring' would simply be copying the previous state... unless I'm misunderstanding something.
>>>>>>
>>>>>>
>>>>>>   abstract override def restoreInto(e: DIElement) {
>>>>>>     val c = e.asInstanceOf[DIComplex]
>>>>>>     c.restoreFrom(this)
>>>>>>     //c._lastSlotAdded = this._lastSlotAdded
>>>>>>     super.restoreInto(e)
>>>>>>   }
>>>>>>
>>>>>> final def restoreFrom(cs: DIComplexSharedImplMixin){
>>>>>>   // Shouldn't I be able to just say something like...
>>>>>>  //
>>>>>>  childNodes.clear()
>>>>>>  childNodes.append(cs.childNodes)
>>>>>>  fastLookup.clear()
>>>>>>  cs.fastLookup.toSeq.foreach( (name, seqDINodes) =>
>>>>>>        fastLookup.put(name, seqDINodes) )
>>>>>> }
>>>>>
>>>>> This is going to be somewhat inefficient. This essentially deletes and
>>>>> recreates the entire childNodes array and fastLookup hashmap everytime
>>>>> we backtrack. This could be pretty back if we childNodes had 1000
>>>>> elements and we just wanted to backtrack 1. Instead we just want to
>>>>> remove the elements that are being backtracked.
>>>>>
>>>>> So I think you still need a concept of _lastSlotAdded. But rather than
>>>>> the lastSlotAdded, it's just the length of the children (maybe call it
>>>>> _numChildren?). I would expect restoreFrom for DIComplex to look
>>>>> something like this (very similar to the existing restoreFrom):
>>>>>
>>>>> final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>>   var i = childNodes.length - 1
>>>>>
>>>>>   // for each child we will remove, remove it from the fastLookup map.
>>>>>   // this should always be the last element in the hashmap seq
>>>>>   while (i >= cs._numChildren) {
>>>>>     var childToRemove = childNodes(i)
>>>>>     var fastSeq = fastLookup(childToRemove.name)
>>>>>     if (fastSeq.length == 1)
>>>>>       // last one, remove the whole key entry
>>>>>       fastLookup.remove(childToRemove.name)
>>>>>     else
>>>>>       // not the last one, just drop the end
>>>>>       fastSeq.dropRight(1)
>>>>>     i -= 1
>>>>>   }
>>>>>
>>>>>   // now just quickly remove all those children
>>>>>   childNodes.reduceToSize(cs._numChildren)
>>>>>   _numChildren = cs._numChildren
>>>>>
>>>>>   if (cs._arraySize.isDefined) {
>>>>>     childNodes(_numChildren - 1).reduceToSize(cs._arraySize.get)
>>>>>   }
>>>>> }
>>>>>
>>>>> The logic here is very similar to the CURRENT code, just is a little
>>>>> more simple since it can make certain assumptions. Like removing
>>>>> something form the fastSeq just means removing the last one
>>>>> (dropRight(1)), and the existing _arraySize handles the special case for
>>>>> DIArrays to simplify that logic.
>>>>>
>>>>> Note that we might want to rethink about the type fastLookup value.
>>>>> Rather than the value being a Seq, it probably wants to be some sort of
>>>>> mutable Seq that has constant time append dropRight. Maybe another
>>>>> ArrayBuf is the right data structure here? That would allow something like
>>>>>
>>>>>   fastLookup.reduceToSize(fastSeq.length - 1)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> // ORIGINAL
>>>>>>   //    final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>>>   //    Assert.invariant(_arraySize == MaybeInt.Nope)
>>>>>>   //
>>>>>>   //    var i = _lastSlotAdded
>>>>>>   //    while (i > cs._lastSlotAdded) {
>>>>>>   //      _slots(i) = null
>>>>>>   //      i -= 1
>>>>>>   //    }
>>>>>>   //    _lastSlotAdded = cs._lastSlotAdded
>>>>>>   //
>>>>>>   //    // check invariant. All slots after last slot added
>>>>>>   //    // should be null.
>>>>>>   //    {
>>>>>>   //      var i = _lastSlotAdded + 1
>>>>>>   //      while (i < _slots.length) {
>>>>>>   //        Assert.invariant(_slots(i) eq null)
>>>>>>   //        i = i + 1
>>>>>>   //      }
>>>>>>   //    }
>>>>>>   //
>>>>>>   //    if (cs._arraySize.isDefined) {
>>>>>>   //      _slots(_lastSlotAdded).asInstanceOf[DIArray].reduceToSize(cs._arraySize.get)
>>>>>>   //    }
>>>>>>   //  }
>>>>>>
>>>>>> // CURRENT, lot of logic here.  Wondering if it's even necessary, thus the question above
>>>>>> //
>>>>>>   final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>>>     Assert.invariant(_arraySize == MaybeInt.Nope)
>>>>>>
>>>>>>     var i = _lastSlotAdded
>>>>>>     while (i > cs._lastSlotAdded) {
>>>>>>       val lastNode = childNodes.last
>>>>>>       val name = lastNode.namedQName.toQNameString
>>>>>>
>>>>>>       lastNode match {
>>>>>>         case arr: DIArray if (arr.length > 1) => {
>>>>>>           // No need to delete the array, just the last DINode in it
>>>>>>           arr.pop()
>>>>>>         }
>>>>>>         case arr: DIArray if (arr.length == 1) => {
>>>>>>           // Last DINode in array, remove the array entirely
>>>>>>           //
>>>>>>           arr.pop()
>>>>>>           //
>>>>>>           // Update insertion order list
>>>>>>           //
>>>>>>           childNodes.remove(childNodes.length - 1)
>>>>>>           //
>>>>>>           // Update Fast Lookup
>>>>>>           //
>>>>>>           val seq = fastLookup(name)
>>>>>>           if (seq.length <= 1) {
>>>>>>             // Last DINode with THIS name, remove it entirely.
>>>>>>             //
>>>>>>             fastLookup.remove(name)
>>>>>>           } else {
>>>>>>             // Not the last DINode with THIS name, just remove this particular one.
>>>>>>             //
>>>>>>             val idx = seq.lastIndexWhere(node => node.isInstanceOf[DIArray])
>>>>>>             val result = seq diff Seq(lastNode)
>>>>>>             fastLookup.update(name, result)
>>>>>>           }
>>>>>>         }
>>>>>>         case _ => {
>>>>>>           //
>>>>>>           // Update insertion order list
>>>>>>           //
>>>>>>           childNodes.remove(childNodes.length - 1)
>>>>>>           //
>>>>>>           // Update Fast Lookup
>>>>>>           //
>>>>>>           val seq = fastLookup(name)
>>>>>>           if (seq.length <= 1) {
>>>>>>             // Last DINode with THIS name, remove it entirely
>>>>>>             //
>>>>>>             fastLookup.remove(name)
>>>>>>           } else {
>>>>>>             // Not the last DINode with THIS name.  Just remove this one!
>>>>>>             //
>>>>>>             val idx = seq.lastIndexWhere(node => !node.isInstanceOf[DIArray])
>>>>>>             val result = seq diff Seq(lastNode)
>>>>>>             fastLookup.update(name, result)
>>>>>>           }
>>>>>>         }
>>>>>>       }
>>>>>>     }
>>>>>>
>>>>>>   }
>>>>>>
>>>>>> The last failure in SimpleTypes, I haven't run the rest of the suite yet though.  Baby steps.
>>>>>> java.lang.Exception:
>>>>>> Comparison failed.
>>>>>> Expected
>>>>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text></hb_root01>
>>>>>> Actual
>>>>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text><text/></hb_root01>
>>>>>> Differences were (path, expected, actual):
>>>>>>  (hb_root01,List(),List(<text/>))
>>>>>> at edu.illinois.ncsa.daffodil.xml.XMLUtils$.compareAndReport(XMLUtils.scala:678)
>>>>>> at edu.illinois.ncsa.daffodil.tdml.VerifyTestCase$.verifyParserTestData(TDMLRunner.scala:1074)
>>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runParseExpectSuccess(TDMLRunner.scala:776)
>>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:675)
>>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:673)
>>>>>> at scala.util.Either$RightProjection.foreach(Either.scala:468)
>>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runProcessor(TDMLRunner.scala:673)
>>>>>> at edu.illinois.ncsa.daffodil.tdml.TestCase.run(TDMLRunner.scala:623)
>>>>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTestWithDataVolumes(TDMLRunner.scala:340)
>>>>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTest(TDMLRunner.scala:328)
>>>>>> at edu.illinois.ncsa.daffodil.tdml.Runner.runOneTest(RunnerFactory.scala:122)
>>>>>> at edu.illinois.ncsa.daffodil.section05.simple_types.TestSimpleTypes.test_hexBinary_Delimited_01(TestSimpleTypes.scala:63)
>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>>>> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>>>>>> at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>>>>>> at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>>>>>> at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>>>>>> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>>>>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>>>>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>>>>>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>>>>>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>>>>>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>>>>>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>>>>>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>>>>>> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>>>>>> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>>>>>> at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
>>>>>> at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
>>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
>>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
>>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ________________________________
>>>>>> From: Steve Lawrence <sl...@apache.org>
>>>>>> Sent: Wednesday, October 11, 2017 4:07:51 PM
>>>>>> To: dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
>>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>>
>>>>>> On 10/11/2017 01:01 PM, Mike Beckerle wrote:
>>>>>>> You have to hash lookup the name to append to an array if you're just appending one thing at a time starting from the slot/hashID. The current Infoset API hides the construction of the DIArray, and only provides access via the slot/hashID, so there's no  way to avoid the hash per each append.
>>>>>>
>>>>>> The hashmap and hashlookups should really only be needed for elements
>>>>>> that are used in expressions. We came up with that optimization so that
>>>>>> we could remove the vast majority of hash lookups. Due to how
>>>>>> getChildArray works, that optimization can't occur. I'm not sure how
>>>>>> much this will affect performance, but there are ways to avoid the need
>>>>>> for a hash lookup for arrays not used in expressions. The removal of
>>>>>> DIArray was one. Maybe API/parser changes is better.
>>>>>>
>>>>>>> But that's fixable if we want. Probably the API to infoset needs to allow appending to an array more efficently, so that a loop like a repeat combinator can be directly holding onto the DIArray's arrayBuffer appending objects as the parser returns them.  But I would not make this change as yet. First things first. Don't optimize
>>> without profiling info, etc.
>>>>>>>
>>>>>>>
>>>>>>> I'm very reluctant to remove DIArray because XML's data model isn't the only game in town. The XML decision to not have arrays, just adjacent things that happen to have the same name, has all sorts of negative performance implications.
>>>>>>
>>>>>> The removal of a DIArray doesn't necessarily mean we don't have a
>>>>>> concept of an arrays. Every element added to Complex child arrayBuf
>>>>>> still knows if it part of an array via ERD.isArray. And comparing the
>>>>>> current DIElements ERD to the previous and next is enough to determine
>>>>>> where array starts and ends. From there, we can support any model that
>>>>>> requires a concept of an array. I'm not saying we should remove the
>>>>>> concept of arrays, just that our InfosetImpl doesn't necessarily need to
>>>>>> have a special DIArray. I'll agree that it does make the logic easier to
>>>>>> get it to a different model, but other things are more complicated.
>>>>>>
>>>>>>> Consider this: we could create DIArrays of elements of simple type that don't even create a DISimple node for each array element. The values could be stored in the DIArray's arrayBuffer directly, eliminating all that overhead of a DISimple node. (This  was the original intent for DIArray, but was punted to get it working sooner.)
>>> The opportunity to do this is entirely lost if you don't have DIArray, but just
>>> a bunch of adjacent DIElements that happen to share ERD.
>>>>>>
>>>>>> Fair point. Seems a good argument for keeping DIArray, but if we don't
>>>>>> get around to doing this optimization, we're taking multiple hits. We're
>>>>>> still allocating many DISimples plus a DIArray, plus the hit from the
>>>>>> hashmap.
>>>>>>
>>>>>>> More speculatively, when we someday implement incremental array trimming so that streaming parse/unparse can drop most array elements once they've been delivered to consumer, well then we will definitely want a DIArray object to abstract the complexity  of this trim behavior and the underlying representation it requires.
>>>>>>
>>>>>> It might simplify this, but I'm not entirely convinced. I think the
>>>>>> trimming of unused Infoset elements is going to have some complex
>>>>>> heuristics regardless. Adding an extra heuristic for arrays won't
>>>>>> necessarily make this easier. But this is just speculation. And again,
>>>>>> we would still know elements are part of an array, it just takes a
>>>>>> couple extra comparisons to determine start/end. But I do agree that
>>>>>> removal of an an entire array is much easier with a DIArray concept.
>>>>>>
>>>>>>
>>>>>>
>>>>>> It seems to me these are good enough arguments for keeping DIArray
>>>>>> around, but it might be worth thinking a little bit about what changes
>>>>>> would be necessary to the API/parser to not need this hash lookup. If
>>>>>> it's a minor change, it could reduce a lot of hash lookups, especially
>>>>>> for large arrays, which are not uncommon.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> ________________________________
>>>>>>> From: Steve Lawrence <sl...@apache.org>
>>>>>>> Sent: Wednesday, October 11, 2017 12:21 PM
>>>>>>> To: Taylor Wise; dev@daffodil.apache.org
>>>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>>>
>>>>>>> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>>>>>>>
>>>>>>>> What you'll notice is that the current implementation will construct the DIArray
>>>>>>>> if it doesn't already exist and append to it if it does already exist.  But it
>>>>>>>> needs the 'slot' number in order to check this.  The slot number is obtained
>>>>>>>> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of slots
>>>>>>>> entirely, how can we accomplish this?  Will we have to do a named lookup then
>>>>>>>> every time we want to append to a DIArray?
>>>>>>> I think you're right. The way things work right now, anytime we add to a
>>>>>>> a DIArray we would have to do a hash lookup on the DIComplex. It's not
>>>>>>> limited to just DPath.
>>>>>>>
>>>>>>> Maybe this is another argument for removing the concept of a DIArray? We
>>>>>>> still need to know if a DIElement is part of an array for the
>>>>>>> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
>>>>>>> element should just means creating a new DISimple/DIComplex and adding
>>>>>>> it to the DIComplex child ArrayBuffer and appending to the HashLookup
>>>>>>> Seq? This would also simplify some query style logic, e.g. fn:count.
>>>>>>>
>>>>>>> Alternatively, we could modify the parser logic so that when it's
>>>>>>> dealing with arrays it stores the DIArray locally somewhere (PState or
>>>>>>> Parser) and appends straight to that without having to ask the parent
>>>>>>> DIComplex for the DIArray? Not sure how big of a change that would be
>>>>>>> though.
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Current Implementation:
>>>>>>>>
>>>>>>>> override def addChild(e: InfosetElement): Unit = {
>>>>>>>>
>>>>>>>>   if (e.runtimeData.isArray)  {
>>>>>>>>
>>>>>>>>    // Make sure there is an array to accept the child
>>>>>>>>
>>>>>>>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>>>>>>>
>>>>>>>>      Assert.invariant(arr ne null)
>>>>>>>>
>>>>>>>>      arr.append(e)
>>>>>>>>
>>>>>>>>    } else {
>>>>>>>>
>>>>>>>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>>>>>>>
>>>>>>>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>>>>>>>
>>>>>>>>    }
>>>>>>>>
>>>>>>>>    e.setParent(this)
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>>>>>>>>      Assert.usage(childERD.isArray)
>>>>>>>>      getChildArray(childERD.dpathElementCompileInfo)
>>>>>>>>    }
>>>>>>>>
>>>>>>>>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>>>>>>>>      Assert.usage(info.isArray)
>>>>>>>>      val slot = info.slotIndexInParent
>>>>>>>>      getChildArray(slot)
>>>>>>>>    }
>>>>>>>>
>>>>>>>>    private def getChildArray(slot: Int): InfosetArray = {
>>>>>>>>      val slotVal = _slots(slot)
>>>>>>>>      if (slotVal ne null)
>>>>>>>>        slotVal match {
>>>>>>>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>>>>>>>          case _ => Assert.usageError("not an array")
>>>>>>>>        }
>>>>>>>>      else {
>>>>>>>>        val arrayERD = erd.childERDs(slot)
>>>>>>>>        // slot is null. There isn't even an array object yet.
>>>>>>>>        // create one (it will have zero entries)
>>>>>>>>        val ia = new DIArray(arrayERD, this)
>>>>>>>>        // no array there yet. So we have to create one.
>>>>>>>>        setChildArray(slot, ia)
>>>>>>>>        ia
>>>>>>>>      }
>>>>>>>>    }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>> *From:* Steve Lawrence <sl...@apache.org>
>>>>>>>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>>>>>>>> *To:* dev@daffodil.apache.org; Taylor Wise
>>>>>>>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>>>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>>>>>>>> Goal: Remove concept of slots from InfosetImpl
>>>>>>>>>
>>>>>>>>> Reasons:
>>>>>>>>>     1. So that element/node lookup can be name based.
>>>>>>>>>     2. Ease re-enabling of Unordered Sequences.
>>>>>>>>>     3. Ease enabling of 'query-style' expressions that return more than
>>>>>>>>> one node.
>>>>>>>>>
>>>>>>>>> An Infoset is a document (DIDocument) and the document contains exactly
>>>>>>>>> one element (DIElement) as its root.  These elements can be simple
>>>>>>>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>>>>>>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>>>>>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>>>>>>>> inherit from DINode.
>>>>>>>>>
>>>>>>>>> Currently DIComplex models its children as an Array of DINodes. The
>>>>>>>>> problem with the current implementation is that there's no way for us to
>>>>>>>>> just ask for an element by name without walking the structure.
>>>>>>>>>
>>>>>>>>> The solution involves modifying DIComplex to represent its children both
>>>>>>>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>>>>>>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>>>>>>>> insertion ordering information we need regarding the children as well as
>>>>>>>>> constant time indexing access to the child in question.  The ArrayBuffer
>>>>>>>>> also has the added benefit of growing and shrinking as objects are added
>>>>>>>>> and removed.  While the HashMap will allow us to associate a name
>>>>>>>>> (String) with DINodes that have that name thus providing a fast named
>>>>>>>>> lookup.
>>>>>>>>>
>>>>>>>>> This named lookup can be optimized further by only inserting into the
>>>>>>>>> HashMap those elements that we know are used in expressions. These are
>>>>>>>>> the elements that would most likely benefit from a fast named lookup and
>>>>>>>>> will allow us to avoid additional overhead of hashing strings and
>>>>>>>>> allocating/appending to a sequence.
>>>>>>>>>
>>>>>>>>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>>>>>>>>> it's possible for the following:
>>>>>>>>>
>>>>>>>>>     <element name="a" .../>
>>>>>>>>>     <element name="a" maxOccurs="unbounded"..../>
>>>>>>>>>     <element name="b".../>
>>>>>>>>>     <element name="a" ..../>
>>>>>>>>>
>>>>>>>>> Each element named 'a' would have an associated DINode.  Because there
>>>>>>>>> are multiple DINodes named 'a', we need to represent them in the HashMap
>>>>>>>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>>>>>>>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>>>>>>>>
>>>>>>>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>>>>>>>     "b" -> mutable.Seq(DIElement)
>>>>>>>>>
>>>>>>>>> The mutable.ArrayBuffer maintains the insertion order and would look as
>>>>>>>>> follows:
>>>>>>>>>
>>>>>>>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>>>>>>>
>>>>>>>>> Back tracking shall be accomplished by going to the last item in the
>>>>>>>>> DIComplex ArrayBuffer and removing it.
>>>>>>>>>     1. If the last item is a DIElement, we simply delete it from the
>>>>>>>>> ArrayBuffer and then remove it also from the HashMap.
>>>>>>>>>     2. If the last item is a DIArray, we then have to look at the
>>>>>>>>> contents of the DIArray.
>>>>>>>>>         a. If the DIArray has more than one entry, remove the last
>>>>>>>>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>>>>>>>>> with the modified DIArray.
>>>>>>>>
>>>>>>>> I think the second part of this is actually unnecessary. Since the
>>>>>>>> DIArray is mutable, we only need to remove its last child. The entry in
>>>>>>>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>>>>>>>> only difference is the contents of the DIArray will be one smaller.
>>>>>>>>
>>>>>>>>>         b. If the DIArray has a single entry then we simply remove
>>>>>>>>> DIArray from ArrayBuffer and the HashMap.
>>>>>>>>>
>>>>>>>>>     <element name="root>
>>>>>>>>>          <sequence ../>
>>>>>>>>>          <element name="a" .../>
>>>>>>>>>          <element name="a" maxOccurs="unbounded">
>>>>>>>>>            ...
>>>>>>>>>            <element name="b" .../>
>>>>>>>>>            <element name="c" .../>
>>>>>>>>>            ...
>>>>>>>>>          </element>
>>>>>>>>>        </sequence>
>>>>>>>>>     </element>
>>>>>>>>>
>>>>>>>>> If the HashMap of DIComplex-root looks like the following:
>>>>>>>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>>>>
>>>>>>>>> ArrayBuffer of DIComplex-root looks like the following:
>>>>>>>>>      0 -> DIElement("a")
>>>>>>>>>      1 -> DIArray("a")
>>>>>>>>>
>>>>>>>>> Where DIArray children ArrayBuffer looks like:
>>>>>>>>>      0 -> DIElement("b")
>>>>>>>>>      1 -> DIElement("c")
>>>>>>>>>
>>>>>>>>> Given the initial example if we backtrack(1):  // purge one item
>>>>>>>>>      DIArray:
>>>>>>>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>>>>>>>
>>>>>>>>>      HashMap DIComplex-root:
>>>>>>>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>>>>
>>>>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>>>>          0 -> DIElement("a")
>>>>>>>>>          1 -> DIArray("a")
>>>>>>>>>
>>>>>>>>> Given the initial example if we backtrack(2): // purge two items
>>>>>>>>>      DIArray:
>>>>>>>>>          Empty
>>>>>>>>>
>>>>>>>>>      HashMap DIComplex-root:
>>>>>>>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>>>>>>>>                                         // removed and is now empty
>>>>>>>>>
>>>>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>>>>          0 -> DIElement("a")
>>>>>>>>>
>>>>>>>>> ===
>>>>>>>>>
>>>>>>>>> class DIComplex {
>>>>>>>>>   def nSlots = erd.nChildSlots
>>>>>>>>
>>>>>>>> I think the concept of nSlots and nChildSlots should go away completely.
>>>>>>>> The children ArrayBuffer should just grow as elements are added and
>>>>>>>> shrink as they are backtracked. It might make sense to have a tunable
>>>>>>>> the determines the initial size of the children Array, but the Scala
>>>>>>>> defaults for initializing and growing a mutable array might already be
>>>>>>>> reasonable.
>>>>>>>>
>>>>>>>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>>>>>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>>>>>>>
>>>>>>>> One thing that might make this change easier is to not worry about the
>>>>>>>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>>>>>>>> really only necessary to support query style expressions, but that comes
>>>>>>>> with a whole other slew of concerns (e.g. fn:count, same name but
>>>>>>>> different types, etc.). There currently exist restrictions that do not
>>>>>>>> allow query style expressions, so this Seq should always be a size of 1
>>>>>>>> until remove those restrictions.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>   def addChild(node: DINode) {
>>>>>>>>>     children += node
>>>>>>>>>     fastLookup(node.name).append(node)
>>>>>>>>>   }
>>>>>>>>>
>>>>>>>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>>>>>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>>>>>>>   }
>>>>>>>>>
>>>>>>>>>   def backtrack(int n) {
>>>>>>>>>     for (i = 0; i < n; i++) {
>>>>>>>>>       val endIdx = children.size -1
>>>>>>>>>       val child = children.remove(endIdx) // children.pop()
>>>>>>>>>
>>>>>>>>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>>>>>>>>         case None => // do nothing
>>>>>>>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>>>>>>>         case Some(nodesSeq) => {
>>>>>>>>>       val lastEntry = nodesSeq.last
>>>>>>>>>           lastEntry match {
>>>>>>>>>         case arr: DIArray => {
>>>>>>>>>               // If the array has >= 2 items, remove last one and update
>>>>>>>>> nodesSeq
>>>>>>>>>               // with this array:
>>>>>>>>>               //  val lastIdx = nodesSeq.length - 1
>>>>>>>>>           //  arr.remove(arr.size - 1)
>>>>>>>>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>>>>>>>>> arr))
>>>>>>>>>               //
>>>>>>>>>               // If the array has 1 item, remove the array from nodesSeq
>>>>>>>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>>>>         }
>>>>>>>>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>>>>           }
>>>>>>>>>
>>>>>>>>>         }
>>>>>>>>>       }
>>>>>>>>>     }
>>>>>>>>>   }
>>>>>>>>>
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> class DIArray {
>>>>>>>>>     private val initialSize =
>>>>>>>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>>>>>>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>>>>>>>>
>>>>>>>>>     def append(ie: InfosetElement): Unit = {
>>>>>>>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>>>>>>>         ie.setArray(this)
>>>>>>>>>     }
>>>>>>>>>
>>>>>>>>>     //
>>>>>>>>>     // For backtracking purposes
>>>>>>>>>     //
>>>>>>>>>     def remove(n: Int): DIElement = {
>>>>>>>>>         _contents.remove(n)
>>>>>>>>>     }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> ===
>>>>>>>>>
>>>>>>>>> To facillitate a breadth-first ordering scheme we shall have to
>>>>>>>>
>>>>>>>> That this is only necessary for unordered sequences, where the insertion
>>>>>>>> order in the ArrayBuffer differs from the display order, correct? If
>>>>>>>> that's true, we definitely want to make sure to only do the below
>>>>>>>> sorting when an element contains an unordred sequence.
>>>>>>>>
>>>>>>>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>>>>>>>> shall be called childElementDeclPosition and pertains to the static
>>>>>>>>> order of the element children of a complex type element.  When sorting
>>>>>>>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>>>>>>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>>>>>>>> entry in the sequence that needs to be sorted.  All elements in this
>>>>>>>>> array will have the same childElementDeclPosition.
>>>>>>>>>
>>>>>>>>> The difficulty in implementing this is that an Element's
>>>>>>>>> dsom.ComplexType can have groups within the groups before finding actual
>>>>>>>>> element children.  Therefore in the Document Schema Object Model (DSOM)
>>>>>>>>> the model group object shall have a startingChildElementDeclPosition and
>>>>>>>>> an endingChildElementDeclPosition.
>>>>>>>>>
>>>>>>>>> 1. The startingChildElementPosition shall be passed to the constructor
>>>>>>>>> for the Term.
>>>>>>>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>>>>>>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>>>>>>>> model group, then its first child term's
>>>>>>>>> startingChildElementDeclPosition is equal to its own
>>>>>>>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>>>>>>>> startingChildElementDeclPosition is the prior term's
>>>>>>>>> endingChildElementDeclPosition + 1.
>>>>>>>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>>>>>>>> endingChildElmentDeclPosition of its last child term.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Looks good to me. +1. This should definitely make implementation of
>>>>>>>> unordered sequences and query style expressions easier. Removal of the
>>>>>>>> slot concept should also help remove concepts that prevent faster schema
>>>>>>>> compilation.
>>>>>>>>
>>>>>>>> - Steve
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>> 
>>> 
>> 
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Mike Beckerle <mb...@tresys.com>.

No I think 2.1.0 as we don't understand the performance implications fully yet.


This should fix bug....


https://opensource.ncsa.illinois.edu/jira/browse/DFDL-1773

[DFDL-1773] Choice ambiguous element name results in ...<https://opensource.ncsa.illinois.edu/jira/browse/DFDL-1773>
opensource.ncsa.illinois.edu
Daffodil; DFDL-1773; Choice ambiguous element name results in failed expression



________________________________
From: Taylor Wise
Sent: Monday, October 23, 2017 12:43:36 PM
To: Mike Beckerle; Steve Lawrence; dev@daffodil.apache.org
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion


Should this be pushed to 2.0.0?

________________________________
From: Mike Beckerle
Sent: Monday, October 23, 2017 12:36:57 PM
To: Taylor Wise; Steve Lawrence; dev@daffodil.apache.org
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion


Yes. The one giant ticket about the space problem compiling (DFDL-1444) is not specific enough.

________________________________
From: Taylor Wise
Sent: Monday, October 23, 2017 12:28:04 PM
To: Steve Lawrence; dev@daffodil.apache.org; Mike Beckerle
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion


I don't believe there's a ticket associated with these changes.  Should I go ahead and create one?

________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Wednesday, October 18, 2017 8:21:02 AM
To: Taylor Wise; dev@daffodil.apache.org; Mike Beckerle
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion

On 10/17/2017 07:57 PM, Taylor Wise wrote:
> Been debugging after implementing your suggested changes.
>
>
> First thing to note, is that by moving the creation of the array out of
> getChildArray that I had to make additional calls to 'addChild' in the
> InfosetInputter.  This corrected things on the 'parser' side.  However, on the
> unparser side it became evident that this 'addChild' function cannot always set
> the parent.  There's an invariant expecting parent to be null along the
> execution branch of the unparser side.  As a result, I modified addChild to also
> accept a default boolean called 'setParent'.  It defaults to true unless
> specified.  So now when we call 'addChild' in InfosetInputter on the unparser
> side, we actually call 'addChild(e, false)'.

This sounds to me like addChild is getting called twice on the same
element and so it's parent is trying to be set twice. Might need some
more investigation. Maybe the InfosetInputter needs more tweaks to
support not using getChildArray? Hard to say without seeing your
InfosetInputter changes.

Regarding the below tests, I think your assessment is correct. The
password element doesn't exist and so we should never even try to get
the -1 index. Instead we should get something like "password element
does not exist" or something. It would be useful to have that test to
ensure we get that error message.

An index of -1 (or anything less than 1) can definitely be caught
statically at schema compile time. Worth creating a bug for that. But
there will be times where the index is calculated so we'll still need
this runtime check. I think it's still a test worth having.

I think you're modifications are reasonable based on what the tests are
trying to tests. Probably to proper coverage of tests we need is:

- trying to access an array that does not exist with both valid and
invalid indicies (less than 1 and greater than array size)
- trying to access an array thta does exist but with invalid indicies
(less than 1 and greater than array size)

So it's probably worth keeping the existing tests for the 'array does
not exist' tests and ensure we get a reasonable error message not
talking about indices. And then just make your modifications new tests.

>
> The above correction reduced the errors from 30+ to just 2.  The only two errors
> remaining have to do with an array index out of bounds error that's expected.  I
> believe that it's actually an issue with the test and not the code.
>
>
> Error:
>
> edu.illinois.ncsa.daffodil.tdml.TDMLException: Did not find diagnostic message
> "Value -1 is out of range" in any of the actual diagnostic messages:
> Runtime Schema Definition Error: Expression Evaluation Error: Child element
> {http://example.com}password does not exist.
> Schema context: hideme Location line 267 column 12 in
> file:/tmp/hiddenElem2658119394211244932.dfdl.xsd
> Data location was preceding byte 0 limit(bytes) 8
>
> Well, password doesn't exist because we are missing the 'pw:' initiator in the
> tdml:document content.  The code now doesn't know to create it because we
> haven't seen the initiator.  At least that's what I surmise.
>
>
> Existing test:
>
>    <!--
>       Test Name: arrayIndexOutOfBounds_02
>          Schema: hiddenElem
>            Root: e1b
>         Purpose: This test demonstrates that a proper error occurs if an array
> index in an expression is out of bounds
>    -->
>
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_02" root="e1b"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>
>      <tdml:document>p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value -1 is out of range</tdml:error>
>          <tdml:error>with length 0</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
>
> Schema:
>      <xs:group name="hiddenData">
>        <xs:sequence dfdl:separator="|">
>          <xs:element name="password" type="xs:string" dfdl:initiator="pw:"
> minOccurs="0" maxOccurs="1" dfdl:lengthKind="delimited"/>
>        </xs:sequence>
>      </xs:group>
>
>      <xs:element name="e1">
>        <xs:complexType>
>          <xs:sequence>
>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{
> fn:concat(fn:substring(/ex:e1/ex:password[1], 1, 1), '***') }"/>
>          </xs:sequence>
>        </xs:complexType>
>      </xs:element>
>      <xs:element name="e1b">
>        <xs:complexType>
>          <xs:sequence>
>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{
> fn:concat(fn:substring(/ex:e1b/ex:password[-1], 1, 1), '***') }"/>
>          </xs:sequence>
>        </xs:complexType>
>      </xs:element>
>
>
> My modification:
>
>    <!--
>       Test Name: arrayIndexOutOfBounds_02
>          Schema: hiddenElem
>            Root: e1b
>         Purpose: This test demonstrates that a proper error occurs if an array
> index in an expression is out of bounds
>    -->
>
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_02" root="e1b"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>
>      <tdml:document>pw:p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value -1 is out of range</tdml:error>
>          <tdml:error>with length 1</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
>
> But honestly, should we really need to do this?  Shouldn't the code know that -1
> is not a valid index without having to force construction of the password array?
>
> I'm similarly going to have to modify this other test as well by constructing a
> different root that attempts to access a character in a position greater than
> the length of the array, unless there's something I'm missing:
>
>    <!--
>       Test Name: arrayIndexOutOfBounds_01
>          Schema: hiddenElem
>            Root: e1
>         Purpose: This test demonstrates that a proper error occurs if an array
> index in an expression is out of bounds
>    -->
>
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_01" root="e1"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>
>      <tdml:document>p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value 1 is out of range</tdml:error>
>          <tdml:error>with length 0</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
>
> Proposed addition to defineSchema:
>
>      <xs:element name="e1c">
>        <xs:complexType>
>          <xs:sequence>
>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{
> fn:concat(fn:substring(/ex:e1c/ex:password[10], 1, 1), '***') }"/>
>          </xs:sequence>
>        </xs:complexType>
>      </xs:element>
>
> Proposed modification to test:
>
>    <!--
>       Test Name: arrayIndexOutOfBounds_01
>          Schema: hiddenElem
>            Root: e1
>         Purpose: This test demonstrates that a proper error occurs if an array
> index in an expression is out of bounds
>    -->
>
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_01" root="e1c"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>
>      <tdml:document>pw:p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value 10 is out of range</tdml:error>
>          <tdml:error>with length 1</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
>
>
>
>
> --------------------------------------------------------------------------------
> *From:* Steve Lawrence <sl...@apache.org>
> *Sent:* Tuesday, October 17, 2017 11:04:03 AM
> *To:* Taylor Wise; dev@daffodil.apache.org; Mike Beckerle
> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
> I don't think a DIArray could be created earlier and things appended to
> it. But, currently, I think an empty DIArray could be created earlier.
>
> And I think this is really just a problem with getChildArray always
> creating DIArrays. What I suspect happened is that we had an array
> following by something that accessed that array, e.g.:
>
> <xs:sequence>
>    <xs:element name="arr" maxOccurs="unbounded" ... />
>    <xs:element name="count" dfdl:inputValueCalc="{ fn:count(../arr) }" />
> </xs:sequence>
>
> arr was slot 1, count was slot 2.
>
> The setChildArray function used to just do something like
>
>    _lastSlotAdded = slot
>
> Which worked in the case of adding a new array--we add a new array, so
> the slot it was added in was the last slot added.
>
> But when fn:count was called, _lastSlotAdded was 2 (from the count
> element). fn:count ended up calling getChildArray, which created a
> DIArray and called setChildArray, which overwrote _lastSlotAdded to the
> slot of the array, i.e. 1. So even though the array was finished and
> couldn't be modified, the use of getChildArray in DPath ended up setting
> _lastSlotAdded to the wrong value. Using max was a way to fix this so we
> couldn't set it to an earlier value.
>
> So, now, because getChildArray create elements, it sounds like things
> could be created out of order. Rather than inserting an empty DIArray
> into a slot, it will added it to the end and potentially mess things up.
> This says to me we definitely want to modify getChildArray to not create
> DIArrays. It should either return the array if it exists, or throw an
> Infoset exception, very similar to getChild. Then we don't have to worry
> about DIArray's being created at the wrong time. The only thing that
> should create a DIArray is addChild.
>
>
> On 10/17/2017 10:13 AM, Taylor Wise wrote:
>> Hmm, very interesting.  This would greatly simplify things.  But what about this
>> code snippet from before my changes:
>>
>>
>>    //  final def setChildArray(slot: Int, arr: DIArray) {
>>    //    Assert.invariant(_slots(slot) eq null)
>>    //    _slots(slot) = arr
>>    //    //
>>    //    // Because arrays are created if not existing, an expression like ../A[1]
>>    //    // can cause array A to come into existence (though empty) and A might be
>>    //    // part of an expression that is reaching backward into earlier data (when
>>    //    // parsing) so that we might be creating an array earlier in the slots
>>    //    // of this element. Hence, the slot being passed here is not necessarily
>>    //    // the last slot added. It might be before it.
>>    //    //
>>    //    _lastSlotAdded = math.max(slot, _lastSlotAdded)
>>    //  }
>> Given your statement:
>>
>>
>> "the issue is determining if the DIArray exists or not. Everything
>> after that is the same. And since children are inserted in the order
>> that we see them, a DIArray will always be at the end of the children
>> ArrayBuffer if it exists. If the DINode at the end of the children
>> ArrayBuffer is not this array, it needs to be created."
>>
>>
>> To me it seems to imply that a DIArray can come into existence without any
>> contents.  But then be appended to at a later time.  In which case it's possible
>> that this DIArray then is not the last item appended.  It could be somewhere
>> earlier and then need to be added to.
>>
>> --------------------------------------------------------------------------------
>> *From:* Steve Lawrence <sl...@apache.org>
>> *Sent:* Tuesday, October 17, 2017 7:20:42 AM
>> *To:* dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>> Actually, I think we're over complicating the array situation. I think
>> we can avoid needing a unique ID and I think we can also avoid needing
>> to always insert arrays into a HashMap.
>>
>> When addChild is called to add an DIElement to an array, there are two
>> cases we need to think about:
>>
>> 1) The DIArray does not exist and it must be created. The
>> DIElement is appended to its contents.
>>
>> 2) The DIArray does exist and we just need to find it. The
>> DIElement is appended to its contents.
>>
>> So the issue is determining if the DIArray exists or not. Everything
>> after that is the same. And since children are inserted in the order
>> that we see them, a DIArray will always be at the end of the children
>> ArrayBuffer if it exists. If the DINode at the end of the children
>> ArrayBuffer is not this array, it needs to be created. So addChild
>> just becomes something like this:
>>
>>    def addChild(e: DIElement) {
>>      if (e.erd.isArray) {
>>        if (children.isEmpty || children.last.erd != e.erd) {
>>          // no children, or last child is not a DIArray for
>>          // this element, create the DIArray
>>          children.append(new DIArray(...))
>>        }
>>        // the array is now always last, add the new child to it
>>        children.last.append(e)
>>      } else {
>>        children.append(e)
>>      }
>>    }
>>
>> So we no longer need a unique ID or a tuple for the HashMap key. The
>> fast HashMap lookup only needs to work on the name of the element, which
>> allows query-style expressions to work in the future. And we now only
>> need to insert DIArray's into the fast HashMap if they are used in
>> expressions addChild no longer needs getChildArray, minimizing hash
>> look ups.
>>
>> Related, I'm wondering if getChildArray should change as well so that it
>> does not create a DIArray if it doesn't exist? I think it seems
>> reasonable that only addChild should create new DIArrays. And if
>> something like DPath calls getChildArray, we probably don't want to end
>> up creating an empty DIArray if it doesn't exist. It should probably
>> just throw an InfosetNoSuchChildElementException if the array doesn't
>> exist, just like getChild.
>>
>> - Steve
>>
>> On 10/16/2017 12:02 PM, Steve Lawrence wrote:
>>> The ID might cause some issues when we want to support query-style
>>> expressions. For example, say we have a schema like this:
>>>
>>> <sequence>
>>>   <xs:element name="array" maxOccurs="unbounded" ... />
>>>   <xs:element name="element" ... />
>>>   <xs:element name="array" maxOccurs="unbounded" ... />
>>> </xs:sequence>
>>>
>>> So two arrays with the same name separated by a single element. Each
>>> element would have a unique ID, so would be inserted in the hash table
>>> as something like
>>>
>>> ("array", 1) -> Seq(DIArray)
>>> ("element", 2) -> Seq(DISimple)
>>> ("array", 3) -> Seq(DIArray)
>>>
>>> A query style expression that accesses the "array" elements (e.g.
>>> fn:count(./array)) would have trouble finding both arrays to get the
>>> full count since the keys for the two arrays are the same.
>>>
>>> Perhaps one option is to keep the name -> DINode map type, so we'd end
>>> up with something like this:
>>>
>>> "array" -> Seq(DIArray, DIArray)
>>> "element" -> Seq(DISimple)
>>>
>>> And then just iterator of the Seq until we find DIArray with the right
>>> ID? That's could have a performance hit if there were lots of things
>>> with the same name, but I suspect that won't be super common.
>>>
>>> Alternatively, we could relook at chaning the InfosetImpl API so that
>>> when an DIArray is created, it's store in the PState for quick access,
>>> and getChildArray would essentially go away.
>>>
>>> - Steve
>>>
>>>
>>> On 10/13/2017 02:14 PM, Mike Beckerle wrote:
>>>> I think the key has to change to include name and namespace of element, and also an additional integer to disambiguate the first a element from the 2nd, from the Nth. This integer need not be 1, 2, 3. It can be any value at all, so long as they are distinct.
>>>>
>>>>
>>>> Let's make sure another case works. E.g.,  ordered sequence like the one you have here, another element to separate it from  an unordered sequence that also contains elements named a.
>>>>
>>>>
>>>> However, an unordered sequence can't have more than 1 element named a, so if every element a declaration gets an additional integer, the one in the unordered sequence will also get one, and that will behave properly.
>>>>
>>>>
>>>> Is this integer the same breadth-first ordering thing we discussed elsewhere? I expect it is.
>>>>
>>>> ________________________________
>>>> From: Taylor Wise
>>>> Sent: Friday, October 13, 2017 1:55:22 PM
>>>> To: Steve Lawrence; dev@daffodil.apache.org; Mike Beckerle
>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>
>>>>
>>>> I'm down to a single error: test_dfdlPosition3
>>>>
>>>>
>>>> java.lang.Exception:
>>>> Comparison failed.
>>>> Expected
>>>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2><a><v>1</v></a><a><v>2</v></a></f>
>>>> Actual
>>>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2></f>
>>>> Differences were (path, expected, actual):
>>>>  (f,<cnt2>2</cnt2>,<a><v>1</v></a>)
>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>>
>>>>     <xs:element name="f">
>>>>       <xs:complexType>
>>>>         <xs:sequence>
>>>>           <xs:element name="cnt1" type="xs:int"
>>>>             dfdl:lengthKind="delimited" dfdl:terminator=";" />
>>>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt1) }">
>>>>             <xs:complexType>
>>>>               <xs:sequence>
>>>>                 <xs:element name="v" type="xs:int"
>>>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>>>               </xs:sequence>
>>>>             </xs:complexType>
>>>>           </xs:element>
>>>>           <xs:element name="cnt2" type="xs:int"
>>>>             dfdl:lengthKind="delimited" />
>>>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt2) }">
>>>>             <xs:complexType>
>>>>               <xs:sequence>
>>>>                 <xs:element name="v" type="xs:int"
>>>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>>>               </xs:sequence>
>>>>             </xs:complexType>
>>>>           </xs:element>
>>>>         </xs:sequence>
>>>>       </xs:complexType>
>>>>     </xs:element>
>>>>
>>>>
>>>>
>>>> I believe what's happening, from stepping through, is that the getChildArray code is confused as to where to add the second set of <a/> elements.  Below are the values of the structures at the point where <cnt2/> has been added:
>>>>
>>>> ChildNodes aka Insertion Order
>>>> ====
>>>> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk)), DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a'))), DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk)))
>>>>
>>>> FastLookup
>>>> ====
>>>> Map(cnt2 -> ArrayBuffer(DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk))), cnt1 -> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk))), a -> ArrayBuffer(DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a')))))
>>>>
>>>> So when we go to add the next array 'a', it sees that 'a' is already in the fastLookup and appends it there.  This adds it to the existing DIArray in the insertion order list.  Really, it should create a new DIArray inserting it after <cnt2/> in the insertion order list.  And inserting it after the DIArray in the fastLookup Hash: a ->
>> Seq(DIArray(a, ...), DIArray(a, ...))
>>>>
>>>> How do we fix this?
>>>>
>>>> Because this is an 'ordered' sequence, can we look at the slotIndexInParent?  Isn't this the position in the sequence definition as defined in the schema above?  This should differentiate between the two 'a' arrays.  And then for 'unordered' sequences,  this  shouldn't matter.  We shouldn't run into two or more members with the same
>> name and namespace.  Correct?
>>>>
>>>> This also means that the 'key' will have to change to include the namespace in addition to the element name.
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: Steve Lawrence <sl...@apache.org>
>>>> Sent: Friday, October 13, 2017 8:45:06 AM
>>>> To: dev@daffodil.apache.org; Taylor Wise; Mike Beckerle
>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>
>>>> On 10/12/2017 08:19 PM, Taylor Wise wrote:
>>>>> Been debugging using the TestSimpleTypes tests.  I'm down to just 1 failure.  Expected results are slightly off.  I've got an additional <text/> in the result.  So, getting closer.  Stopping for the night.
>>>>>
>>>>>
>>>>> I'm not so sure that we need the _lastSlotAdded variable anymore.  I changed it to be just a def that reflects the current length of childNodes (ArrayBuffer).
>>>>> z
>>>>>
>>>>> I'm betting my current issue is within the restore functions I modified, but I'll continue looking at that in the morning.  I would think that 'restoring' would simply be copying the previous state... unless I'm misunderstanding something.
>>>>>
>>>>>
>>>>>   abstract override def restoreInto(e: DIElement) {
>>>>>     val c = e.asInstanceOf[DIComplex]
>>>>>     c.restoreFrom(this)
>>>>>     //c._lastSlotAdded = this._lastSlotAdded
>>>>>     super.restoreInto(e)
>>>>>   }
>>>>>
>>>>> final def restoreFrom(cs: DIComplexSharedImplMixin){
>>>>>   // Shouldn't I be able to just say something like...
>>>>>  //
>>>>>  childNodes.clear()
>>>>>  childNodes.append(cs.childNodes)
>>>>>  fastLookup.clear()
>>>>>  cs.fastLookup.toSeq.foreach( (name, seqDINodes) =>
>>>>>        fastLookup.put(name, seqDINodes) )
>>>>> }
>>>>
>>>> This is going to be somewhat inefficient. This essentially deletes and
>>>> recreates the entire childNodes array and fastLookup hashmap everytime
>>>> we backtrack. This could be pretty back if we childNodes had 1000
>>>> elements and we just wanted to backtrack 1. Instead we just want to
>>>> remove the elements that are being backtracked.
>>>>
>>>> So I think you still need a concept of _lastSlotAdded. But rather than
>>>> the lastSlotAdded, it's just the length of the children (maybe call it
>>>> _numChildren?). I would expect restoreFrom for DIComplex to look
>>>> something like this (very similar to the existing restoreFrom):
>>>>
>>>> final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>   var i = childNodes.length - 1
>>>>
>>>>   // for each child we will remove, remove it from the fastLookup map.
>>>>   // this should always be the last element in the hashmap seq
>>>>   while (i >= cs._numChildren) {
>>>>     var childToRemove = childNodes(i)
>>>>     var fastSeq = fastLookup(childToRemove.name)
>>>>     if (fastSeq.length == 1)
>>>>       // last one, remove the whole key entry
>>>>       fastLookup.remove(childToRemove.name)
>>>>     else
>>>>       // not the last one, just drop the end
>>>>       fastSeq.dropRight(1)
>>>>     i -= 1
>>>>   }
>>>>
>>>>   // now just quickly remove all those children
>>>>   childNodes.reduceToSize(cs._numChildren)
>>>>   _numChildren = cs._numChildren
>>>>
>>>>   if (cs._arraySize.isDefined) {
>>>>     childNodes(_numChildren - 1).reduceToSize(cs._arraySize.get)
>>>>   }
>>>> }
>>>>
>>>> The logic here is very similar to the CURRENT code, just is a little
>>>> more simple since it can make certain assumptions. Like removing
>>>> something form the fastSeq just means removing the last one
>>>> (dropRight(1)), and the existing _arraySize handles the special case for
>>>> DIArrays to simplify that logic.
>>>>
>>>> Note that we might want to rethink about the type fastLookup value.
>>>> Rather than the value being a Seq, it probably wants to be some sort of
>>>> mutable Seq that has constant time append dropRight. Maybe another
>>>> ArrayBuf is the right data structure here? That would allow something like
>>>>
>>>>   fastLookup.reduceToSize(fastSeq.length - 1)
>>>>
>>>>
>>>>
>>>>
>>>>> // ORIGINAL
>>>>>   //    final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>>   //    Assert.invariant(_arraySize == MaybeInt.Nope)
>>>>>   //
>>>>>   //    var i = _lastSlotAdded
>>>>>   //    while (i > cs._lastSlotAdded) {
>>>>>   //      _slots(i) = null
>>>>>   //      i -= 1
>>>>>   //    }
>>>>>   //    _lastSlotAdded = cs._lastSlotAdded
>>>>>   //
>>>>>   //    // check invariant. All slots after last slot added
>>>>>   //    // should be null.
>>>>>   //    {
>>>>>   //      var i = _lastSlotAdded + 1
>>>>>   //      while (i < _slots.length) {
>>>>>   //        Assert.invariant(_slots(i) eq null)
>>>>>   //        i = i + 1
>>>>>   //      }
>>>>>   //    }
>>>>>   //
>>>>>   //    if (cs._arraySize.isDefined) {
>>>>>   //      _slots(_lastSlotAdded).asInstanceOf[DIArray].reduceToSize(cs._arraySize.get)
>>>>>   //    }
>>>>>   //  }
>>>>>
>>>>> // CURRENT, lot of logic here.  Wondering if it's even necessary, thus the question above
>>>>> //
>>>>>   final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>>     Assert.invariant(_arraySize == MaybeInt.Nope)
>>>>>
>>>>>     var i = _lastSlotAdded
>>>>>     while (i > cs._lastSlotAdded) {
>>>>>       val lastNode = childNodes.last
>>>>>       val name = lastNode.namedQName.toQNameString
>>>>>
>>>>>       lastNode match {
>>>>>         case arr: DIArray if (arr.length > 1) => {
>>>>>           // No need to delete the array, just the last DINode in it
>>>>>           arr.pop()
>>>>>         }
>>>>>         case arr: DIArray if (arr.length == 1) => {
>>>>>           // Last DINode in array, remove the array entirely
>>>>>           //
>>>>>           arr.pop()
>>>>>           //
>>>>>           // Update insertion order list
>>>>>           //
>>>>>           childNodes.remove(childNodes.length - 1)
>>>>>           //
>>>>>           // Update Fast Lookup
>>>>>           //
>>>>>           val seq = fastLookup(name)
>>>>>           if (seq.length <= 1) {
>>>>>             // Last DINode with THIS name, remove it entirely.
>>>>>             //
>>>>>             fastLookup.remove(name)
>>>>>           } else {
>>>>>             // Not the last DINode with THIS name, just remove this particular one.
>>>>>             //
>>>>>             val idx = seq.lastIndexWhere(node => node.isInstanceOf[DIArray])
>>>>>             val result = seq diff Seq(lastNode)
>>>>>             fastLookup.update(name, result)
>>>>>           }
>>>>>         }
>>>>>         case _ => {
>>>>>           //
>>>>>           // Update insertion order list
>>>>>           //
>>>>>           childNodes.remove(childNodes.length - 1)
>>>>>           //
>>>>>           // Update Fast Lookup
>>>>>           //
>>>>>           val seq = fastLookup(name)
>>>>>           if (seq.length <= 1) {
>>>>>             // Last DINode with THIS name, remove it entirely
>>>>>             //
>>>>>             fastLookup.remove(name)
>>>>>           } else {
>>>>>             // Not the last DINode with THIS name.  Just remove this one!
>>>>>             //
>>>>>             val idx = seq.lastIndexWhere(node => !node.isInstanceOf[DIArray])
>>>>>             val result = seq diff Seq(lastNode)
>>>>>             fastLookup.update(name, result)
>>>>>           }
>>>>>         }
>>>>>       }
>>>>>     }
>>>>>
>>>>>   }
>>>>>
>>>>> The last failure in SimpleTypes, I haven't run the rest of the suite yet though.  Baby steps.
>>>>> java.lang.Exception:
>>>>> Comparison failed.
>>>>> Expected
>>>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text></hb_root01>
>>>>> Actual
>>>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text><text/></hb_root01>
>>>>> Differences were (path, expected, actual):
>>>>>  (hb_root01,List(),List(<text/>))
>>>>> at edu.illinois.ncsa.daffodil.xml.XMLUtils$.compareAndReport(XMLUtils.scala:678)
>>>>> at edu.illinois.ncsa.daffodil.tdml.VerifyTestCase$.verifyParserTestData(TDMLRunner.scala:1074)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runParseExpectSuccess(TDMLRunner.scala:776)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:675)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:673)
>>>>> at scala.util.Either$RightProjection.foreach(Either.scala:468)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runProcessor(TDMLRunner.scala:673)
>>>>> at edu.illinois.ncsa.daffodil.tdml.TestCase.run(TDMLRunner.scala:623)
>>>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTestWithDataVolumes(TDMLRunner.scala:340)
>>>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTest(TDMLRunner.scala:328)
>>>>> at edu.illinois.ncsa.daffodil.tdml.Runner.runOneTest(RunnerFactory.scala:122)
>>>>> at edu.illinois.ncsa.daffodil.section05.simple_types.TestSimpleTypes.test_hexBinary_Delimited_01(TestSimpleTypes.scala:63)
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>>> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>>>>> at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>>>>> at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>>>>> at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>>>>> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>>>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>>>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>>>>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>>>>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>>>>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>>>>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>>>>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>>>>> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>>>>> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>>>>> at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
>>>>> at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: Steve Lawrence <sl...@apache.org>
>>>>> Sent: Wednesday, October 11, 2017 4:07:51 PM
>>>>> To: dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>
>>>>> On 10/11/2017 01:01 PM, Mike Beckerle wrote:
>>>>>> You have to hash lookup the name to append to an array if you're just appending one thing at a time starting from the slot/hashID. The current Infoset API hides the construction of the DIArray, and only provides access via the slot/hashID, so there's  no  way to avoid the hash per each append.
>>>>>
>>>>> The hashmap and hashlookups should really only be needed for elements
>>>>> that are used in expressions. We came up with that optimization so that
>>>>> we could remove the vast majority of hash lookups. Due to how
>>>>> getChildArray works, that optimization can't occur. I'm not sure how
>>>>> much this will affect performance, but there are ways to avoid the need
>>>>> for a hash lookup for arrays not used in expressions. The removal of
>>>>> DIArray was one. Maybe API/parser changes is better.
>>>>>
>>>>>> But that's fixable if we want. Probably the API to infoset needs to allow appending to an array more efficently, so that a loop like a repeat combinator can be directly holding onto the DIArray's arrayBuffer appending objects as the parser returns them. But I would not make this change as yet. First things first. Don't optimize
>> without profiling info, etc.
>>>>>>
>>>>>>
>>>>>> I'm very reluctant to remove DIArray because XML's data model isn't the only game in town. The XML decision to not have arrays, just adjacent things that happen to have the same name, has all sorts of negative performance implications.
>>>>>
>>>>> The removal of a DIArray doesn't necessarily mean we don't have a
>>>>> concept of an arrays. Every element added to Complex child arrayBuf
>>>>> still knows if it part of an array via ERD.isArray. And comparing the
>>>>> current DIElements ERD to the previous and next is enough to determine
>>>>> where array starts and ends. From there, we can support any model that
>>>>> requires a concept of an array. I'm not saying we should remove the
>>>>> concept of arrays, just that our InfosetImpl doesn't necessarily need to
>>>>> have a special DIArray. I'll agree that it does make the logic easier to
>>>>> get it to a different model, but other things are more complicated.
>>>>>
>>>>>> Consider this: we could create DIArrays of elements of simple type that don't even create a DISimple node for each array element. The values could be stored in the DIArray's arrayBuffer directly, eliminating all that overhead of a DISimple node. (This was the original intent for DIArray, but was punted to get it working sooner.)
>> The opportunity to do this is entirely lost if you don't have DIArray, but just
>> a bunch of adjacent DIElements that happen to share ERD.
>>>>>
>>>>> Fair point. Seems a good argument for keeping DIArray, but if we don't
>>>>> get around to doing this optimization, we're taking multiple hits. We're
>>>>> still allocating many DISimples plus a DIArray, plus the hit from the
>>>>> hashmap.
>>>>>
>>>>>> More speculatively, when we someday implement incremental array trimming so that streaming parse/unparse can drop most array elements once they've been delivered to consumer, well then we will definitely want a DIArray object to abstract the complexity of this trim behavior and the underlying representation it requires.
>>>>>
>>>>> It might simplify this, but I'm not entirely convinced. I think the
>>>>> trimming of unused Infoset elements is going to have some complex
>>>>> heuristics regardless. Adding an extra heuristic for arrays won't
>>>>> necessarily make this easier. But this is just speculation. And again,
>>>>> we would still know elements are part of an array, it just takes a
>>>>> couple extra comparisons to determine start/end. But I do agree that
>>>>> removal of an an entire array is much easier with a DIArray concept.
>>>>>
>>>>>
>>>>>
>>>>> It seems to me these are good enough arguments for keeping DIArray
>>>>> around, but it might be worth thinking a little bit about what changes
>>>>> would be necessary to the API/parser to not need this hash lookup. If
>>>>> it's a minor change, it could reduce a lot of hash lookups, especially
>>>>> for large arrays, which are not uncommon.
>>>>>
>>>>>
>>>>>>
>>>>>> ________________________________
>>>>>> From: Steve Lawrence <sl...@apache.org>
>>>>>> Sent: Wednesday, October 11, 2017 12:21 PM
>>>>>> To: Taylor Wise; dev@daffodil.apache.org
>>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>>
>>>>>> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>>>>>>
>>>>>>> What you'll notice is that the current implementation will construct the DIArray
>>>>>>> if it doesn't already exist and append to it if it does already exist.  But it
>>>>>>> needs the 'slot' number in order to check this.  The slot number is obtained
>>>>>>> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of slots
>>>>>>> entirely, how can we accomplish this?  Will we have to do a named lookup then
>>>>>>> every time we want to append to a DIArray?
>>>>>> I think you're right. The way things work right now, anytime we add to a
>>>>>> a DIArray we would have to do a hash lookup on the DIComplex. It's not
>>>>>> limited to just DPath.
>>>>>>
>>>>>> Maybe this is another argument for removing the concept of a DIArray? We
>>>>>> still need to know if a DIElement is part of an array for the
>>>>>> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
>>>>>> element should just means creating a new DISimple/DIComplex and adding
>>>>>> it to the DIComplex child ArrayBuffer and appending to the HashLookup
>>>>>> Seq? This would also simplify some query style logic, e.g. fn:count.
>>>>>>
>>>>>> Alternatively, we could modify the parser logic so that when it's
>>>>>> dealing with arrays it stores the DIArray locally somewhere (PState or
>>>>>> Parser) and appends straight to that without having to ask the parent
>>>>>> DIComplex for the DIArray? Not sure how big of a change that would be
>>>>>> though.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Current Implementation:
>>>>>>>
>>>>>>> override def addChild(e: InfosetElement): Unit = {
>>>>>>>
>>>>>>>   if (e.runtimeData.isArray)  {
>>>>>>>
>>>>>>>    // Make sure there is an array to accept the child
>>>>>>>
>>>>>>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>>>>>>
>>>>>>>      Assert.invariant(arr ne null)
>>>>>>>
>>>>>>>      arr.append(e)
>>>>>>>
>>>>>>>    } else {
>>>>>>>
>>>>>>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>>>>>>
>>>>>>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>>>>>>
>>>>>>>    }
>>>>>>>
>>>>>>>    e.setParent(this)
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>>>>>>>      Assert.usage(childERD.isArray)
>>>>>>>      getChildArray(childERD.dpathElementCompileInfo)
>>>>>>>    }
>>>>>>>
>>>>>>>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>>>>>>>      Assert.usage(info.isArray)
>>>>>>>      val slot = info.slotIndexInParent
>>>>>>>      getChildArray(slot)
>>>>>>>    }
>>>>>>>
>>>>>>>    private def getChildArray(slot: Int): InfosetArray = {
>>>>>>>      val slotVal = _slots(slot)
>>>>>>>      if (slotVal ne null)
>>>>>>>        slotVal match {
>>>>>>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>>>>>>          case _ => Assert.usageError("not an array")
>>>>>>>        }
>>>>>>>      else {
>>>>>>>        val arrayERD = erd.childERDs(slot)
>>>>>>>        // slot is null. There isn't even an array object yet.
>>>>>>>        // create one (it will have zero entries)
>>>>>>>        val ia = new DIArray(arrayERD, this)
>>>>>>>        // no array there yet. So we have to create one.
>>>>>>>        setChildArray(slot, ia)
>>>>>>>        ia
>>>>>>>      }
>>>>>>>    }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------------
>>>>>>> *From:* Steve Lawrence <sl...@apache.org>
>>>>>>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>>>>>>> *To:* dev@daffodil.apache.org; Taylor Wise
>>>>>>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>>>>>>> Goal: Remove concept of slots from InfosetImpl
>>>>>>>>
>>>>>>>> Reasons:
>>>>>>>>     1. So that element/node lookup can be name based.
>>>>>>>>     2. Ease re-enabling of Unordered Sequences.
>>>>>>>>     3. Ease enabling of 'query-style' expressions that return more than
>>>>>>>> one node.
>>>>>>>>
>>>>>>>> An Infoset is a document (DIDocument) and the document contains exactly
>>>>>>>> one element (DIElement) as its root.  These elements can be simple
>>>>>>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>>>>>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>>>>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>>>>>>> inherit from DINode.
>>>>>>>>
>>>>>>>> Currently DIComplex models its children as an Array of DINodes. The
>>>>>>>> problem with the current implementation is that there's no way for us to
>>>>>>>> just ask for an element by name without walking the structure.
>>>>>>>>
>>>>>>>> The solution involves modifying DIComplex to represent its children both
>>>>>>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>>>>>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>>>>>>> insertion ordering information we need regarding the children as well as
>>>>>>>> constant time indexing access to the child in question.  The ArrayBuffer
>>>>>>>> also has the added benefit of growing and shrinking as objects are added
>>>>>>>> and removed.  While the HashMap will allow us to associate a name
>>>>>>>> (String) with DINodes that have that name thus providing a fast named
>>>>>>>> lookup.
>>>>>>>>
>>>>>>>> This named lookup can be optimized further by only inserting into the
>>>>>>>> HashMap those elements that we know are used in expressions. These are
>>>>>>>> the elements that would most likely benefit from a fast named lookup and
>>>>>>>> will allow us to avoid additional overhead of hashing strings and
>>>>>>>> allocating/appending to a sequence.
>>>>>>>>
>>>>>>>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>>>>>>>> it's possible for the following:
>>>>>>>>
>>>>>>>>     <element name="a" .../>
>>>>>>>>     <element name="a" maxOccurs="unbounded"..../>
>>>>>>>>     <element name="b".../>
>>>>>>>>     <element name="a" ..../>
>>>>>>>>
>>>>>>>> Each element named 'a' would have an associated DINode.  Because there
>>>>>>>> are multiple DINodes named 'a', we need to represent them in the HashMap
>>>>>>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>>>>>>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>>>>>>>
>>>>>>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>>>>>>     "b" -> mutable.Seq(DIElement)
>>>>>>>>
>>>>>>>> The mutable.ArrayBuffer maintains the insertion order and would look as
>>>>>>>> follows:
>>>>>>>>
>>>>>>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>>>>>>
>>>>>>>> Back tracking shall be accomplished by going to the last item in the
>>>>>>>> DIComplex ArrayBuffer and removing it.
>>>>>>>>     1. If the last item is a DIElement, we simply delete it from the
>>>>>>>> ArrayBuffer and then remove it also from the HashMap.
>>>>>>>>     2. If the last item is a DIArray, we then have to look at the
>>>>>>>> contents of the DIArray.
>>>>>>>>         a. If the DIArray has more than one entry, remove the last
>>>>>>>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>>>>>>>> with the modified DIArray.
>>>>>>>
>>>>>>> I think the second part of this is actually unnecessary. Since the
>>>>>>> DIArray is mutable, we only need to remove its last child. The entry in
>>>>>>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>>>>>>> only difference is the contents of the DIArray will be one smaller.
>>>>>>>
>>>>>>>>         b. If the DIArray has a single entry then we simply remove
>>>>>>>> DIArray from ArrayBuffer and the HashMap.
>>>>>>>>
>>>>>>>>     <element name="root>
>>>>>>>>          <sequence ../>
>>>>>>>>          <element name="a" .../>
>>>>>>>>          <element name="a" maxOccurs="unbounded">
>>>>>>>>            ...
>>>>>>>>            <element name="b" .../>
>>>>>>>>            <element name="c" .../>
>>>>>>>>            ...
>>>>>>>>          </element>
>>>>>>>>        </sequence>
>>>>>>>>     </element>
>>>>>>>>
>>>>>>>> If the HashMap of DIComplex-root looks like the following:
>>>>>>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>>>
>>>>>>>> ArrayBuffer of DIComplex-root looks like the following:
>>>>>>>>      0 -> DIElement("a")
>>>>>>>>      1 -> DIArray("a")
>>>>>>>>
>>>>>>>> Where DIArray children ArrayBuffer looks like:
>>>>>>>>      0 -> DIElement("b")
>>>>>>>>      1 -> DIElement("c")
>>>>>>>>
>>>>>>>> Given the initial example if we backtrack(1):  // purge one item
>>>>>>>>      DIArray:
>>>>>>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>>>>>>
>>>>>>>>      HashMap DIComplex-root:
>>>>>>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>>>
>>>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>>>          0 -> DIElement("a")
>>>>>>>>          1 -> DIArray("a")
>>>>>>>>
>>>>>>>> Given the initial example if we backtrack(2): // purge two items
>>>>>>>>      DIArray:
>>>>>>>>          Empty
>>>>>>>>
>>>>>>>>      HashMap DIComplex-root:
>>>>>>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>>>>>>>                                         // removed and is now empty
>>>>>>>>
>>>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>>>          0 -> DIElement("a")
>>>>>>>>
>>>>>>>> ===
>>>>>>>>
>>>>>>>> class DIComplex {
>>>>>>>>   def nSlots = erd.nChildSlots
>>>>>>>
>>>>>>> I think the concept of nSlots and nChildSlots should go away completely.
>>>>>>> The children ArrayBuffer should just grow as elements are added and
>>>>>>> shrink as they are backtracked. It might make sense to have a tunable
>>>>>>> the determines the initial size of the children Array, but the Scala
>>>>>>> defaults for initializing and growing a mutable array might already be
>>>>>>> reasonable.
>>>>>>>
>>>>>>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>>>>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>>>>>>
>>>>>>> One thing that might make this change easier is to not worry about the
>>>>>>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>>>>>>> really only necessary to support query style expressions, but that comes
>>>>>>> with a whole other slew of concerns (e.g. fn:count, same name but
>>>>>>> different types, etc.). There currently exist restrictions that do not
>>>>>>> allow query style expressions, so this Seq should always be a size of 1
>>>>>>> until remove those restrictions.
>>>>>>>
>>>>>>>>
>>>>>>>>   def addChild(node: DINode) {
>>>>>>>>     children += node
>>>>>>>>     fastLookup(node.name).append(node)
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>>>>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   def backtrack(int n) {
>>>>>>>>     for (i = 0; i < n; i++) {
>>>>>>>>       val endIdx = children.size -1
>>>>>>>>       val child = children.remove(endIdx) // children.pop()
>>>>>>>>
>>>>>>>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>>>>>>>         case None => // do nothing
>>>>>>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>>>>>>         case Some(nodesSeq) => {
>>>>>>>>       val lastEntry = nodesSeq.last
>>>>>>>>           lastEntry match {
>>>>>>>>         case arr: DIArray => {
>>>>>>>>               // If the array has >= 2 items, remove last one and update
>>>>>>>> nodesSeq
>>>>>>>>               // with this array:
>>>>>>>>               //  val lastIdx = nodesSeq.length - 1
>>>>>>>>           //  arr.remove(arr.size - 1)
>>>>>>>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>>>>>>>> arr))
>>>>>>>>               //
>>>>>>>>               // If the array has 1 item, remove the array from nodesSeq
>>>>>>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>>>         }
>>>>>>>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>>>           }
>>>>>>>>
>>>>>>>>         }
>>>>>>>>       }
>>>>>>>>     }
>>>>>>>>   }
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> class DIArray {
>>>>>>>>     private val initialSize =
>>>>>>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>>>>>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>>>>>>>
>>>>>>>>     def append(ie: InfosetElement): Unit = {
>>>>>>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>>>>>>         ie.setArray(this)
>>>>>>>>     }
>>>>>>>>
>>>>>>>>     //
>>>>>>>>     // For backtracking purposes
>>>>>>>>     //
>>>>>>>>     def remove(n: Int): DIElement = {
>>>>>>>>         _contents.remove(n)
>>>>>>>>     }
>>>>>>>> }
>>>>>>>>
>>>>>>>> ===
>>>>>>>>
>>>>>>>> To facillitate a breadth-first ordering scheme we shall have to
>>>>>>>
>>>>>>> That this is only necessary for unordered sequences, where the insertion
>>>>>>> order in the ArrayBuffer differs from the display order, correct? If
>>>>>>> that's true, we definitely want to make sure to only do the below
>>>>>>> sorting when an element contains an unordred sequence.
>>>>>>>
>>>>>>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>>>>>>> shall be called childElementDeclPosition and pertains to the static
>>>>>>>> order of the element children of a complex type element.  When sorting
>>>>>>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>>>>>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>>>>>>> entry in the sequence that needs to be sorted.  All elements in this
>>>>>>>> array will have the same childElementDeclPosition.
>>>>>>>>
>>>>>>>> The difficulty in implementing this is that an Element's
>>>>>>>> dsom.ComplexType can have groups within the groups before finding actual
>>>>>>>> element children.  Therefore in the Document Schema Object Model (DSOM)
>>>>>>>> the model group object shall have a startingChildElementDeclPosition and
>>>>>>>> an endingChildElementDeclPosition.
>>>>>>>>
>>>>>>>> 1. The startingChildElementPosition shall be passed to the constructor
>>>>>>>> for the Term.
>>>>>>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>>>>>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>>>>>>> model group, then its first child term's
>>>>>>>> startingChildElementDeclPosition is equal to its own
>>>>>>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>>>>>>> startingChildElementDeclPosition is the prior term's
>>>>>>>> endingChildElementDeclPosition + 1.
>>>>>>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>>>>>>> endingChildElmentDeclPosition of its last child term.
>>>>>>>>
>>>>>>>
>>>>>>> Looks good to me. +1. This should definitely make implementation of
>>>>>>> unordered sequences and query style expressions easier. Removal of the
>>>>>>> slot concept should also help remove concepts that prevent faster schema
>>>>>>> compilation.
>>>>>>>
>>>>>>> - Steve
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Taylor Wise <tw...@tresys.com>.

Should this be pushed to 2.0.0?

________________________________
From: Mike Beckerle
Sent: Monday, October 23, 2017 12:36:57 PM
To: Taylor Wise; Steve Lawrence; dev@daffodil.apache.org
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion


Yes. The one giant ticket about the space problem compiling (DFDL-1444) is not specific enough.

________________________________
From: Taylor Wise
Sent: Monday, October 23, 2017 12:28:04 PM
To: Steve Lawrence; dev@daffodil.apache.org; Mike Beckerle
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion


I don't believe there's a ticket associated with these changes.  Should I go ahead and create one?

________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Wednesday, October 18, 2017 8:21:02 AM
To: Taylor Wise; dev@daffodil.apache.org; Mike Beckerle
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion

On 10/17/2017 07:57 PM, Taylor Wise wrote:
> Been debugging after implementing your suggested changes.
>
>
> First thing to note, is that by moving the creation of the array out of
> getChildArray that I had to make additional calls to 'addChild' in the
> InfosetInputter.  This corrected things on the 'parser' side.  However, on the
> unparser side it became evident that this 'addChild' function cannot always set
> the parent.  There's an invariant expecting parent to be null along the
> execution branch of the unparser side.  As a result, I modified addChild to also
> accept a default boolean called 'setParent'.  It defaults to true unless
> specified.  So now when we call 'addChild' in InfosetInputter on the unparser
> side, we actually call 'addChild(e, false)'.

This sounds to me like addChild is getting called twice on the same
element and so it's parent is trying to be set twice. Might need some
more investigation. Maybe the InfosetInputter needs more tweaks to
support not using getChildArray? Hard to say without seeing your
InfosetInputter changes.

Regarding the below tests, I think your assessment is correct. The
password element doesn't exist and so we should never even try to get
the -1 index. Instead we should get something like "password element
does not exist" or something. It would be useful to have that test to
ensure we get that error message.

An index of -1 (or anything less than 1) can definitely be caught
statically at schema compile time. Worth creating a bug for that. But
there will be times where the index is calculated so we'll still need
this runtime check. I think it's still a test worth having.

I think you're modifications are reasonable based on what the tests are
trying to tests. Probably to proper coverage of tests we need is:

- trying to access an array that does not exist with both valid and
invalid indicies (less than 1 and greater than array size)
- trying to access an array thta does exist but with invalid indicies
(less than 1 and greater than array size)

So it's probably worth keeping the existing tests for the 'array does
not exist' tests and ensure we get a reasonable error message not
talking about indices. And then just make your modifications new tests.

>
> The above correction reduced the errors from 30+ to just 2.  The only two errors
> remaining have to do with an array index out of bounds error that's expected.  I
> believe that it's actually an issue with the test and not the code.
>
>
> Error:
>
> edu.illinois.ncsa.daffodil.tdml.TDMLException: Did not find diagnostic message
> "Value -1 is out of range" in any of the actual diagnostic messages:
> Runtime Schema Definition Error: Expression Evaluation Error: Child element
> {http://example.com}password does not exist.
> Schema context: hideme Location line 267 column 12 in
> file:/tmp/hiddenElem2658119394211244932.dfdl.xsd
> Data location was preceding byte 0 limit(bytes) 8
>
> Well, password doesn't exist because we are missing the 'pw:' initiator in the
> tdml:document content.  The code now doesn't know to create it because we
> haven't seen the initiator.  At least that's what I surmise.
>
>
> Existing test:
>
>    <!--
>       Test Name: arrayIndexOutOfBounds_02
>          Schema: hiddenElem
>            Root: e1b
>         Purpose: This test demonstrates that a proper error occurs if an array
> index in an expression is out of bounds
>    -->
>
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_02" root="e1b"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>
>      <tdml:document>p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value -1 is out of range</tdml:error>
>          <tdml:error>with length 0</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
>
> Schema:
>      <xs:group name="hiddenData">
>        <xs:sequence dfdl:separator="|">
>          <xs:element name="password" type="xs:string" dfdl:initiator="pw:"
> minOccurs="0" maxOccurs="1" dfdl:lengthKind="delimited"/>
>        </xs:sequence>
>      </xs:group>
>
>      <xs:element name="e1">
>        <xs:complexType>
>          <xs:sequence>
>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{
> fn:concat(fn:substring(/ex:e1/ex:password[1], 1, 1), '***') }"/>
>          </xs:sequence>
>        </xs:complexType>
>      </xs:element>
>      <xs:element name="e1b">
>        <xs:complexType>
>          <xs:sequence>
>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{
> fn:concat(fn:substring(/ex:e1b/ex:password[-1], 1, 1), '***') }"/>
>          </xs:sequence>
>        </xs:complexType>
>      </xs:element>
>
>
> My modification:
>
>    <!--
>       Test Name: arrayIndexOutOfBounds_02
>          Schema: hiddenElem
>            Root: e1b
>         Purpose: This test demonstrates that a proper error occurs if an array
> index in an expression is out of bounds
>    -->
>
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_02" root="e1b"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>
>      <tdml:document>pw:p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value -1 is out of range</tdml:error>
>          <tdml:error>with length 1</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
>
> But honestly, should we really need to do this?  Shouldn't the code know that -1
> is not a valid index without having to force construction of the password array?
>
> I'm similarly going to have to modify this other test as well by constructing a
> different root that attempts to access a character in a position greater than
> the length of the array, unless there's something I'm missing:
>
>    <!--
>       Test Name: arrayIndexOutOfBounds_01
>          Schema: hiddenElem
>            Root: e1
>         Purpose: This test demonstrates that a proper error occurs if an array
> index in an expression is out of bounds
>    -->
>
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_01" root="e1"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>
>      <tdml:document>p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value 1 is out of range</tdml:error>
>          <tdml:error>with length 0</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
>
> Proposed addition to defineSchema:
>
>      <xs:element name="e1c">
>        <xs:complexType>
>          <xs:sequence>
>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{
> fn:concat(fn:substring(/ex:e1c/ex:password[10], 1, 1), '***') }"/>
>          </xs:sequence>
>        </xs:complexType>
>      </xs:element>
>
> Proposed modification to test:
>
>    <!--
>       Test Name: arrayIndexOutOfBounds_01
>          Schema: hiddenElem
>            Root: e1
>         Purpose: This test demonstrates that a proper error occurs if an array
> index in an expression is out of bounds
>    -->
>
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_01" root="e1c"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>
>      <tdml:document>pw:p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value 10 is out of range</tdml:error>
>          <tdml:error>with length 1</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
>
>
>
>
> --------------------------------------------------------------------------------
> *From:* Steve Lawrence <sl...@apache.org>
> *Sent:* Tuesday, October 17, 2017 11:04:03 AM
> *To:* Taylor Wise; dev@daffodil.apache.org; Mike Beckerle
> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
> I don't think a DIArray could be created earlier and things appended to
> it. But, currently, I think an empty DIArray could be created earlier.
>
> And I think this is really just a problem with getChildArray always
> creating DIArrays. What I suspect happened is that we had an array
> following by something that accessed that array, e.g.:
>
> <xs:sequence>
>    <xs:element name="arr" maxOccurs="unbounded" ... />
>    <xs:element name="count" dfdl:inputValueCalc="{ fn:count(../arr) }" />
> </xs:sequence>
>
> arr was slot 1, count was slot 2.
>
> The setChildArray function used to just do something like
>
>    _lastSlotAdded = slot
>
> Which worked in the case of adding a new array--we add a new array, so
> the slot it was added in was the last slot added.
>
> But when fn:count was called, _lastSlotAdded was 2 (from the count
> element). fn:count ended up calling getChildArray, which created a
> DIArray and called setChildArray, which overwrote _lastSlotAdded to the
> slot of the array, i.e. 1. So even though the array was finished and
> couldn't be modified, the use of getChildArray in DPath ended up setting
> _lastSlotAdded to the wrong value. Using max was a way to fix this so we
> couldn't set it to an earlier value.
>
> So, now, because getChildArray create elements, it sounds like things
> could be created out of order. Rather than inserting an empty DIArray
> into a slot, it will added it to the end and potentially mess things up.
> This says to me we definitely want to modify getChildArray to not create
> DIArrays. It should either return the array if it exists, or throw an
> Infoset exception, very similar to getChild. Then we don't have to worry
> about DIArray's being created at the wrong time. The only thing that
> should create a DIArray is addChild.
>
>
> On 10/17/2017 10:13 AM, Taylor Wise wrote:
>> Hmm, very interesting.  This would greatly simplify things.  But what about this
>> code snippet from before my changes:
>>
>>
>>    //  final def setChildArray(slot: Int, arr: DIArray) {
>>    //    Assert.invariant(_slots(slot) eq null)
>>    //    _slots(slot) = arr
>>    //    //
>>    //    // Because arrays are created if not existing, an expression like ../A[1]
>>    //    // can cause array A to come into existence (though empty) and A might be
>>    //    // part of an expression that is reaching backward into earlier data (when
>>    //    // parsing) so that we might be creating an array earlier in the slots
>>    //    // of this element. Hence, the slot being passed here is not necessarily
>>    //    // the last slot added. It might be before it.
>>    //    //
>>    //    _lastSlotAdded = math.max(slot, _lastSlotAdded)
>>    //  }
>> Given your statement:
>>
>>
>> "the issue is determining if the DIArray exists or not. Everything
>> after that is the same. And since children are inserted in the order
>> that we see them, a DIArray will always be at the end of the children
>> ArrayBuffer if it exists. If the DINode at the end of the children
>> ArrayBuffer is not this array, it needs to be created."
>>
>>
>> To me it seems to imply that a DIArray can come into existence without any
>> contents.  But then be appended to at a later time.  In which case it's possible
>> that this DIArray then is not the last item appended.  It could be somewhere
>> earlier and then need to be added to.
>>
>> --------------------------------------------------------------------------------
>> *From:* Steve Lawrence <sl...@apache.org>
>> *Sent:* Tuesday, October 17, 2017 7:20:42 AM
>> *To:* dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>> Actually, I think we're over complicating the array situation. I think
>> we can avoid needing a unique ID and I think we can also avoid needing
>> to always insert arrays into a HashMap.
>>
>> When addChild is called to add an DIElement to an array, there are two
>> cases we need to think about:
>>
>> 1) The DIArray does not exist and it must be created. The
>> DIElement is appended to its contents.
>>
>> 2) The DIArray does exist and we just need to find it. The
>> DIElement is appended to its contents.
>>
>> So the issue is determining if the DIArray exists or not. Everything
>> after that is the same. And since children are inserted in the order
>> that we see them, a DIArray will always be at the end of the children
>> ArrayBuffer if it exists. If the DINode at the end of the children
>> ArrayBuffer is not this array, it needs to be created. So addChild
>> just becomes something like this:
>>
>>    def addChild(e: DIElement) {
>>      if (e.erd.isArray) {
>>        if (children.isEmpty || children.last.erd != e.erd) {
>>          // no children, or last child is not a DIArray for
>>          // this element, create the DIArray
>>          children.append(new DIArray(...))
>>        }
>>        // the array is now always last, add the new child to it
>>        children.last.append(e)
>>      } else {
>>        children.append(e)
>>      }
>>    }
>>
>> So we no longer need a unique ID or a tuple for the HashMap key. The
>> fast HashMap lookup only needs to work on the name of the element, which
>> allows query-style expressions to work in the future. And we now only
>> need to insert DIArray's into the fast HashMap if they are used in
>> expressions addChild no longer needs getChildArray, minimizing hash
>> look ups.
>>
>> Related, I'm wondering if getChildArray should change as well so that it
>> does not create a DIArray if it doesn't exist? I think it seems
>> reasonable that only addChild should create new DIArrays. And if
>> something like DPath calls getChildArray, we probably don't want to end
>> up creating an empty DIArray if it doesn't exist. It should probably
>> just throw an InfosetNoSuchChildElementException if the array doesn't
>> exist, just like getChild.
>>
>> - Steve
>>
>> On 10/16/2017 12:02 PM, Steve Lawrence wrote:
>>> The ID might cause some issues when we want to support query-style
>>> expressions. For example, say we have a schema like this:
>>>
>>> <sequence>
>>>   <xs:element name="array" maxOccurs="unbounded" ... />
>>>   <xs:element name="element" ... />
>>>   <xs:element name="array" maxOccurs="unbounded" ... />
>>> </xs:sequence>
>>>
>>> So two arrays with the same name separated by a single element. Each
>>> element would have a unique ID, so would be inserted in the hash table
>>> as something like
>>>
>>> ("array", 1) -> Seq(DIArray)
>>> ("element", 2) -> Seq(DISimple)
>>> ("array", 3) -> Seq(DIArray)
>>>
>>> A query style expression that accesses the "array" elements (e.g.
>>> fn:count(./array)) would have trouble finding both arrays to get the
>>> full count since the keys for the two arrays are the same.
>>>
>>> Perhaps one option is to keep the name -> DINode map type, so we'd end
>>> up with something like this:
>>>
>>> "array" -> Seq(DIArray, DIArray)
>>> "element" -> Seq(DISimple)
>>>
>>> And then just iterator of the Seq until we find DIArray with the right
>>> ID? That's could have a performance hit if there were lots of things
>>> with the same name, but I suspect that won't be super common.
>>>
>>> Alternatively, we could relook at chaning the InfosetImpl API so that
>>> when an DIArray is created, it's store in the PState for quick access,
>>> and getChildArray would essentially go away.
>>>
>>> - Steve
>>>
>>>
>>> On 10/13/2017 02:14 PM, Mike Beckerle wrote:
>>>> I think the key has to change to include name and namespace of element, and also an additional integer to disambiguate the first a element from the 2nd, from the Nth. This integer need not be 1, 2, 3. It can be any value at all, so long as they are distinct.
>>>>
>>>>
>>>> Let's make sure another case works. E.g.,  ordered sequence like the one you have here, another element to separate it from  an unordered sequence that also contains elements named a.
>>>>
>>>>
>>>> However, an unordered sequence can't have more than 1 element named a, so if every element a declaration gets an additional integer, the one in the unordered sequence will also get one, and that will behave properly.
>>>>
>>>>
>>>> Is this integer the same breadth-first ordering thing we discussed elsewhere? I expect it is.
>>>>
>>>> ________________________________
>>>> From: Taylor Wise
>>>> Sent: Friday, October 13, 2017 1:55:22 PM
>>>> To: Steve Lawrence; dev@daffodil.apache.org; Mike Beckerle
>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>
>>>>
>>>> I'm down to a single error: test_dfdlPosition3
>>>>
>>>>
>>>> java.lang.Exception:
>>>> Comparison failed.
>>>> Expected
>>>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2><a><v>1</v></a><a><v>2</v></a></f>
>>>> Actual
>>>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2></f>
>>>> Differences were (path, expected, actual):
>>>>  (f,<cnt2>2</cnt2>,<a><v>1</v></a>)
>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>>
>>>>     <xs:element name="f">
>>>>       <xs:complexType>
>>>>         <xs:sequence>
>>>>           <xs:element name="cnt1" type="xs:int"
>>>>             dfdl:lengthKind="delimited" dfdl:terminator=";" />
>>>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt1) }">
>>>>             <xs:complexType>
>>>>               <xs:sequence>
>>>>                 <xs:element name="v" type="xs:int"
>>>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>>>               </xs:sequence>
>>>>             </xs:complexType>
>>>>           </xs:element>
>>>>           <xs:element name="cnt2" type="xs:int"
>>>>             dfdl:lengthKind="delimited" />
>>>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt2) }">
>>>>             <xs:complexType>
>>>>               <xs:sequence>
>>>>                 <xs:element name="v" type="xs:int"
>>>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>>>               </xs:sequence>
>>>>             </xs:complexType>
>>>>           </xs:element>
>>>>         </xs:sequence>
>>>>       </xs:complexType>
>>>>     </xs:element>
>>>>
>>>>
>>>>
>>>> I believe what's happening, from stepping through, is that the getChildArray code is confused as to where to add the second set of <a/> elements.  Below are the values of the structures at the point where <cnt2/> has been added:
>>>>
>>>> ChildNodes aka Insertion Order
>>>> ====
>>>> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk)), DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a'))), DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk)))
>>>>
>>>> FastLookup
>>>> ====
>>>> Map(cnt2 -> ArrayBuffer(DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk))), cnt1 -> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk))), a -> ArrayBuffer(DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a')))))
>>>>
>>>> So when we go to add the next array 'a', it sees that 'a' is already in the fastLookup and appends it there.  This adds it to the existing DIArray in the insertion order list.  Really, it should create a new DIArray inserting it after <cnt2/> in the insertion order list.  And inserting it after the DIArray in the fastLookup Hash: a ->
>> Seq(DIArray(a, ...), DIArray(a, ...))
>>>>
>>>> How do we fix this?
>>>>
>>>> Because this is an 'ordered' sequence, can we look at the slotIndexInParent?  Isn't this the position in the sequence definition as defined in the schema above?  This should differentiate between the two 'a' arrays.  And then for 'unordered' sequences,  this  shouldn't matter.  We shouldn't run into two or more members with the same
>> name and namespace.  Correct?
>>>>
>>>> This also means that the 'key' will have to change to include the namespace in addition to the element name.
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: Steve Lawrence <sl...@apache.org>
>>>> Sent: Friday, October 13, 2017 8:45:06 AM
>>>> To: dev@daffodil.apache.org; Taylor Wise; Mike Beckerle
>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>
>>>> On 10/12/2017 08:19 PM, Taylor Wise wrote:
>>>>> Been debugging using the TestSimpleTypes tests.  I'm down to just 1 failure.  Expected results are slightly off.  I've got an additional <text/> in the result.  So, getting closer.  Stopping for the night.
>>>>>
>>>>>
>>>>> I'm not so sure that we need the _lastSlotAdded variable anymore.  I changed it to be just a def that reflects the current length of childNodes (ArrayBuffer).
>>>>> z
>>>>>
>>>>> I'm betting my current issue is within the restore functions I modified, but I'll continue looking at that in the morning.  I would think that 'restoring' would simply be copying the previous state... unless I'm misunderstanding something.
>>>>>
>>>>>
>>>>>   abstract override def restoreInto(e: DIElement) {
>>>>>     val c = e.asInstanceOf[DIComplex]
>>>>>     c.restoreFrom(this)
>>>>>     //c._lastSlotAdded = this._lastSlotAdded
>>>>>     super.restoreInto(e)
>>>>>   }
>>>>>
>>>>> final def restoreFrom(cs: DIComplexSharedImplMixin){
>>>>>   // Shouldn't I be able to just say something like...
>>>>>  //
>>>>>  childNodes.clear()
>>>>>  childNodes.append(cs.childNodes)
>>>>>  fastLookup.clear()
>>>>>  cs.fastLookup.toSeq.foreach( (name, seqDINodes) =>
>>>>>        fastLookup.put(name, seqDINodes) )
>>>>> }
>>>>
>>>> This is going to be somewhat inefficient. This essentially deletes and
>>>> recreates the entire childNodes array and fastLookup hashmap everytime
>>>> we backtrack. This could be pretty back if we childNodes had 1000
>>>> elements and we just wanted to backtrack 1. Instead we just want to
>>>> remove the elements that are being backtracked.
>>>>
>>>> So I think you still need a concept of _lastSlotAdded. But rather than
>>>> the lastSlotAdded, it's just the length of the children (maybe call it
>>>> _numChildren?). I would expect restoreFrom for DIComplex to look
>>>> something like this (very similar to the existing restoreFrom):
>>>>
>>>> final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>   var i = childNodes.length - 1
>>>>
>>>>   // for each child we will remove, remove it from the fastLookup map.
>>>>   // this should always be the last element in the hashmap seq
>>>>   while (i >= cs._numChildren) {
>>>>     var childToRemove = childNodes(i)
>>>>     var fastSeq = fastLookup(childToRemove.name)
>>>>     if (fastSeq.length == 1)
>>>>       // last one, remove the whole key entry
>>>>       fastLookup.remove(childToRemove.name)
>>>>     else
>>>>       // not the last one, just drop the end
>>>>       fastSeq.dropRight(1)
>>>>     i -= 1
>>>>   }
>>>>
>>>>   // now just quickly remove all those children
>>>>   childNodes.reduceToSize(cs._numChildren)
>>>>   _numChildren = cs._numChildren
>>>>
>>>>   if (cs._arraySize.isDefined) {
>>>>     childNodes(_numChildren - 1).reduceToSize(cs._arraySize.get)
>>>>   }
>>>> }
>>>>
>>>> The logic here is very similar to the CURRENT code, just is a little
>>>> more simple since it can make certain assumptions. Like removing
>>>> something form the fastSeq just means removing the last one
>>>> (dropRight(1)), and the existing _arraySize handles the special case for
>>>> DIArrays to simplify that logic.
>>>>
>>>> Note that we might want to rethink about the type fastLookup value.
>>>> Rather than the value being a Seq, it probably wants to be some sort of
>>>> mutable Seq that has constant time append dropRight. Maybe another
>>>> ArrayBuf is the right data structure here? That would allow something like
>>>>
>>>>   fastLookup.reduceToSize(fastSeq.length - 1)
>>>>
>>>>
>>>>
>>>>
>>>>> // ORIGINAL
>>>>>   //    final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>>   //    Assert.invariant(_arraySize == MaybeInt.Nope)
>>>>>   //
>>>>>   //    var i = _lastSlotAdded
>>>>>   //    while (i > cs._lastSlotAdded) {
>>>>>   //      _slots(i) = null
>>>>>   //      i -= 1
>>>>>   //    }
>>>>>   //    _lastSlotAdded = cs._lastSlotAdded
>>>>>   //
>>>>>   //    // check invariant. All slots after last slot added
>>>>>   //    // should be null.
>>>>>   //    {
>>>>>   //      var i = _lastSlotAdded + 1
>>>>>   //      while (i < _slots.length) {
>>>>>   //        Assert.invariant(_slots(i) eq null)
>>>>>   //        i = i + 1
>>>>>   //      }
>>>>>   //    }
>>>>>   //
>>>>>   //    if (cs._arraySize.isDefined) {
>>>>>   //      _slots(_lastSlotAdded).asInstanceOf[DIArray].reduceToSize(cs._arraySize.get)
>>>>>   //    }
>>>>>   //  }
>>>>>
>>>>> // CURRENT, lot of logic here.  Wondering if it's even necessary, thus the question above
>>>>> //
>>>>>   final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>>     Assert.invariant(_arraySize == MaybeInt.Nope)
>>>>>
>>>>>     var i = _lastSlotAdded
>>>>>     while (i > cs._lastSlotAdded) {
>>>>>       val lastNode = childNodes.last
>>>>>       val name = lastNode.namedQName.toQNameString
>>>>>
>>>>>       lastNode match {
>>>>>         case arr: DIArray if (arr.length > 1) => {
>>>>>           // No need to delete the array, just the last DINode in it
>>>>>           arr.pop()
>>>>>         }
>>>>>         case arr: DIArray if (arr.length == 1) => {
>>>>>           // Last DINode in array, remove the array entirely
>>>>>           //
>>>>>           arr.pop()
>>>>>           //
>>>>>           // Update insertion order list
>>>>>           //
>>>>>           childNodes.remove(childNodes.length - 1)
>>>>>           //
>>>>>           // Update Fast Lookup
>>>>>           //
>>>>>           val seq = fastLookup(name)
>>>>>           if (seq.length <= 1) {
>>>>>             // Last DINode with THIS name, remove it entirely.
>>>>>             //
>>>>>             fastLookup.remove(name)
>>>>>           } else {
>>>>>             // Not the last DINode with THIS name, just remove this particular one.
>>>>>             //
>>>>>             val idx = seq.lastIndexWhere(node => node.isInstanceOf[DIArray])
>>>>>             val result = seq diff Seq(lastNode)
>>>>>             fastLookup.update(name, result)
>>>>>           }
>>>>>         }
>>>>>         case _ => {
>>>>>           //
>>>>>           // Update insertion order list
>>>>>           //
>>>>>           childNodes.remove(childNodes.length - 1)
>>>>>           //
>>>>>           // Update Fast Lookup
>>>>>           //
>>>>>           val seq = fastLookup(name)
>>>>>           if (seq.length <= 1) {
>>>>>             // Last DINode with THIS name, remove it entirely
>>>>>             //
>>>>>             fastLookup.remove(name)
>>>>>           } else {
>>>>>             // Not the last DINode with THIS name.  Just remove this one!
>>>>>             //
>>>>>             val idx = seq.lastIndexWhere(node => !node.isInstanceOf[DIArray])
>>>>>             val result = seq diff Seq(lastNode)
>>>>>             fastLookup.update(name, result)
>>>>>           }
>>>>>         }
>>>>>       }
>>>>>     }
>>>>>
>>>>>   }
>>>>>
>>>>> The last failure in SimpleTypes, I haven't run the rest of the suite yet though.  Baby steps.
>>>>> java.lang.Exception:
>>>>> Comparison failed.
>>>>> Expected
>>>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text></hb_root01>
>>>>> Actual
>>>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text><text/></hb_root01>
>>>>> Differences were (path, expected, actual):
>>>>>  (hb_root01,List(),List(<text/>))
>>>>> at edu.illinois.ncsa.daffodil.xml.XMLUtils$.compareAndReport(XMLUtils.scala:678)
>>>>> at edu.illinois.ncsa.daffodil.tdml.VerifyTestCase$.verifyParserTestData(TDMLRunner.scala:1074)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runParseExpectSuccess(TDMLRunner.scala:776)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:675)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:673)
>>>>> at scala.util.Either$RightProjection.foreach(Either.scala:468)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runProcessor(TDMLRunner.scala:673)
>>>>> at edu.illinois.ncsa.daffodil.tdml.TestCase.run(TDMLRunner.scala:623)
>>>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTestWithDataVolumes(TDMLRunner.scala:340)
>>>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTest(TDMLRunner.scala:328)
>>>>> at edu.illinois.ncsa.daffodil.tdml.Runner.runOneTest(RunnerFactory.scala:122)
>>>>> at edu.illinois.ncsa.daffodil.section05.simple_types.TestSimpleTypes.test_hexBinary_Delimited_01(TestSimpleTypes.scala:63)
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>>> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>>>>> at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>>>>> at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>>>>> at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>>>>> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>>>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>>>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>>>>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>>>>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>>>>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>>>>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>>>>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>>>>> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>>>>> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>>>>> at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
>>>>> at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: Steve Lawrence <sl...@apache.org>
>>>>> Sent: Wednesday, October 11, 2017 4:07:51 PM
>>>>> To: dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>
>>>>> On 10/11/2017 01:01 PM, Mike Beckerle wrote:
>>>>>> You have to hash lookup the name to append to an array if you're just appending one thing at a time starting from the slot/hashID. The current Infoset API hides the construction of the DIArray, and only provides access via the slot/hashID, so there's  no  way to avoid the hash per each append.
>>>>>
>>>>> The hashmap and hashlookups should really only be needed for elements
>>>>> that are used in expressions. We came up with that optimization so that
>>>>> we could remove the vast majority of hash lookups. Due to how
>>>>> getChildArray works, that optimization can't occur. I'm not sure how
>>>>> much this will affect performance, but there are ways to avoid the need
>>>>> for a hash lookup for arrays not used in expressions. The removal of
>>>>> DIArray was one. Maybe API/parser changes is better.
>>>>>
>>>>>> But that's fixable if we want. Probably the API to infoset needs to allow appending to an array more efficently, so that a loop like a repeat combinator can be directly holding onto the DIArray's arrayBuffer appending objects as the parser returns them. But I would not make this change as yet. First things first. Don't optimize
>> without profiling info, etc.
>>>>>>
>>>>>>
>>>>>> I'm very reluctant to remove DIArray because XML's data model isn't the only game in town. The XML decision to not have arrays, just adjacent things that happen to have the same name, has all sorts of negative performance implications.
>>>>>
>>>>> The removal of a DIArray doesn't necessarily mean we don't have a
>>>>> concept of an arrays. Every element added to Complex child arrayBuf
>>>>> still knows if it part of an array via ERD.isArray. And comparing the
>>>>> current DIElements ERD to the previous and next is enough to determine
>>>>> where array starts and ends. From there, we can support any model that
>>>>> requires a concept of an array. I'm not saying we should remove the
>>>>> concept of arrays, just that our InfosetImpl doesn't necessarily need to
>>>>> have a special DIArray. I'll agree that it does make the logic easier to
>>>>> get it to a different model, but other things are more complicated.
>>>>>
>>>>>> Consider this: we could create DIArrays of elements of simple type that don't even create a DISimple node for each array element. The values could be stored in the DIArray's arrayBuffer directly, eliminating all that overhead of a DISimple node. (This was the original intent for DIArray, but was punted to get it working sooner.)
>> The opportunity to do this is entirely lost if you don't have DIArray, but just
>> a bunch of adjacent DIElements that happen to share ERD.
>>>>>
>>>>> Fair point. Seems a good argument for keeping DIArray, but if we don't
>>>>> get around to doing this optimization, we're taking multiple hits. We're
>>>>> still allocating many DISimples plus a DIArray, plus the hit from the
>>>>> hashmap.
>>>>>
>>>>>> More speculatively, when we someday implement incremental array trimming so that streaming parse/unparse can drop most array elements once they've been delivered to consumer, well then we will definitely want a DIArray object to abstract the complexity of this trim behavior and the underlying representation it requires.
>>>>>
>>>>> It might simplify this, but I'm not entirely convinced. I think the
>>>>> trimming of unused Infoset elements is going to have some complex
>>>>> heuristics regardless. Adding an extra heuristic for arrays won't
>>>>> necessarily make this easier. But this is just speculation. And again,
>>>>> we would still know elements are part of an array, it just takes a
>>>>> couple extra comparisons to determine start/end. But I do agree that
>>>>> removal of an an entire array is much easier with a DIArray concept.
>>>>>
>>>>>
>>>>>
>>>>> It seems to me these are good enough arguments for keeping DIArray
>>>>> around, but it might be worth thinking a little bit about what changes
>>>>> would be necessary to the API/parser to not need this hash lookup. If
>>>>> it's a minor change, it could reduce a lot of hash lookups, especially
>>>>> for large arrays, which are not uncommon.
>>>>>
>>>>>
>>>>>>
>>>>>> ________________________________
>>>>>> From: Steve Lawrence <sl...@apache.org>
>>>>>> Sent: Wednesday, October 11, 2017 12:21 PM
>>>>>> To: Taylor Wise; dev@daffodil.apache.org
>>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>>
>>>>>> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>>>>>>
>>>>>>> What you'll notice is that the current implementation will construct the DIArray
>>>>>>> if it doesn't already exist and append to it if it does already exist.  But it
>>>>>>> needs the 'slot' number in order to check this.  The slot number is obtained
>>>>>>> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of slots
>>>>>>> entirely, how can we accomplish this?  Will we have to do a named lookup then
>>>>>>> every time we want to append to a DIArray?
>>>>>> I think you're right. The way things work right now, anytime we add to a
>>>>>> a DIArray we would have to do a hash lookup on the DIComplex. It's not
>>>>>> limited to just DPath.
>>>>>>
>>>>>> Maybe this is another argument for removing the concept of a DIArray? We
>>>>>> still need to know if a DIElement is part of an array for the
>>>>>> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
>>>>>> element should just means creating a new DISimple/DIComplex and adding
>>>>>> it to the DIComplex child ArrayBuffer and appending to the HashLookup
>>>>>> Seq? This would also simplify some query style logic, e.g. fn:count.
>>>>>>
>>>>>> Alternatively, we could modify the parser logic so that when it's
>>>>>> dealing with arrays it stores the DIArray locally somewhere (PState or
>>>>>> Parser) and appends straight to that without having to ask the parent
>>>>>> DIComplex for the DIArray? Not sure how big of a change that would be
>>>>>> though.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Current Implementation:
>>>>>>>
>>>>>>> override def addChild(e: InfosetElement): Unit = {
>>>>>>>
>>>>>>>   if (e.runtimeData.isArray)  {
>>>>>>>
>>>>>>>    // Make sure there is an array to accept the child
>>>>>>>
>>>>>>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>>>>>>
>>>>>>>      Assert.invariant(arr ne null)
>>>>>>>
>>>>>>>      arr.append(e)
>>>>>>>
>>>>>>>    } else {
>>>>>>>
>>>>>>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>>>>>>
>>>>>>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>>>>>>
>>>>>>>    }
>>>>>>>
>>>>>>>    e.setParent(this)
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>>>>>>>      Assert.usage(childERD.isArray)
>>>>>>>      getChildArray(childERD.dpathElementCompileInfo)
>>>>>>>    }
>>>>>>>
>>>>>>>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>>>>>>>      Assert.usage(info.isArray)
>>>>>>>      val slot = info.slotIndexInParent
>>>>>>>      getChildArray(slot)
>>>>>>>    }
>>>>>>>
>>>>>>>    private def getChildArray(slot: Int): InfosetArray = {
>>>>>>>      val slotVal = _slots(slot)
>>>>>>>      if (slotVal ne null)
>>>>>>>        slotVal match {
>>>>>>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>>>>>>          case _ => Assert.usageError("not an array")
>>>>>>>        }
>>>>>>>      else {
>>>>>>>        val arrayERD = erd.childERDs(slot)
>>>>>>>        // slot is null. There isn't even an array object yet.
>>>>>>>        // create one (it will have zero entries)
>>>>>>>        val ia = new DIArray(arrayERD, this)
>>>>>>>        // no array there yet. So we have to create one.
>>>>>>>        setChildArray(slot, ia)
>>>>>>>        ia
>>>>>>>      }
>>>>>>>    }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------------
>>>>>>> *From:* Steve Lawrence <sl...@apache.org>
>>>>>>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>>>>>>> *To:* dev@daffodil.apache.org; Taylor Wise
>>>>>>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>>>>>>> Goal: Remove concept of slots from InfosetImpl
>>>>>>>>
>>>>>>>> Reasons:
>>>>>>>>     1. So that element/node lookup can be name based.
>>>>>>>>     2. Ease re-enabling of Unordered Sequences.
>>>>>>>>     3. Ease enabling of 'query-style' expressions that return more than
>>>>>>>> one node.
>>>>>>>>
>>>>>>>> An Infoset is a document (DIDocument) and the document contains exactly
>>>>>>>> one element (DIElement) as its root.  These elements can be simple
>>>>>>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>>>>>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>>>>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>>>>>>> inherit from DINode.
>>>>>>>>
>>>>>>>> Currently DIComplex models its children as an Array of DINodes. The
>>>>>>>> problem with the current implementation is that there's no way for us to
>>>>>>>> just ask for an element by name without walking the structure.
>>>>>>>>
>>>>>>>> The solution involves modifying DIComplex to represent its children both
>>>>>>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>>>>>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>>>>>>> insertion ordering information we need regarding the children as well as
>>>>>>>> constant time indexing access to the child in question.  The ArrayBuffer
>>>>>>>> also has the added benefit of growing and shrinking as objects are added
>>>>>>>> and removed.  While the HashMap will allow us to associate a name
>>>>>>>> (String) with DINodes that have that name thus providing a fast named
>>>>>>>> lookup.
>>>>>>>>
>>>>>>>> This named lookup can be optimized further by only inserting into the
>>>>>>>> HashMap those elements that we know are used in expressions. These are
>>>>>>>> the elements that would most likely benefit from a fast named lookup and
>>>>>>>> will allow us to avoid additional overhead of hashing strings and
>>>>>>>> allocating/appending to a sequence.
>>>>>>>>
>>>>>>>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>>>>>>>> it's possible for the following:
>>>>>>>>
>>>>>>>>     <element name="a" .../>
>>>>>>>>     <element name="a" maxOccurs="unbounded"..../>
>>>>>>>>     <element name="b".../>
>>>>>>>>     <element name="a" ..../>
>>>>>>>>
>>>>>>>> Each element named 'a' would have an associated DINode.  Because there
>>>>>>>> are multiple DINodes named 'a', we need to represent them in the HashMap
>>>>>>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>>>>>>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>>>>>>>
>>>>>>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>>>>>>     "b" -> mutable.Seq(DIElement)
>>>>>>>>
>>>>>>>> The mutable.ArrayBuffer maintains the insertion order and would look as
>>>>>>>> follows:
>>>>>>>>
>>>>>>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>>>>>>
>>>>>>>> Back tracking shall be accomplished by going to the last item in the
>>>>>>>> DIComplex ArrayBuffer and removing it.
>>>>>>>>     1. If the last item is a DIElement, we simply delete it from the
>>>>>>>> ArrayBuffer and then remove it also from the HashMap.
>>>>>>>>     2. If the last item is a DIArray, we then have to look at the
>>>>>>>> contents of the DIArray.
>>>>>>>>         a. If the DIArray has more than one entry, remove the last
>>>>>>>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>>>>>>>> with the modified DIArray.
>>>>>>>
>>>>>>> I think the second part of this is actually unnecessary. Since the
>>>>>>> DIArray is mutable, we only need to remove its last child. The entry in
>>>>>>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>>>>>>> only difference is the contents of the DIArray will be one smaller.
>>>>>>>
>>>>>>>>         b. If the DIArray has a single entry then we simply remove
>>>>>>>> DIArray from ArrayBuffer and the HashMap.
>>>>>>>>
>>>>>>>>     <element name="root>
>>>>>>>>          <sequence ../>
>>>>>>>>          <element name="a" .../>
>>>>>>>>          <element name="a" maxOccurs="unbounded">
>>>>>>>>            ...
>>>>>>>>            <element name="b" .../>
>>>>>>>>            <element name="c" .../>
>>>>>>>>            ...
>>>>>>>>          </element>
>>>>>>>>        </sequence>
>>>>>>>>     </element>
>>>>>>>>
>>>>>>>> If the HashMap of DIComplex-root looks like the following:
>>>>>>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>>>
>>>>>>>> ArrayBuffer of DIComplex-root looks like the following:
>>>>>>>>      0 -> DIElement("a")
>>>>>>>>      1 -> DIArray("a")
>>>>>>>>
>>>>>>>> Where DIArray children ArrayBuffer looks like:
>>>>>>>>      0 -> DIElement("b")
>>>>>>>>      1 -> DIElement("c")
>>>>>>>>
>>>>>>>> Given the initial example if we backtrack(1):  // purge one item
>>>>>>>>      DIArray:
>>>>>>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>>>>>>
>>>>>>>>      HashMap DIComplex-root:
>>>>>>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>>>
>>>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>>>          0 -> DIElement("a")
>>>>>>>>          1 -> DIArray("a")
>>>>>>>>
>>>>>>>> Given the initial example if we backtrack(2): // purge two items
>>>>>>>>      DIArray:
>>>>>>>>          Empty
>>>>>>>>
>>>>>>>>      HashMap DIComplex-root:
>>>>>>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>>>>>>>                                         // removed and is now empty
>>>>>>>>
>>>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>>>          0 -> DIElement("a")
>>>>>>>>
>>>>>>>> ===
>>>>>>>>
>>>>>>>> class DIComplex {
>>>>>>>>   def nSlots = erd.nChildSlots
>>>>>>>
>>>>>>> I think the concept of nSlots and nChildSlots should go away completely.
>>>>>>> The children ArrayBuffer should just grow as elements are added and
>>>>>>> shrink as they are backtracked. It might make sense to have a tunable
>>>>>>> the determines the initial size of the children Array, but the Scala
>>>>>>> defaults for initializing and growing a mutable array might already be
>>>>>>> reasonable.
>>>>>>>
>>>>>>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>>>>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>>>>>>
>>>>>>> One thing that might make this change easier is to not worry about the
>>>>>>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>>>>>>> really only necessary to support query style expressions, but that comes
>>>>>>> with a whole other slew of concerns (e.g. fn:count, same name but
>>>>>>> different types, etc.). There currently exist restrictions that do not
>>>>>>> allow query style expressions, so this Seq should always be a size of 1
>>>>>>> until remove those restrictions.
>>>>>>>
>>>>>>>>
>>>>>>>>   def addChild(node: DINode) {
>>>>>>>>     children += node
>>>>>>>>     fastLookup(node.name).append(node)
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>>>>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   def backtrack(int n) {
>>>>>>>>     for (i = 0; i < n; i++) {
>>>>>>>>       val endIdx = children.size -1
>>>>>>>>       val child = children.remove(endIdx) // children.pop()
>>>>>>>>
>>>>>>>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>>>>>>>         case None => // do nothing
>>>>>>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>>>>>>         case Some(nodesSeq) => {
>>>>>>>>       val lastEntry = nodesSeq.last
>>>>>>>>           lastEntry match {
>>>>>>>>         case arr: DIArray => {
>>>>>>>>               // If the array has >= 2 items, remove last one and update
>>>>>>>> nodesSeq
>>>>>>>>               // with this array:
>>>>>>>>               //  val lastIdx = nodesSeq.length - 1
>>>>>>>>           //  arr.remove(arr.size - 1)
>>>>>>>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>>>>>>>> arr))
>>>>>>>>               //
>>>>>>>>               // If the array has 1 item, remove the array from nodesSeq
>>>>>>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>>>         }
>>>>>>>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>>>           }
>>>>>>>>
>>>>>>>>         }
>>>>>>>>       }
>>>>>>>>     }
>>>>>>>>   }
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> class DIArray {
>>>>>>>>     private val initialSize =
>>>>>>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>>>>>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>>>>>>>
>>>>>>>>     def append(ie: InfosetElement): Unit = {
>>>>>>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>>>>>>         ie.setArray(this)
>>>>>>>>     }
>>>>>>>>
>>>>>>>>     //
>>>>>>>>     // For backtracking purposes
>>>>>>>>     //
>>>>>>>>     def remove(n: Int): DIElement = {
>>>>>>>>         _contents.remove(n)
>>>>>>>>     }
>>>>>>>> }
>>>>>>>>
>>>>>>>> ===
>>>>>>>>
>>>>>>>> To facillitate a breadth-first ordering scheme we shall have to
>>>>>>>
>>>>>>> That this is only necessary for unordered sequences, where the insertion
>>>>>>> order in the ArrayBuffer differs from the display order, correct? If
>>>>>>> that's true, we definitely want to make sure to only do the below
>>>>>>> sorting when an element contains an unordred sequence.
>>>>>>>
>>>>>>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>>>>>>> shall be called childElementDeclPosition and pertains to the static
>>>>>>>> order of the element children of a complex type element.  When sorting
>>>>>>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>>>>>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>>>>>>> entry in the sequence that needs to be sorted.  All elements in this
>>>>>>>> array will have the same childElementDeclPosition.
>>>>>>>>
>>>>>>>> The difficulty in implementing this is that an Element's
>>>>>>>> dsom.ComplexType can have groups within the groups before finding actual
>>>>>>>> element children.  Therefore in the Document Schema Object Model (DSOM)
>>>>>>>> the model group object shall have a startingChildElementDeclPosition and
>>>>>>>> an endingChildElementDeclPosition.
>>>>>>>>
>>>>>>>> 1. The startingChildElementPosition shall be passed to the constructor
>>>>>>>> for the Term.
>>>>>>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>>>>>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>>>>>>> model group, then its first child term's
>>>>>>>> startingChildElementDeclPosition is equal to its own
>>>>>>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>>>>>>> startingChildElementDeclPosition is the prior term's
>>>>>>>> endingChildElementDeclPosition + 1.
>>>>>>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>>>>>>> endingChildElmentDeclPosition of its last child term.
>>>>>>>>
>>>>>>>
>>>>>>> Looks good to me. +1. This should definitely make implementation of
>>>>>>> unordered sequences and query style expressions easier. Removal of the
>>>>>>> slot concept should also help remove concepts that prevent faster schema
>>>>>>> compilation.
>>>>>>>
>>>>>>> - Steve
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Mike Beckerle <mb...@tresys.com>.

Yes. The one giant ticket about the space problem compiling (DFDL-1444) is not specific enough.

________________________________
From: Taylor Wise
Sent: Monday, October 23, 2017 12:28:04 PM
To: Steve Lawrence; dev@daffodil.apache.org; Mike Beckerle
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion


I don't believe there's a ticket associated with these changes.  Should I go ahead and create one?

________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Wednesday, October 18, 2017 8:21:02 AM
To: Taylor Wise; dev@daffodil.apache.org; Mike Beckerle
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion

On 10/17/2017 07:57 PM, Taylor Wise wrote:
> Been debugging after implementing your suggested changes.
>
>
> First thing to note, is that by moving the creation of the array out of
> getChildArray that I had to make additional calls to 'addChild' in the
> InfosetInputter.  This corrected things on the 'parser' side.  However, on the
> unparser side it became evident that this 'addChild' function cannot always set
> the parent.  There's an invariant expecting parent to be null along the
> execution branch of the unparser side.  As a result, I modified addChild to also
> accept a default boolean called 'setParent'.  It defaults to true unless
> specified.  So now when we call 'addChild' in InfosetInputter on the unparser
> side, we actually call 'addChild(e, false)'.

This sounds to me like addChild is getting called twice on the same
element and so it's parent is trying to be set twice. Might need some
more investigation. Maybe the InfosetInputter needs more tweaks to
support not using getChildArray? Hard to say without seeing your
InfosetInputter changes.

Regarding the below tests, I think your assessment is correct. The
password element doesn't exist and so we should never even try to get
the -1 index. Instead we should get something like "password element
does not exist" or something. It would be useful to have that test to
ensure we get that error message.

An index of -1 (or anything less than 1) can definitely be caught
statically at schema compile time. Worth creating a bug for that. But
there will be times where the index is calculated so we'll still need
this runtime check. I think it's still a test worth having.

I think you're modifications are reasonable based on what the tests are
trying to tests. Probably to proper coverage of tests we need is:

- trying to access an array that does not exist with both valid and
invalid indicies (less than 1 and greater than array size)
- trying to access an array thta does exist but with invalid indicies
(less than 1 and greater than array size)

So it's probably worth keeping the existing tests for the 'array does
not exist' tests and ensure we get a reasonable error message not
talking about indices. And then just make your modifications new tests.

>
> The above correction reduced the errors from 30+ to just 2.  The only two errors
> remaining have to do with an array index out of bounds error that's expected.  I
> believe that it's actually an issue with the test and not the code.
>
>
> Error:
>
> edu.illinois.ncsa.daffodil.tdml.TDMLException: Did not find diagnostic message
> "Value -1 is out of range" in any of the actual diagnostic messages:
> Runtime Schema Definition Error: Expression Evaluation Error: Child element
> {http://example.com}password does not exist.
> Schema context: hideme Location line 267 column 12 in
> file:/tmp/hiddenElem2658119394211244932.dfdl.xsd
> Data location was preceding byte 0 limit(bytes) 8
>
> Well, password doesn't exist because we are missing the 'pw:' initiator in the
> tdml:document content.  The code now doesn't know to create it because we
> haven't seen the initiator.  At least that's what I surmise.
>
>
> Existing test:
>
>    <!--
>       Test Name: arrayIndexOutOfBounds_02
>          Schema: hiddenElem
>            Root: e1b
>         Purpose: This test demonstrates that a proper error occurs if an array
> index in an expression is out of bounds
>    -->
>
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_02" root="e1b"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>
>      <tdml:document>p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value -1 is out of range</tdml:error>
>          <tdml:error>with length 0</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
>
> Schema:
>      <xs:group name="hiddenData">
>        <xs:sequence dfdl:separator="|">
>          <xs:element name="password" type="xs:string" dfdl:initiator="pw:"
> minOccurs="0" maxOccurs="1" dfdl:lengthKind="delimited"/>
>        </xs:sequence>
>      </xs:group>
>
>      <xs:element name="e1">
>        <xs:complexType>
>          <xs:sequence>
>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{
> fn:concat(fn:substring(/ex:e1/ex:password[1], 1, 1), '***') }"/>
>          </xs:sequence>
>        </xs:complexType>
>      </xs:element>
>      <xs:element name="e1b">
>        <xs:complexType>
>          <xs:sequence>
>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{
> fn:concat(fn:substring(/ex:e1b/ex:password[-1], 1, 1), '***') }"/>
>          </xs:sequence>
>        </xs:complexType>
>      </xs:element>
>
>
> My modification:
>
>    <!--
>       Test Name: arrayIndexOutOfBounds_02
>          Schema: hiddenElem
>            Root: e1b
>         Purpose: This test demonstrates that a proper error occurs if an array
> index in an expression is out of bounds
>    -->
>
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_02" root="e1b"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>
>      <tdml:document>pw:p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value -1 is out of range</tdml:error>
>          <tdml:error>with length 1</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
>
> But honestly, should we really need to do this?  Shouldn't the code know that -1
> is not a valid index without having to force construction of the password array?
>
> I'm similarly going to have to modify this other test as well by constructing a
> different root that attempts to access a character in a position greater than
> the length of the array, unless there's something I'm missing:
>
>    <!--
>       Test Name: arrayIndexOutOfBounds_01
>          Schema: hiddenElem
>            Root: e1
>         Purpose: This test demonstrates that a proper error occurs if an array
> index in an expression is out of bounds
>    -->
>
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_01" root="e1"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>
>      <tdml:document>p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value 1 is out of range</tdml:error>
>          <tdml:error>with length 0</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
>
> Proposed addition to defineSchema:
>
>      <xs:element name="e1c">
>        <xs:complexType>
>          <xs:sequence>
>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{
> fn:concat(fn:substring(/ex:e1c/ex:password[10], 1, 1), '***') }"/>
>          </xs:sequence>
>        </xs:complexType>
>      </xs:element>
>
> Proposed modification to test:
>
>    <!--
>       Test Name: arrayIndexOutOfBounds_01
>          Schema: hiddenElem
>            Root: e1
>         Purpose: This test demonstrates that a proper error occurs if an array
> index in an expression is out of bounds
>    -->
>
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_01" root="e1c"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>
>      <tdml:document>pw:p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value 10 is out of range</tdml:error>
>          <tdml:error>with length 1</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
>
>
>
>
> --------------------------------------------------------------------------------
> *From:* Steve Lawrence <sl...@apache.org>
> *Sent:* Tuesday, October 17, 2017 11:04:03 AM
> *To:* Taylor Wise; dev@daffodil.apache.org; Mike Beckerle
> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
> I don't think a DIArray could be created earlier and things appended to
> it. But, currently, I think an empty DIArray could be created earlier.
>
> And I think this is really just a problem with getChildArray always
> creating DIArrays. What I suspect happened is that we had an array
> following by something that accessed that array, e.g.:
>
> <xs:sequence>
>    <xs:element name="arr" maxOccurs="unbounded" ... />
>    <xs:element name="count" dfdl:inputValueCalc="{ fn:count(../arr) }" />
> </xs:sequence>
>
> arr was slot 1, count was slot 2.
>
> The setChildArray function used to just do something like
>
>    _lastSlotAdded = slot
>
> Which worked in the case of adding a new array--we add a new array, so
> the slot it was added in was the last slot added.
>
> But when fn:count was called, _lastSlotAdded was 2 (from the count
> element). fn:count ended up calling getChildArray, which created a
> DIArray and called setChildArray, which overwrote _lastSlotAdded to the
> slot of the array, i.e. 1. So even though the array was finished and
> couldn't be modified, the use of getChildArray in DPath ended up setting
> _lastSlotAdded to the wrong value. Using max was a way to fix this so we
> couldn't set it to an earlier value.
>
> So, now, because getChildArray create elements, it sounds like things
> could be created out of order. Rather than inserting an empty DIArray
> into a slot, it will added it to the end and potentially mess things up.
> This says to me we definitely want to modify getChildArray to not create
> DIArrays. It should either return the array if it exists, or throw an
> Infoset exception, very similar to getChild. Then we don't have to worry
> about DIArray's being created at the wrong time. The only thing that
> should create a DIArray is addChild.
>
>
> On 10/17/2017 10:13 AM, Taylor Wise wrote:
>> Hmm, very interesting.  This would greatly simplify things.  But what about this
>> code snippet from before my changes:
>>
>>
>>    //  final def setChildArray(slot: Int, arr: DIArray) {
>>    //    Assert.invariant(_slots(slot) eq null)
>>    //    _slots(slot) = arr
>>    //    //
>>    //    // Because arrays are created if not existing, an expression like ../A[1]
>>    //    // can cause array A to come into existence (though empty) and A might be
>>    //    // part of an expression that is reaching backward into earlier data (when
>>    //    // parsing) so that we might be creating an array earlier in the slots
>>    //    // of this element. Hence, the slot being passed here is not necessarily
>>    //    // the last slot added. It might be before it.
>>    //    //
>>    //    _lastSlotAdded = math.max(slot, _lastSlotAdded)
>>    //  }
>> Given your statement:
>>
>>
>> "the issue is determining if the DIArray exists or not. Everything
>> after that is the same. And since children are inserted in the order
>> that we see them, a DIArray will always be at the end of the children
>> ArrayBuffer if it exists. If the DINode at the end of the children
>> ArrayBuffer is not this array, it needs to be created."
>>
>>
>> To me it seems to imply that a DIArray can come into existence without any
>> contents.  But then be appended to at a later time.  In which case it's possible
>> that this DIArray then is not the last item appended.  It could be somewhere
>> earlier and then need to be added to.
>>
>> --------------------------------------------------------------------------------
>> *From:* Steve Lawrence <sl...@apache.org>
>> *Sent:* Tuesday, October 17, 2017 7:20:42 AM
>> *To:* dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>> Actually, I think we're over complicating the array situation. I think
>> we can avoid needing a unique ID and I think we can also avoid needing
>> to always insert arrays into a HashMap.
>>
>> When addChild is called to add an DIElement to an array, there are two
>> cases we need to think about:
>>
>> 1) The DIArray does not exist and it must be created. The
>> DIElement is appended to its contents.
>>
>> 2) The DIArray does exist and we just need to find it. The
>> DIElement is appended to its contents.
>>
>> So the issue is determining if the DIArray exists or not. Everything
>> after that is the same. And since children are inserted in the order
>> that we see them, a DIArray will always be at the end of the children
>> ArrayBuffer if it exists. If the DINode at the end of the children
>> ArrayBuffer is not this array, it needs to be created. So addChild
>> just becomes something like this:
>>
>>    def addChild(e: DIElement) {
>>      if (e.erd.isArray) {
>>        if (children.isEmpty || children.last.erd != e.erd) {
>>          // no children, or last child is not a DIArray for
>>          // this element, create the DIArray
>>          children.append(new DIArray(...))
>>        }
>>        // the array is now always last, add the new child to it
>>        children.last.append(e)
>>      } else {
>>        children.append(e)
>>      }
>>    }
>>
>> So we no longer need a unique ID or a tuple for the HashMap key. The
>> fast HashMap lookup only needs to work on the name of the element, which
>> allows query-style expressions to work in the future. And we now only
>> need to insert DIArray's into the fast HashMap if they are used in
>> expressions addChild no longer needs getChildArray, minimizing hash
>> look ups.
>>
>> Related, I'm wondering if getChildArray should change as well so that it
>> does not create a DIArray if it doesn't exist? I think it seems
>> reasonable that only addChild should create new DIArrays. And if
>> something like DPath calls getChildArray, we probably don't want to end
>> up creating an empty DIArray if it doesn't exist. It should probably
>> just throw an InfosetNoSuchChildElementException if the array doesn't
>> exist, just like getChild.
>>
>> - Steve
>>
>> On 10/16/2017 12:02 PM, Steve Lawrence wrote:
>>> The ID might cause some issues when we want to support query-style
>>> expressions. For example, say we have a schema like this:
>>>
>>> <sequence>
>>>   <xs:element name="array" maxOccurs="unbounded" ... />
>>>   <xs:element name="element" ... />
>>>   <xs:element name="array" maxOccurs="unbounded" ... />
>>> </xs:sequence>
>>>
>>> So two arrays with the same name separated by a single element. Each
>>> element would have a unique ID, so would be inserted in the hash table
>>> as something like
>>>
>>> ("array", 1) -> Seq(DIArray)
>>> ("element", 2) -> Seq(DISimple)
>>> ("array", 3) -> Seq(DIArray)
>>>
>>> A query style expression that accesses the "array" elements (e.g.
>>> fn:count(./array)) would have trouble finding both arrays to get the
>>> full count since the keys for the two arrays are the same.
>>>
>>> Perhaps one option is to keep the name -> DINode map type, so we'd end
>>> up with something like this:
>>>
>>> "array" -> Seq(DIArray, DIArray)
>>> "element" -> Seq(DISimple)
>>>
>>> And then just iterator of the Seq until we find DIArray with the right
>>> ID? That's could have a performance hit if there were lots of things
>>> with the same name, but I suspect that won't be super common.
>>>
>>> Alternatively, we could relook at chaning the InfosetImpl API so that
>>> when an DIArray is created, it's store in the PState for quick access,
>>> and getChildArray would essentially go away.
>>>
>>> - Steve
>>>
>>>
>>> On 10/13/2017 02:14 PM, Mike Beckerle wrote:
>>>> I think the key has to change to include name and namespace of element, and also an additional integer to disambiguate the first a element from the 2nd, from the Nth. This integer need not be 1, 2, 3. It can be any value at all, so long as they are distinct.
>>>>
>>>>
>>>> Let's make sure another case works. E.g.,  ordered sequence like the one you have here, another element to separate it from  an unordered sequence that also contains elements named a.
>>>>
>>>>
>>>> However, an unordered sequence can't have more than 1 element named a, so if every element a declaration gets an additional integer, the one in the unordered sequence will also get one, and that will behave properly.
>>>>
>>>>
>>>> Is this integer the same breadth-first ordering thing we discussed elsewhere? I expect it is.
>>>>
>>>> ________________________________
>>>> From: Taylor Wise
>>>> Sent: Friday, October 13, 2017 1:55:22 PM
>>>> To: Steve Lawrence; dev@daffodil.apache.org; Mike Beckerle
>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>
>>>>
>>>> I'm down to a single error: test_dfdlPosition3
>>>>
>>>>
>>>> java.lang.Exception:
>>>> Comparison failed.
>>>> Expected
>>>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2><a><v>1</v></a><a><v>2</v></a></f>
>>>> Actual
>>>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2></f>
>>>> Differences were (path, expected, actual):
>>>>  (f,<cnt2>2</cnt2>,<a><v>1</v></a>)
>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>>
>>>>     <xs:element name="f">
>>>>       <xs:complexType>
>>>>         <xs:sequence>
>>>>           <xs:element name="cnt1" type="xs:int"
>>>>             dfdl:lengthKind="delimited" dfdl:terminator=";" />
>>>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt1) }">
>>>>             <xs:complexType>
>>>>               <xs:sequence>
>>>>                 <xs:element name="v" type="xs:int"
>>>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>>>               </xs:sequence>
>>>>             </xs:complexType>
>>>>           </xs:element>
>>>>           <xs:element name="cnt2" type="xs:int"
>>>>             dfdl:lengthKind="delimited" />
>>>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt2) }">
>>>>             <xs:complexType>
>>>>               <xs:sequence>
>>>>                 <xs:element name="v" type="xs:int"
>>>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>>>               </xs:sequence>
>>>>             </xs:complexType>
>>>>           </xs:element>
>>>>         </xs:sequence>
>>>>       </xs:complexType>
>>>>     </xs:element>
>>>>
>>>>
>>>>
>>>> I believe what's happening, from stepping through, is that the getChildArray code is confused as to where to add the second set of <a/> elements.  Below are the values of the structures at the point where <cnt2/> has been added:
>>>>
>>>> ChildNodes aka Insertion Order
>>>> ====
>>>> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk)), DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a'))), DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk)))
>>>>
>>>> FastLookup
>>>> ====
>>>> Map(cnt2 -> ArrayBuffer(DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk))), cnt1 -> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk))), a -> ArrayBuffer(DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a')))))
>>>>
>>>> So when we go to add the next array 'a', it sees that 'a' is already in the fastLookup and appends it there.  This adds it to the existing DIArray in the insertion order list.  Really, it should create a new DIArray inserting it after <cnt2/> in the insertion order list.  And inserting it after the DIArray in the fastLookup Hash: a ->
>> Seq(DIArray(a, ...), DIArray(a, ...))
>>>>
>>>> How do we fix this?
>>>>
>>>> Because this is an 'ordered' sequence, can we look at the slotIndexInParent?  Isn't this the position in the sequence definition as defined in the schema above?  This should differentiate between the two 'a' arrays.  And then for 'unordered' sequences,  this  shouldn't matter.  We shouldn't run into two or more members with the same
>> name and namespace.  Correct?
>>>>
>>>> This also means that the 'key' will have to change to include the namespace in addition to the element name.
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: Steve Lawrence <sl...@apache.org>
>>>> Sent: Friday, October 13, 2017 8:45:06 AM
>>>> To: dev@daffodil.apache.org; Taylor Wise; Mike Beckerle
>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>
>>>> On 10/12/2017 08:19 PM, Taylor Wise wrote:
>>>>> Been debugging using the TestSimpleTypes tests.  I'm down to just 1 failure.  Expected results are slightly off.  I've got an additional <text/> in the result.  So, getting closer.  Stopping for the night.
>>>>>
>>>>>
>>>>> I'm not so sure that we need the _lastSlotAdded variable anymore.  I changed it to be just a def that reflects the current length of childNodes (ArrayBuffer).
>>>>> z
>>>>>
>>>>> I'm betting my current issue is within the restore functions I modified, but I'll continue looking at that in the morning.  I would think that 'restoring' would simply be copying the previous state... unless I'm misunderstanding something.
>>>>>
>>>>>
>>>>>   abstract override def restoreInto(e: DIElement) {
>>>>>     val c = e.asInstanceOf[DIComplex]
>>>>>     c.restoreFrom(this)
>>>>>     //c._lastSlotAdded = this._lastSlotAdded
>>>>>     super.restoreInto(e)
>>>>>   }
>>>>>
>>>>> final def restoreFrom(cs: DIComplexSharedImplMixin){
>>>>>   // Shouldn't I be able to just say something like...
>>>>>  //
>>>>>  childNodes.clear()
>>>>>  childNodes.append(cs.childNodes)
>>>>>  fastLookup.clear()
>>>>>  cs.fastLookup.toSeq.foreach( (name, seqDINodes) =>
>>>>>        fastLookup.put(name, seqDINodes) )
>>>>> }
>>>>
>>>> This is going to be somewhat inefficient. This essentially deletes and
>>>> recreates the entire childNodes array and fastLookup hashmap everytime
>>>> we backtrack. This could be pretty back if we childNodes had 1000
>>>> elements and we just wanted to backtrack 1. Instead we just want to
>>>> remove the elements that are being backtracked.
>>>>
>>>> So I think you still need a concept of _lastSlotAdded. But rather than
>>>> the lastSlotAdded, it's just the length of the children (maybe call it
>>>> _numChildren?). I would expect restoreFrom for DIComplex to look
>>>> something like this (very similar to the existing restoreFrom):
>>>>
>>>> final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>   var i = childNodes.length - 1
>>>>
>>>>   // for each child we will remove, remove it from the fastLookup map.
>>>>   // this should always be the last element in the hashmap seq
>>>>   while (i >= cs._numChildren) {
>>>>     var childToRemove = childNodes(i)
>>>>     var fastSeq = fastLookup(childToRemove.name)
>>>>     if (fastSeq.length == 1)
>>>>       // last one, remove the whole key entry
>>>>       fastLookup.remove(childToRemove.name)
>>>>     else
>>>>       // not the last one, just drop the end
>>>>       fastSeq.dropRight(1)
>>>>     i -= 1
>>>>   }
>>>>
>>>>   // now just quickly remove all those children
>>>>   childNodes.reduceToSize(cs._numChildren)
>>>>   _numChildren = cs._numChildren
>>>>
>>>>   if (cs._arraySize.isDefined) {
>>>>     childNodes(_numChildren - 1).reduceToSize(cs._arraySize.get)
>>>>   }
>>>> }
>>>>
>>>> The logic here is very similar to the CURRENT code, just is a little
>>>> more simple since it can make certain assumptions. Like removing
>>>> something form the fastSeq just means removing the last one
>>>> (dropRight(1)), and the existing _arraySize handles the special case for
>>>> DIArrays to simplify that logic.
>>>>
>>>> Note that we might want to rethink about the type fastLookup value.
>>>> Rather than the value being a Seq, it probably wants to be some sort of
>>>> mutable Seq that has constant time append dropRight. Maybe another
>>>> ArrayBuf is the right data structure here? That would allow something like
>>>>
>>>>   fastLookup.reduceToSize(fastSeq.length - 1)
>>>>
>>>>
>>>>
>>>>
>>>>> // ORIGINAL
>>>>>   //    final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>>   //    Assert.invariant(_arraySize == MaybeInt.Nope)
>>>>>   //
>>>>>   //    var i = _lastSlotAdded
>>>>>   //    while (i > cs._lastSlotAdded) {
>>>>>   //      _slots(i) = null
>>>>>   //      i -= 1
>>>>>   //    }
>>>>>   //    _lastSlotAdded = cs._lastSlotAdded
>>>>>   //
>>>>>   //    // check invariant. All slots after last slot added
>>>>>   //    // should be null.
>>>>>   //    {
>>>>>   //      var i = _lastSlotAdded + 1
>>>>>   //      while (i < _slots.length) {
>>>>>   //        Assert.invariant(_slots(i) eq null)
>>>>>   //        i = i + 1
>>>>>   //      }
>>>>>   //    }
>>>>>   //
>>>>>   //    if (cs._arraySize.isDefined) {
>>>>>   //      _slots(_lastSlotAdded).asInstanceOf[DIArray].reduceToSize(cs._arraySize.get)
>>>>>   //    }
>>>>>   //  }
>>>>>
>>>>> // CURRENT, lot of logic here.  Wondering if it's even necessary, thus the question above
>>>>> //
>>>>>   final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>>     Assert.invariant(_arraySize == MaybeInt.Nope)
>>>>>
>>>>>     var i = _lastSlotAdded
>>>>>     while (i > cs._lastSlotAdded) {
>>>>>       val lastNode = childNodes.last
>>>>>       val name = lastNode.namedQName.toQNameString
>>>>>
>>>>>       lastNode match {
>>>>>         case arr: DIArray if (arr.length > 1) => {
>>>>>           // No need to delete the array, just the last DINode in it
>>>>>           arr.pop()
>>>>>         }
>>>>>         case arr: DIArray if (arr.length == 1) => {
>>>>>           // Last DINode in array, remove the array entirely
>>>>>           //
>>>>>           arr.pop()
>>>>>           //
>>>>>           // Update insertion order list
>>>>>           //
>>>>>           childNodes.remove(childNodes.length - 1)
>>>>>           //
>>>>>           // Update Fast Lookup
>>>>>           //
>>>>>           val seq = fastLookup(name)
>>>>>           if (seq.length <= 1) {
>>>>>             // Last DINode with THIS name, remove it entirely.
>>>>>             //
>>>>>             fastLookup.remove(name)
>>>>>           } else {
>>>>>             // Not the last DINode with THIS name, just remove this particular one.
>>>>>             //
>>>>>             val idx = seq.lastIndexWhere(node => node.isInstanceOf[DIArray])
>>>>>             val result = seq diff Seq(lastNode)
>>>>>             fastLookup.update(name, result)
>>>>>           }
>>>>>         }
>>>>>         case _ => {
>>>>>           //
>>>>>           // Update insertion order list
>>>>>           //
>>>>>           childNodes.remove(childNodes.length - 1)
>>>>>           //
>>>>>           // Update Fast Lookup
>>>>>           //
>>>>>           val seq = fastLookup(name)
>>>>>           if (seq.length <= 1) {
>>>>>             // Last DINode with THIS name, remove it entirely
>>>>>             //
>>>>>             fastLookup.remove(name)
>>>>>           } else {
>>>>>             // Not the last DINode with THIS name.  Just remove this one!
>>>>>             //
>>>>>             val idx = seq.lastIndexWhere(node => !node.isInstanceOf[DIArray])
>>>>>             val result = seq diff Seq(lastNode)
>>>>>             fastLookup.update(name, result)
>>>>>           }
>>>>>         }
>>>>>       }
>>>>>     }
>>>>>
>>>>>   }
>>>>>
>>>>> The last failure in SimpleTypes, I haven't run the rest of the suite yet though.  Baby steps.
>>>>> java.lang.Exception:
>>>>> Comparison failed.
>>>>> Expected
>>>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text></hb_root01>
>>>>> Actual
>>>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text><text/></hb_root01>
>>>>> Differences were (path, expected, actual):
>>>>>  (hb_root01,List(),List(<text/>))
>>>>> at edu.illinois.ncsa.daffodil.xml.XMLUtils$.compareAndReport(XMLUtils.scala:678)
>>>>> at edu.illinois.ncsa.daffodil.tdml.VerifyTestCase$.verifyParserTestData(TDMLRunner.scala:1074)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runParseExpectSuccess(TDMLRunner.scala:776)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:675)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:673)
>>>>> at scala.util.Either$RightProjection.foreach(Either.scala:468)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runProcessor(TDMLRunner.scala:673)
>>>>> at edu.illinois.ncsa.daffodil.tdml.TestCase.run(TDMLRunner.scala:623)
>>>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTestWithDataVolumes(TDMLRunner.scala:340)
>>>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTest(TDMLRunner.scala:328)
>>>>> at edu.illinois.ncsa.daffodil.tdml.Runner.runOneTest(RunnerFactory.scala:122)
>>>>> at edu.illinois.ncsa.daffodil.section05.simple_types.TestSimpleTypes.test_hexBinary_Delimited_01(TestSimpleTypes.scala:63)
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>>> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>>>>> at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>>>>> at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>>>>> at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>>>>> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>>>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>>>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>>>>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>>>>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>>>>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>>>>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>>>>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>>>>> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>>>>> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>>>>> at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
>>>>> at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: Steve Lawrence <sl...@apache.org>
>>>>> Sent: Wednesday, October 11, 2017 4:07:51 PM
>>>>> To: dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>
>>>>> On 10/11/2017 01:01 PM, Mike Beckerle wrote:
>>>>>> You have to hash lookup the name to append to an array if you're just appending one thing at a time starting from the slot/hashID. The current Infoset API hides the construction of the DIArray, and only provides access via the slot/hashID, so there's  no  way to avoid the hash per each append.
>>>>>
>>>>> The hashmap and hashlookups should really only be needed for elements
>>>>> that are used in expressions. We came up with that optimization so that
>>>>> we could remove the vast majority of hash lookups. Due to how
>>>>> getChildArray works, that optimization can't occur. I'm not sure how
>>>>> much this will affect performance, but there are ways to avoid the need
>>>>> for a hash lookup for arrays not used in expressions. The removal of
>>>>> DIArray was one. Maybe API/parser changes is better.
>>>>>
>>>>>> But that's fixable if we want. Probably the API to infoset needs to allow appending to an array more efficently, so that a loop like a repeat combinator can be directly holding onto the DIArray's arrayBuffer appending objects as the parser returns them. But I would not make this change as yet. First things first. Don't optimize
>> without profiling info, etc.
>>>>>>
>>>>>>
>>>>>> I'm very reluctant to remove DIArray because XML's data model isn't the only game in town. The XML decision to not have arrays, just adjacent things that happen to have the same name, has all sorts of negative performance implications.
>>>>>
>>>>> The removal of a DIArray doesn't necessarily mean we don't have a
>>>>> concept of an arrays. Every element added to Complex child arrayBuf
>>>>> still knows if it part of an array via ERD.isArray. And comparing the
>>>>> current DIElements ERD to the previous and next is enough to determine
>>>>> where array starts and ends. From there, we can support any model that
>>>>> requires a concept of an array. I'm not saying we should remove the
>>>>> concept of arrays, just that our InfosetImpl doesn't necessarily need to
>>>>> have a special DIArray. I'll agree that it does make the logic easier to
>>>>> get it to a different model, but other things are more complicated.
>>>>>
>>>>>> Consider this: we could create DIArrays of elements of simple type that don't even create a DISimple node for each array element. The values could be stored in the DIArray's arrayBuffer directly, eliminating all that overhead of a DISimple node. (This was the original intent for DIArray, but was punted to get it working sooner.)
>> The opportunity to do this is entirely lost if you don't have DIArray, but just
>> a bunch of adjacent DIElements that happen to share ERD.
>>>>>
>>>>> Fair point. Seems a good argument for keeping DIArray, but if we don't
>>>>> get around to doing this optimization, we're taking multiple hits. We're
>>>>> still allocating many DISimples plus a DIArray, plus the hit from the
>>>>> hashmap.
>>>>>
>>>>>> More speculatively, when we someday implement incremental array trimming so that streaming parse/unparse can drop most array elements once they've been delivered to consumer, well then we will definitely want a DIArray object to abstract the complexity of this trim behavior and the underlying representation it requires.
>>>>>
>>>>> It might simplify this, but I'm not entirely convinced. I think the
>>>>> trimming of unused Infoset elements is going to have some complex
>>>>> heuristics regardless. Adding an extra heuristic for arrays won't
>>>>> necessarily make this easier. But this is just speculation. And again,
>>>>> we would still know elements are part of an array, it just takes a
>>>>> couple extra comparisons to determine start/end. But I do agree that
>>>>> removal of an an entire array is much easier with a DIArray concept.
>>>>>
>>>>>
>>>>>
>>>>> It seems to me these are good enough arguments for keeping DIArray
>>>>> around, but it might be worth thinking a little bit about what changes
>>>>> would be necessary to the API/parser to not need this hash lookup. If
>>>>> it's a minor change, it could reduce a lot of hash lookups, especially
>>>>> for large arrays, which are not uncommon.
>>>>>
>>>>>
>>>>>>
>>>>>> ________________________________
>>>>>> From: Steve Lawrence <sl...@apache.org>
>>>>>> Sent: Wednesday, October 11, 2017 12:21 PM
>>>>>> To: Taylor Wise; dev@daffodil.apache.org
>>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>>
>>>>>> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>>>>>>
>>>>>>> What you'll notice is that the current implementation will construct the DIArray
>>>>>>> if it doesn't already exist and append to it if it does already exist.  But it
>>>>>>> needs the 'slot' number in order to check this.  The slot number is obtained
>>>>>>> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of slots
>>>>>>> entirely, how can we accomplish this?  Will we have to do a named lookup then
>>>>>>> every time we want to append to a DIArray?
>>>>>> I think you're right. The way things work right now, anytime we add to a
>>>>>> a DIArray we would have to do a hash lookup on the DIComplex. It's not
>>>>>> limited to just DPath.
>>>>>>
>>>>>> Maybe this is another argument for removing the concept of a DIArray? We
>>>>>> still need to know if a DIElement is part of an array for the
>>>>>> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
>>>>>> element should just means creating a new DISimple/DIComplex and adding
>>>>>> it to the DIComplex child ArrayBuffer and appending to the HashLookup
>>>>>> Seq? This would also simplify some query style logic, e.g. fn:count.
>>>>>>
>>>>>> Alternatively, we could modify the parser logic so that when it's
>>>>>> dealing with arrays it stores the DIArray locally somewhere (PState or
>>>>>> Parser) and appends straight to that without having to ask the parent
>>>>>> DIComplex for the DIArray? Not sure how big of a change that would be
>>>>>> though.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Current Implementation:
>>>>>>>
>>>>>>> override def addChild(e: InfosetElement): Unit = {
>>>>>>>
>>>>>>>   if (e.runtimeData.isArray)  {
>>>>>>>
>>>>>>>    // Make sure there is an array to accept the child
>>>>>>>
>>>>>>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>>>>>>
>>>>>>>      Assert.invariant(arr ne null)
>>>>>>>
>>>>>>>      arr.append(e)
>>>>>>>
>>>>>>>    } else {
>>>>>>>
>>>>>>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>>>>>>
>>>>>>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>>>>>>
>>>>>>>    }
>>>>>>>
>>>>>>>    e.setParent(this)
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>>>>>>>      Assert.usage(childERD.isArray)
>>>>>>>      getChildArray(childERD.dpathElementCompileInfo)
>>>>>>>    }
>>>>>>>
>>>>>>>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>>>>>>>      Assert.usage(info.isArray)
>>>>>>>      val slot = info.slotIndexInParent
>>>>>>>      getChildArray(slot)
>>>>>>>    }
>>>>>>>
>>>>>>>    private def getChildArray(slot: Int): InfosetArray = {
>>>>>>>      val slotVal = _slots(slot)
>>>>>>>      if (slotVal ne null)
>>>>>>>        slotVal match {
>>>>>>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>>>>>>          case _ => Assert.usageError("not an array")
>>>>>>>        }
>>>>>>>      else {
>>>>>>>        val arrayERD = erd.childERDs(slot)
>>>>>>>        // slot is null. There isn't even an array object yet.
>>>>>>>        // create one (it will have zero entries)
>>>>>>>        val ia = new DIArray(arrayERD, this)
>>>>>>>        // no array there yet. So we have to create one.
>>>>>>>        setChildArray(slot, ia)
>>>>>>>        ia
>>>>>>>      }
>>>>>>>    }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------------
>>>>>>> *From:* Steve Lawrence <sl...@apache.org>
>>>>>>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>>>>>>> *To:* dev@daffodil.apache.org; Taylor Wise
>>>>>>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>>>>>>> Goal: Remove concept of slots from InfosetImpl
>>>>>>>>
>>>>>>>> Reasons:
>>>>>>>>     1. So that element/node lookup can be name based.
>>>>>>>>     2. Ease re-enabling of Unordered Sequences.
>>>>>>>>     3. Ease enabling of 'query-style' expressions that return more than
>>>>>>>> one node.
>>>>>>>>
>>>>>>>> An Infoset is a document (DIDocument) and the document contains exactly
>>>>>>>> one element (DIElement) as its root.  These elements can be simple
>>>>>>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>>>>>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>>>>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>>>>>>> inherit from DINode.
>>>>>>>>
>>>>>>>> Currently DIComplex models its children as an Array of DINodes. The
>>>>>>>> problem with the current implementation is that there's no way for us to
>>>>>>>> just ask for an element by name without walking the structure.
>>>>>>>>
>>>>>>>> The solution involves modifying DIComplex to represent its children both
>>>>>>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>>>>>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>>>>>>> insertion ordering information we need regarding the children as well as
>>>>>>>> constant time indexing access to the child in question.  The ArrayBuffer
>>>>>>>> also has the added benefit of growing and shrinking as objects are added
>>>>>>>> and removed.  While the HashMap will allow us to associate a name
>>>>>>>> (String) with DINodes that have that name thus providing a fast named
>>>>>>>> lookup.
>>>>>>>>
>>>>>>>> This named lookup can be optimized further by only inserting into the
>>>>>>>> HashMap those elements that we know are used in expressions. These are
>>>>>>>> the elements that would most likely benefit from a fast named lookup and
>>>>>>>> will allow us to avoid additional overhead of hashing strings and
>>>>>>>> allocating/appending to a sequence.
>>>>>>>>
>>>>>>>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>>>>>>>> it's possible for the following:
>>>>>>>>
>>>>>>>>     <element name="a" .../>
>>>>>>>>     <element name="a" maxOccurs="unbounded"..../>
>>>>>>>>     <element name="b".../>
>>>>>>>>     <element name="a" ..../>
>>>>>>>>
>>>>>>>> Each element named 'a' would have an associated DINode.  Because there
>>>>>>>> are multiple DINodes named 'a', we need to represent them in the HashMap
>>>>>>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>>>>>>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>>>>>>>
>>>>>>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>>>>>>     "b" -> mutable.Seq(DIElement)
>>>>>>>>
>>>>>>>> The mutable.ArrayBuffer maintains the insertion order and would look as
>>>>>>>> follows:
>>>>>>>>
>>>>>>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>>>>>>
>>>>>>>> Back tracking shall be accomplished by going to the last item in the
>>>>>>>> DIComplex ArrayBuffer and removing it.
>>>>>>>>     1. If the last item is a DIElement, we simply delete it from the
>>>>>>>> ArrayBuffer and then remove it also from the HashMap.
>>>>>>>>     2. If the last item is a DIArray, we then have to look at the
>>>>>>>> contents of the DIArray.
>>>>>>>>         a. If the DIArray has more than one entry, remove the last
>>>>>>>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>>>>>>>> with the modified DIArray.
>>>>>>>
>>>>>>> I think the second part of this is actually unnecessary. Since the
>>>>>>> DIArray is mutable, we only need to remove its last child. The entry in
>>>>>>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>>>>>>> only difference is the contents of the DIArray will be one smaller.
>>>>>>>
>>>>>>>>         b. If the DIArray has a single entry then we simply remove
>>>>>>>> DIArray from ArrayBuffer and the HashMap.
>>>>>>>>
>>>>>>>>     <element name="root>
>>>>>>>>          <sequence ../>
>>>>>>>>          <element name="a" .../>
>>>>>>>>          <element name="a" maxOccurs="unbounded">
>>>>>>>>            ...
>>>>>>>>            <element name="b" .../>
>>>>>>>>            <element name="c" .../>
>>>>>>>>            ...
>>>>>>>>          </element>
>>>>>>>>        </sequence>
>>>>>>>>     </element>
>>>>>>>>
>>>>>>>> If the HashMap of DIComplex-root looks like the following:
>>>>>>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>>>
>>>>>>>> ArrayBuffer of DIComplex-root looks like the following:
>>>>>>>>      0 -> DIElement("a")
>>>>>>>>      1 -> DIArray("a")
>>>>>>>>
>>>>>>>> Where DIArray children ArrayBuffer looks like:
>>>>>>>>      0 -> DIElement("b")
>>>>>>>>      1 -> DIElement("c")
>>>>>>>>
>>>>>>>> Given the initial example if we backtrack(1):  // purge one item
>>>>>>>>      DIArray:
>>>>>>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>>>>>>
>>>>>>>>      HashMap DIComplex-root:
>>>>>>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>>>
>>>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>>>          0 -> DIElement("a")
>>>>>>>>          1 -> DIArray("a")
>>>>>>>>
>>>>>>>> Given the initial example if we backtrack(2): // purge two items
>>>>>>>>      DIArray:
>>>>>>>>          Empty
>>>>>>>>
>>>>>>>>      HashMap DIComplex-root:
>>>>>>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>>>>>>>                                         // removed and is now empty
>>>>>>>>
>>>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>>>          0 -> DIElement("a")
>>>>>>>>
>>>>>>>> ===
>>>>>>>>
>>>>>>>> class DIComplex {
>>>>>>>>   def nSlots = erd.nChildSlots
>>>>>>>
>>>>>>> I think the concept of nSlots and nChildSlots should go away completely.
>>>>>>> The children ArrayBuffer should just grow as elements are added and
>>>>>>> shrink as they are backtracked. It might make sense to have a tunable
>>>>>>> the determines the initial size of the children Array, but the Scala
>>>>>>> defaults for initializing and growing a mutable array might already be
>>>>>>> reasonable.
>>>>>>>
>>>>>>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>>>>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>>>>>>
>>>>>>> One thing that might make this change easier is to not worry about the
>>>>>>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>>>>>>> really only necessary to support query style expressions, but that comes
>>>>>>> with a whole other slew of concerns (e.g. fn:count, same name but
>>>>>>> different types, etc.). There currently exist restrictions that do not
>>>>>>> allow query style expressions, so this Seq should always be a size of 1
>>>>>>> until remove those restrictions.
>>>>>>>
>>>>>>>>
>>>>>>>>   def addChild(node: DINode) {
>>>>>>>>     children += node
>>>>>>>>     fastLookup(node.name).append(node)
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>>>>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   def backtrack(int n) {
>>>>>>>>     for (i = 0; i < n; i++) {
>>>>>>>>       val endIdx = children.size -1
>>>>>>>>       val child = children.remove(endIdx) // children.pop()
>>>>>>>>
>>>>>>>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>>>>>>>         case None => // do nothing
>>>>>>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>>>>>>         case Some(nodesSeq) => {
>>>>>>>>       val lastEntry = nodesSeq.last
>>>>>>>>           lastEntry match {
>>>>>>>>         case arr: DIArray => {
>>>>>>>>               // If the array has >= 2 items, remove last one and update
>>>>>>>> nodesSeq
>>>>>>>>               // with this array:
>>>>>>>>               //  val lastIdx = nodesSeq.length - 1
>>>>>>>>           //  arr.remove(arr.size - 1)
>>>>>>>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>>>>>>>> arr))
>>>>>>>>               //
>>>>>>>>               // If the array has 1 item, remove the array from nodesSeq
>>>>>>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>>>         }
>>>>>>>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>>>           }
>>>>>>>>
>>>>>>>>         }
>>>>>>>>       }
>>>>>>>>     }
>>>>>>>>   }
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> class DIArray {
>>>>>>>>     private val initialSize =
>>>>>>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>>>>>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>>>>>>>
>>>>>>>>     def append(ie: InfosetElement): Unit = {
>>>>>>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>>>>>>         ie.setArray(this)
>>>>>>>>     }
>>>>>>>>
>>>>>>>>     //
>>>>>>>>     // For backtracking purposes
>>>>>>>>     //
>>>>>>>>     def remove(n: Int): DIElement = {
>>>>>>>>         _contents.remove(n)
>>>>>>>>     }
>>>>>>>> }
>>>>>>>>
>>>>>>>> ===
>>>>>>>>
>>>>>>>> To facillitate a breadth-first ordering scheme we shall have to
>>>>>>>
>>>>>>> That this is only necessary for unordered sequences, where the insertion
>>>>>>> order in the ArrayBuffer differs from the display order, correct? If
>>>>>>> that's true, we definitely want to make sure to only do the below
>>>>>>> sorting when an element contains an unordred sequence.
>>>>>>>
>>>>>>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>>>>>>> shall be called childElementDeclPosition and pertains to the static
>>>>>>>> order of the element children of a complex type element.  When sorting
>>>>>>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>>>>>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>>>>>>> entry in the sequence that needs to be sorted.  All elements in this
>>>>>>>> array will have the same childElementDeclPosition.
>>>>>>>>
>>>>>>>> The difficulty in implementing this is that an Element's
>>>>>>>> dsom.ComplexType can have groups within the groups before finding actual
>>>>>>>> element children.  Therefore in the Document Schema Object Model (DSOM)
>>>>>>>> the model group object shall have a startingChildElementDeclPosition and
>>>>>>>> an endingChildElementDeclPosition.
>>>>>>>>
>>>>>>>> 1. The startingChildElementPosition shall be passed to the constructor
>>>>>>>> for the Term.
>>>>>>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>>>>>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>>>>>>> model group, then its first child term's
>>>>>>>> startingChildElementDeclPosition is equal to its own
>>>>>>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>>>>>>> startingChildElementDeclPosition is the prior term's
>>>>>>>> endingChildElementDeclPosition + 1.
>>>>>>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>>>>>>> endingChildElmentDeclPosition of its last child term.
>>>>>>>>
>>>>>>>
>>>>>>> Looks good to me. +1. This should definitely make implementation of
>>>>>>> unordered sequences and query style expressions easier. Removal of the
>>>>>>> slot concept should also help remove concepts that prevent faster schema
>>>>>>> compilation.
>>>>>>>
>>>>>>> - Steve
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Taylor Wise <tw...@tresys.com>.

I don't believe there's a ticket associated with these changes.  Should I go ahead and create one?

________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Wednesday, October 18, 2017 8:21:02 AM
To: Taylor Wise; dev@daffodil.apache.org; Mike Beckerle
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion

On 10/17/2017 07:57 PM, Taylor Wise wrote:
> Been debugging after implementing your suggested changes.
>
>
> First thing to note, is that by moving the creation of the array out of
> getChildArray that I had to make additional calls to 'addChild' in the
> InfosetInputter.  This corrected things on the 'parser' side.  However, on the
> unparser side it became evident that this 'addChild' function cannot always set
> the parent.  There's an invariant expecting parent to be null along the
> execution branch of the unparser side.  As a result, I modified addChild to also
> accept a default boolean called 'setParent'.  It defaults to true unless
> specified.  So now when we call 'addChild' in InfosetInputter on the unparser
> side, we actually call 'addChild(e, false)'.

This sounds to me like addChild is getting called twice on the same
element and so it's parent is trying to be set twice. Might need some
more investigation. Maybe the InfosetInputter needs more tweaks to
support not using getChildArray? Hard to say without seeing your
InfosetInputter changes.

Regarding the below tests, I think your assessment is correct. The
password element doesn't exist and so we should never even try to get
the -1 index. Instead we should get something like "password element
does not exist" or something. It would be useful to have that test to
ensure we get that error message.

An index of -1 (or anything less than 1) can definitely be caught
statically at schema compile time. Worth creating a bug for that. But
there will be times where the index is calculated so we'll still need
this runtime check. I think it's still a test worth having.

I think you're modifications are reasonable based on what the tests are
trying to tests. Probably to proper coverage of tests we need is:

- trying to access an array that does not exist with both valid and
invalid indicies (less than 1 and greater than array size)
- trying to access an array thta does exist but with invalid indicies
(less than 1 and greater than array size)

So it's probably worth keeping the existing tests for the 'array does
not exist' tests and ensure we get a reasonable error message not
talking about indices. And then just make your modifications new tests.

>
> The above correction reduced the errors from 30+ to just 2.  The only two errors
> remaining have to do with an array index out of bounds error that's expected.  I
> believe that it's actually an issue with the test and not the code.
>
>
> Error:
>
> edu.illinois.ncsa.daffodil.tdml.TDMLException: Did not find diagnostic message
> "Value -1 is out of range" in any of the actual diagnostic messages:
> Runtime Schema Definition Error: Expression Evaluation Error: Child element
> {http://example.com}password does not exist.
> Schema context: hideme Location line 267 column 12 in
> file:/tmp/hiddenElem2658119394211244932.dfdl.xsd
> Data location was preceding byte 0 limit(bytes) 8
>
> Well, password doesn't exist because we are missing the 'pw:' initiator in the
> tdml:document content.  The code now doesn't know to create it because we
> haven't seen the initiator.  At least that's what I surmise.
>
>
> Existing test:
>
>    <!--
>       Test Name: arrayIndexOutOfBounds_02
>          Schema: hiddenElem
>            Root: e1b
>         Purpose: This test demonstrates that a proper error occurs if an array
> index in an expression is out of bounds
>    -->
>
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_02" root="e1b"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>
>      <tdml:document>p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value -1 is out of range</tdml:error>
>          <tdml:error>with length 0</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
>
> Schema:
>      <xs:group name="hiddenData">
>        <xs:sequence dfdl:separator="|">
>          <xs:element name="password" type="xs:string" dfdl:initiator="pw:"
> minOccurs="0" maxOccurs="1" dfdl:lengthKind="delimited"/>
>        </xs:sequence>
>      </xs:group>
>
>      <xs:element name="e1">
>        <xs:complexType>
>          <xs:sequence>
>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{
> fn:concat(fn:substring(/ex:e1/ex:password[1], 1, 1), '***') }"/>
>          </xs:sequence>
>        </xs:complexType>
>      </xs:element>
>      <xs:element name="e1b">
>        <xs:complexType>
>          <xs:sequence>
>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{
> fn:concat(fn:substring(/ex:e1b/ex:password[-1], 1, 1), '***') }"/>
>          </xs:sequence>
>        </xs:complexType>
>      </xs:element>
>
>
> My modification:
>
>    <!--
>       Test Name: arrayIndexOutOfBounds_02
>          Schema: hiddenElem
>            Root: e1b
>         Purpose: This test demonstrates that a proper error occurs if an array
> index in an expression is out of bounds
>    -->
>
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_02" root="e1b"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>
>      <tdml:document>pw:p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value -1 is out of range</tdml:error>
>          <tdml:error>with length 1</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
>
> But honestly, should we really need to do this?  Shouldn't the code know that -1
> is not a valid index without having to force construction of the password array?
>
> I'm similarly going to have to modify this other test as well by constructing a
> different root that attempts to access a character in a position greater than
> the length of the array, unless there's something I'm missing:
>
>    <!--
>       Test Name: arrayIndexOutOfBounds_01
>          Schema: hiddenElem
>            Root: e1
>         Purpose: This test demonstrates that a proper error occurs if an array
> index in an expression is out of bounds
>    -->
>
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_01" root="e1"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>
>      <tdml:document>p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value 1 is out of range</tdml:error>
>          <tdml:error>with length 0</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
>
> Proposed addition to defineSchema:
>
>      <xs:element name="e1c">
>        <xs:complexType>
>          <xs:sequence>
>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{
> fn:concat(fn:substring(/ex:e1c/ex:password[10], 1, 1), '***') }"/>
>          </xs:sequence>
>        </xs:complexType>
>      </xs:element>
>
> Proposed modification to test:
>
>    <!--
>       Test Name: arrayIndexOutOfBounds_01
>          Schema: hiddenElem
>            Root: e1
>         Purpose: This test demonstrates that a proper error occurs if an array
> index in an expression is out of bounds
>    -->
>
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_01" root="e1c"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
>
>      <tdml:document>pw:p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value 10 is out of range</tdml:error>
>          <tdml:error>with length 1</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
>
>
>
>
> --------------------------------------------------------------------------------
> *From:* Steve Lawrence <sl...@apache.org>
> *Sent:* Tuesday, October 17, 2017 11:04:03 AM
> *To:* Taylor Wise; dev@daffodil.apache.org; Mike Beckerle
> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
> I don't think a DIArray could be created earlier and things appended to
> it. But, currently, I think an empty DIArray could be created earlier.
>
> And I think this is really just a problem with getChildArray always
> creating DIArrays. What I suspect happened is that we had an array
> following by something that accessed that array, e.g.:
>
> <xs:sequence>
>    <xs:element name="arr" maxOccurs="unbounded" ... />
>    <xs:element name="count" dfdl:inputValueCalc="{ fn:count(../arr) }" />
> </xs:sequence>
>
> arr was slot 1, count was slot 2.
>
> The setChildArray function used to just do something like
>
>    _lastSlotAdded = slot
>
> Which worked in the case of adding a new array--we add a new array, so
> the slot it was added in was the last slot added.
>
> But when fn:count was called, _lastSlotAdded was 2 (from the count
> element). fn:count ended up calling getChildArray, which created a
> DIArray and called setChildArray, which overwrote _lastSlotAdded to the
> slot of the array, i.e. 1. So even though the array was finished and
> couldn't be modified, the use of getChildArray in DPath ended up setting
> _lastSlotAdded to the wrong value. Using max was a way to fix this so we
> couldn't set it to an earlier value.
>
> So, now, because getChildArray create elements, it sounds like things
> could be created out of order. Rather than inserting an empty DIArray
> into a slot, it will added it to the end and potentially mess things up.
> This says to me we definitely want to modify getChildArray to not create
> DIArrays. It should either return the array if it exists, or throw an
> Infoset exception, very similar to getChild. Then we don't have to worry
> about DIArray's being created at the wrong time. The only thing that
> should create a DIArray is addChild.
>
>
> On 10/17/2017 10:13 AM, Taylor Wise wrote:
>> Hmm, very interesting.  This would greatly simplify things.  But what about this
>> code snippet from before my changes:
>>
>>
>>    //  final def setChildArray(slot: Int, arr: DIArray) {
>>    //    Assert.invariant(_slots(slot) eq null)
>>    //    _slots(slot) = arr
>>    //    //
>>    //    // Because arrays are created if not existing, an expression like ../A[1]
>>    //    // can cause array A to come into existence (though empty) and A might be
>>    //    // part of an expression that is reaching backward into earlier data (when
>>    //    // parsing) so that we might be creating an array earlier in the slots
>>    //    // of this element. Hence, the slot being passed here is not necessarily
>>    //    // the last slot added. It might be before it.
>>    //    //
>>    //    _lastSlotAdded = math.max(slot, _lastSlotAdded)
>>    //  }
>> Given your statement:
>>
>>
>> "the issue is determining if the DIArray exists or not. Everything
>> after that is the same. And since children are inserted in the order
>> that we see them, a DIArray will always be at the end of the children
>> ArrayBuffer if it exists. If the DINode at the end of the children
>> ArrayBuffer is not this array, it needs to be created."
>>
>>
>> To me it seems to imply that a DIArray can come into existence without any
>> contents.  But then be appended to at a later time.  In which case it's possible
>> that this DIArray then is not the last item appended.  It could be somewhere
>> earlier and then need to be added to.
>>
>> --------------------------------------------------------------------------------
>> *From:* Steve Lawrence <sl...@apache.org>
>> *Sent:* Tuesday, October 17, 2017 7:20:42 AM
>> *To:* dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>> Actually, I think we're over complicating the array situation. I think
>> we can avoid needing a unique ID and I think we can also avoid needing
>> to always insert arrays into a HashMap.
>>
>> When addChild is called to add an DIElement to an array, there are two
>> cases we need to think about:
>>
>> 1) The DIArray does not exist and it must be created. The
>> DIElement is appended to its contents.
>>
>> 2) The DIArray does exist and we just need to find it. The
>> DIElement is appended to its contents.
>>
>> So the issue is determining if the DIArray exists or not. Everything
>> after that is the same. And since children are inserted in the order
>> that we see them, a DIArray will always be at the end of the children
>> ArrayBuffer if it exists. If the DINode at the end of the children
>> ArrayBuffer is not this array, it needs to be created. So addChild
>> just becomes something like this:
>>
>>    def addChild(e: DIElement) {
>>      if (e.erd.isArray) {
>>        if (children.isEmpty || children.last.erd != e.erd) {
>>          // no children, or last child is not a DIArray for
>>          // this element, create the DIArray
>>          children.append(new DIArray(...))
>>        }
>>        // the array is now always last, add the new child to it
>>        children.last.append(e)
>>      } else {
>>        children.append(e)
>>      }
>>    }
>>
>> So we no longer need a unique ID or a tuple for the HashMap key. The
>> fast HashMap lookup only needs to work on the name of the element, which
>> allows query-style expressions to work in the future. And we now only
>> need to insert DIArray's into the fast HashMap if they are used in
>> expressions addChild no longer needs getChildArray, minimizing hash
>> look ups.
>>
>> Related, I'm wondering if getChildArray should change as well so that it
>> does not create a DIArray if it doesn't exist? I think it seems
>> reasonable that only addChild should create new DIArrays. And if
>> something like DPath calls getChildArray, we probably don't want to end
>> up creating an empty DIArray if it doesn't exist. It should probably
>> just throw an InfosetNoSuchChildElementException if the array doesn't
>> exist, just like getChild.
>>
>> - Steve
>>
>> On 10/16/2017 12:02 PM, Steve Lawrence wrote:
>>> The ID might cause some issues when we want to support query-style
>>> expressions. For example, say we have a schema like this:
>>>
>>> <sequence>
>>>   <xs:element name="array" maxOccurs="unbounded" ... />
>>>   <xs:element name="element" ... />
>>>   <xs:element name="array" maxOccurs="unbounded" ... />
>>> </xs:sequence>
>>>
>>> So two arrays with the same name separated by a single element. Each
>>> element would have a unique ID, so would be inserted in the hash table
>>> as something like
>>>
>>> ("array", 1) -> Seq(DIArray)
>>> ("element", 2) -> Seq(DISimple)
>>> ("array", 3) -> Seq(DIArray)
>>>
>>> A query style expression that accesses the "array" elements (e.g.
>>> fn:count(./array)) would have trouble finding both arrays to get the
>>> full count since the keys for the two arrays are the same.
>>>
>>> Perhaps one option is to keep the name -> DINode map type, so we'd end
>>> up with something like this:
>>>
>>> "array" -> Seq(DIArray, DIArray)
>>> "element" -> Seq(DISimple)
>>>
>>> And then just iterator of the Seq until we find DIArray with the right
>>> ID? That's could have a performance hit if there were lots of things
>>> with the same name, but I suspect that won't be super common.
>>>
>>> Alternatively, we could relook at chaning the InfosetImpl API so that
>>> when an DIArray is created, it's store in the PState for quick access,
>>> and getChildArray would essentially go away.
>>>
>>> - Steve
>>>
>>>
>>> On 10/13/2017 02:14 PM, Mike Beckerle wrote:
>>>> I think the key has to change to include name and namespace of element, and also an additional integer to disambiguate the first a element from the 2nd, from the Nth. This integer need not be 1, 2, 3. It can be any value at all, so long as they are distinct.
>>>>
>>>>
>>>> Let's make sure another case works. E.g.,  ordered sequence like the one you have here, another element to separate it from  an unordered sequence that also contains elements named a.
>>>>
>>>>
>>>> However, an unordered sequence can't have more than 1 element named a, so if every element a declaration gets an additional integer, the one in the unordered sequence will also get one, and that will behave properly.
>>>>
>>>>
>>>> Is this integer the same breadth-first ordering thing we discussed elsewhere? I expect it is.
>>>>
>>>> ________________________________
>>>> From: Taylor Wise
>>>> Sent: Friday, October 13, 2017 1:55:22 PM
>>>> To: Steve Lawrence; dev@daffodil.apache.org; Mike Beckerle
>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>
>>>>
>>>> I'm down to a single error: test_dfdlPosition3
>>>>
>>>>
>>>> java.lang.Exception:
>>>> Comparison failed.
>>>> Expected
>>>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2><a><v>1</v></a><a><v>2</v></a></f>
>>>> Actual
>>>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2></f>
>>>> Differences were (path, expected, actual):
>>>>  (f,<cnt2>2</cnt2>,<a><v>1</v></a>)
>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>>
>>>>     <xs:element name="f">
>>>>       <xs:complexType>
>>>>         <xs:sequence>
>>>>           <xs:element name="cnt1" type="xs:int"
>>>>             dfdl:lengthKind="delimited" dfdl:terminator=";" />
>>>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt1) }">
>>>>             <xs:complexType>
>>>>               <xs:sequence>
>>>>                 <xs:element name="v" type="xs:int"
>>>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>>>               </xs:sequence>
>>>>             </xs:complexType>
>>>>           </xs:element>
>>>>           <xs:element name="cnt2" type="xs:int"
>>>>             dfdl:lengthKind="delimited" />
>>>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt2) }">
>>>>             <xs:complexType>
>>>>               <xs:sequence>
>>>>                 <xs:element name="v" type="xs:int"
>>>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>>>               </xs:sequence>
>>>>             </xs:complexType>
>>>>           </xs:element>
>>>>         </xs:sequence>
>>>>       </xs:complexType>
>>>>     </xs:element>
>>>>
>>>>
>>>>
>>>> I believe what's happening, from stepping through, is that the getChildArray code is confused as to where to add the second set of <a/> elements.  Below are the values of the structures at the point where <cnt2/> has been added:
>>>>
>>>> ChildNodes aka Insertion Order
>>>> ====
>>>> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk)), DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a'))), DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk)))
>>>>
>>>> FastLookup
>>>> ====
>>>> Map(cnt2 -> ArrayBuffer(DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk))), cnt1 -> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk))), a -> ArrayBuffer(DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a')))))
>>>>
>>>> So when we go to add the next array 'a', it sees that 'a' is already in the fastLookup and appends it there.  This adds it to the existing DIArray in the insertion order list.  Really, it should create a new DIArray inserting it after <cnt2/> in the insertion order list.  And inserting it after the DIArray in the fastLookup Hash: a ->
>> Seq(DIArray(a, ...), DIArray(a, ...))
>>>>
>>>> How do we fix this?
>>>>
>>>> Because this is an 'ordered' sequence, can we look at the slotIndexInParent?  Isn't this the position in the sequence definition as defined in the schema above?  This should differentiate between the two 'a' arrays.  And then for 'unordered' sequences,  this  shouldn't matter.  We shouldn't run into two or more members with the same
>> name and namespace.  Correct?
>>>>
>>>> This also means that the 'key' will have to change to include the namespace in addition to the element name.
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: Steve Lawrence <sl...@apache.org>
>>>> Sent: Friday, October 13, 2017 8:45:06 AM
>>>> To: dev@daffodil.apache.org; Taylor Wise; Mike Beckerle
>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>
>>>> On 10/12/2017 08:19 PM, Taylor Wise wrote:
>>>>> Been debugging using the TestSimpleTypes tests.  I'm down to just 1 failure.  Expected results are slightly off.  I've got an additional <text/> in the result.  So, getting closer.  Stopping for the night.
>>>>>
>>>>>
>>>>> I'm not so sure that we need the _lastSlotAdded variable anymore.  I changed it to be just a def that reflects the current length of childNodes (ArrayBuffer).
>>>>> z
>>>>>
>>>>> I'm betting my current issue is within the restore functions I modified, but I'll continue looking at that in the morning.  I would think that 'restoring' would simply be copying the previous state... unless I'm misunderstanding something.
>>>>>
>>>>>
>>>>>   abstract override def restoreInto(e: DIElement) {
>>>>>     val c = e.asInstanceOf[DIComplex]
>>>>>     c.restoreFrom(this)
>>>>>     //c._lastSlotAdded = this._lastSlotAdded
>>>>>     super.restoreInto(e)
>>>>>   }
>>>>>
>>>>> final def restoreFrom(cs: DIComplexSharedImplMixin){
>>>>>   // Shouldn't I be able to just say something like...
>>>>>  //
>>>>>  childNodes.clear()
>>>>>  childNodes.append(cs.childNodes)
>>>>>  fastLookup.clear()
>>>>>  cs.fastLookup.toSeq.foreach( (name, seqDINodes) =>
>>>>>        fastLookup.put(name, seqDINodes) )
>>>>> }
>>>>
>>>> This is going to be somewhat inefficient. This essentially deletes and
>>>> recreates the entire childNodes array and fastLookup hashmap everytime
>>>> we backtrack. This could be pretty back if we childNodes had 1000
>>>> elements and we just wanted to backtrack 1. Instead we just want to
>>>> remove the elements that are being backtracked.
>>>>
>>>> So I think you still need a concept of _lastSlotAdded. But rather than
>>>> the lastSlotAdded, it's just the length of the children (maybe call it
>>>> _numChildren?). I would expect restoreFrom for DIComplex to look
>>>> something like this (very similar to the existing restoreFrom):
>>>>
>>>> final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>   var i = childNodes.length - 1
>>>>
>>>>   // for each child we will remove, remove it from the fastLookup map.
>>>>   // this should always be the last element in the hashmap seq
>>>>   while (i >= cs._numChildren) {
>>>>     var childToRemove = childNodes(i)
>>>>     var fastSeq = fastLookup(childToRemove.name)
>>>>     if (fastSeq.length == 1)
>>>>       // last one, remove the whole key entry
>>>>       fastLookup.remove(childToRemove.name)
>>>>     else
>>>>       // not the last one, just drop the end
>>>>       fastSeq.dropRight(1)
>>>>     i -= 1
>>>>   }
>>>>
>>>>   // now just quickly remove all those children
>>>>   childNodes.reduceToSize(cs._numChildren)
>>>>   _numChildren = cs._numChildren
>>>>
>>>>   if (cs._arraySize.isDefined) {
>>>>     childNodes(_numChildren - 1).reduceToSize(cs._arraySize.get)
>>>>   }
>>>> }
>>>>
>>>> The logic here is very similar to the CURRENT code, just is a little
>>>> more simple since it can make certain assumptions. Like removing
>>>> something form the fastSeq just means removing the last one
>>>> (dropRight(1)), and the existing _arraySize handles the special case for
>>>> DIArrays to simplify that logic.
>>>>
>>>> Note that we might want to rethink about the type fastLookup value.
>>>> Rather than the value being a Seq, it probably wants to be some sort of
>>>> mutable Seq that has constant time append dropRight. Maybe another
>>>> ArrayBuf is the right data structure here? That would allow something like
>>>>
>>>>   fastLookup.reduceToSize(fastSeq.length - 1)
>>>>
>>>>
>>>>
>>>>
>>>>> // ORIGINAL
>>>>>   //    final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>>   //    Assert.invariant(_arraySize == MaybeInt.Nope)
>>>>>   //
>>>>>   //    var i = _lastSlotAdded
>>>>>   //    while (i > cs._lastSlotAdded) {
>>>>>   //      _slots(i) = null
>>>>>   //      i -= 1
>>>>>   //    }
>>>>>   //    _lastSlotAdded = cs._lastSlotAdded
>>>>>   //
>>>>>   //    // check invariant. All slots after last slot added
>>>>>   //    // should be null.
>>>>>   //    {
>>>>>   //      var i = _lastSlotAdded + 1
>>>>>   //      while (i < _slots.length) {
>>>>>   //        Assert.invariant(_slots(i) eq null)
>>>>>   //        i = i + 1
>>>>>   //      }
>>>>>   //    }
>>>>>   //
>>>>>   //    if (cs._arraySize.isDefined) {
>>>>>   //      _slots(_lastSlotAdded).asInstanceOf[DIArray].reduceToSize(cs._arraySize.get)
>>>>>   //    }
>>>>>   //  }
>>>>>
>>>>> // CURRENT, lot of logic here.  Wondering if it's even necessary, thus the question above
>>>>> //
>>>>>   final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>>     Assert.invariant(_arraySize == MaybeInt.Nope)
>>>>>
>>>>>     var i = _lastSlotAdded
>>>>>     while (i > cs._lastSlotAdded) {
>>>>>       val lastNode = childNodes.last
>>>>>       val name = lastNode.namedQName.toQNameString
>>>>>
>>>>>       lastNode match {
>>>>>         case arr: DIArray if (arr.length > 1) => {
>>>>>           // No need to delete the array, just the last DINode in it
>>>>>           arr.pop()
>>>>>         }
>>>>>         case arr: DIArray if (arr.length == 1) => {
>>>>>           // Last DINode in array, remove the array entirely
>>>>>           //
>>>>>           arr.pop()
>>>>>           //
>>>>>           // Update insertion order list
>>>>>           //
>>>>>           childNodes.remove(childNodes.length - 1)
>>>>>           //
>>>>>           // Update Fast Lookup
>>>>>           //
>>>>>           val seq = fastLookup(name)
>>>>>           if (seq.length <= 1) {
>>>>>             // Last DINode with THIS name, remove it entirely.
>>>>>             //
>>>>>             fastLookup.remove(name)
>>>>>           } else {
>>>>>             // Not the last DINode with THIS name, just remove this particular one.
>>>>>             //
>>>>>             val idx = seq.lastIndexWhere(node => node.isInstanceOf[DIArray])
>>>>>             val result = seq diff Seq(lastNode)
>>>>>             fastLookup.update(name, result)
>>>>>           }
>>>>>         }
>>>>>         case _ => {
>>>>>           //
>>>>>           // Update insertion order list
>>>>>           //
>>>>>           childNodes.remove(childNodes.length - 1)
>>>>>           //
>>>>>           // Update Fast Lookup
>>>>>           //
>>>>>           val seq = fastLookup(name)
>>>>>           if (seq.length <= 1) {
>>>>>             // Last DINode with THIS name, remove it entirely
>>>>>             //
>>>>>             fastLookup.remove(name)
>>>>>           } else {
>>>>>             // Not the last DINode with THIS name.  Just remove this one!
>>>>>             //
>>>>>             val idx = seq.lastIndexWhere(node => !node.isInstanceOf[DIArray])
>>>>>             val result = seq diff Seq(lastNode)
>>>>>             fastLookup.update(name, result)
>>>>>           }
>>>>>         }
>>>>>       }
>>>>>     }
>>>>>
>>>>>   }
>>>>>
>>>>> The last failure in SimpleTypes, I haven't run the rest of the suite yet though.  Baby steps.
>>>>> java.lang.Exception:
>>>>> Comparison failed.
>>>>> Expected
>>>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text></hb_root01>
>>>>> Actual
>>>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text><text/></hb_root01>
>>>>> Differences were (path, expected, actual):
>>>>>  (hb_root01,List(),List(<text/>))
>>>>> at edu.illinois.ncsa.daffodil.xml.XMLUtils$.compareAndReport(XMLUtils.scala:678)
>>>>> at edu.illinois.ncsa.daffodil.tdml.VerifyTestCase$.verifyParserTestData(TDMLRunner.scala:1074)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runParseExpectSuccess(TDMLRunner.scala:776)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:675)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:673)
>>>>> at scala.util.Either$RightProjection.foreach(Either.scala:468)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runProcessor(TDMLRunner.scala:673)
>>>>> at edu.illinois.ncsa.daffodil.tdml.TestCase.run(TDMLRunner.scala:623)
>>>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTestWithDataVolumes(TDMLRunner.scala:340)
>>>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTest(TDMLRunner.scala:328)
>>>>> at edu.illinois.ncsa.daffodil.tdml.Runner.runOneTest(RunnerFactory.scala:122)
>>>>> at edu.illinois.ncsa.daffodil.section05.simple_types.TestSimpleTypes.test_hexBinary_Delimited_01(TestSimpleTypes.scala:63)
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>>> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>>>>> at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>>>>> at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>>>>> at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>>>>> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>>>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>>>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>>>>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>>>>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>>>>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>>>>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>>>>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>>>>> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>>>>> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>>>>> at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
>>>>> at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: Steve Lawrence <sl...@apache.org>
>>>>> Sent: Wednesday, October 11, 2017 4:07:51 PM
>>>>> To: dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>
>>>>> On 10/11/2017 01:01 PM, Mike Beckerle wrote:
>>>>>> You have to hash lookup the name to append to an array if you're just appending one thing at a time starting from the slot/hashID. The current Infoset API hides the construction of the DIArray, and only provides access via the slot/hashID, so there's  no  way to avoid the hash per each append.
>>>>>
>>>>> The hashmap and hashlookups should really only be needed for elements
>>>>> that are used in expressions. We came up with that optimization so that
>>>>> we could remove the vast majority of hash lookups. Due to how
>>>>> getChildArray works, that optimization can't occur. I'm not sure how
>>>>> much this will affect performance, but there are ways to avoid the need
>>>>> for a hash lookup for arrays not used in expressions. The removal of
>>>>> DIArray was one. Maybe API/parser changes is better.
>>>>>
>>>>>> But that's fixable if we want. Probably the API to infoset needs to allow appending to an array more efficently, so that a loop like a repeat combinator can be directly holding onto the DIArray's arrayBuffer appending objects as the parser returns them. But I would not make this change as yet. First things first. Don't optimize
>> without profiling info, etc.
>>>>>>
>>>>>>
>>>>>> I'm very reluctant to remove DIArray because XML's data model isn't the only game in town. The XML decision to not have arrays, just adjacent things that happen to have the same name, has all sorts of negative performance implications.
>>>>>
>>>>> The removal of a DIArray doesn't necessarily mean we don't have a
>>>>> concept of an arrays. Every element added to Complex child arrayBuf
>>>>> still knows if it part of an array via ERD.isArray. And comparing the
>>>>> current DIElements ERD to the previous and next is enough to determine
>>>>> where array starts and ends. From there, we can support any model that
>>>>> requires a concept of an array. I'm not saying we should remove the
>>>>> concept of arrays, just that our InfosetImpl doesn't necessarily need to
>>>>> have a special DIArray. I'll agree that it does make the logic easier to
>>>>> get it to a different model, but other things are more complicated.
>>>>>
>>>>>> Consider this: we could create DIArrays of elements of simple type that don't even create a DISimple node for each array element. The values could be stored in the DIArray's arrayBuffer directly, eliminating all that overhead of a DISimple node. (This was the original intent for DIArray, but was punted to get it working sooner.)
>> The opportunity to do this is entirely lost if you don't have DIArray, but just
>> a bunch of adjacent DIElements that happen to share ERD.
>>>>>
>>>>> Fair point. Seems a good argument for keeping DIArray, but if we don't
>>>>> get around to doing this optimization, we're taking multiple hits. We're
>>>>> still allocating many DISimples plus a DIArray, plus the hit from the
>>>>> hashmap.
>>>>>
>>>>>> More speculatively, when we someday implement incremental array trimming so that streaming parse/unparse can drop most array elements once they've been delivered to consumer, well then we will definitely want a DIArray object to abstract the complexity of this trim behavior and the underlying representation it requires.
>>>>>
>>>>> It might simplify this, but I'm not entirely convinced. I think the
>>>>> trimming of unused Infoset elements is going to have some complex
>>>>> heuristics regardless. Adding an extra heuristic for arrays won't
>>>>> necessarily make this easier. But this is just speculation. And again,
>>>>> we would still know elements are part of an array, it just takes a
>>>>> couple extra comparisons to determine start/end. But I do agree that
>>>>> removal of an an entire array is much easier with a DIArray concept.
>>>>>
>>>>>
>>>>>
>>>>> It seems to me these are good enough arguments for keeping DIArray
>>>>> around, but it might be worth thinking a little bit about what changes
>>>>> would be necessary to the API/parser to not need this hash lookup. If
>>>>> it's a minor change, it could reduce a lot of hash lookups, especially
>>>>> for large arrays, which are not uncommon.
>>>>>
>>>>>
>>>>>>
>>>>>> ________________________________
>>>>>> From: Steve Lawrence <sl...@apache.org>
>>>>>> Sent: Wednesday, October 11, 2017 12:21 PM
>>>>>> To: Taylor Wise; dev@daffodil.apache.org
>>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>>
>>>>>> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>>>>>>
>>>>>>> What you'll notice is that the current implementation will construct the DIArray
>>>>>>> if it doesn't already exist and append to it if it does already exist.  But it
>>>>>>> needs the 'slot' number in order to check this.  The slot number is obtained
>>>>>>> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of slots
>>>>>>> entirely, how can we accomplish this?  Will we have to do a named lookup then
>>>>>>> every time we want to append to a DIArray?
>>>>>> I think you're right. The way things work right now, anytime we add to a
>>>>>> a DIArray we would have to do a hash lookup on the DIComplex. It's not
>>>>>> limited to just DPath.
>>>>>>
>>>>>> Maybe this is another argument for removing the concept of a DIArray? We
>>>>>> still need to know if a DIElement is part of an array for the
>>>>>> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
>>>>>> element should just means creating a new DISimple/DIComplex and adding
>>>>>> it to the DIComplex child ArrayBuffer and appending to the HashLookup
>>>>>> Seq? This would also simplify some query style logic, e.g. fn:count.
>>>>>>
>>>>>> Alternatively, we could modify the parser logic so that when it's
>>>>>> dealing with arrays it stores the DIArray locally somewhere (PState or
>>>>>> Parser) and appends straight to that without having to ask the parent
>>>>>> DIComplex for the DIArray? Not sure how big of a change that would be
>>>>>> though.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Current Implementation:
>>>>>>>
>>>>>>> override def addChild(e: InfosetElement): Unit = {
>>>>>>>
>>>>>>>   if (e.runtimeData.isArray)  {
>>>>>>>
>>>>>>>    // Make sure there is an array to accept the child
>>>>>>>
>>>>>>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>>>>>>
>>>>>>>      Assert.invariant(arr ne null)
>>>>>>>
>>>>>>>      arr.append(e)
>>>>>>>
>>>>>>>    } else {
>>>>>>>
>>>>>>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>>>>>>
>>>>>>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>>>>>>
>>>>>>>    }
>>>>>>>
>>>>>>>    e.setParent(this)
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>>>>>>>      Assert.usage(childERD.isArray)
>>>>>>>      getChildArray(childERD.dpathElementCompileInfo)
>>>>>>>    }
>>>>>>>
>>>>>>>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>>>>>>>      Assert.usage(info.isArray)
>>>>>>>      val slot = info.slotIndexInParent
>>>>>>>      getChildArray(slot)
>>>>>>>    }
>>>>>>>
>>>>>>>    private def getChildArray(slot: Int): InfosetArray = {
>>>>>>>      val slotVal = _slots(slot)
>>>>>>>      if (slotVal ne null)
>>>>>>>        slotVal match {
>>>>>>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>>>>>>          case _ => Assert.usageError("not an array")
>>>>>>>        }
>>>>>>>      else {
>>>>>>>        val arrayERD = erd.childERDs(slot)
>>>>>>>        // slot is null. There isn't even an array object yet.
>>>>>>>        // create one (it will have zero entries)
>>>>>>>        val ia = new DIArray(arrayERD, this)
>>>>>>>        // no array there yet. So we have to create one.
>>>>>>>        setChildArray(slot, ia)
>>>>>>>        ia
>>>>>>>      }
>>>>>>>    }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------------
>>>>>>> *From:* Steve Lawrence <sl...@apache.org>
>>>>>>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>>>>>>> *To:* dev@daffodil.apache.org; Taylor Wise
>>>>>>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>>>>>>> Goal: Remove concept of slots from InfosetImpl
>>>>>>>>
>>>>>>>> Reasons:
>>>>>>>>     1. So that element/node lookup can be name based.
>>>>>>>>     2. Ease re-enabling of Unordered Sequences.
>>>>>>>>     3. Ease enabling of 'query-style' expressions that return more than
>>>>>>>> one node.
>>>>>>>>
>>>>>>>> An Infoset is a document (DIDocument) and the document contains exactly
>>>>>>>> one element (DIElement) as its root.  These elements can be simple
>>>>>>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>>>>>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>>>>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>>>>>>> inherit from DINode.
>>>>>>>>
>>>>>>>> Currently DIComplex models its children as an Array of DINodes. The
>>>>>>>> problem with the current implementation is that there's no way for us to
>>>>>>>> just ask for an element by name without walking the structure.
>>>>>>>>
>>>>>>>> The solution involves modifying DIComplex to represent its children both
>>>>>>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>>>>>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>>>>>>> insertion ordering information we need regarding the children as well as
>>>>>>>> constant time indexing access to the child in question.  The ArrayBuffer
>>>>>>>> also has the added benefit of growing and shrinking as objects are added
>>>>>>>> and removed.  While the HashMap will allow us to associate a name
>>>>>>>> (String) with DINodes that have that name thus providing a fast named
>>>>>>>> lookup.
>>>>>>>>
>>>>>>>> This named lookup can be optimized further by only inserting into the
>>>>>>>> HashMap those elements that we know are used in expressions. These are
>>>>>>>> the elements that would most likely benefit from a fast named lookup and
>>>>>>>> will allow us to avoid additional overhead of hashing strings and
>>>>>>>> allocating/appending to a sequence.
>>>>>>>>
>>>>>>>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>>>>>>>> it's possible for the following:
>>>>>>>>
>>>>>>>>     <element name="a" .../>
>>>>>>>>     <element name="a" maxOccurs="unbounded"..../>
>>>>>>>>     <element name="b".../>
>>>>>>>>     <element name="a" ..../>
>>>>>>>>
>>>>>>>> Each element named 'a' would have an associated DINode.  Because there
>>>>>>>> are multiple DINodes named 'a', we need to represent them in the HashMap
>>>>>>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>>>>>>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>>>>>>>
>>>>>>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>>>>>>     "b" -> mutable.Seq(DIElement)
>>>>>>>>
>>>>>>>> The mutable.ArrayBuffer maintains the insertion order and would look as
>>>>>>>> follows:
>>>>>>>>
>>>>>>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>>>>>>
>>>>>>>> Back tracking shall be accomplished by going to the last item in the
>>>>>>>> DIComplex ArrayBuffer and removing it.
>>>>>>>>     1. If the last item is a DIElement, we simply delete it from the
>>>>>>>> ArrayBuffer and then remove it also from the HashMap.
>>>>>>>>     2. If the last item is a DIArray, we then have to look at the
>>>>>>>> contents of the DIArray.
>>>>>>>>         a. If the DIArray has more than one entry, remove the last
>>>>>>>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>>>>>>>> with the modified DIArray.
>>>>>>>
>>>>>>> I think the second part of this is actually unnecessary. Since the
>>>>>>> DIArray is mutable, we only need to remove its last child. The entry in
>>>>>>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>>>>>>> only difference is the contents of the DIArray will be one smaller.
>>>>>>>
>>>>>>>>         b. If the DIArray has a single entry then we simply remove
>>>>>>>> DIArray from ArrayBuffer and the HashMap.
>>>>>>>>
>>>>>>>>     <element name="root>
>>>>>>>>          <sequence ../>
>>>>>>>>          <element name="a" .../>
>>>>>>>>          <element name="a" maxOccurs="unbounded">
>>>>>>>>            ...
>>>>>>>>            <element name="b" .../>
>>>>>>>>            <element name="c" .../>
>>>>>>>>            ...
>>>>>>>>          </element>
>>>>>>>>        </sequence>
>>>>>>>>     </element>
>>>>>>>>
>>>>>>>> If the HashMap of DIComplex-root looks like the following:
>>>>>>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>>>
>>>>>>>> ArrayBuffer of DIComplex-root looks like the following:
>>>>>>>>      0 -> DIElement("a")
>>>>>>>>      1 -> DIArray("a")
>>>>>>>>
>>>>>>>> Where DIArray children ArrayBuffer looks like:
>>>>>>>>      0 -> DIElement("b")
>>>>>>>>      1 -> DIElement("c")
>>>>>>>>
>>>>>>>> Given the initial example if we backtrack(1):  // purge one item
>>>>>>>>      DIArray:
>>>>>>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>>>>>>
>>>>>>>>      HashMap DIComplex-root:
>>>>>>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>>>
>>>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>>>          0 -> DIElement("a")
>>>>>>>>          1 -> DIArray("a")
>>>>>>>>
>>>>>>>> Given the initial example if we backtrack(2): // purge two items
>>>>>>>>      DIArray:
>>>>>>>>          Empty
>>>>>>>>
>>>>>>>>      HashMap DIComplex-root:
>>>>>>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>>>>>>>                                         // removed and is now empty
>>>>>>>>
>>>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>>>          0 -> DIElement("a")
>>>>>>>>
>>>>>>>> ===
>>>>>>>>
>>>>>>>> class DIComplex {
>>>>>>>>   def nSlots = erd.nChildSlots
>>>>>>>
>>>>>>> I think the concept of nSlots and nChildSlots should go away completely.
>>>>>>> The children ArrayBuffer should just grow as elements are added and
>>>>>>> shrink as they are backtracked. It might make sense to have a tunable
>>>>>>> the determines the initial size of the children Array, but the Scala
>>>>>>> defaults for initializing and growing a mutable array might already be
>>>>>>> reasonable.
>>>>>>>
>>>>>>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>>>>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>>>>>>
>>>>>>> One thing that might make this change easier is to not worry about the
>>>>>>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>>>>>>> really only necessary to support query style expressions, but that comes
>>>>>>> with a whole other slew of concerns (e.g. fn:count, same name but
>>>>>>> different types, etc.). There currently exist restrictions that do not
>>>>>>> allow query style expressions, so this Seq should always be a size of 1
>>>>>>> until remove those restrictions.
>>>>>>>
>>>>>>>>
>>>>>>>>   def addChild(node: DINode) {
>>>>>>>>     children += node
>>>>>>>>     fastLookup(node.name).append(node)
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>>>>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   def backtrack(int n) {
>>>>>>>>     for (i = 0; i < n; i++) {
>>>>>>>>       val endIdx = children.size -1
>>>>>>>>       val child = children.remove(endIdx) // children.pop()
>>>>>>>>
>>>>>>>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>>>>>>>         case None => // do nothing
>>>>>>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>>>>>>         case Some(nodesSeq) => {
>>>>>>>>       val lastEntry = nodesSeq.last
>>>>>>>>           lastEntry match {
>>>>>>>>         case arr: DIArray => {
>>>>>>>>               // If the array has >= 2 items, remove last one and update
>>>>>>>> nodesSeq
>>>>>>>>               // with this array:
>>>>>>>>               //  val lastIdx = nodesSeq.length - 1
>>>>>>>>           //  arr.remove(arr.size - 1)
>>>>>>>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>>>>>>>> arr))
>>>>>>>>               //
>>>>>>>>               // If the array has 1 item, remove the array from nodesSeq
>>>>>>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>>>         }
>>>>>>>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>>>           }
>>>>>>>>
>>>>>>>>         }
>>>>>>>>       }
>>>>>>>>     }
>>>>>>>>   }
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> class DIArray {
>>>>>>>>     private val initialSize =
>>>>>>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>>>>>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>>>>>>>
>>>>>>>>     def append(ie: InfosetElement): Unit = {
>>>>>>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>>>>>>         ie.setArray(this)
>>>>>>>>     }
>>>>>>>>
>>>>>>>>     //
>>>>>>>>     // For backtracking purposes
>>>>>>>>     //
>>>>>>>>     def remove(n: Int): DIElement = {
>>>>>>>>         _contents.remove(n)
>>>>>>>>     }
>>>>>>>> }
>>>>>>>>
>>>>>>>> ===
>>>>>>>>
>>>>>>>> To facillitate a breadth-first ordering scheme we shall have to
>>>>>>>
>>>>>>> That this is only necessary for unordered sequences, where the insertion
>>>>>>> order in the ArrayBuffer differs from the display order, correct? If
>>>>>>> that's true, we definitely want to make sure to only do the below
>>>>>>> sorting when an element contains an unordred sequence.
>>>>>>>
>>>>>>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>>>>>>> shall be called childElementDeclPosition and pertains to the static
>>>>>>>> order of the element children of a complex type element.  When sorting
>>>>>>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>>>>>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>>>>>>> entry in the sequence that needs to be sorted.  All elements in this
>>>>>>>> array will have the same childElementDeclPosition.
>>>>>>>>
>>>>>>>> The difficulty in implementing this is that an Element's
>>>>>>>> dsom.ComplexType can have groups within the groups before finding actual
>>>>>>>> element children.  Therefore in the Document Schema Object Model (DSOM)
>>>>>>>> the model group object shall have a startingChildElementDeclPosition and
>>>>>>>> an endingChildElementDeclPosition.
>>>>>>>>
>>>>>>>> 1. The startingChildElementPosition shall be passed to the constructor
>>>>>>>> for the Term.
>>>>>>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>>>>>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>>>>>>> model group, then its first child term's
>>>>>>>> startingChildElementDeclPosition is equal to its own
>>>>>>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>>>>>>> startingChildElementDeclPosition is the prior term's
>>>>>>>> endingChildElementDeclPosition + 1.
>>>>>>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>>>>>>> endingChildElmentDeclPosition of its last child term.
>>>>>>>>
>>>>>>>
>>>>>>> Looks good to me. +1. This should definitely make implementation of
>>>>>>> unordered sequences and query style expressions easier. Removal of the
>>>>>>> slot concept should also help remove concepts that prevent faster schema
>>>>>>> compilation.
>>>>>>>
>>>>>>> - Steve
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Steve Lawrence <sl...@apache.org>.

On 10/17/2017 07:57 PM, Taylor Wise wrote:
> Been debugging after implementing your suggested changes.
> 
> 
> First thing to note, is that by moving the creation of the array out of 
> getChildArray that I had to make additional calls to 'addChild' in the 
> InfosetInputter.  This corrected things on the 'parser' side.  However, on the 
> unparser side it became evident that this 'addChild' function cannot always set 
> the parent.  There's an invariant expecting parent to be null along the 
> execution branch of the unparser side.  As a result, I modified addChild to also 
> accept a default boolean called 'setParent'.  It defaults to true unless 
> specified.  So now when we call 'addChild' in InfosetInputter on the unparser 
> side, we actually call 'addChild(e, false)'.

This sounds to me like addChild is getting called twice on the same
element and so it's parent is trying to be set twice. Might need some
more investigation. Maybe the InfosetInputter needs more tweaks to
support not using getChildArray? Hard to say without seeing your
InfosetInputter changes.

Regarding the below tests, I think your assessment is correct. The
password element doesn't exist and so we should never even try to get
the -1 index. Instead we should get something like "password element
does not exist" or something. It would be useful to have that test to
ensure we get that error message.

An index of -1 (or anything less than 1) can definitely be caught
statically at schema compile time. Worth creating a bug for that. But
there will be times where the index is calculated so we'll still need
this runtime check. I think it's still a test worth having.

I think you're modifications are reasonable based on what the tests are
trying to tests. Probably to proper coverage of tests we need is:

- trying to access an array that does not exist with both valid and
invalid indicies (less than 1 and greater than array size)
- trying to access an array thta does exist but with invalid indicies
(less than 1 and greater than array size)

So it's probably worth keeping the existing tests for the 'array does
not exist' tests and ensure we get a reasonable error message not
talking about indices. And then just make your modifications new tests.

> 
> The above correction reduced the errors from 30+ to just 2.  The only two errors 
> remaining have to do with an array index out of bounds error that's expected.  I 
> believe that it's actually an issue with the test and not the code.
> 
> 
> Error:
> 
> edu.illinois.ncsa.daffodil.tdml.TDMLException: Did not find diagnostic message 
> "Value -1 is out of range" in any of the actual diagnostic messages:
> Runtime Schema Definition Error: Expression Evaluation Error: Child element 
> {http://example.com}password does not exist.
> Schema context: hideme Location line 267 column 12 in 
> file:/tmp/hiddenElem2658119394211244932.dfdl.xsd
> Data location was preceding byte 0 limit(bytes) 8
> 
> Well, password doesn't exist because we are missing the 'pw:' initiator in the 
> tdml:document content.  The code now doesn't know to create it because we 
> haven't seen the initiator.  At least that's what I surmise.
> 
> 
> Existing test:
> 
>    <!--
>       Test Name: arrayIndexOutOfBounds_02
>          Schema: hiddenElem
>            Root: e1b
>         Purpose: This test demonstrates that a proper error occurs if an array 
> index in an expression is out of bounds
>    -->
> 
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_02" root="e1b"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
> 
>      <tdml:document>p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value -1 is out of range</tdml:error>
>          <tdml:error>with length 0</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
> 
> Schema:
>      <xs:group name="hiddenData">
>        <xs:sequence dfdl:separator="|">
>          <xs:element name="password" type="xs:string" dfdl:initiator="pw:" 
> minOccurs="0" maxOccurs="1" dfdl:lengthKind="delimited"/>
>        </xs:sequence>
>      </xs:group>
> 
>      <xs:element name="e1">
>        <xs:complexType>
>          <xs:sequence>
>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{ 
> fn:concat(fn:substring(/ex:e1/ex:password[1], 1, 1), '***') }"/>
>          </xs:sequence>
>        </xs:complexType>
>      </xs:element>
>      <xs:element name="e1b">
>        <xs:complexType>
>          <xs:sequence>
>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{ 
> fn:concat(fn:substring(/ex:e1b/ex:password[-1], 1, 1), '***') }"/>
>          </xs:sequence>
>        </xs:complexType>
>      </xs:element>
> 
> 
> My modification:
> 
>    <!--
>       Test Name: arrayIndexOutOfBounds_02
>          Schema: hiddenElem
>            Root: e1b
>         Purpose: This test demonstrates that a proper error occurs if an array 
> index in an expression is out of bounds
>    -->
> 
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_02" root="e1b"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
> 
>      <tdml:document>pw:p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value -1 is out of range</tdml:error>
>          <tdml:error>with length 1</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
> 
> But honestly, should we really need to do this?  Shouldn't the code know that -1 
> is not a valid index without having to force construction of the password array?
> 
> I'm similarly going to have to modify this other test as well by constructing a 
> different root that attempts to access a character in a position greater than 
> the length of the array, unless there's something I'm missing:
> 
>    <!--
>       Test Name: arrayIndexOutOfBounds_01
>          Schema: hiddenElem
>            Root: e1
>         Purpose: This test demonstrates that a proper error occurs if an array 
> index in an expression is out of bounds
>    -->
> 
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_01" root="e1"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
> 
>      <tdml:document>p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value 1 is out of range</tdml:error>
>          <tdml:error>with length 0</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
> 
> Proposed addition to defineSchema:
> 
>      <xs:element name="e1c">
>        <xs:complexType>
>          <xs:sequence>
>            <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
>            <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{ 
> fn:concat(fn:substring(/ex:e1c/ex:password[10], 1, 1), '***') }"/>
>          </xs:sequence>
>        </xs:complexType>
>      </xs:element>
> 
> Proposed modification to test:
> 
>    <!--
>       Test Name: arrayIndexOutOfBounds_01
>          Schema: hiddenElem
>            Root: e1
>         Purpose: This test demonstrates that a proper error occurs if an array 
> index in an expression is out of bounds
>    -->
> 
>    <tdml:parserTestCase name="arrayIndexOutOfBounds_01" root="e1c"
>      model="hiddenElem" description="Section 23 - DFDL Expressions">
> 
>      <tdml:document>pw:p455w0rd</tdml:document>
>        <tdml:errors>
>          <tdml:error>Schema Definition Error</tdml:error>
>          <tdml:error>expression evaluation error</tdml:error>
>          <tdml:error>Value 10 is out of range</tdml:error>
>          <tdml:error>with length 1</tdml:error>
>        </tdml:errors>
>    </tdml:parserTestCase>
> 
> 
> 
> 
> --------------------------------------------------------------------------------
> *From:* Steve Lawrence <sl...@apache.org>
> *Sent:* Tuesday, October 17, 2017 11:04:03 AM
> *To:* Taylor Wise; dev@daffodil.apache.org; Mike Beckerle
> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
> I don't think a DIArray could be created earlier and things appended to
> it. But, currently, I think an empty DIArray could be created earlier.
> 
> And I think this is really just a problem with getChildArray always
> creating DIArrays. What I suspect happened is that we had an array
> following by something that accessed that array, e.g.:
> 
> <xs:sequence>
>    <xs:element name="arr" maxOccurs="unbounded" ... />
>    <xs:element name="count" dfdl:inputValueCalc="{ fn:count(../arr) }" />
> </xs:sequence>
> 
> arr was slot 1, count was slot 2.
> 
> The setChildArray function used to just do something like
> 
>    _lastSlotAdded = slot
> 
> Which worked in the case of adding a new array--we add a new array, so
> the slot it was added in was the last slot added.
> 
> But when fn:count was called, _lastSlotAdded was 2 (from the count
> element). fn:count ended up calling getChildArray, which created a
> DIArray and called setChildArray, which overwrote _lastSlotAdded to the
> slot of the array, i.e. 1. So even though the array was finished and
> couldn't be modified, the use of getChildArray in DPath ended up setting
> _lastSlotAdded to the wrong value. Using max was a way to fix this so we
> couldn't set it to an earlier value.
> 
> So, now, because getChildArray create elements, it sounds like things
> could be created out of order. Rather than inserting an empty DIArray
> into a slot, it will added it to the end and potentially mess things up.
> This says to me we definitely want to modify getChildArray to not create
> DIArrays. It should either return the array if it exists, or throw an
> Infoset exception, very similar to getChild. Then we don't have to worry
> about DIArray's being created at the wrong time. The only thing that
> should create a DIArray is addChild.
> 
> 
> On 10/17/2017 10:13 AM, Taylor Wise wrote:
>> Hmm, very interesting.  This would greatly simplify things.  But what about this
>> code snippet from before my changes:
>> 
>> 
>>    //  final def setChildArray(slot: Int, arr: DIArray) {
>>    //    Assert.invariant(_slots(slot) eq null)
>>    //    _slots(slot) = arr
>>    //    //
>>    //    // Because arrays are created if not existing, an expression like ../A[1]
>>    //    // can cause array A to come into existence (though empty) and A might be
>>    //    // part of an expression that is reaching backward into earlier data (when
>>    //    // parsing) so that we might be creating an array earlier in the slots
>>    //    // of this element. Hence, the slot being passed here is not necessarily
>>    //    // the last slot added. It might be before it.
>>    //    //
>>    //    _lastSlotAdded = math.max(slot, _lastSlotAdded)
>>    //  }
>> Given your statement:
>> 
>> 
>> "the issue is determining if the DIArray exists or not. Everything
>> after that is the same. And since children are inserted in the order
>> that we see them, a DIArray will always be at the end of the children
>> ArrayBuffer if it exists. If the DINode at the end of the children
>> ArrayBuffer is not this array, it needs to be created."
>> 
>> 
>> To me it seems to imply that a DIArray can come into existence without any 
>> contents.  But then be appended to at a later time.  In which case it's possible
>> that this DIArray then is not the last item appended.  It could be somewhere 
>> earlier and then need to be added to.
>> 
>> --------------------------------------------------------------------------------
>> *From:* Steve Lawrence <sl...@apache.org>
>> *Sent:* Tuesday, October 17, 2017 7:20:42 AM
>> *To:* dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>> Actually, I think we're over complicating the array situation. I think
>> we can avoid needing a unique ID and I think we can also avoid needing
>> to always insert arrays into a HashMap.
>> 
>> When addChild is called to add an DIElement to an array, there are two
>> cases we need to think about:
>> 
>> 1) The DIArray does not exist and it must be created. The
>> DIElement is appended to its contents.
>> 
>> 2) The DIArray does exist and we just need to find it. The
>> DIElement is appended to its contents.
>> 
>> So the issue is determining if the DIArray exists or not. Everything
>> after that is the same. And since children are inserted in the order
>> that we see them, a DIArray will always be at the end of the children
>> ArrayBuffer if it exists. If the DINode at the end of the children
>> ArrayBuffer is not this array, it needs to be created. So addChild
>> just becomes something like this:
>> 
>>    def addChild(e: DIElement) {
>>      if (e.erd.isArray) {
>>        if (children.isEmpty || children.last.erd != e.erd) {
>>          // no children, or last child is not a DIArray for
>>          // this element, create the DIArray
>>          children.append(new DIArray(...))
>>        }
>>        // the array is now always last, add the new child to it
>>        children.last.append(e)
>>      } else {
>>        children.append(e)
>>      }
>>    }
>> 
>> So we no longer need a unique ID or a tuple for the HashMap key. The
>> fast HashMap lookup only needs to work on the name of the element, which
>> allows query-style expressions to work in the future. And we now only
>> need to insert DIArray's into the fast HashMap if they are used in
>> expressions addChild no longer needs getChildArray, minimizing hash
>> look ups.
>> 
>> Related, I'm wondering if getChildArray should change as well so that it
>> does not create a DIArray if it doesn't exist? I think it seems
>> reasonable that only addChild should create new DIArrays. And if
>> something like DPath calls getChildArray, we probably don't want to end
>> up creating an empty DIArray if it doesn't exist. It should probably
>> just throw an InfosetNoSuchChildElementException if the array doesn't
>> exist, just like getChild.
>> 
>> - Steve
>> 
>> On 10/16/2017 12:02 PM, Steve Lawrence wrote:
>>> The ID might cause some issues when we want to support query-style
>>> expressions. For example, say we have a schema like this:
>>> 
>>> <sequence>
>>>   <xs:element name="array" maxOccurs="unbounded" ... />
>>>   <xs:element name="element" ... />
>>>   <xs:element name="array" maxOccurs="unbounded" ... />
>>> </xs:sequence>
>>> 
>>> So two arrays with the same name separated by a single element. Each
>>> element would have a unique ID, so would be inserted in the hash table
>>> as something like
>>> 
>>> ("array", 1) -> Seq(DIArray)
>>> ("element", 2) -> Seq(DISimple)
>>> ("array", 3) -> Seq(DIArray)
>>> 
>>> A query style expression that accesses the "array" elements (e.g.
>>> fn:count(./array)) would have trouble finding both arrays to get the
>>> full count since the keys for the two arrays are the same.
>>> 
>>> Perhaps one option is to keep the name -> DINode map type, so we'd end
>>> up with something like this:
>>> 
>>> "array" -> Seq(DIArray, DIArray)
>>> "element" -> Seq(DISimple)
>>> 
>>> And then just iterator of the Seq until we find DIArray with the right
>>> ID? That's could have a performance hit if there were lots of things
>>> with the same name, but I suspect that won't be super common.
>>> 
>>> Alternatively, we could relook at chaning the InfosetImpl API so that
>>> when an DIArray is created, it's store in the PState for quick access,
>>> and getChildArray would essentially go away.
>>> 
>>> - Steve
>>> 
>>> 
>>> On 10/13/2017 02:14 PM, Mike Beckerle wrote:
>>>> I think the key has to change to include name and namespace of element, and also an additional integer to disambiguate the first a element from the 2nd, from the Nth. This integer need not be 1, 2, 3. It can be any value at all, so long as they are distinct.
>>>>
>>>>
>>>> Let's make sure another case works. E.g.,  ordered sequence like the one you have here, another element to separate it from  an unordered sequence that also contains elements named a.
>>>>
>>>>
>>>> However, an unordered sequence can't have more than 1 element named a, so if every element a declaration gets an additional integer, the one in the unordered sequence will also get one, and that will behave properly.
>>>>
>>>>
>>>> Is this integer the same breadth-first ordering thing we discussed elsewhere? I expect it is.
>>>>
>>>> ________________________________
>>>> From: Taylor Wise
>>>> Sent: Friday, October 13, 2017 1:55:22 PM
>>>> To: Steve Lawrence; dev@daffodil.apache.org; Mike Beckerle
>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>
>>>>
>>>> I'm down to a single error: test_dfdlPosition3
>>>>
>>>>
>>>> java.lang.Exception:
>>>> Comparison failed.
>>>> Expected
>>>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2><a><v>1</v></a><a><v>2</v></a></f>
>>>> Actual
>>>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2></f>
>>>> Differences were (path, expected, actual):
>>>>  (f,<cnt2>2</cnt2>,<a><v>1</v></a>)
>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>>
>>>>     <xs:element name="f">
>>>>       <xs:complexType>
>>>>         <xs:sequence>
>>>>           <xs:element name="cnt1" type="xs:int"
>>>>             dfdl:lengthKind="delimited" dfdl:terminator=";" />
>>>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt1) }">
>>>>             <xs:complexType>
>>>>               <xs:sequence>
>>>>                 <xs:element name="v" type="xs:int"
>>>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>>>               </xs:sequence>
>>>>             </xs:complexType>
>>>>           </xs:element>
>>>>           <xs:element name="cnt2" type="xs:int"
>>>>             dfdl:lengthKind="delimited" />
>>>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt2) }">
>>>>             <xs:complexType>
>>>>               <xs:sequence>
>>>>                 <xs:element name="v" type="xs:int"
>>>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>>>               </xs:sequence>
>>>>             </xs:complexType>
>>>>           </xs:element>
>>>>         </xs:sequence>
>>>>       </xs:complexType>
>>>>     </xs:element>
>>>>
>>>>
>>>>
>>>> I believe what's happening, from stepping through, is that the getChildArray code is confused as to where to add the second set of <a/> elements.  Below are the values of the structures at the point where <cnt2/> has been added:
>>>>
>>>> ChildNodes aka Insertion Order
>>>> ====
>>>> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk)), DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a'))), DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk)))
>>>>
>>>> FastLookup
>>>> ====
>>>> Map(cnt2 -> ArrayBuffer(DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk))), cnt1 -> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk))), a -> ArrayBuffer(DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a')))))
>>>>
>>>> So when we go to add the next array 'a', it sees that 'a' is already in the fastLookup and appends it there.  This adds it to the existing DIArray in the insertion order list.  Really, it should create a new DIArray inserting it after <cnt2/> in the insertion order list.  And inserting it after the DIArray in the fastLookup Hash: a ->
>> Seq(DIArray(a, ...), DIArray(a, ...))
>>>>
>>>> How do we fix this?
>>>>
>>>> Because this is an 'ordered' sequence, can we look at the slotIndexInParent?  Isn't this the position in the sequence definition as defined in the schema above?  This should differentiate between the two 'a' arrays.  And then for 'unordered' sequences,  this  shouldn't matter.  We shouldn't run into two or more members with the same
>> name and namespace.  Correct?
>>>>
>>>> This also means that the 'key' will have to change to include the namespace in addition to the element name.
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: Steve Lawrence <sl...@apache.org>
>>>> Sent: Friday, October 13, 2017 8:45:06 AM
>>>> To: dev@daffodil.apache.org; Taylor Wise; Mike Beckerle
>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>
>>>> On 10/12/2017 08:19 PM, Taylor Wise wrote:
>>>>> Been debugging using the TestSimpleTypes tests.  I'm down to just 1 failure.  Expected results are slightly off.  I've got an additional <text/> in the result.  So, getting closer.  Stopping for the night.
>>>>>
>>>>>
>>>>> I'm not so sure that we need the _lastSlotAdded variable anymore.  I changed it to be just a def that reflects the current length of childNodes (ArrayBuffer).
>>>>> z
>>>>>
>>>>> I'm betting my current issue is within the restore functions I modified, but I'll continue looking at that in the morning.  I would think that 'restoring' would simply be copying the previous state... unless I'm misunderstanding something.
>>>>>
>>>>>
>>>>>   abstract override def restoreInto(e: DIElement) {
>>>>>     val c = e.asInstanceOf[DIComplex]
>>>>>     c.restoreFrom(this)
>>>>>     //c._lastSlotAdded = this._lastSlotAdded
>>>>>     super.restoreInto(e)
>>>>>   }
>>>>>
>>>>> final def restoreFrom(cs: DIComplexSharedImplMixin){
>>>>>   // Shouldn't I be able to just say something like...
>>>>>  //
>>>>>  childNodes.clear()
>>>>>  childNodes.append(cs.childNodes)
>>>>>  fastLookup.clear()
>>>>>  cs.fastLookup.toSeq.foreach( (name, seqDINodes) =>
>>>>>        fastLookup.put(name, seqDINodes) )
>>>>> }
>>>>
>>>> This is going to be somewhat inefficient. This essentially deletes and
>>>> recreates the entire childNodes array and fastLookup hashmap everytime
>>>> we backtrack. This could be pretty back if we childNodes had 1000
>>>> elements and we just wanted to backtrack 1. Instead we just want to
>>>> remove the elements that are being backtracked.
>>>>
>>>> So I think you still need a concept of _lastSlotAdded. But rather than
>>>> the lastSlotAdded, it's just the length of the children (maybe call it
>>>> _numChildren?). I would expect restoreFrom for DIComplex to look
>>>> something like this (very similar to the existing restoreFrom):
>>>>
>>>> final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>   var i = childNodes.length - 1
>>>>
>>>>   // for each child we will remove, remove it from the fastLookup map.
>>>>   // this should always be the last element in the hashmap seq
>>>>   while (i >= cs._numChildren) {
>>>>     var childToRemove = childNodes(i)
>>>>     var fastSeq = fastLookup(childToRemove.name)
>>>>     if (fastSeq.length == 1)
>>>>       // last one, remove the whole key entry
>>>>       fastLookup.remove(childToRemove.name)
>>>>     else
>>>>       // not the last one, just drop the end
>>>>       fastSeq.dropRight(1)
>>>>     i -= 1
>>>>   }
>>>>
>>>>   // now just quickly remove all those children
>>>>   childNodes.reduceToSize(cs._numChildren)
>>>>   _numChildren = cs._numChildren
>>>>
>>>>   if (cs._arraySize.isDefined) {
>>>>     childNodes(_numChildren - 1).reduceToSize(cs._arraySize.get)
>>>>   }
>>>> }
>>>>
>>>> The logic here is very similar to the CURRENT code, just is a little
>>>> more simple since it can make certain assumptions. Like removing
>>>> something form the fastSeq just means removing the last one
>>>> (dropRight(1)), and the existing _arraySize handles the special case for
>>>> DIArrays to simplify that logic.
>>>>
>>>> Note that we might want to rethink about the type fastLookup value.
>>>> Rather than the value being a Seq, it probably wants to be some sort of
>>>> mutable Seq that has constant time append dropRight. Maybe another
>>>> ArrayBuf is the right data structure here? That would allow something like
>>>>
>>>>   fastLookup.reduceToSize(fastSeq.length - 1)
>>>>
>>>>
>>>>
>>>>
>>>>> // ORIGINAL
>>>>>   //    final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>>   //    Assert.invariant(_arraySize == MaybeInt.Nope)
>>>>>   //
>>>>>   //    var i = _lastSlotAdded
>>>>>   //    while (i > cs._lastSlotAdded) {
>>>>>   //      _slots(i) = null
>>>>>   //      i -= 1
>>>>>   //    }
>>>>>   //    _lastSlotAdded = cs._lastSlotAdded
>>>>>   //
>>>>>   //    // check invariant. All slots after last slot added
>>>>>   //    // should be null.
>>>>>   //    {
>>>>>   //      var i = _lastSlotAdded + 1
>>>>>   //      while (i < _slots.length) {
>>>>>   //        Assert.invariant(_slots(i) eq null)
>>>>>   //        i = i + 1
>>>>>   //      }
>>>>>   //    }
>>>>>   //
>>>>>   //    if (cs._arraySize.isDefined) {
>>>>>   //      _slots(_lastSlotAdded).asInstanceOf[DIArray].reduceToSize(cs._arraySize.get)
>>>>>   //    }
>>>>>   //  }
>>>>>
>>>>> // CURRENT, lot of logic here.  Wondering if it's even necessary, thus the question above
>>>>> //
>>>>>   final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>>     Assert.invariant(_arraySize == MaybeInt.Nope)
>>>>>
>>>>>     var i = _lastSlotAdded
>>>>>     while (i > cs._lastSlotAdded) {
>>>>>       val lastNode = childNodes.last
>>>>>       val name = lastNode.namedQName.toQNameString
>>>>>
>>>>>       lastNode match {
>>>>>         case arr: DIArray if (arr.length > 1) => {
>>>>>           // No need to delete the array, just the last DINode in it
>>>>>           arr.pop()
>>>>>         }
>>>>>         case arr: DIArray if (arr.length == 1) => {
>>>>>           // Last DINode in array, remove the array entirely
>>>>>           //
>>>>>           arr.pop()
>>>>>           //
>>>>>           // Update insertion order list
>>>>>           //
>>>>>           childNodes.remove(childNodes.length - 1)
>>>>>           //
>>>>>           // Update Fast Lookup
>>>>>           //
>>>>>           val seq = fastLookup(name)
>>>>>           if (seq.length <= 1) {
>>>>>             // Last DINode with THIS name, remove it entirely.
>>>>>             //
>>>>>             fastLookup.remove(name)
>>>>>           } else {
>>>>>             // Not the last DINode with THIS name, just remove this particular one.
>>>>>             //
>>>>>             val idx = seq.lastIndexWhere(node => node.isInstanceOf[DIArray])
>>>>>             val result = seq diff Seq(lastNode)
>>>>>             fastLookup.update(name, result)
>>>>>           }
>>>>>         }
>>>>>         case _ => {
>>>>>           //
>>>>>           // Update insertion order list
>>>>>           //
>>>>>           childNodes.remove(childNodes.length - 1)
>>>>>           //
>>>>>           // Update Fast Lookup
>>>>>           //
>>>>>           val seq = fastLookup(name)
>>>>>           if (seq.length <= 1) {
>>>>>             // Last DINode with THIS name, remove it entirely
>>>>>             //
>>>>>             fastLookup.remove(name)
>>>>>           } else {
>>>>>             // Not the last DINode with THIS name.  Just remove this one!
>>>>>             //
>>>>>             val idx = seq.lastIndexWhere(node => !node.isInstanceOf[DIArray])
>>>>>             val result = seq diff Seq(lastNode)
>>>>>             fastLookup.update(name, result)
>>>>>           }
>>>>>         }
>>>>>       }
>>>>>     }
>>>>>
>>>>>   }
>>>>>
>>>>> The last failure in SimpleTypes, I haven't run the rest of the suite yet though.  Baby steps.
>>>>> java.lang.Exception:
>>>>> Comparison failed.
>>>>> Expected
>>>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text></hb_root01>
>>>>> Actual
>>>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text><text/></hb_root01>
>>>>> Differences were (path, expected, actual):
>>>>>  (hb_root01,List(),List(<text/>))
>>>>> at edu.illinois.ncsa.daffodil.xml.XMLUtils$.compareAndReport(XMLUtils.scala:678)
>>>>> at edu.illinois.ncsa.daffodil.tdml.VerifyTestCase$.verifyParserTestData(TDMLRunner.scala:1074)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runParseExpectSuccess(TDMLRunner.scala:776)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:675)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:673)
>>>>> at scala.util.Either$RightProjection.foreach(Either.scala:468)
>>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runProcessor(TDMLRunner.scala:673)
>>>>> at edu.illinois.ncsa.daffodil.tdml.TestCase.run(TDMLRunner.scala:623)
>>>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTestWithDataVolumes(TDMLRunner.scala:340)
>>>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTest(TDMLRunner.scala:328)
>>>>> at edu.illinois.ncsa.daffodil.tdml.Runner.runOneTest(RunnerFactory.scala:122)
>>>>> at edu.illinois.ncsa.daffodil.section05.simple_types.TestSimpleTypes.test_hexBinary_Delimited_01(TestSimpleTypes.scala:63)
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>>> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>>>>> at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>>>>> at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>>>>> at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>>>>> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>>>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>>>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>>>>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>>>>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>>>>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>>>>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>>>>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>>>>> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>>>>> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>>>>> at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
>>>>> at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
>>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: Steve Lawrence <sl...@apache.org>
>>>>> Sent: Wednesday, October 11, 2017 4:07:51 PM
>>>>> To: dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>
>>>>> On 10/11/2017 01:01 PM, Mike Beckerle wrote:
>>>>>> You have to hash lookup the name to append to an array if you're just appending one thing at a time starting from the slot/hashID. The current Infoset API hides the construction of the DIArray, and only provides access via the slot/hashID, so there's  no  way to avoid the hash per each append.
>>>>>
>>>>> The hashmap and hashlookups should really only be needed for elements
>>>>> that are used in expressions. We came up with that optimization so that
>>>>> we could remove the vast majority of hash lookups. Due to how
>>>>> getChildArray works, that optimization can't occur. I'm not sure how
>>>>> much this will affect performance, but there are ways to avoid the need
>>>>> for a hash lookup for arrays not used in expressions. The removal of
>>>>> DIArray was one. Maybe API/parser changes is better.
>>>>>
>>>>>> But that's fixable if we want. Probably the API to infoset needs to allow appending to an array more efficently, so that a loop like a repeat combinator can be directly holding onto the DIArray's arrayBuffer appending objects as the parser returns them. But I would not make this change as yet. First things first. Don't optimize
>> without profiling info, etc.
>>>>>>
>>>>>>
>>>>>> I'm very reluctant to remove DIArray because XML's data model isn't the only game in town. The XML decision to not have arrays, just adjacent things that happen to have the same name, has all sorts of negative performance implications.
>>>>>
>>>>> The removal of a DIArray doesn't necessarily mean we don't have a
>>>>> concept of an arrays. Every element added to Complex child arrayBuf
>>>>> still knows if it part of an array via ERD.isArray. And comparing the
>>>>> current DIElements ERD to the previous and next is enough to determine
>>>>> where array starts and ends. From there, we can support any model that
>>>>> requires a concept of an array. I'm not saying we should remove the
>>>>> concept of arrays, just that our InfosetImpl doesn't necessarily need to
>>>>> have a special DIArray. I'll agree that it does make the logic easier to
>>>>> get it to a different model, but other things are more complicated.
>>>>>
>>>>>> Consider this: we could create DIArrays of elements of simple type that don't even create a DISimple node for each array element. The values could be stored in the DIArray's arrayBuffer directly, eliminating all that overhead of a DISimple node. (This was the original intent for DIArray, but was punted to get it working sooner.)
>> The opportunity to do this is entirely lost if you don't have DIArray, but just
>> a bunch of adjacent DIElements that happen to share ERD.
>>>>>
>>>>> Fair point. Seems a good argument for keeping DIArray, but if we don't
>>>>> get around to doing this optimization, we're taking multiple hits. We're
>>>>> still allocating many DISimples plus a DIArray, plus the hit from the
>>>>> hashmap.
>>>>>
>>>>>> More speculatively, when we someday implement incremental array trimming so that streaming parse/unparse can drop most array elements once they've been delivered to consumer, well then we will definitely want a DIArray object to abstract the complexity of this trim behavior and the underlying representation it requires.
>>>>>
>>>>> It might simplify this, but I'm not entirely convinced. I think the
>>>>> trimming of unused Infoset elements is going to have some complex
>>>>> heuristics regardless. Adding an extra heuristic for arrays won't
>>>>> necessarily make this easier. But this is just speculation. And again,
>>>>> we would still know elements are part of an array, it just takes a
>>>>> couple extra comparisons to determine start/end. But I do agree that
>>>>> removal of an an entire array is much easier with a DIArray concept.
>>>>>
>>>>>
>>>>>
>>>>> It seems to me these are good enough arguments for keeping DIArray
>>>>> around, but it might be worth thinking a little bit about what changes
>>>>> would be necessary to the API/parser to not need this hash lookup. If
>>>>> it's a minor change, it could reduce a lot of hash lookups, especially
>>>>> for large arrays, which are not uncommon.
>>>>>
>>>>>
>>>>>>
>>>>>> ________________________________
>>>>>> From: Steve Lawrence <sl...@apache.org>
>>>>>> Sent: Wednesday, October 11, 2017 12:21 PM
>>>>>> To: Taylor Wise; dev@daffodil.apache.org
>>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>>
>>>>>> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>>>>>>
>>>>>>> What you'll notice is that the current implementation will construct the DIArray
>>>>>>> if it doesn't already exist and append to it if it does already exist.  But it
>>>>>>> needs the 'slot' number in order to check this.  The slot number is obtained
>>>>>>> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of slots
>>>>>>> entirely, how can we accomplish this?  Will we have to do a named lookup then
>>>>>>> every time we want to append to a DIArray?
>>>>>> I think you're right. The way things work right now, anytime we add to a
>>>>>> a DIArray we would have to do a hash lookup on the DIComplex. It's not
>>>>>> limited to just DPath.
>>>>>>
>>>>>> Maybe this is another argument for removing the concept of a DIArray? We
>>>>>> still need to know if a DIElement is part of an array for the
>>>>>> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
>>>>>> element should just means creating a new DISimple/DIComplex and adding
>>>>>> it to the DIComplex child ArrayBuffer and appending to the HashLookup
>>>>>> Seq? This would also simplify some query style logic, e.g. fn:count.
>>>>>>
>>>>>> Alternatively, we could modify the parser logic so that when it's
>>>>>> dealing with arrays it stores the DIArray locally somewhere (PState or
>>>>>> Parser) and appends straight to that without having to ask the parent
>>>>>> DIComplex for the DIArray? Not sure how big of a change that would be
>>>>>> though.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Current Implementation:
>>>>>>>
>>>>>>> override def addChild(e: InfosetElement): Unit = {
>>>>>>>
>>>>>>>   if (e.runtimeData.isArray)  {
>>>>>>>
>>>>>>>    // Make sure there is an array to accept the child
>>>>>>>
>>>>>>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>>>>>>
>>>>>>>      Assert.invariant(arr ne null)
>>>>>>>
>>>>>>>      arr.append(e)
>>>>>>>
>>>>>>>    } else {
>>>>>>>
>>>>>>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>>>>>>
>>>>>>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>>>>>>
>>>>>>>    }
>>>>>>>
>>>>>>>    e.setParent(this)
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>>>>>>>      Assert.usage(childERD.isArray)
>>>>>>>      getChildArray(childERD.dpathElementCompileInfo)
>>>>>>>    }
>>>>>>>
>>>>>>>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>>>>>>>      Assert.usage(info.isArray)
>>>>>>>      val slot = info.slotIndexInParent
>>>>>>>      getChildArray(slot)
>>>>>>>    }
>>>>>>>
>>>>>>>    private def getChildArray(slot: Int): InfosetArray = {
>>>>>>>      val slotVal = _slots(slot)
>>>>>>>      if (slotVal ne null)
>>>>>>>        slotVal match {
>>>>>>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>>>>>>          case _ => Assert.usageError("not an array")
>>>>>>>        }
>>>>>>>      else {
>>>>>>>        val arrayERD = erd.childERDs(slot)
>>>>>>>        // slot is null. There isn't even an array object yet.
>>>>>>>        // create one (it will have zero entries)
>>>>>>>        val ia = new DIArray(arrayERD, this)
>>>>>>>        // no array there yet. So we have to create one.
>>>>>>>        setChildArray(slot, ia)
>>>>>>>        ia
>>>>>>>      }
>>>>>>>    }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------------
>>>>>>> *From:* Steve Lawrence <sl...@apache.org>
>>>>>>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>>>>>>> *To:* dev@daffodil.apache.org; Taylor Wise
>>>>>>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>>>>>>> Goal: Remove concept of slots from InfosetImpl
>>>>>>>>
>>>>>>>> Reasons:
>>>>>>>>     1. So that element/node lookup can be name based.
>>>>>>>>     2. Ease re-enabling of Unordered Sequences.
>>>>>>>>     3. Ease enabling of 'query-style' expressions that return more than
>>>>>>>> one node.
>>>>>>>>
>>>>>>>> An Infoset is a document (DIDocument) and the document contains exactly
>>>>>>>> one element (DIElement) as its root.  These elements can be simple
>>>>>>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>>>>>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>>>>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>>>>>>> inherit from DINode.
>>>>>>>>
>>>>>>>> Currently DIComplex models its children as an Array of DINodes. The
>>>>>>>> problem with the current implementation is that there's no way for us to
>>>>>>>> just ask for an element by name without walking the structure.
>>>>>>>>
>>>>>>>> The solution involves modifying DIComplex to represent its children both
>>>>>>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>>>>>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>>>>>>> insertion ordering information we need regarding the children as well as
>>>>>>>> constant time indexing access to the child in question.  The ArrayBuffer
>>>>>>>> also has the added benefit of growing and shrinking as objects are added
>>>>>>>> and removed.  While the HashMap will allow us to associate a name
>>>>>>>> (String) with DINodes that have that name thus providing a fast named
>>>>>>>> lookup.
>>>>>>>>
>>>>>>>> This named lookup can be optimized further by only inserting into the
>>>>>>>> HashMap those elements that we know are used in expressions. These are
>>>>>>>> the elements that would most likely benefit from a fast named lookup and
>>>>>>>> will allow us to avoid additional overhead of hashing strings and
>>>>>>>> allocating/appending to a sequence.
>>>>>>>>
>>>>>>>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>>>>>>>> it's possible for the following:
>>>>>>>>
>>>>>>>>     <element name="a" .../>
>>>>>>>>     <element name="a" maxOccurs="unbounded"..../>
>>>>>>>>     <element name="b".../>
>>>>>>>>     <element name="a" ..../>
>>>>>>>>
>>>>>>>> Each element named 'a' would have an associated DINode.  Because there
>>>>>>>> are multiple DINodes named 'a', we need to represent them in the HashMap
>>>>>>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>>>>>>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>>>>>>>
>>>>>>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>>>>>>     "b" -> mutable.Seq(DIElement)
>>>>>>>>
>>>>>>>> The mutable.ArrayBuffer maintains the insertion order and would look as
>>>>>>>> follows:
>>>>>>>>
>>>>>>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>>>>>>
>>>>>>>> Back tracking shall be accomplished by going to the last item in the
>>>>>>>> DIComplex ArrayBuffer and removing it.
>>>>>>>>     1. If the last item is a DIElement, we simply delete it from the
>>>>>>>> ArrayBuffer and then remove it also from the HashMap.
>>>>>>>>     2. If the last item is a DIArray, we then have to look at the
>>>>>>>> contents of the DIArray.
>>>>>>>>         a. If the DIArray has more than one entry, remove the last
>>>>>>>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>>>>>>>> with the modified DIArray.
>>>>>>>
>>>>>>> I think the second part of this is actually unnecessary. Since the
>>>>>>> DIArray is mutable, we only need to remove its last child. The entry in
>>>>>>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>>>>>>> only difference is the contents of the DIArray will be one smaller.
>>>>>>>
>>>>>>>>         b. If the DIArray has a single entry then we simply remove
>>>>>>>> DIArray from ArrayBuffer and the HashMap.
>>>>>>>>
>>>>>>>>     <element name="root>
>>>>>>>>          <sequence ../>
>>>>>>>>          <element name="a" .../>
>>>>>>>>          <element name="a" maxOccurs="unbounded">
>>>>>>>>            ...
>>>>>>>>            <element name="b" .../>
>>>>>>>>            <element name="c" .../>
>>>>>>>>            ...
>>>>>>>>          </element>
>>>>>>>>        </sequence>
>>>>>>>>     </element>
>>>>>>>>
>>>>>>>> If the HashMap of DIComplex-root looks like the following:
>>>>>>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>>>
>>>>>>>> ArrayBuffer of DIComplex-root looks like the following:
>>>>>>>>      0 -> DIElement("a")
>>>>>>>>      1 -> DIArray("a")
>>>>>>>>
>>>>>>>> Where DIArray children ArrayBuffer looks like:
>>>>>>>>      0 -> DIElement("b")
>>>>>>>>      1 -> DIElement("c")
>>>>>>>>
>>>>>>>> Given the initial example if we backtrack(1):  // purge one item
>>>>>>>>      DIArray:
>>>>>>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>>>>>>
>>>>>>>>      HashMap DIComplex-root:
>>>>>>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>>>
>>>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>>>          0 -> DIElement("a")
>>>>>>>>          1 -> DIArray("a")
>>>>>>>>
>>>>>>>> Given the initial example if we backtrack(2): // purge two items
>>>>>>>>      DIArray:
>>>>>>>>          Empty
>>>>>>>>
>>>>>>>>      HashMap DIComplex-root:
>>>>>>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>>>>>>>                                         // removed and is now empty
>>>>>>>>
>>>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>>>          0 -> DIElement("a")
>>>>>>>>
>>>>>>>> ===
>>>>>>>>
>>>>>>>> class DIComplex {
>>>>>>>>   def nSlots = erd.nChildSlots
>>>>>>>
>>>>>>> I think the concept of nSlots and nChildSlots should go away completely.
>>>>>>> The children ArrayBuffer should just grow as elements are added and
>>>>>>> shrink as they are backtracked. It might make sense to have a tunable
>>>>>>> the determines the initial size of the children Array, but the Scala
>>>>>>> defaults for initializing and growing a mutable array might already be
>>>>>>> reasonable.
>>>>>>>
>>>>>>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>>>>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>>>>>>
>>>>>>> One thing that might make this change easier is to not worry about the
>>>>>>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>>>>>>> really only necessary to support query style expressions, but that comes
>>>>>>> with a whole other slew of concerns (e.g. fn:count, same name but
>>>>>>> different types, etc.). There currently exist restrictions that do not
>>>>>>> allow query style expressions, so this Seq should always be a size of 1
>>>>>>> until remove those restrictions.
>>>>>>>
>>>>>>>>
>>>>>>>>   def addChild(node: DINode) {
>>>>>>>>     children += node
>>>>>>>>     fastLookup(node.name).append(node)
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>>>>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>>>>>>   }
>>>>>>>>
>>>>>>>>   def backtrack(int n) {
>>>>>>>>     for (i = 0; i < n; i++) {
>>>>>>>>       val endIdx = children.size -1
>>>>>>>>       val child = children.remove(endIdx) // children.pop()
>>>>>>>>
>>>>>>>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>>>>>>>         case None => // do nothing
>>>>>>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>>>>>>         case Some(nodesSeq) => {
>>>>>>>>       val lastEntry = nodesSeq.last
>>>>>>>>           lastEntry match {
>>>>>>>>         case arr: DIArray => {
>>>>>>>>               // If the array has >= 2 items, remove last one and update
>>>>>>>> nodesSeq
>>>>>>>>               // with this array:
>>>>>>>>               //  val lastIdx = nodesSeq.length - 1
>>>>>>>>           //  arr.remove(arr.size - 1)
>>>>>>>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>>>>>>>> arr))
>>>>>>>>               //
>>>>>>>>               // If the array has 1 item, remove the array from nodesSeq
>>>>>>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>>>         }
>>>>>>>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>>>           }
>>>>>>>>
>>>>>>>>         }
>>>>>>>>       }
>>>>>>>>     }
>>>>>>>>   }
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> class DIArray {
>>>>>>>>     private val initialSize =
>>>>>>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>>>>>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>>>>>>>
>>>>>>>>     def append(ie: InfosetElement): Unit = {
>>>>>>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>>>>>>         ie.setArray(this)
>>>>>>>>     }
>>>>>>>>
>>>>>>>>     //
>>>>>>>>     // For backtracking purposes
>>>>>>>>     //
>>>>>>>>     def remove(n: Int): DIElement = {
>>>>>>>>         _contents.remove(n)
>>>>>>>>     }
>>>>>>>> }
>>>>>>>>
>>>>>>>> ===
>>>>>>>>
>>>>>>>> To facillitate a breadth-first ordering scheme we shall have to
>>>>>>>
>>>>>>> That this is only necessary for unordered sequences, where the insertion
>>>>>>> order in the ArrayBuffer differs from the display order, correct? If
>>>>>>> that's true, we definitely want to make sure to only do the below
>>>>>>> sorting when an element contains an unordred sequence.
>>>>>>>
>>>>>>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>>>>>>> shall be called childElementDeclPosition and pertains to the static
>>>>>>>> order of the element children of a complex type element.  When sorting
>>>>>>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>>>>>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>>>>>>> entry in the sequence that needs to be sorted.  All elements in this
>>>>>>>> array will have the same childElementDeclPosition.
>>>>>>>>
>>>>>>>> The difficulty in implementing this is that an Element's
>>>>>>>> dsom.ComplexType can have groups within the groups before finding actual
>>>>>>>> element children.  Therefore in the Document Schema Object Model (DSOM)
>>>>>>>> the model group object shall have a startingChildElementDeclPosition and
>>>>>>>> an endingChildElementDeclPosition.
>>>>>>>>
>>>>>>>> 1. The startingChildElementPosition shall be passed to the constructor
>>>>>>>> for the Term.
>>>>>>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>>>>>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>>>>>>> model group, then its first child term's
>>>>>>>> startingChildElementDeclPosition is equal to its own
>>>>>>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>>>>>>> startingChildElementDeclPosition is the prior term's
>>>>>>>> endingChildElementDeclPosition + 1.
>>>>>>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>>>>>>> endingChildElmentDeclPosition of its last child term.
>>>>>>>>
>>>>>>>
>>>>>>> Looks good to me. +1. This should definitely make implementation of
>>>>>>> unordered sequences and query style expressions easier. Removal of the
>>>>>>> slot concept should also help remove concepts that prevent faster schema
>>>>>>> compilation.
>>>>>>>
>>>>>>> - Steve
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>> 
>> 
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Taylor Wise <tw...@tresys.com>.

Been debugging after implementing your suggested changes.


First thing to note, is that by moving the creation of the array out of getChildArray that I had to make additional calls to 'addChild' in the InfosetInputter.  This corrected things on the 'parser' side.  However, on the unparser side it became evident that this 'addChild' function cannot always set the parent.  There's an invariant expecting parent to be null along the execution branch of the unparser side.  As a result, I modified addChild to also accept a default boolean called 'setParent'.  It defaults to true unless specified.  So now when we call 'addChild' in InfosetInputter on the unparser side, we actually call 'addChild(e, false)'.


The above correction reduced the errors from 30+ to just 2.  The only two errors remaining have to do with an array index out of bounds error that's expected.  I believe that it's actually an issue with the test and not the code.


Error:

edu.illinois.ncsa.daffodil.tdml.TDMLException: Did not find diagnostic message "Value -1 is out of range" in any of the actual diagnostic messages:
Runtime Schema Definition Error: Expression Evaluation Error: Child element {http://example.com}password does not exist.
Schema context: hideme Location line 267 column 12 in file:/tmp/hiddenElem2658119394211244932.dfdl.xsd
Data location was preceding byte 0 limit(bytes) 8

Well, password doesn't exist because we are missing the 'pw:' initiator in the tdml:document content.  The code now doesn't know to create it because we haven't seen the initiator.  At least that's what I surmise.


Existing test:

  <!--
     Test Name: arrayIndexOutOfBounds_02
        Schema: hiddenElem
          Root: e1b
       Purpose: This test demonstrates that a proper error occurs if an array index in an expression is out of bounds
  -->

  <tdml:parserTestCase name="arrayIndexOutOfBounds_02" root="e1b"
    model="hiddenElem" description="Section 23 - DFDL Expressions">

    <tdml:document>p455w0rd</tdml:document>
      <tdml:errors>
        <tdml:error>Schema Definition Error</tdml:error>
        <tdml:error>expression evaluation error</tdml:error>
        <tdml:error>Value -1 is out of range</tdml:error>
        <tdml:error>with length 0</tdml:error>
      </tdml:errors>
  </tdml:parserTestCase>

Schema:
    <xs:group name="hiddenData">
      <xs:sequence dfdl:separator="|">
        <xs:element name="password" type="xs:string" dfdl:initiator="pw:" minOccurs="0" maxOccurs="1" dfdl:lengthKind="delimited"/>
      </xs:sequence>
    </xs:group>

    <xs:element name="e1">
      <xs:complexType>
        <xs:sequence>
          <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
          <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{ fn:concat(fn:substring(/ex:e1/ex:password[1], 1, 1), '***') }"/>
        </xs:sequence>
      </xs:complexType>
    </xs:element>

    <xs:element name="e1b">
      <xs:complexType>
        <xs:sequence>
          <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
          <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{ fn:concat(fn:substring(/ex:e1b/ex:password[-1], 1, 1), '***') }"/>
        </xs:sequence>
      </xs:complexType>
    </xs:element>


My modification:

  <!--
     Test Name: arrayIndexOutOfBounds_02
        Schema: hiddenElem
          Root: e1b
       Purpose: This test demonstrates that a proper error occurs if an array index in an expression is out of bounds
  -->

  <tdml:parserTestCase name="arrayIndexOutOfBounds_02" root="e1b"
    model="hiddenElem" description="Section 23 - DFDL Expressions">

    <tdml:document>pw:p455w0rd</tdml:document>
      <tdml:errors>
        <tdml:error>Schema Definition Error</tdml:error>
        <tdml:error>expression evaluation error</tdml:error>
        <tdml:error>Value -1 is out of range</tdml:error>
        <tdml:error>with length 1</tdml:error>
      </tdml:errors>
  </tdml:parserTestCase>

But honestly, should we really need to do this?  Shouldn't the code know that -1 is not a valid index without having to force construction of the password array?

I'm similarly going to have to modify this other test as well by constructing a different root that attempts to access a character in a position greater than the length of the array, unless there's something I'm missing:

  <!--
     Test Name: arrayIndexOutOfBounds_01
        Schema: hiddenElem
          Root: e1
       Purpose: This test demonstrates that a proper error occurs if an array index in an expression is out of bounds
  -->

  <tdml:parserTestCase name="arrayIndexOutOfBounds_01" root="e1"
    model="hiddenElem" description="Section 23 - DFDL Expressions">

    <tdml:document>p455w0rd</tdml:document>
      <tdml:errors>
        <tdml:error>Schema Definition Error</tdml:error>
        <tdml:error>expression evaluation error</tdml:error>
        <tdml:error>Value 1 is out of range</tdml:error>
        <tdml:error>with length 0</tdml:error>
      </tdml:errors>
  </tdml:parserTestCase>

Proposed addition to defineSchema:

    <xs:element name="e1c">
      <xs:complexType>
        <xs:sequence>
          <xs:sequence dfdl:hiddenGroupRef="ex:hiddenData"/>
          <xs:element name="hideme" type="xs:string" dfdl:inputValueCalc="{ fn:concat(fn:substring(/ex:e1c/ex:password[10], 1, 1), '***') }"/>
        </xs:sequence>
      </xs:complexType>
    </xs:element>

Proposed modification to test:

  <!--
     Test Name: arrayIndexOutOfBounds_01
        Schema: hiddenElem
          Root: e1
       Purpose: This test demonstrates that a proper error occurs if an array index in an expression is out of bounds
  -->

  <tdml:parserTestCase name="arrayIndexOutOfBounds_01" root="e1c"
    model="hiddenElem" description="Section 23 - DFDL Expressions">

    <tdml:document>pw:p455w0rd</tdml:document>
      <tdml:errors>
        <tdml:error>Schema Definition Error</tdml:error>
        <tdml:error>expression evaluation error</tdml:error>
        <tdml:error>Value 10 is out of range</tdml:error>
        <tdml:error>with length 1</tdml:error>
      </tdml:errors>
  </tdml:parserTestCase>





________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Tuesday, October 17, 2017 11:04:03 AM
To: Taylor Wise; dev@daffodil.apache.org; Mike Beckerle
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion

I don't think a DIArray could be created earlier and things appended to
it. But, currently, I think an empty DIArray could be created earlier.

And I think this is really just a problem with getChildArray always
creating DIArrays. What I suspect happened is that we had an array
following by something that accessed that array, e.g.:

<xs:sequence>
  <xs:element name="arr" maxOccurs="unbounded" ... />
  <xs:element name="count" dfdl:inputValueCalc="{ fn:count(../arr) }" />
</xs:sequence>

arr was slot 1, count was slot 2.

The setChildArray function used to just do something like

  _lastSlotAdded = slot

Which worked in the case of adding a new array--we add a new array, so
the slot it was added in was the last slot added.

But when fn:count was called, _lastSlotAdded was 2 (from the count
element). fn:count ended up calling getChildArray, which created a
DIArray and called setChildArray, which overwrote _lastSlotAdded to the
slot of the array, i.e. 1. So even though the array was finished and
couldn't be modified, the use of getChildArray in DPath ended up setting
_lastSlotAdded to the wrong value. Using max was a way to fix this so we
couldn't set it to an earlier value.

So, now, because getChildArray create elements, it sounds like things
could be created out of order. Rather than inserting an empty DIArray
into a slot, it will added it to the end and potentially mess things up.
This says to me we definitely want to modify getChildArray to not create
DIArrays. It should either return the array if it exists, or throw an
Infoset exception, very similar to getChild. Then we don't have to worry
about DIArray's being created at the wrong time. The only thing that
should create a DIArray is addChild.


On 10/17/2017 10:13 AM, Taylor Wise wrote:
> Hmm, very interesting.  This would greatly simplify things.  But what about this
> code snippet from before my changes:
>
>
>    //  final def setChildArray(slot: Int, arr: DIArray) {
>    //    Assert.invariant(_slots(slot) eq null)
>    //    _slots(slot) = arr
>    //    //
>    //    // Because arrays are created if not existing, an expression like ../A[1]
>    //    // can cause array A to come into existence (though empty) and A might be
>    //    // part of an expression that is reaching backward into earlier data (when
>    //    // parsing) so that we might be creating an array earlier in the slots
>    //    // of this element. Hence, the slot being passed here is not necessarily
>    //    // the last slot added. It might be before it.
>    //    //
>    //    _lastSlotAdded = math.max(slot, _lastSlotAdded)
>    //  }
> Given your statement:
>
>
> "the issue is determining if the DIArray exists or not. Everything
> after that is the same. And since children are inserted in the order
> that we see them, a DIArray will always be at the end of the children
> ArrayBuffer if it exists. If the DINode at the end of the children
> ArrayBuffer is not this array, it needs to be created."
>
>
> To me it seems to imply that a DIArray can come into existence without any
> contents.  But then be appended to at a later time.  In which case it's possible
> that this DIArray then is not the last item appended.  It could be somewhere
> earlier and then need to be added to.
>
> --------------------------------------------------------------------------------
> *From:* Steve Lawrence <sl...@apache.org>
> *Sent:* Tuesday, October 17, 2017 7:20:42 AM
> *To:* dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
> Actually, I think we're over complicating the array situation. I think
> we can avoid needing a unique ID and I think we can also avoid needing
> to always insert arrays into a HashMap.
>
> When addChild is called to add an DIElement to an array, there are two
> cases we need to think about:
>
> 1) The DIArray does not exist and it must be created. The
> DIElement is appended to its contents.
>
> 2) The DIArray does exist and we just need to find it. The
> DIElement is appended to its contents.
>
> So the issue is determining if the DIArray exists or not. Everything
> after that is the same. And since children are inserted in the order
> that we see them, a DIArray will always be at the end of the children
> ArrayBuffer if it exists. If the DINode at the end of the children
> ArrayBuffer is not this array, it needs to be created. So addChild
> just becomes something like this:
>
>    def addChild(e: DIElement) {
>      if (e.erd.isArray) {
>        if (children.isEmpty || children.last.erd != e.erd) {
>          // no children, or last child is not a DIArray for
>          // this element, create the DIArray
>          children.append(new DIArray(...))
>        }
>        // the array is now always last, add the new child to it
>        children.last.append(e)
>      } else {
>        children.append(e)
>      }
>    }
>
> So we no longer need a unique ID or a tuple for the HashMap key. The
> fast HashMap lookup only needs to work on the name of the element, which
> allows query-style expressions to work in the future. And we now only
> need to insert DIArray's into the fast HashMap if they are used in
> expressions addChild no longer needs getChildArray, minimizing hash
> look ups.
>
> Related, I'm wondering if getChildArray should change as well so that it
> does not create a DIArray if it doesn't exist? I think it seems
> reasonable that only addChild should create new DIArrays. And if
> something like DPath calls getChildArray, we probably don't want to end
> up creating an empty DIArray if it doesn't exist. It should probably
> just throw an InfosetNoSuchChildElementException if the array doesn't
> exist, just like getChild.
>
> - Steve
>
> On 10/16/2017 12:02 PM, Steve Lawrence wrote:
>> The ID might cause some issues when we want to support query-style
>> expressions. For example, say we have a schema like this:
>>
>> <sequence>
>>   <xs:element name="array" maxOccurs="unbounded" ... />
>>   <xs:element name="element" ... />
>>   <xs:element name="array" maxOccurs="unbounded" ... />
>> </xs:sequence>
>>
>> So two arrays with the same name separated by a single element. Each
>> element would have a unique ID, so would be inserted in the hash table
>> as something like
>>
>> ("array", 1) -> Seq(DIArray)
>> ("element", 2) -> Seq(DISimple)
>> ("array", 3) -> Seq(DIArray)
>>
>> A query style expression that accesses the "array" elements (e.g.
>> fn:count(./array)) would have trouble finding both arrays to get the
>> full count since the keys for the two arrays are the same.
>>
>> Perhaps one option is to keep the name -> DINode map type, so we'd end
>> up with something like this:
>>
>> "array" -> Seq(DIArray, DIArray)
>> "element" -> Seq(DISimple)
>>
>> And then just iterator of the Seq until we find DIArray with the right
>> ID? That's could have a performance hit if there were lots of things
>> with the same name, but I suspect that won't be super common.
>>
>> Alternatively, we could relook at chaning the InfosetImpl API so that
>> when an DIArray is created, it's store in the PState for quick access,
>> and getChildArray would essentially go away.
>>
>> - Steve
>>
>>
>> On 10/13/2017 02:14 PM, Mike Beckerle wrote:
>>> I think the key has to change to include name and namespace of element, and also an additional integer to disambiguate the first a element from the 2nd, from the Nth. This integer need not be 1, 2, 3. It can be any value at all, so long as they are distinct.
>>>
>>>
>>> Let's make sure another case works. E.g.,  ordered sequence like the one you have here, another element to separate it from  an unordered sequence that also contains elements named a.
>>>
>>>
>>> However, an unordered sequence can't have more than 1 element named a, so if every element a declaration gets an additional integer, the one in the unordered sequence will also get one, and that will behave properly.
>>>
>>>
>>> Is this integer the same breadth-first ordering thing we discussed elsewhere? I expect it is.
>>>
>>> ________________________________
>>> From: Taylor Wise
>>> Sent: Friday, October 13, 2017 1:55:22 PM
>>> To: Steve Lawrence; dev@daffodil.apache.org; Mike Beckerle
>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>
>>>
>>> I'm down to a single error: test_dfdlPosition3
>>>
>>>
>>> java.lang.Exception:
>>> Comparison failed.
>>> Expected
>>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2><a><v>1</v></a><a><v>2</v></a></f>
>>> Actual
>>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2></f>
>>> Differences were (path, expected, actual):
>>>  (f,<cnt2>2</cnt2>,<a><v>1</v></a>)
>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>
>>>     <xs:element name="f">
>>>       <xs:complexType>
>>>         <xs:sequence>
>>>           <xs:element name="cnt1" type="xs:int"
>>>             dfdl:lengthKind="delimited" dfdl:terminator=";" />
>>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt1) }">
>>>             <xs:complexType>
>>>               <xs:sequence>
>>>                 <xs:element name="v" type="xs:int"
>>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>>               </xs:sequence>
>>>             </xs:complexType>
>>>           </xs:element>
>>>           <xs:element name="cnt2" type="xs:int"
>>>             dfdl:lengthKind="delimited" />
>>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt2) }">
>>>             <xs:complexType>
>>>               <xs:sequence>
>>>                 <xs:element name="v" type="xs:int"
>>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>>               </xs:sequence>
>>>             </xs:complexType>
>>>           </xs:element>
>>>         </xs:sequence>
>>>       </xs:complexType>
>>>     </xs:element>
>>>
>>>
>>>
>>> I believe what's happening, from stepping through, is that the getChildArray code is confused as to where to add the second set of <a/> elements.  Below are the values of the structures at the point where <cnt2/> has been added:
>>>
>>> ChildNodes aka Insertion Order
>>> ====
>>> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk)), DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a'))), DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk)))
>>>
>>> FastLookup
>>> ====
>>> Map(cnt2 -> ArrayBuffer(DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk))), cnt1 -> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk))), a -> ArrayBuffer(DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a')))))
>>>
>>> So when we go to add the next array 'a', it sees that 'a' is already in the fastLookup and appends it there.  This adds it to the existing DIArray in the insertion order list.  Really, it should create a new DIArray inserting it after <cnt2/> in the insertion  order list.  And inserting it after the DIArray in the fastLookup Hash: a ->
> Seq(DIArray(a, ...), DIArray(a, ...))
>>>
>>> How do we fix this?
>>>
>>> Because this is an 'ordered' sequence, can we look at the slotIndexInParent?  Isn't this the position in the sequence definition as defined in the schema above?  This should differentiate between the two 'a' arrays.  And then for 'unordered' sequences, this  shouldn't matter.  We shouldn't run into two or more members with the same
> name and namespace.  Correct?
>>>
>>> This also means that the 'key' will have to change to include the namespace in addition to the element name.
>>>
>>>
>>>
>>> ________________________________
>>> From: Steve Lawrence <sl...@apache.org>
>>> Sent: Friday, October 13, 2017 8:45:06 AM
>>> To: dev@daffodil.apache.org; Taylor Wise; Mike Beckerle
>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>
>>> On 10/12/2017 08:19 PM, Taylor Wise wrote:
>>>> Been debugging using the TestSimpleTypes tests.  I'm down to just 1 failure.  Expected results are slightly off.  I've got an additional <text/> in the result.  So, getting closer.  Stopping for the night.
>>>>
>>>>
>>>> I'm not so sure that we need the _lastSlotAdded variable anymore.  I changed it to be just a def that reflects the current length of childNodes (ArrayBuffer).
>>>> z
>>>>
>>>> I'm betting my current issue is within the restore functions I modified, but I'll continue looking at that in the morning.  I would think that 'restoring' would simply be copying the previous state... unless I'm misunderstanding something.
>>>>
>>>>
>>>>   abstract override def restoreInto(e: DIElement) {
>>>>     val c = e.asInstanceOf[DIComplex]
>>>>     c.restoreFrom(this)
>>>>     //c._lastSlotAdded = this._lastSlotAdded
>>>>     super.restoreInto(e)
>>>>   }
>>>>
>>>> final def restoreFrom(cs: DIComplexSharedImplMixin){
>>>>   // Shouldn't I be able to just say something like...
>>>>  //
>>>>  childNodes.clear()
>>>>  childNodes.append(cs.childNodes)
>>>>  fastLookup.clear()
>>>>  cs.fastLookup.toSeq.foreach( (name, seqDINodes) =>
>>>>        fastLookup.put(name, seqDINodes) )
>>>> }
>>>
>>> This is going to be somewhat inefficient. This essentially deletes and
>>> recreates the entire childNodes array and fastLookup hashmap everytime
>>> we backtrack. This could be pretty back if we childNodes had 1000
>>> elements and we just wanted to backtrack 1. Instead we just want to
>>> remove the elements that are being backtracked.
>>>
>>> So I think you still need a concept of _lastSlotAdded. But rather than
>>> the lastSlotAdded, it's just the length of the children (maybe call it
>>> _numChildren?). I would expect restoreFrom for DIComplex to look
>>> something like this (very similar to the existing restoreFrom):
>>>
>>> final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>   var i = childNodes.length - 1
>>>
>>>   // for each child we will remove, remove it from the fastLookup map.
>>>   // this should always be the last element in the hashmap seq
>>>   while (i >= cs._numChildren) {
>>>     var childToRemove = childNodes(i)
>>>     var fastSeq = fastLookup(childToRemove.name)
>>>     if (fastSeq.length == 1)
>>>       // last one, remove the whole key entry
>>>       fastLookup.remove(childToRemove.name)
>>>     else
>>>       // not the last one, just drop the end
>>>       fastSeq.dropRight(1)
>>>     i -= 1
>>>   }
>>>
>>>   // now just quickly remove all those children
>>>   childNodes.reduceToSize(cs._numChildren)
>>>   _numChildren = cs._numChildren
>>>
>>>   if (cs._arraySize.isDefined) {
>>>     childNodes(_numChildren - 1).reduceToSize(cs._arraySize.get)
>>>   }
>>> }
>>>
>>> The logic here is very similar to the CURRENT code, just is a little
>>> more simple since it can make certain assumptions. Like removing
>>> something form the fastSeq just means removing the last one
>>> (dropRight(1)), and the existing _arraySize handles the special case for
>>> DIArrays to simplify that logic.
>>>
>>> Note that we might want to rethink about the type fastLookup value.
>>> Rather than the value being a Seq, it probably wants to be some sort of
>>> mutable Seq that has constant time append dropRight. Maybe another
>>> ArrayBuf is the right data structure here? That would allow something like
>>>
>>>   fastLookup.reduceToSize(fastSeq.length - 1)
>>>
>>>
>>>
>>>
>>>> // ORIGINAL
>>>>   //    final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>   //    Assert.invariant(_arraySize == MaybeInt.Nope)
>>>>   //
>>>>   //    var i = _lastSlotAdded
>>>>   //    while (i > cs._lastSlotAdded) {
>>>>   //      _slots(i) = null
>>>>   //      i -= 1
>>>>   //    }
>>>>   //    _lastSlotAdded = cs._lastSlotAdded
>>>>   //
>>>>   //    // check invariant. All slots after last slot added
>>>>   //    // should be null.
>>>>   //    {
>>>>   //      var i = _lastSlotAdded + 1
>>>>   //      while (i < _slots.length) {
>>>>   //        Assert.invariant(_slots(i) eq null)
>>>>   //        i = i + 1
>>>>   //      }
>>>>   //    }
>>>>   //
>>>>   //    if (cs._arraySize.isDefined) {
>>>>   //      _slots(_lastSlotAdded).asInstanceOf[DIArray].reduceToSize(cs._arraySize.get)
>>>>   //    }
>>>>   //  }
>>>>
>>>> // CURRENT, lot of logic here.  Wondering if it's even necessary, thus the question above
>>>> //
>>>>   final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>     Assert.invariant(_arraySize == MaybeInt.Nope)
>>>>
>>>>     var i = _lastSlotAdded
>>>>     while (i > cs._lastSlotAdded) {
>>>>       val lastNode = childNodes.last
>>>>       val name = lastNode.namedQName.toQNameString
>>>>
>>>>       lastNode match {
>>>>         case arr: DIArray if (arr.length > 1) => {
>>>>           // No need to delete the array, just the last DINode in it
>>>>           arr.pop()
>>>>         }
>>>>         case arr: DIArray if (arr.length == 1) => {
>>>>           // Last DINode in array, remove the array entirely
>>>>           //
>>>>           arr.pop()
>>>>           //
>>>>           // Update insertion order list
>>>>           //
>>>>           childNodes.remove(childNodes.length - 1)
>>>>           //
>>>>           // Update Fast Lookup
>>>>           //
>>>>           val seq = fastLookup(name)
>>>>           if (seq.length <= 1) {
>>>>             // Last DINode with THIS name, remove it entirely.
>>>>             //
>>>>             fastLookup.remove(name)
>>>>           } else {
>>>>             // Not the last DINode with THIS name, just remove this particular one.
>>>>             //
>>>>             val idx = seq.lastIndexWhere(node => node.isInstanceOf[DIArray])
>>>>             val result = seq diff Seq(lastNode)
>>>>             fastLookup.update(name, result)
>>>>           }
>>>>         }
>>>>         case _ => {
>>>>           //
>>>>           // Update insertion order list
>>>>           //
>>>>           childNodes.remove(childNodes.length - 1)
>>>>           //
>>>>           // Update Fast Lookup
>>>>           //
>>>>           val seq = fastLookup(name)
>>>>           if (seq.length <= 1) {
>>>>             // Last DINode with THIS name, remove it entirely
>>>>             //
>>>>             fastLookup.remove(name)
>>>>           } else {
>>>>             // Not the last DINode with THIS name.  Just remove this one!
>>>>             //
>>>>             val idx = seq.lastIndexWhere(node => !node.isInstanceOf[DIArray])
>>>>             val result = seq diff Seq(lastNode)
>>>>             fastLookup.update(name, result)
>>>>           }
>>>>         }
>>>>       }
>>>>     }
>>>>
>>>>   }
>>>>
>>>> The last failure in SimpleTypes, I haven't run the rest of the suite yet though.  Baby steps.
>>>> java.lang.Exception:
>>>> Comparison failed.
>>>> Expected
>>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text></hb_root01>
>>>> Actual
>>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text><text/></hb_root01>
>>>> Differences were (path, expected, actual):
>>>>  (hb_root01,List(),List(<text/>))
>>>> at edu.illinois.ncsa.daffodil.xml.XMLUtils$.compareAndReport(XMLUtils.scala:678)
>>>> at edu.illinois.ncsa.daffodil.tdml.VerifyTestCase$.verifyParserTestData(TDMLRunner.scala:1074)
>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runParseExpectSuccess(TDMLRunner.scala:776)
>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:675)
>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:673)
>>>> at scala.util.Either$RightProjection.foreach(Either.scala:468)
>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runProcessor(TDMLRunner.scala:673)
>>>> at edu.illinois.ncsa.daffodil.tdml.TestCase.run(TDMLRunner.scala:623)
>>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTestWithDataVolumes(TDMLRunner.scala:340)
>>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTest(TDMLRunner.scala:328)
>>>> at edu.illinois.ncsa.daffodil.tdml.Runner.runOneTest(RunnerFactory.scala:122)
>>>> at edu.illinois.ncsa.daffodil.section05.simple_types.TestSimpleTypes.test_hexBinary_Delimited_01(TestSimpleTypes.scala:63)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>>>> at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>>>> at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>>>> at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>>>> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>>>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>>>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>>>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>>>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>>>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>>>> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>>>> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>>>> at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
>>>> at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: Steve Lawrence <sl...@apache.org>
>>>> Sent: Wednesday, October 11, 2017 4:07:51 PM
>>>> To: dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>
>>>> On 10/11/2017 01:01 PM, Mike Beckerle wrote:
>>>>> You have to hash lookup the name to append to an array if you're just appending one thing at a time starting from the slot/hashID. The current Infoset API hides the construction of the DIArray, and only provides access via the slot/hashID, so there's no  way to avoid the hash per each append.
>>>>
>>>> The hashmap and hashlookups should really only be needed for elements
>>>> that are used in expressions. We came up with that optimization so that
>>>> we could remove the vast majority of hash lookups. Due to how
>>>> getChildArray works, that optimization can't occur. I'm not sure how
>>>> much this will affect performance, but there are ways to avoid the need
>>>> for a hash lookup for arrays not used in expressions. The removal of
>>>> DIArray was one. Maybe API/parser changes is better.
>>>>
>>>>> But that's fixable if we want. Probably the API to infoset needs to allow appending to an array more efficently, so that a loop like a repeat combinator can be directly holding onto the DIArray's arrayBuffer appending objects as the parser returns them.  But I would not make this change as yet. First things first. Don't optimize
> without profiling info, etc.
>>>>>
>>>>>
>>>>> I'm very reluctant to remove DIArray because XML's data model isn't the only game in town. The XML decision to not have arrays, just adjacent things that happen to have the same name, has all sorts of negative performance implications.
>>>>
>>>> The removal of a DIArray doesn't necessarily mean we don't have a
>>>> concept of an arrays. Every element added to Complex child arrayBuf
>>>> still knows if it part of an array via ERD.isArray. And comparing the
>>>> current DIElements ERD to the previous and next is enough to determine
>>>> where array starts and ends. From there, we can support any model that
>>>> requires a concept of an array. I'm not saying we should remove the
>>>> concept of arrays, just that our InfosetImpl doesn't necessarily need to
>>>> have a special DIArray. I'll agree that it does make the logic easier to
>>>> get it to a different model, but other things are more complicated.
>>>>
>>>>> Consider this: we could create DIArrays of elements of simple type that don't even create a DISimple node for each array element. The values could be stored in the DIArray's arrayBuffer directly, eliminating all that overhead of a DISimple node. (This  was the original intent for DIArray, but was punted to get it working sooner.)
> The opportunity to do this is entirely lost if you don't have DIArray, but just
> a bunch of adjacent DIElements that happen to share ERD.
>>>>
>>>> Fair point. Seems a good argument for keeping DIArray, but if we don't
>>>> get around to doing this optimization, we're taking multiple hits. We're
>>>> still allocating many DISimples plus a DIArray, plus the hit from the
>>>> hashmap.
>>>>
>>>>> More speculatively, when we someday implement incremental array trimming so that streaming parse/unparse can drop most array elements once they've been delivered to consumer, well then we will definitely want a DIArray object to abstract the complexity  of this trim behavior and the underlying representation it requires.
>>>>
>>>> It might simplify this, but I'm not entirely convinced. I think the
>>>> trimming of unused Infoset elements is going to have some complex
>>>> heuristics regardless. Adding an extra heuristic for arrays won't
>>>> necessarily make this easier. But this is just speculation. And again,
>>>> we would still know elements are part of an array, it just takes a
>>>> couple extra comparisons to determine start/end. But I do agree that
>>>> removal of an an entire array is much easier with a DIArray concept.
>>>>
>>>>
>>>>
>>>> It seems to me these are good enough arguments for keeping DIArray
>>>> around, but it might be worth thinking a little bit about what changes
>>>> would be necessary to the API/parser to not need this hash lookup. If
>>>> it's a minor change, it could reduce a lot of hash lookups, especially
>>>> for large arrays, which are not uncommon.
>>>>
>>>>
>>>>>
>>>>> ________________________________
>>>>> From: Steve Lawrence <sl...@apache.org>
>>>>> Sent: Wednesday, October 11, 2017 12:21 PM
>>>>> To: Taylor Wise; dev@daffodil.apache.org
>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>
>>>>> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>>>>>
>>>>>> What you'll notice is that the current implementation will construct the DIArray
>>>>>> if it doesn't already exist and append to it if it does already exist.  But it
>>>>>> needs the 'slot' number in order to check this.  The slot number is obtained
>>>>>> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of slots
>>>>>> entirely, how can we accomplish this?  Will we have to do a named lookup then
>>>>>> every time we want to append to a DIArray?
>>>>> I think you're right. The way things work right now, anytime we add to a
>>>>> a DIArray we would have to do a hash lookup on the DIComplex. It's not
>>>>> limited to just DPath.
>>>>>
>>>>> Maybe this is another argument for removing the concept of a DIArray? We
>>>>> still need to know if a DIElement is part of an array for the
>>>>> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
>>>>> element should just means creating a new DISimple/DIComplex and adding
>>>>> it to the DIComplex child ArrayBuffer and appending to the HashLookup
>>>>> Seq? This would also simplify some query style logic, e.g. fn:count.
>>>>>
>>>>> Alternatively, we could modify the parser logic so that when it's
>>>>> dealing with arrays it stores the DIArray locally somewhere (PState or
>>>>> Parser) and appends straight to that without having to ask the parent
>>>>> DIComplex for the DIArray? Not sure how big of a change that would be
>>>>> though.
>>>>>
>>>>>>
>>>>>>
>>>>>> Current Implementation:
>>>>>>
>>>>>> override def addChild(e: InfosetElement): Unit = {
>>>>>>
>>>>>>   if (e.runtimeData.isArray)  {
>>>>>>
>>>>>>    // Make sure there is an array to accept the child
>>>>>>
>>>>>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>>>>>
>>>>>>      Assert.invariant(arr ne null)
>>>>>>
>>>>>>      arr.append(e)
>>>>>>
>>>>>>    } else {
>>>>>>
>>>>>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>>>>>
>>>>>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>>>>>
>>>>>>    }
>>>>>>
>>>>>>    e.setParent(this)
>>>>>>
>>>>>> }
>>>>>>
>>>>>>
>>>>>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>>>>>>      Assert.usage(childERD.isArray)
>>>>>>      getChildArray(childERD.dpathElementCompileInfo)
>>>>>>    }
>>>>>>
>>>>>>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>>>>>>      Assert.usage(info.isArray)
>>>>>>      val slot = info.slotIndexInParent
>>>>>>      getChildArray(slot)
>>>>>>    }
>>>>>>
>>>>>>    private def getChildArray(slot: Int): InfosetArray = {
>>>>>>      val slotVal = _slots(slot)
>>>>>>      if (slotVal ne null)
>>>>>>        slotVal match {
>>>>>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>>>>>          case _ => Assert.usageError("not an array")
>>>>>>        }
>>>>>>      else {
>>>>>>        val arrayERD = erd.childERDs(slot)
>>>>>>        // slot is null. There isn't even an array object yet.
>>>>>>        // create one (it will have zero entries)
>>>>>>        val ia = new DIArray(arrayERD, this)
>>>>>>        // no array there yet. So we have to create one.
>>>>>>        setChildArray(slot, ia)
>>>>>>        ia
>>>>>>      }
>>>>>>    }
>>>>>>
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------------
>>>>>> *From:* Steve Lawrence <sl...@apache.org>
>>>>>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>>>>>> *To:* dev@daffodil.apache.org; Taylor Wise
>>>>>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>>>>>> Goal: Remove concept of slots from InfosetImpl
>>>>>>>
>>>>>>> Reasons:
>>>>>>>     1. So that element/node lookup can be name based.
>>>>>>>     2. Ease re-enabling of Unordered Sequences.
>>>>>>>     3. Ease enabling of 'query-style' expressions that return more than
>>>>>>> one node.
>>>>>>>
>>>>>>> An Infoset is a document (DIDocument) and the document contains exactly
>>>>>>> one element (DIElement) as its root.  These elements can be simple
>>>>>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>>>>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>>>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>>>>>> inherit from DINode.
>>>>>>>
>>>>>>> Currently DIComplex models its children as an Array of DINodes. The
>>>>>>> problem with the current implementation is that there's no way for us to
>>>>>>> just ask for an element by name without walking the structure.
>>>>>>>
>>>>>>> The solution involves modifying DIComplex to represent its children both
>>>>>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>>>>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>>>>>> insertion ordering information we need regarding the children as well as
>>>>>>> constant time indexing access to the child in question.  The ArrayBuffer
>>>>>>> also has the added benefit of growing and shrinking as objects are added
>>>>>>> and removed.  While the HashMap will allow us to associate a name
>>>>>>> (String) with DINodes that have that name thus providing a fast named
>>>>>>> lookup.
>>>>>>>
>>>>>>> This named lookup can be optimized further by only inserting into the
>>>>>>> HashMap those elements that we know are used in expressions. These are
>>>>>>> the elements that would most likely benefit from a fast named lookup and
>>>>>>> will allow us to avoid additional overhead of hashing strings and
>>>>>>> allocating/appending to a sequence.
>>>>>>>
>>>>>>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>>>>>>> it's possible for the following:
>>>>>>>
>>>>>>>     <element name="a" .../>
>>>>>>>     <element name="a" maxOccurs="unbounded"..../>
>>>>>>>     <element name="b".../>
>>>>>>>     <element name="a" ..../>
>>>>>>>
>>>>>>> Each element named 'a' would have an associated DINode.  Because there
>>>>>>> are multiple DINodes named 'a', we need to represent them in the HashMap
>>>>>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>>>>>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>>>>>>
>>>>>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>>>>>     "b" -> mutable.Seq(DIElement)
>>>>>>>
>>>>>>> The mutable.ArrayBuffer maintains the insertion order and would look as
>>>>>>> follows:
>>>>>>>
>>>>>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>>>>>
>>>>>>> Back tracking shall be accomplished by going to the last item in the
>>>>>>> DIComplex ArrayBuffer and removing it.
>>>>>>>     1. If the last item is a DIElement, we simply delete it from the
>>>>>>> ArrayBuffer and then remove it also from the HashMap.
>>>>>>>     2. If the last item is a DIArray, we then have to look at the
>>>>>>> contents of the DIArray.
>>>>>>>         a. If the DIArray has more than one entry, remove the last
>>>>>>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>>>>>>> with the modified DIArray.
>>>>>>
>>>>>> I think the second part of this is actually unnecessary. Since the
>>>>>> DIArray is mutable, we only need to remove its last child. The entry in
>>>>>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>>>>>> only difference is the contents of the DIArray will be one smaller.
>>>>>>
>>>>>>>         b. If the DIArray has a single entry then we simply remove
>>>>>>> DIArray from ArrayBuffer and the HashMap.
>>>>>>>
>>>>>>>     <element name="root>
>>>>>>>          <sequence ../>
>>>>>>>          <element name="a" .../>
>>>>>>>          <element name="a" maxOccurs="unbounded">
>>>>>>>            ...
>>>>>>>            <element name="b" .../>
>>>>>>>            <element name="c" .../>
>>>>>>>            ...
>>>>>>>          </element>
>>>>>>>        </sequence>
>>>>>>>     </element>
>>>>>>>
>>>>>>> If the HashMap of DIComplex-root looks like the following:
>>>>>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>>
>>>>>>> ArrayBuffer of DIComplex-root looks like the following:
>>>>>>>      0 -> DIElement("a")
>>>>>>>      1 -> DIArray("a")
>>>>>>>
>>>>>>> Where DIArray children ArrayBuffer looks like:
>>>>>>>      0 -> DIElement("b")
>>>>>>>      1 -> DIElement("c")
>>>>>>>
>>>>>>> Given the initial example if we backtrack(1):  // purge one item
>>>>>>>      DIArray:
>>>>>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>>>>>
>>>>>>>      HashMap DIComplex-root:
>>>>>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>>
>>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>>          0 -> DIElement("a")
>>>>>>>          1 -> DIArray("a")
>>>>>>>
>>>>>>> Given the initial example if we backtrack(2): // purge two items
>>>>>>>      DIArray:
>>>>>>>          Empty
>>>>>>>
>>>>>>>      HashMap DIComplex-root:
>>>>>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>>>>>>                                         // removed and is now empty
>>>>>>>
>>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>>          0 -> DIElement("a")
>>>>>>>
>>>>>>> ===
>>>>>>>
>>>>>>> class DIComplex {
>>>>>>>   def nSlots = erd.nChildSlots
>>>>>>
>>>>>> I think the concept of nSlots and nChildSlots should go away completely.
>>>>>> The children ArrayBuffer should just grow as elements are added and
>>>>>> shrink as they are backtracked. It might make sense to have a tunable
>>>>>> the determines the initial size of the children Array, but the Scala
>>>>>> defaults for initializing and growing a mutable array might already be
>>>>>> reasonable.
>>>>>>
>>>>>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>>>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>>>>>
>>>>>> One thing that might make this change easier is to not worry about the
>>>>>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>>>>>> really only necessary to support query style expressions, but that comes
>>>>>> with a whole other slew of concerns (e.g. fn:count, same name but
>>>>>> different types, etc.). There currently exist restrictions that do not
>>>>>> allow query style expressions, so this Seq should always be a size of 1
>>>>>> until remove those restrictions.
>>>>>>
>>>>>>>
>>>>>>>   def addChild(node: DINode) {
>>>>>>>     children += node
>>>>>>>     fastLookup(node.name).append(node)
>>>>>>>   }
>>>>>>>
>>>>>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>>>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>>>>>   }
>>>>>>>
>>>>>>>   def backtrack(int n) {
>>>>>>>     for (i = 0; i < n; i++) {
>>>>>>>       val endIdx = children.size -1
>>>>>>>       val child = children.remove(endIdx) // children.pop()
>>>>>>>
>>>>>>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>>>>>>         case None => // do nothing
>>>>>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>>>>>         case Some(nodesSeq) => {
>>>>>>>       val lastEntry = nodesSeq.last
>>>>>>>           lastEntry match {
>>>>>>>         case arr: DIArray => {
>>>>>>>               // If the array has >= 2 items, remove last one and update
>>>>>>> nodesSeq
>>>>>>>               // with this array:
>>>>>>>               //  val lastIdx = nodesSeq.length - 1
>>>>>>>           //  arr.remove(arr.size - 1)
>>>>>>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>>>>>>> arr))
>>>>>>>               //
>>>>>>>               // If the array has 1 item, remove the array from nodesSeq
>>>>>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>>         }
>>>>>>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>>           }
>>>>>>>
>>>>>>>         }
>>>>>>>       }
>>>>>>>     }
>>>>>>>   }
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> class DIArray {
>>>>>>>     private val initialSize =
>>>>>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>>>>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>>>>>>
>>>>>>>     def append(ie: InfosetElement): Unit = {
>>>>>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>>>>>         ie.setArray(this)
>>>>>>>     }
>>>>>>>
>>>>>>>     //
>>>>>>>     // For backtracking purposes
>>>>>>>     //
>>>>>>>     def remove(n: Int): DIElement = {
>>>>>>>         _contents.remove(n)
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>> ===
>>>>>>>
>>>>>>> To facillitate a breadth-first ordering scheme we shall have to
>>>>>>
>>>>>> That this is only necessary for unordered sequences, where the insertion
>>>>>> order in the ArrayBuffer differs from the display order, correct? If
>>>>>> that's true, we definitely want to make sure to only do the below
>>>>>> sorting when an element contains an unordred sequence.
>>>>>>
>>>>>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>>>>>> shall be called childElementDeclPosition and pertains to the static
>>>>>>> order of the element children of a complex type element.  When sorting
>>>>>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>>>>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>>>>>> entry in the sequence that needs to be sorted.  All elements in this
>>>>>>> array will have the same childElementDeclPosition.
>>>>>>>
>>>>>>> The difficulty in implementing this is that an Element's
>>>>>>> dsom.ComplexType can have groups within the groups before finding actual
>>>>>>> element children.  Therefore in the Document Schema Object Model (DSOM)
>>>>>>> the model group object shall have a startingChildElementDeclPosition and
>>>>>>> an endingChildElementDeclPosition.
>>>>>>>
>>>>>>> 1. The startingChildElementPosition shall be passed to the constructor
>>>>>>> for the Term.
>>>>>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>>>>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>>>>>> model group, then its first child term's
>>>>>>> startingChildElementDeclPosition is equal to its own
>>>>>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>>>>>> startingChildElementDeclPosition is the prior term's
>>>>>>> endingChildElementDeclPosition + 1.
>>>>>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>>>>>> endingChildElmentDeclPosition of its last child term.
>>>>>>>
>>>>>>
>>>>>> Looks good to me. +1. This should definitely make implementation of
>>>>>> unordered sequences and query style expressions easier. Removal of the
>>>>>> slot concept should also help remove concepts that prevent faster schema
>>>>>> compilation.
>>>>>>
>>>>>> - Steve
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Steve Lawrence <sl...@apache.org>.

I don't think a DIArray could be created earlier and things appended to
it. But, currently, I think an empty DIArray could be created earlier.

And I think this is really just a problem with getChildArray always
creating DIArrays. What I suspect happened is that we had an array
following by something that accessed that array, e.g.:

<xs:sequence>
  <xs:element name="arr" maxOccurs="unbounded" ... />
  <xs:element name="count" dfdl:inputValueCalc="{ fn:count(../arr) }" />
</xs:sequence>

arr was slot 1, count was slot 2.

The setChildArray function used to just do something like

  _lastSlotAdded = slot

Which worked in the case of adding a new array--we add a new array, so
the slot it was added in was the last slot added.

But when fn:count was called, _lastSlotAdded was 2 (from the count
element). fn:count ended up calling getChildArray, which created a
DIArray and called setChildArray, which overwrote _lastSlotAdded to the
slot of the array, i.e. 1. So even though the array was finished and
couldn't be modified, the use of getChildArray in DPath ended up setting
_lastSlotAdded to the wrong value. Using max was a way to fix this so we
couldn't set it to an earlier value.

So, now, because getChildArray create elements, it sounds like things
could be created out of order. Rather than inserting an empty DIArray
into a slot, it will added it to the end and potentially mess things up.
This says to me we definitely want to modify getChildArray to not create
DIArrays. It should either return the array if it exists, or throw an
Infoset exception, very similar to getChild. Then we don't have to worry
about DIArray's being created at the wrong time. The only thing that
should create a DIArray is addChild.


On 10/17/2017 10:13 AM, Taylor Wise wrote:
> Hmm, very interesting.  This would greatly simplify things.  But what about this 
> code snippet from before my changes:
> 
> 
>    //  final def setChildArray(slot: Int, arr: DIArray) {
>    //    Assert.invariant(_slots(slot) eq null)
>    //    _slots(slot) = arr
>    //    //
>    //    // Because arrays are created if not existing, an expression like ../A[1]
>    //    // can cause array A to come into existence (though empty) and A might be
>    //    // part of an expression that is reaching backward into earlier data (when
>    //    // parsing) so that we might be creating an array earlier in the slots
>    //    // of this element. Hence, the slot being passed here is not necessarily
>    //    // the last slot added. It might be before it.
>    //    //
>    //    _lastSlotAdded = math.max(slot, _lastSlotAdded)
>    //  }
> Given your statement:
> 
> 
> "the issue is determining if the DIArray exists or not. Everything
> after that is the same. And since children are inserted in the order
> that we see them, a DIArray will always be at the end of the children
> ArrayBuffer if it exists. If the DINode at the end of the children
> ArrayBuffer is not this array, it needs to be created."
> 
> 
> To me it seems to imply that a DIArray can come into existence without any 
> contents.  But then be appended to at a later time.  In which case it's possible 
> that this DIArray then is not the last item appended.  It could be somewhere 
> earlier and then need to be added to.
> 
> --------------------------------------------------------------------------------
> *From:* Steve Lawrence <sl...@apache.org>
> *Sent:* Tuesday, October 17, 2017 7:20:42 AM
> *To:* dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
> Actually, I think we're over complicating the array situation. I think
> we can avoid needing a unique ID and I think we can also avoid needing
> to always insert arrays into a HashMap.
> 
> When addChild is called to add an DIElement to an array, there are two
> cases we need to think about:
> 
> 1) The DIArray does not exist and it must be created. The
> DIElement is appended to its contents.
> 
> 2) The DIArray does exist and we just need to find it. The
> DIElement is appended to its contents.
> 
> So the issue is determining if the DIArray exists or not. Everything
> after that is the same. And since children are inserted in the order
> that we see them, a DIArray will always be at the end of the children
> ArrayBuffer if it exists. If the DINode at the end of the children
> ArrayBuffer is not this array, it needs to be created. So addChild
> just becomes something like this:
> 
>    def addChild(e: DIElement) {
>      if (e.erd.isArray) {
>        if (children.isEmpty || children.last.erd != e.erd) {
>          // no children, or last child is not a DIArray for
>          // this element, create the DIArray
>          children.append(new DIArray(...))
>        }
>        // the array is now always last, add the new child to it
>        children.last.append(e)
>      } else {
>        children.append(e)
>      }
>    }
> 
> So we no longer need a unique ID or a tuple for the HashMap key. The
> fast HashMap lookup only needs to work on the name of the element, which
> allows query-style expressions to work in the future. And we now only
> need to insert DIArray's into the fast HashMap if they are used in
> expressions addChild no longer needs getChildArray, minimizing hash
> look ups.
> 
> Related, I'm wondering if getChildArray should change as well so that it
> does not create a DIArray if it doesn't exist? I think it seems
> reasonable that only addChild should create new DIArrays. And if
> something like DPath calls getChildArray, we probably don't want to end
> up creating an empty DIArray if it doesn't exist. It should probably
> just throw an InfosetNoSuchChildElementException if the array doesn't
> exist, just like getChild.
> 
> - Steve
> 
> On 10/16/2017 12:02 PM, Steve Lawrence wrote:
>> The ID might cause some issues when we want to support query-style
>> expressions. For example, say we have a schema like this:
>> 
>> <sequence>
>>   <xs:element name="array" maxOccurs="unbounded" ... />
>>   <xs:element name="element" ... />
>>   <xs:element name="array" maxOccurs="unbounded" ... />
>> </xs:sequence>
>> 
>> So two arrays with the same name separated by a single element. Each
>> element would have a unique ID, so would be inserted in the hash table
>> as something like
>> 
>> ("array", 1) -> Seq(DIArray)
>> ("element", 2) -> Seq(DISimple)
>> ("array", 3) -> Seq(DIArray)
>> 
>> A query style expression that accesses the "array" elements (e.g.
>> fn:count(./array)) would have trouble finding both arrays to get the
>> full count since the keys for the two arrays are the same.
>> 
>> Perhaps one option is to keep the name -> DINode map type, so we'd end
>> up with something like this:
>> 
>> "array" -> Seq(DIArray, DIArray)
>> "element" -> Seq(DISimple)
>> 
>> And then just iterator of the Seq until we find DIArray with the right
>> ID? That's could have a performance hit if there were lots of things
>> with the same name, but I suspect that won't be super common.
>> 
>> Alternatively, we could relook at chaning the InfosetImpl API so that
>> when an DIArray is created, it's store in the PState for quick access,
>> and getChildArray would essentially go away.
>> 
>> - Steve
>> 
>> 
>> On 10/13/2017 02:14 PM, Mike Beckerle wrote:
>>> I think the key has to change to include name and namespace of element, and also an additional integer to disambiguate the first a element from the 2nd, from the Nth. This integer need not be 1, 2, 3. It can be any value at all, so long as they are distinct.
>>>
>>>
>>> Let's make sure another case works. E.g.,  ordered sequence like the one you have here, another element to separate it from  an unordered sequence that also contains elements named a.
>>>
>>>
>>> However, an unordered sequence can't have more than 1 element named a, so if every element a declaration gets an additional integer, the one in the unordered sequence will also get one, and that will behave properly.
>>>
>>>
>>> Is this integer the same breadth-first ordering thing we discussed elsewhere? I expect it is.
>>>
>>> ________________________________
>>> From: Taylor Wise
>>> Sent: Friday, October 13, 2017 1:55:22 PM
>>> To: Steve Lawrence; dev@daffodil.apache.org; Mike Beckerle
>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>
>>>
>>> I'm down to a single error: test_dfdlPosition3
>>>
>>>
>>> java.lang.Exception:
>>> Comparison failed.
>>> Expected
>>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2><a><v>1</v></a><a><v>2</v></a></f>
>>> Actual
>>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2></f>
>>> Differences were (path, expected, actual):
>>>  (f,<cnt2>2</cnt2>,<a><v>1</v></a>)
>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>
>>>     <xs:element name="f">
>>>       <xs:complexType>
>>>         <xs:sequence>
>>>           <xs:element name="cnt1" type="xs:int"
>>>             dfdl:lengthKind="delimited" dfdl:terminator=";" />
>>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt1) }">
>>>             <xs:complexType>
>>>               <xs:sequence>
>>>                 <xs:element name="v" type="xs:int"
>>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>>               </xs:sequence>
>>>             </xs:complexType>
>>>           </xs:element>
>>>           <xs:element name="cnt2" type="xs:int"
>>>             dfdl:lengthKind="delimited" />
>>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt2) }">
>>>             <xs:complexType>
>>>               <xs:sequence>
>>>                 <xs:element name="v" type="xs:int"
>>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>>               </xs:sequence>
>>>             </xs:complexType>
>>>           </xs:element>
>>>         </xs:sequence>
>>>       </xs:complexType>
>>>     </xs:element>
>>>
>>>
>>>
>>> I believe what's happening, from stepping through, is that the getChildArray code is confused as to where to add the second set of <a/> elements.  Below are the values of the structures at the point where <cnt2/> has been added:
>>>
>>> ChildNodes aka Insertion Order
>>> ====
>>> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk)), DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a'))), DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk)))
>>>
>>> FastLookup
>>> ====
>>> Map(cnt2 -> ArrayBuffer(DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk))), cnt1 -> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk))), a -> ArrayBuffer(DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a')))))
>>>
>>> So when we go to add the next array 'a', it sees that 'a' is already in the fastLookup and appends it there.  This adds it to the existing DIArray in the insertion order list.  Really, it should create a new DIArray inserting it after <cnt2/> in the insertion  order list.  And inserting it after the DIArray in the fastLookup Hash: a -> 
> Seq(DIArray(a, ...), DIArray(a, ...))
>>>
>>> How do we fix this?
>>>
>>> Because this is an 'ordered' sequence, can we look at the slotIndexInParent?  Isn't this the position in the sequence definition as defined in the schema above?  This should differentiate between the two 'a' arrays.  And then for 'unordered' sequences, this  shouldn't matter.  We shouldn't run into two or more members with the same 
> name and namespace.  Correct?
>>>
>>> This also means that the 'key' will have to change to include the namespace in addition to the element name.
>>>
>>>
>>>
>>> ________________________________
>>> From: Steve Lawrence <sl...@apache.org>
>>> Sent: Friday, October 13, 2017 8:45:06 AM
>>> To: dev@daffodil.apache.org; Taylor Wise; Mike Beckerle
>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>
>>> On 10/12/2017 08:19 PM, Taylor Wise wrote:
>>>> Been debugging using the TestSimpleTypes tests.  I'm down to just 1 failure.  Expected results are slightly off.  I've got an additional <text/> in the result.  So, getting closer.  Stopping for the night.
>>>>
>>>>
>>>> I'm not so sure that we need the _lastSlotAdded variable anymore.  I changed it to be just a def that reflects the current length of childNodes (ArrayBuffer).
>>>> z
>>>>
>>>> I'm betting my current issue is within the restore functions I modified, but I'll continue looking at that in the morning.  I would think that 'restoring' would simply be copying the previous state... unless I'm misunderstanding something.
>>>>
>>>>
>>>>   abstract override def restoreInto(e: DIElement) {
>>>>     val c = e.asInstanceOf[DIComplex]
>>>>     c.restoreFrom(this)
>>>>     //c._lastSlotAdded = this._lastSlotAdded
>>>>     super.restoreInto(e)
>>>>   }
>>>>
>>>> final def restoreFrom(cs: DIComplexSharedImplMixin){
>>>>   // Shouldn't I be able to just say something like...
>>>>  //
>>>>  childNodes.clear()
>>>>  childNodes.append(cs.childNodes)
>>>>  fastLookup.clear()
>>>>  cs.fastLookup.toSeq.foreach( (name, seqDINodes) =>
>>>>        fastLookup.put(name, seqDINodes) )
>>>> }
>>>
>>> This is going to be somewhat inefficient. This essentially deletes and
>>> recreates the entire childNodes array and fastLookup hashmap everytime
>>> we backtrack. This could be pretty back if we childNodes had 1000
>>> elements and we just wanted to backtrack 1. Instead we just want to
>>> remove the elements that are being backtracked.
>>>
>>> So I think you still need a concept of _lastSlotAdded. But rather than
>>> the lastSlotAdded, it's just the length of the children (maybe call it
>>> _numChildren?). I would expect restoreFrom for DIComplex to look
>>> something like this (very similar to the existing restoreFrom):
>>>
>>> final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>   var i = childNodes.length - 1
>>>
>>>   // for each child we will remove, remove it from the fastLookup map.
>>>   // this should always be the last element in the hashmap seq
>>>   while (i >= cs._numChildren) {
>>>     var childToRemove = childNodes(i)
>>>     var fastSeq = fastLookup(childToRemove.name)
>>>     if (fastSeq.length == 1)
>>>       // last one, remove the whole key entry
>>>       fastLookup.remove(childToRemove.name)
>>>     else
>>>       // not the last one, just drop the end
>>>       fastSeq.dropRight(1)
>>>     i -= 1
>>>   }
>>>
>>>   // now just quickly remove all those children
>>>   childNodes.reduceToSize(cs._numChildren)
>>>   _numChildren = cs._numChildren
>>>
>>>   if (cs._arraySize.isDefined) {
>>>     childNodes(_numChildren - 1).reduceToSize(cs._arraySize.get)
>>>   }
>>> }
>>>
>>> The logic here is very similar to the CURRENT code, just is a little
>>> more simple since it can make certain assumptions. Like removing
>>> something form the fastSeq just means removing the last one
>>> (dropRight(1)), and the existing _arraySize handles the special case for
>>> DIArrays to simplify that logic.
>>>
>>> Note that we might want to rethink about the type fastLookup value.
>>> Rather than the value being a Seq, it probably wants to be some sort of
>>> mutable Seq that has constant time append dropRight. Maybe another
>>> ArrayBuf is the right data structure here? That would allow something like
>>>
>>>   fastLookup.reduceToSize(fastSeq.length - 1)
>>>
>>>
>>>
>>>
>>>> // ORIGINAL
>>>>   //    final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>   //    Assert.invariant(_arraySize == MaybeInt.Nope)
>>>>   //
>>>>   //    var i = _lastSlotAdded
>>>>   //    while (i > cs._lastSlotAdded) {
>>>>   //      _slots(i) = null
>>>>   //      i -= 1
>>>>   //    }
>>>>   //    _lastSlotAdded = cs._lastSlotAdded
>>>>   //
>>>>   //    // check invariant. All slots after last slot added
>>>>   //    // should be null.
>>>>   //    {
>>>>   //      var i = _lastSlotAdded + 1
>>>>   //      while (i < _slots.length) {
>>>>   //        Assert.invariant(_slots(i) eq null)
>>>>   //        i = i + 1
>>>>   //      }
>>>>   //    }
>>>>   //
>>>>   //    if (cs._arraySize.isDefined) {
>>>>   //      _slots(_lastSlotAdded).asInstanceOf[DIArray].reduceToSize(cs._arraySize.get)
>>>>   //    }
>>>>   //  }
>>>>
>>>> // CURRENT, lot of logic here.  Wondering if it's even necessary, thus the question above
>>>> //
>>>>   final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>>     Assert.invariant(_arraySize == MaybeInt.Nope)
>>>>
>>>>     var i = _lastSlotAdded
>>>>     while (i > cs._lastSlotAdded) {
>>>>       val lastNode = childNodes.last
>>>>       val name = lastNode.namedQName.toQNameString
>>>>
>>>>       lastNode match {
>>>>         case arr: DIArray if (arr.length > 1) => {
>>>>           // No need to delete the array, just the last DINode in it
>>>>           arr.pop()
>>>>         }
>>>>         case arr: DIArray if (arr.length == 1) => {
>>>>           // Last DINode in array, remove the array entirely
>>>>           //
>>>>           arr.pop()
>>>>           //
>>>>           // Update insertion order list
>>>>           //
>>>>           childNodes.remove(childNodes.length - 1)
>>>>           //
>>>>           // Update Fast Lookup
>>>>           //
>>>>           val seq = fastLookup(name)
>>>>           if (seq.length <= 1) {
>>>>             // Last DINode with THIS name, remove it entirely.
>>>>             //
>>>>             fastLookup.remove(name)
>>>>           } else {
>>>>             // Not the last DINode with THIS name, just remove this particular one.
>>>>             //
>>>>             val idx = seq.lastIndexWhere(node => node.isInstanceOf[DIArray])
>>>>             val result = seq diff Seq(lastNode)
>>>>             fastLookup.update(name, result)
>>>>           }
>>>>         }
>>>>         case _ => {
>>>>           //
>>>>           // Update insertion order list
>>>>           //
>>>>           childNodes.remove(childNodes.length - 1)
>>>>           //
>>>>           // Update Fast Lookup
>>>>           //
>>>>           val seq = fastLookup(name)
>>>>           if (seq.length <= 1) {
>>>>             // Last DINode with THIS name, remove it entirely
>>>>             //
>>>>             fastLookup.remove(name)
>>>>           } else {
>>>>             // Not the last DINode with THIS name.  Just remove this one!
>>>>             //
>>>>             val idx = seq.lastIndexWhere(node => !node.isInstanceOf[DIArray])
>>>>             val result = seq diff Seq(lastNode)
>>>>             fastLookup.update(name, result)
>>>>           }
>>>>         }
>>>>       }
>>>>     }
>>>>
>>>>   }
>>>>
>>>> The last failure in SimpleTypes, I haven't run the rest of the suite yet though.  Baby steps.
>>>> java.lang.Exception:
>>>> Comparison failed.
>>>> Expected
>>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text></hb_root01>
>>>> Actual
>>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text><text/></hb_root01>
>>>> Differences were (path, expected, actual):
>>>>  (hb_root01,List(),List(<text/>))
>>>> at edu.illinois.ncsa.daffodil.xml.XMLUtils$.compareAndReport(XMLUtils.scala:678)
>>>> at edu.illinois.ncsa.daffodil.tdml.VerifyTestCase$.verifyParserTestData(TDMLRunner.scala:1074)
>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runParseExpectSuccess(TDMLRunner.scala:776)
>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:675)
>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:673)
>>>> at scala.util.Either$RightProjection.foreach(Either.scala:468)
>>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runProcessor(TDMLRunner.scala:673)
>>>> at edu.illinois.ncsa.daffodil.tdml.TestCase.run(TDMLRunner.scala:623)
>>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTestWithDataVolumes(TDMLRunner.scala:340)
>>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTest(TDMLRunner.scala:328)
>>>> at edu.illinois.ncsa.daffodil.tdml.Runner.runOneTest(RunnerFactory.scala:122)
>>>> at edu.illinois.ncsa.daffodil.section05.simple_types.TestSimpleTypes.test_hexBinary_Delimited_01(TestSimpleTypes.scala:63)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>>>> at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>>>> at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>>>> at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>>>> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>>>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>>>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>>>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>>>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>>>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>>>> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>>>> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>>>> at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
>>>> at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
>>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: Steve Lawrence <sl...@apache.org>
>>>> Sent: Wednesday, October 11, 2017 4:07:51 PM
>>>> To: dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>
>>>> On 10/11/2017 01:01 PM, Mike Beckerle wrote:
>>>>> You have to hash lookup the name to append to an array if you're just appending one thing at a time starting from the slot/hashID. The current Infoset API hides the construction of the DIArray, and only provides access via the slot/hashID, so there's no  way to avoid the hash per each append.
>>>>
>>>> The hashmap and hashlookups should really only be needed for elements
>>>> that are used in expressions. We came up with that optimization so that
>>>> we could remove the vast majority of hash lookups. Due to how
>>>> getChildArray works, that optimization can't occur. I'm not sure how
>>>> much this will affect performance, but there are ways to avoid the need
>>>> for a hash lookup for arrays not used in expressions. The removal of
>>>> DIArray was one. Maybe API/parser changes is better.
>>>>
>>>>> But that's fixable if we want. Probably the API to infoset needs to allow appending to an array more efficently, so that a loop like a repeat combinator can be directly holding onto the DIArray's arrayBuffer appending objects as the parser returns them.  But I would not make this change as yet. First things first. Don't optimize 
> without profiling info, etc.
>>>>>
>>>>>
>>>>> I'm very reluctant to remove DIArray because XML's data model isn't the only game in town. The XML decision to not have arrays, just adjacent things that happen to have the same name, has all sorts of negative performance implications.
>>>>
>>>> The removal of a DIArray doesn't necessarily mean we don't have a
>>>> concept of an arrays. Every element added to Complex child arrayBuf
>>>> still knows if it part of an array via ERD.isArray. And comparing the
>>>> current DIElements ERD to the previous and next is enough to determine
>>>> where array starts and ends. From there, we can support any model that
>>>> requires a concept of an array. I'm not saying we should remove the
>>>> concept of arrays, just that our InfosetImpl doesn't necessarily need to
>>>> have a special DIArray. I'll agree that it does make the logic easier to
>>>> get it to a different model, but other things are more complicated.
>>>>
>>>>> Consider this: we could create DIArrays of elements of simple type that don't even create a DISimple node for each array element. The values could be stored in the DIArray's arrayBuffer directly, eliminating all that overhead of a DISimple node. (This  was the original intent for DIArray, but was punted to get it working sooner.) 
> The opportunity to do this is entirely lost if you don't have DIArray, but just 
> a bunch of adjacent DIElements that happen to share ERD.
>>>>
>>>> Fair point. Seems a good argument for keeping DIArray, but if we don't
>>>> get around to doing this optimization, we're taking multiple hits. We're
>>>> still allocating many DISimples plus a DIArray, plus the hit from the
>>>> hashmap.
>>>>
>>>>> More speculatively, when we someday implement incremental array trimming so that streaming parse/unparse can drop most array elements once they've been delivered to consumer, well then we will definitely want a DIArray object to abstract the complexity  of this trim behavior and the underlying representation it requires.
>>>>
>>>> It might simplify this, but I'm not entirely convinced. I think the
>>>> trimming of unused Infoset elements is going to have some complex
>>>> heuristics regardless. Adding an extra heuristic for arrays won't
>>>> necessarily make this easier. But this is just speculation. And again,
>>>> we would still know elements are part of an array, it just takes a
>>>> couple extra comparisons to determine start/end. But I do agree that
>>>> removal of an an entire array is much easier with a DIArray concept.
>>>>
>>>>
>>>>
>>>> It seems to me these are good enough arguments for keeping DIArray
>>>> around, but it might be worth thinking a little bit about what changes
>>>> would be necessary to the API/parser to not need this hash lookup. If
>>>> it's a minor change, it could reduce a lot of hash lookups, especially
>>>> for large arrays, which are not uncommon.
>>>>
>>>>
>>>>>
>>>>> ________________________________
>>>>> From: Steve Lawrence <sl...@apache.org>
>>>>> Sent: Wednesday, October 11, 2017 12:21 PM
>>>>> To: Taylor Wise; dev@daffodil.apache.org
>>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>
>>>>> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>>>>>
>>>>>> What you'll notice is that the current implementation will construct the DIArray
>>>>>> if it doesn't already exist and append to it if it does already exist.  But it
>>>>>> needs the 'slot' number in order to check this.  The slot number is obtained
>>>>>> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of slots
>>>>>> entirely, how can we accomplish this?  Will we have to do a named lookup then
>>>>>> every time we want to append to a DIArray?
>>>>> I think you're right. The way things work right now, anytime we add to a
>>>>> a DIArray we would have to do a hash lookup on the DIComplex. It's not
>>>>> limited to just DPath.
>>>>>
>>>>> Maybe this is another argument for removing the concept of a DIArray? We
>>>>> still need to know if a DIElement is part of an array for the
>>>>> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
>>>>> element should just means creating a new DISimple/DIComplex and adding
>>>>> it to the DIComplex child ArrayBuffer and appending to the HashLookup
>>>>> Seq? This would also simplify some query style logic, e.g. fn:count.
>>>>>
>>>>> Alternatively, we could modify the parser logic so that when it's
>>>>> dealing with arrays it stores the DIArray locally somewhere (PState or
>>>>> Parser) and appends straight to that without having to ask the parent
>>>>> DIComplex for the DIArray? Not sure how big of a change that would be
>>>>> though.
>>>>>
>>>>>>
>>>>>>
>>>>>> Current Implementation:
>>>>>>
>>>>>> override def addChild(e: InfosetElement): Unit = {
>>>>>>
>>>>>>   if (e.runtimeData.isArray)  {
>>>>>>
>>>>>>    // Make sure there is an array to accept the child
>>>>>>
>>>>>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>>>>>
>>>>>>      Assert.invariant(arr ne null)
>>>>>>
>>>>>>      arr.append(e)
>>>>>>
>>>>>>    } else {
>>>>>>
>>>>>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>>>>>
>>>>>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>>>>>
>>>>>>    }
>>>>>>
>>>>>>    e.setParent(this)
>>>>>>
>>>>>> }
>>>>>>
>>>>>>
>>>>>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>>>>>>      Assert.usage(childERD.isArray)
>>>>>>      getChildArray(childERD.dpathElementCompileInfo)
>>>>>>    }
>>>>>>
>>>>>>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>>>>>>      Assert.usage(info.isArray)
>>>>>>      val slot = info.slotIndexInParent
>>>>>>      getChildArray(slot)
>>>>>>    }
>>>>>>
>>>>>>    private def getChildArray(slot: Int): InfosetArray = {
>>>>>>      val slotVal = _slots(slot)
>>>>>>      if (slotVal ne null)
>>>>>>        slotVal match {
>>>>>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>>>>>          case _ => Assert.usageError("not an array")
>>>>>>        }
>>>>>>      else {
>>>>>>        val arrayERD = erd.childERDs(slot)
>>>>>>        // slot is null. There isn't even an array object yet.
>>>>>>        // create one (it will have zero entries)
>>>>>>        val ia = new DIArray(arrayERD, this)
>>>>>>        // no array there yet. So we have to create one.
>>>>>>        setChildArray(slot, ia)
>>>>>>        ia
>>>>>>      }
>>>>>>    }
>>>>>>
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------------
>>>>>> *From:* Steve Lawrence <sl...@apache.org>
>>>>>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>>>>>> *To:* dev@daffodil.apache.org; Taylor Wise
>>>>>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>>>>>> Goal: Remove concept of slots from InfosetImpl
>>>>>>>
>>>>>>> Reasons:
>>>>>>>     1. So that element/node lookup can be name based.
>>>>>>>     2. Ease re-enabling of Unordered Sequences.
>>>>>>>     3. Ease enabling of 'query-style' expressions that return more than
>>>>>>> one node.
>>>>>>>
>>>>>>> An Infoset is a document (DIDocument) and the document contains exactly
>>>>>>> one element (DIElement) as its root.  These elements can be simple
>>>>>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>>>>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>>>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>>>>>> inherit from DINode.
>>>>>>>
>>>>>>> Currently DIComplex models its children as an Array of DINodes. The
>>>>>>> problem with the current implementation is that there's no way for us to
>>>>>>> just ask for an element by name without walking the structure.
>>>>>>>
>>>>>>> The solution involves modifying DIComplex to represent its children both
>>>>>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>>>>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>>>>>> insertion ordering information we need regarding the children as well as
>>>>>>> constant time indexing access to the child in question.  The ArrayBuffer
>>>>>>> also has the added benefit of growing and shrinking as objects are added
>>>>>>> and removed.  While the HashMap will allow us to associate a name
>>>>>>> (String) with DINodes that have that name thus providing a fast named
>>>>>>> lookup.
>>>>>>>
>>>>>>> This named lookup can be optimized further by only inserting into the
>>>>>>> HashMap those elements that we know are used in expressions. These are
>>>>>>> the elements that would most likely benefit from a fast named lookup and
>>>>>>> will allow us to avoid additional overhead of hashing strings and
>>>>>>> allocating/appending to a sequence.
>>>>>>>
>>>>>>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>>>>>>> it's possible for the following:
>>>>>>>
>>>>>>>     <element name="a" .../>
>>>>>>>     <element name="a" maxOccurs="unbounded"..../>
>>>>>>>     <element name="b".../>
>>>>>>>     <element name="a" ..../>
>>>>>>>
>>>>>>> Each element named 'a' would have an associated DINode.  Because there
>>>>>>> are multiple DINodes named 'a', we need to represent them in the HashMap
>>>>>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>>>>>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>>>>>>
>>>>>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>>>>>     "b" -> mutable.Seq(DIElement)
>>>>>>>
>>>>>>> The mutable.ArrayBuffer maintains the insertion order and would look as
>>>>>>> follows:
>>>>>>>
>>>>>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>>>>>
>>>>>>> Back tracking shall be accomplished by going to the last item in the
>>>>>>> DIComplex ArrayBuffer and removing it.
>>>>>>>     1. If the last item is a DIElement, we simply delete it from the
>>>>>>> ArrayBuffer and then remove it also from the HashMap.
>>>>>>>     2. If the last item is a DIArray, we then have to look at the
>>>>>>> contents of the DIArray.
>>>>>>>         a. If the DIArray has more than one entry, remove the last
>>>>>>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>>>>>>> with the modified DIArray.
>>>>>>
>>>>>> I think the second part of this is actually unnecessary. Since the
>>>>>> DIArray is mutable, we only need to remove its last child. The entry in
>>>>>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>>>>>> only difference is the contents of the DIArray will be one smaller.
>>>>>>
>>>>>>>         b. If the DIArray has a single entry then we simply remove
>>>>>>> DIArray from ArrayBuffer and the HashMap.
>>>>>>>
>>>>>>>     <element name="root>
>>>>>>>          <sequence ../>
>>>>>>>          <element name="a" .../>
>>>>>>>          <element name="a" maxOccurs="unbounded">
>>>>>>>            ...
>>>>>>>            <element name="b" .../>
>>>>>>>            <element name="c" .../>
>>>>>>>            ...
>>>>>>>          </element>
>>>>>>>        </sequence>
>>>>>>>     </element>
>>>>>>>
>>>>>>> If the HashMap of DIComplex-root looks like the following:
>>>>>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>>
>>>>>>> ArrayBuffer of DIComplex-root looks like the following:
>>>>>>>      0 -> DIElement("a")
>>>>>>>      1 -> DIArray("a")
>>>>>>>
>>>>>>> Where DIArray children ArrayBuffer looks like:
>>>>>>>      0 -> DIElement("b")
>>>>>>>      1 -> DIElement("c")
>>>>>>>
>>>>>>> Given the initial example if we backtrack(1):  // purge one item
>>>>>>>      DIArray:
>>>>>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>>>>>
>>>>>>>      HashMap DIComplex-root:
>>>>>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>>
>>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>>          0 -> DIElement("a")
>>>>>>>          1 -> DIArray("a")
>>>>>>>
>>>>>>> Given the initial example if we backtrack(2): // purge two items
>>>>>>>      DIArray:
>>>>>>>          Empty
>>>>>>>
>>>>>>>      HashMap DIComplex-root:
>>>>>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>>>>>>                                         // removed and is now empty
>>>>>>>
>>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>>          0 -> DIElement("a")
>>>>>>>
>>>>>>> ===
>>>>>>>
>>>>>>> class DIComplex {
>>>>>>>   def nSlots = erd.nChildSlots
>>>>>>
>>>>>> I think the concept of nSlots and nChildSlots should go away completely.
>>>>>> The children ArrayBuffer should just grow as elements are added and
>>>>>> shrink as they are backtracked. It might make sense to have a tunable
>>>>>> the determines the initial size of the children Array, but the Scala
>>>>>> defaults for initializing and growing a mutable array might already be
>>>>>> reasonable.
>>>>>>
>>>>>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>>>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>>>>>
>>>>>> One thing that might make this change easier is to not worry about the
>>>>>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>>>>>> really only necessary to support query style expressions, but that comes
>>>>>> with a whole other slew of concerns (e.g. fn:count, same name but
>>>>>> different types, etc.). There currently exist restrictions that do not
>>>>>> allow query style expressions, so this Seq should always be a size of 1
>>>>>> until remove those restrictions.
>>>>>>
>>>>>>>
>>>>>>>   def addChild(node: DINode) {
>>>>>>>     children += node
>>>>>>>     fastLookup(node.name).append(node)
>>>>>>>   }
>>>>>>>
>>>>>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>>>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>>>>>   }
>>>>>>>
>>>>>>>   def backtrack(int n) {
>>>>>>>     for (i = 0; i < n; i++) {
>>>>>>>       val endIdx = children.size -1
>>>>>>>       val child = children.remove(endIdx) // children.pop()
>>>>>>>
>>>>>>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>>>>>>         case None => // do nothing
>>>>>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>>>>>         case Some(nodesSeq) => {
>>>>>>>       val lastEntry = nodesSeq.last
>>>>>>>           lastEntry match {
>>>>>>>         case arr: DIArray => {
>>>>>>>               // If the array has >= 2 items, remove last one and update
>>>>>>> nodesSeq
>>>>>>>               // with this array:
>>>>>>>               //  val lastIdx = nodesSeq.length - 1
>>>>>>>           //  arr.remove(arr.size - 1)
>>>>>>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>>>>>>> arr))
>>>>>>>               //
>>>>>>>               // If the array has 1 item, remove the array from nodesSeq
>>>>>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>>         }
>>>>>>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>>           }
>>>>>>>
>>>>>>>         }
>>>>>>>       }
>>>>>>>     }
>>>>>>>   }
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> class DIArray {
>>>>>>>     private val initialSize =
>>>>>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>>>>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>>>>>>
>>>>>>>     def append(ie: InfosetElement): Unit = {
>>>>>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>>>>>         ie.setArray(this)
>>>>>>>     }
>>>>>>>
>>>>>>>     //
>>>>>>>     // For backtracking purposes
>>>>>>>     //
>>>>>>>     def remove(n: Int): DIElement = {
>>>>>>>         _contents.remove(n)
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>> ===
>>>>>>>
>>>>>>> To facillitate a breadth-first ordering scheme we shall have to
>>>>>>
>>>>>> That this is only necessary for unordered sequences, where the insertion
>>>>>> order in the ArrayBuffer differs from the display order, correct? If
>>>>>> that's true, we definitely want to make sure to only do the below
>>>>>> sorting when an element contains an unordred sequence.
>>>>>>
>>>>>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>>>>>> shall be called childElementDeclPosition and pertains to the static
>>>>>>> order of the element children of a complex type element.  When sorting
>>>>>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>>>>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>>>>>> entry in the sequence that needs to be sorted.  All elements in this
>>>>>>> array will have the same childElementDeclPosition.
>>>>>>>
>>>>>>> The difficulty in implementing this is that an Element's
>>>>>>> dsom.ComplexType can have groups within the groups before finding actual
>>>>>>> element children.  Therefore in the Document Schema Object Model (DSOM)
>>>>>>> the model group object shall have a startingChildElementDeclPosition and
>>>>>>> an endingChildElementDeclPosition.
>>>>>>>
>>>>>>> 1. The startingChildElementPosition shall be passed to the constructor
>>>>>>> for the Term.
>>>>>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>>>>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>>>>>> model group, then its first child term's
>>>>>>> startingChildElementDeclPosition is equal to its own
>>>>>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>>>>>> startingChildElementDeclPosition is the prior term's
>>>>>>> endingChildElementDeclPosition + 1.
>>>>>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>>>>>> endingChildElmentDeclPosition of its last child term.
>>>>>>>
>>>>>>
>>>>>> Looks good to me. +1. This should definitely make implementation of
>>>>>> unordered sequences and query style expressions easier. Removal of the
>>>>>> slot concept should also help remove concepts that prevent faster schema
>>>>>> compilation.
>>>>>>
>>>>>> - Steve
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>> 
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Taylor Wise <tw...@tresys.com>.

Hmm, very interesting.  This would greatly simplify things.  But what about this code snippet from before my changes:


  //  final def setChildArray(slot: Int, arr: DIArray) {
  //    Assert.invariant(_slots(slot) eq null)
  //    _slots(slot) = arr
  //    //
  //    // Because arrays are created if not existing, an expression like ../A[1]
  //    // can cause array A to come into existence (though empty) and A might be
  //    // part of an expression that is reaching backward into earlier data (when
  //    // parsing) so that we might be creating an array earlier in the slots
  //    // of this element. Hence, the slot being passed here is not necessarily
  //    // the last slot added. It might be before it.
  //    //
  //    _lastSlotAdded = math.max(slot, _lastSlotAdded)
  //  }
Given your statement:


"the issue is determining if the DIArray exists or not. Everything
after that is the same. And since children are inserted in the order
that we see them, a DIArray will always be at the end of the children
ArrayBuffer if it exists. If the DINode at the end of the children
ArrayBuffer is not this array, it needs to be created."


To me it seems to imply that a DIArray can come into existence without any contents.  But then be appended to at a later time.  In which case it's possible that this DIArray then is not the last item appended.  It could be somewhere earlier and then need to be added to.

________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Tuesday, October 17, 2017 7:20:42 AM
To: dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion

Actually, I think we're over complicating the array situation. I think
we can avoid needing a unique ID and I think we can also avoid needing
to always insert arrays into a HashMap.

When addChild is called to add an DIElement to an array, there are two
cases we need to think about:

1) The DIArray does not exist and it must be created. The
DIElement is appended to its contents.

2) The DIArray does exist and we just need to find it. The
DIElement is appended to its contents.

So the issue is determining if the DIArray exists or not. Everything
after that is the same. And since children are inserted in the order
that we see them, a DIArray will always be at the end of the children
ArrayBuffer if it exists. If the DINode at the end of the children
ArrayBuffer is not this array, it needs to be created. So addChild
just becomes something like this:

  def addChild(e: DIElement) {
    if (e.erd.isArray) {
      if (children.isEmpty || children.last.erd != e.erd) {
        // no children, or last child is not a DIArray for
        // this element, create the DIArray
        children.append(new DIArray(...))
      }
      // the array is now always last, add the new child to it
      children.last.append(e)
    } else {
      children.append(e)
    }
  }

So we no longer need a unique ID or a tuple for the HashMap key. The
fast HashMap lookup only needs to work on the name of the element, which
allows query-style expressions to work in the future. And we now only
need to insert DIArray's into the fast HashMap if they are used in
expressions addChild no longer needs getChildArray, minimizing hash
look ups.

Related, I'm wondering if getChildArray should change as well so that it
does not create a DIArray if it doesn't exist? I think it seems
reasonable that only addChild should create new DIArrays. And if
something like DPath calls getChildArray, we probably don't want to end
up creating an empty DIArray if it doesn't exist. It should probably
just throw an InfosetNoSuchChildElementException if the array doesn't
exist, just like getChild.

- Steve

On 10/16/2017 12:02 PM, Steve Lawrence wrote:
> The ID might cause some issues when we want to support query-style
> expressions. For example, say we have a schema like this:
>
> <sequence>
>   <xs:element name="array" maxOccurs="unbounded" ... />
>   <xs:element name="element" ... />
>   <xs:element name="array" maxOccurs="unbounded" ... />
> </xs:sequence>
>
> So two arrays with the same name separated by a single element. Each
> element would have a unique ID, so would be inserted in the hash table
> as something like
>
> ("array", 1) -> Seq(DIArray)
> ("element", 2) -> Seq(DISimple)
> ("array", 3) -> Seq(DIArray)
>
> A query style expression that accesses the "array" elements (e.g.
> fn:count(./array)) would have trouble finding both arrays to get the
> full count since the keys for the two arrays are the same.
>
> Perhaps one option is to keep the name -> DINode map type, so we'd end
> up with something like this:
>
> "array" -> Seq(DIArray, DIArray)
> "element" -> Seq(DISimple)
>
> And then just iterator of the Seq until we find DIArray with the right
> ID? That's could have a performance hit if there were lots of things
> with the same name, but I suspect that won't be super common.
>
> Alternatively, we could relook at chaning the InfosetImpl API so that
> when an DIArray is created, it's store in the PState for quick access,
> and getChildArray would essentially go away.
>
> - Steve
>
>
> On 10/13/2017 02:14 PM, Mike Beckerle wrote:
>> I think the key has to change to include name and namespace of element, and also an additional integer to disambiguate the first a element from the 2nd, from the Nth. This integer need not be 1, 2, 3. It can be any value at all, so long as they are distinct.
>>
>>
>> Let's make sure another case works. E.g.,  ordered sequence like the one you have here, another element to separate it from  an unordered sequence that also contains elements named a.
>>
>>
>> However, an unordered sequence can't have more than 1 element named a, so if every element a declaration gets an additional integer, the one in the unordered sequence will also get one, and that will behave properly.
>>
>>
>> Is this integer the same breadth-first ordering thing we discussed elsewhere? I expect it is.
>>
>> ________________________________
>> From: Taylor Wise
>> Sent: Friday, October 13, 2017 1:55:22 PM
>> To: Steve Lawrence; dev@daffodil.apache.org; Mike Beckerle
>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>
>>
>> I'm down to a single error: test_dfdlPosition3
>>
>>
>> java.lang.Exception:
>> Comparison failed.
>> Expected
>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2><a><v>1</v></a><a><v>2</v></a></f>
>> Actual
>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2></f>
>> Differences were (path, expected, actual):
>>  (f,<cnt2>2</cnt2>,<a><v>1</v></a>)
>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>
>>     <xs:element name="f">
>>       <xs:complexType>
>>         <xs:sequence>
>>           <xs:element name="cnt1" type="xs:int"
>>             dfdl:lengthKind="delimited" dfdl:terminator=";" />
>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt1) }">
>>             <xs:complexType>
>>               <xs:sequence>
>>                 <xs:element name="v" type="xs:int"
>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>               </xs:sequence>
>>             </xs:complexType>
>>           </xs:element>
>>           <xs:element name="cnt2" type="xs:int"
>>             dfdl:lengthKind="delimited" />
>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt2) }">
>>             <xs:complexType>
>>               <xs:sequence>
>>                 <xs:element name="v" type="xs:int"
>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>               </xs:sequence>
>>             </xs:complexType>
>>           </xs:element>
>>         </xs:sequence>
>>       </xs:complexType>
>>     </xs:element>
>>
>>
>>
>> I believe what's happening, from stepping through, is that the getChildArray code is confused as to where to add the second set of <a/> elements.  Below are the values of the structures at the point where <cnt2/> has been added:
>>
>> ChildNodes aka Insertion Order
>> ====
>> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk)), DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a'))), DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk)))
>>
>> FastLookup
>> ====
>> Map(cnt2 -> ArrayBuffer(DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk))), cnt1 -> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk))), a -> ArrayBuffer(DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a')))))
>>
>> So when we go to add the next array 'a', it sees that 'a' is already in the fastLookup and appends it there.  This adds it to the existing DIArray in the insertion order list.  Really, it should create a new DIArray inserting it after <cnt2/> in the insertion order list.  And inserting it after the DIArray in the fastLookup Hash: a -> Seq(DIArray(a, ...), DIArray(a, ...))
>>
>> How do we fix this?
>>
>> Because this is an 'ordered' sequence, can we look at the slotIndexInParent?  Isn't this the position in the sequence definition as defined in the schema above?  This should differentiate between the two 'a' arrays.  And then for 'unordered' sequences, this shouldn't matter.  We shouldn't run into two or more members with the same name and namespace.  Correct?
>>
>> This also means that the 'key' will have to change to include the namespace in addition to the element name.
>>
>>
>>
>> ________________________________
>> From: Steve Lawrence <sl...@apache.org>
>> Sent: Friday, October 13, 2017 8:45:06 AM
>> To: dev@daffodil.apache.org; Taylor Wise; Mike Beckerle
>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>
>> On 10/12/2017 08:19 PM, Taylor Wise wrote:
>>> Been debugging using the TestSimpleTypes tests.  I'm down to just 1 failure.  Expected results are slightly off.  I've got an additional <text/> in the result.  So, getting closer.  Stopping for the night.
>>>
>>>
>>> I'm not so sure that we need the _lastSlotAdded variable anymore.  I changed it to be just a def that reflects the current length of childNodes (ArrayBuffer).
>>> z
>>>
>>> I'm betting my current issue is within the restore functions I modified, but I'll continue looking at that in the morning.  I would think that 'restoring' would simply be copying the previous state... unless I'm misunderstanding something.
>>>
>>>
>>>   abstract override def restoreInto(e: DIElement) {
>>>     val c = e.asInstanceOf[DIComplex]
>>>     c.restoreFrom(this)
>>>     //c._lastSlotAdded = this._lastSlotAdded
>>>     super.restoreInto(e)
>>>   }
>>>
>>> final def restoreFrom(cs: DIComplexSharedImplMixin){
>>>   // Shouldn't I be able to just say something like...
>>>  //
>>>  childNodes.clear()
>>>  childNodes.append(cs.childNodes)
>>>  fastLookup.clear()
>>>  cs.fastLookup.toSeq.foreach( (name, seqDINodes) =>
>>>        fastLookup.put(name, seqDINodes) )
>>> }
>>
>> This is going to be somewhat inefficient. This essentially deletes and
>> recreates the entire childNodes array and fastLookup hashmap everytime
>> we backtrack. This could be pretty back if we childNodes had 1000
>> elements and we just wanted to backtrack 1. Instead we just want to
>> remove the elements that are being backtracked.
>>
>> So I think you still need a concept of _lastSlotAdded. But rather than
>> the lastSlotAdded, it's just the length of the children (maybe call it
>> _numChildren?). I would expect restoreFrom for DIComplex to look
>> something like this (very similar to the existing restoreFrom):
>>
>> final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>   var i = childNodes.length - 1
>>
>>   // for each child we will remove, remove it from the fastLookup map.
>>   // this should always be the last element in the hashmap seq
>>   while (i >= cs._numChildren) {
>>     var childToRemove = childNodes(i)
>>     var fastSeq = fastLookup(childToRemove.name)
>>     if (fastSeq.length == 1)
>>       // last one, remove the whole key entry
>>       fastLookup.remove(childToRemove.name)
>>     else
>>       // not the last one, just drop the end
>>       fastSeq.dropRight(1)
>>     i -= 1
>>   }
>>
>>   // now just quickly remove all those children
>>   childNodes.reduceToSize(cs._numChildren)
>>   _numChildren = cs._numChildren
>>
>>   if (cs._arraySize.isDefined) {
>>     childNodes(_numChildren - 1).reduceToSize(cs._arraySize.get)
>>   }
>> }
>>
>> The logic here is very similar to the CURRENT code, just is a little
>> more simple since it can make certain assumptions. Like removing
>> something form the fastSeq just means removing the last one
>> (dropRight(1)), and the existing _arraySize handles the special case for
>> DIArrays to simplify that logic.
>>
>> Note that we might want to rethink about the type fastLookup value.
>> Rather than the value being a Seq, it probably wants to be some sort of
>> mutable Seq that has constant time append dropRight. Maybe another
>> ArrayBuf is the right data structure here? That would allow something like
>>
>>   fastLookup.reduceToSize(fastSeq.length - 1)
>>
>>
>>
>>
>>> // ORIGINAL
>>>   //    final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>   //    Assert.invariant(_arraySize == MaybeInt.Nope)
>>>   //
>>>   //    var i = _lastSlotAdded
>>>   //    while (i > cs._lastSlotAdded) {
>>>   //      _slots(i) = null
>>>   //      i -= 1
>>>   //    }
>>>   //    _lastSlotAdded = cs._lastSlotAdded
>>>   //
>>>   //    // check invariant. All slots after last slot added
>>>   //    // should be null.
>>>   //    {
>>>   //      var i = _lastSlotAdded + 1
>>>   //      while (i < _slots.length) {
>>>   //        Assert.invariant(_slots(i) eq null)
>>>   //        i = i + 1
>>>   //      }
>>>   //    }
>>>   //
>>>   //    if (cs._arraySize.isDefined) {
>>>   //      _slots(_lastSlotAdded).asInstanceOf[DIArray].reduceToSize(cs._arraySize.get)
>>>   //    }
>>>   //  }
>>>
>>> // CURRENT, lot of logic here.  Wondering if it's even necessary, thus the question above
>>> //
>>>   final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>     Assert.invariant(_arraySize == MaybeInt.Nope)
>>>
>>>     var i = _lastSlotAdded
>>>     while (i > cs._lastSlotAdded) {
>>>       val lastNode = childNodes.last
>>>       val name = lastNode.namedQName.toQNameString
>>>
>>>       lastNode match {
>>>         case arr: DIArray if (arr.length > 1) => {
>>>           // No need to delete the array, just the last DINode in it
>>>           arr.pop()
>>>         }
>>>         case arr: DIArray if (arr.length == 1) => {
>>>           // Last DINode in array, remove the array entirely
>>>           //
>>>           arr.pop()
>>>           //
>>>           // Update insertion order list
>>>           //
>>>           childNodes.remove(childNodes.length - 1)
>>>           //
>>>           // Update Fast Lookup
>>>           //
>>>           val seq = fastLookup(name)
>>>           if (seq.length <= 1) {
>>>             // Last DINode with THIS name, remove it entirely.
>>>             //
>>>             fastLookup.remove(name)
>>>           } else {
>>>             // Not the last DINode with THIS name, just remove this particular one.
>>>             //
>>>             val idx = seq.lastIndexWhere(node => node.isInstanceOf[DIArray])
>>>             val result = seq diff Seq(lastNode)
>>>             fastLookup.update(name, result)
>>>           }
>>>         }
>>>         case _ => {
>>>           //
>>>           // Update insertion order list
>>>           //
>>>           childNodes.remove(childNodes.length - 1)
>>>           //
>>>           // Update Fast Lookup
>>>           //
>>>           val seq = fastLookup(name)
>>>           if (seq.length <= 1) {
>>>             // Last DINode with THIS name, remove it entirely
>>>             //
>>>             fastLookup.remove(name)
>>>           } else {
>>>             // Not the last DINode with THIS name.  Just remove this one!
>>>             //
>>>             val idx = seq.lastIndexWhere(node => !node.isInstanceOf[DIArray])
>>>             val result = seq diff Seq(lastNode)
>>>             fastLookup.update(name, result)
>>>           }
>>>         }
>>>       }
>>>     }
>>>
>>>   }
>>>
>>> The last failure in SimpleTypes, I haven't run the rest of the suite yet though.  Baby steps.
>>> java.lang.Exception:
>>> Comparison failed.
>>> Expected
>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text></hb_root01>
>>> Actual
>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text><text/></hb_root01>
>>> Differences were (path, expected, actual):
>>>  (hb_root01,List(),List(<text/>))
>>> at edu.illinois.ncsa.daffodil.xml.XMLUtils$.compareAndReport(XMLUtils.scala:678)
>>> at edu.illinois.ncsa.daffodil.tdml.VerifyTestCase$.verifyParserTestData(TDMLRunner.scala:1074)
>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runParseExpectSuccess(TDMLRunner.scala:776)
>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:675)
>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:673)
>>> at scala.util.Either$RightProjection.foreach(Either.scala:468)
>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runProcessor(TDMLRunner.scala:673)
>>> at edu.illinois.ncsa.daffodil.tdml.TestCase.run(TDMLRunner.scala:623)
>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTestWithDataVolumes(TDMLRunner.scala:340)
>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTest(TDMLRunner.scala:328)
>>> at edu.illinois.ncsa.daffodil.tdml.Runner.runOneTest(RunnerFactory.scala:122)
>>> at edu.illinois.ncsa.daffodil.section05.simple_types.TestSimpleTypes.test_hexBinary_Delimited_01(TestSimpleTypes.scala:63)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>>> at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>>> at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>>> at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>>> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>>> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>>> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>>> at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
>>> at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>
>>>
>>>
>>>
>>>
>>> ________________________________
>>> From: Steve Lawrence <sl...@apache.org>
>>> Sent: Wednesday, October 11, 2017 4:07:51 PM
>>> To: dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>
>>> On 10/11/2017 01:01 PM, Mike Beckerle wrote:
>>>> You have to hash lookup the name to append to an array if you're just appending one thing at a time starting from the slot/hashID. The current Infoset API hides the construction of the DIArray, and only provides access via the slot/hashID, so there's no way to avoid the hash per each append.
>>>
>>> The hashmap and hashlookups should really only be needed for elements
>>> that are used in expressions. We came up with that optimization so that
>>> we could remove the vast majority of hash lookups. Due to how
>>> getChildArray works, that optimization can't occur. I'm not sure how
>>> much this will affect performance, but there are ways to avoid the need
>>> for a hash lookup for arrays not used in expressions. The removal of
>>> DIArray was one. Maybe API/parser changes is better.
>>>
>>>> But that's fixable if we want. Probably the API to infoset needs to allow appending to an array more efficently, so that a loop like a repeat combinator can be directly holding onto the DIArray's arrayBuffer appending objects as the parser returns them. But I would not make this change as yet. First things first. Don't optimize without profiling info, etc.
>>>>
>>>>
>>>> I'm very reluctant to remove DIArray because XML's data model isn't the only game in town. The XML decision to not have arrays, just adjacent things that happen to have the same name, has all sorts of negative performance implications.
>>>
>>> The removal of a DIArray doesn't necessarily mean we don't have a
>>> concept of an arrays. Every element added to Complex child arrayBuf
>>> still knows if it part of an array via ERD.isArray. And comparing the
>>> current DIElements ERD to the previous and next is enough to determine
>>> where array starts and ends. From there, we can support any model that
>>> requires a concept of an array. I'm not saying we should remove the
>>> concept of arrays, just that our InfosetImpl doesn't necessarily need to
>>> have a special DIArray. I'll agree that it does make the logic easier to
>>> get it to a different model, but other things are more complicated.
>>>
>>>> Consider this: we could create DIArrays of elements of simple type that don't even create a DISimple node for each array element. The values could be stored in the DIArray's arrayBuffer directly, eliminating all that overhead of a DISimple node. (This was the original intent for DIArray, but was punted to get it working sooner.) The opportunity to do this is entirely lost if you don't have DIArray, but just a bunch of adjacent DIElements that happen to share ERD.
>>>
>>> Fair point. Seems a good argument for keeping DIArray, but if we don't
>>> get around to doing this optimization, we're taking multiple hits. We're
>>> still allocating many DISimples plus a DIArray, plus the hit from the
>>> hashmap.
>>>
>>>> More speculatively, when we someday implement incremental array trimming so that streaming parse/unparse can drop most array elements once they've been delivered to consumer, well then we will definitely want a DIArray object to abstract the complexity of this trim behavior and the underlying representation it requires.
>>>
>>> It might simplify this, but I'm not entirely convinced. I think the
>>> trimming of unused Infoset elements is going to have some complex
>>> heuristics regardless. Adding an extra heuristic for arrays won't
>>> necessarily make this easier. But this is just speculation. And again,
>>> we would still know elements are part of an array, it just takes a
>>> couple extra comparisons to determine start/end. But I do agree that
>>> removal of an an entire array is much easier with a DIArray concept.
>>>
>>>
>>>
>>> It seems to me these are good enough arguments for keeping DIArray
>>> around, but it might be worth thinking a little bit about what changes
>>> would be necessary to the API/parser to not need this hash lookup. If
>>> it's a minor change, it could reduce a lot of hash lookups, especially
>>> for large arrays, which are not uncommon.
>>>
>>>
>>>>
>>>> ________________________________
>>>> From: Steve Lawrence <sl...@apache.org>
>>>> Sent: Wednesday, October 11, 2017 12:21 PM
>>>> To: Taylor Wise; dev@daffodil.apache.org
>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>
>>>> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>>>>
>>>>> What you'll notice is that the current implementation will construct the DIArray
>>>>> if it doesn't already exist and append to it if it does already exist.  But it
>>>>> needs the 'slot' number in order to check this.  The slot number is obtained
>>>>> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of slots
>>>>> entirely, how can we accomplish this?  Will we have to do a named lookup then
>>>>> every time we want to append to a DIArray?
>>>> I think you're right. The way things work right now, anytime we add to a
>>>> a DIArray we would have to do a hash lookup on the DIComplex. It's not
>>>> limited to just DPath.
>>>>
>>>> Maybe this is another argument for removing the concept of a DIArray? We
>>>> still need to know if a DIElement is part of an array for the
>>>> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
>>>> element should just means creating a new DISimple/DIComplex and adding
>>>> it to the DIComplex child ArrayBuffer and appending to the HashLookup
>>>> Seq? This would also simplify some query style logic, e.g. fn:count.
>>>>
>>>> Alternatively, we could modify the parser logic so that when it's
>>>> dealing with arrays it stores the DIArray locally somewhere (PState or
>>>> Parser) and appends straight to that without having to ask the parent
>>>> DIComplex for the DIArray? Not sure how big of a change that would be
>>>> though.
>>>>
>>>>>
>>>>>
>>>>> Current Implementation:
>>>>>
>>>>> override def addChild(e: InfosetElement): Unit = {
>>>>>
>>>>>   if (e.runtimeData.isArray)  {
>>>>>
>>>>>    // Make sure there is an array to accept the child
>>>>>
>>>>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>>>>
>>>>>      Assert.invariant(arr ne null)
>>>>>
>>>>>      arr.append(e)
>>>>>
>>>>>    } else {
>>>>>
>>>>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>>>>
>>>>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>>>>
>>>>>    }
>>>>>
>>>>>    e.setParent(this)
>>>>>
>>>>> }
>>>>>
>>>>>
>>>>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>>>>>      Assert.usage(childERD.isArray)
>>>>>      getChildArray(childERD.dpathElementCompileInfo)
>>>>>    }
>>>>>
>>>>>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>>>>>      Assert.usage(info.isArray)
>>>>>      val slot = info.slotIndexInParent
>>>>>      getChildArray(slot)
>>>>>    }
>>>>>
>>>>>    private def getChildArray(slot: Int): InfosetArray = {
>>>>>      val slotVal = _slots(slot)
>>>>>      if (slotVal ne null)
>>>>>        slotVal match {
>>>>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>>>>          case _ => Assert.usageError("not an array")
>>>>>        }
>>>>>      else {
>>>>>        val arrayERD = erd.childERDs(slot)
>>>>>        // slot is null. There isn't even an array object yet.
>>>>>        // create one (it will have zero entries)
>>>>>        val ia = new DIArray(arrayERD, this)
>>>>>        // no array there yet. So we have to create one.
>>>>>        setChildArray(slot, ia)
>>>>>        ia
>>>>>      }
>>>>>    }
>>>>>
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------------
>>>>> *From:* Steve Lawrence <sl...@apache.org>
>>>>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>>>>> *To:* dev@daffodil.apache.org; Taylor Wise
>>>>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>>>>> Goal: Remove concept of slots from InfosetImpl
>>>>>>
>>>>>> Reasons:
>>>>>>     1. So that element/node lookup can be name based.
>>>>>>     2. Ease re-enabling of Unordered Sequences.
>>>>>>     3. Ease enabling of 'query-style' expressions that return more than
>>>>>> one node.
>>>>>>
>>>>>> An Infoset is a document (DIDocument) and the document contains exactly
>>>>>> one element (DIElement) as its root.  These elements can be simple
>>>>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>>>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>>>>> inherit from DINode.
>>>>>>
>>>>>> Currently DIComplex models its children as an Array of DINodes. The
>>>>>> problem with the current implementation is that there's no way for us to
>>>>>> just ask for an element by name without walking the structure.
>>>>>>
>>>>>> The solution involves modifying DIComplex to represent its children both
>>>>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>>>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>>>>> insertion ordering information we need regarding the children as well as
>>>>>> constant time indexing access to the child in question.  The ArrayBuffer
>>>>>> also has the added benefit of growing and shrinking as objects are added
>>>>>> and removed.  While the HashMap will allow us to associate a name
>>>>>> (String) with DINodes that have that name thus providing a fast named
>>>>>> lookup.
>>>>>>
>>>>>> This named lookup can be optimized further by only inserting into the
>>>>>> HashMap those elements that we know are used in expressions. These are
>>>>>> the elements that would most likely benefit from a fast named lookup and
>>>>>> will allow us to avoid additional overhead of hashing strings and
>>>>>> allocating/appending to a sequence.
>>>>>>
>>>>>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>>>>>> it's possible for the following:
>>>>>>
>>>>>>     <element name="a" .../>
>>>>>>     <element name="a" maxOccurs="unbounded"..../>
>>>>>>     <element name="b".../>
>>>>>>     <element name="a" ..../>
>>>>>>
>>>>>> Each element named 'a' would have an associated DINode.  Because there
>>>>>> are multiple DINodes named 'a', we need to represent them in the HashMap
>>>>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>>>>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>>>>>
>>>>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>>>>     "b" -> mutable.Seq(DIElement)
>>>>>>
>>>>>> The mutable.ArrayBuffer maintains the insertion order and would look as
>>>>>> follows:
>>>>>>
>>>>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>>>>
>>>>>> Back tracking shall be accomplished by going to the last item in the
>>>>>> DIComplex ArrayBuffer and removing it.
>>>>>>     1. If the last item is a DIElement, we simply delete it from the
>>>>>> ArrayBuffer and then remove it also from the HashMap.
>>>>>>     2. If the last item is a DIArray, we then have to look at the
>>>>>> contents of the DIArray.
>>>>>>         a. If the DIArray has more than one entry, remove the last
>>>>>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>>>>>> with the modified DIArray.
>>>>>
>>>>> I think the second part of this is actually unnecessary. Since the
>>>>> DIArray is mutable, we only need to remove its last child. The entry in
>>>>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>>>>> only difference is the contents of the DIArray will be one smaller.
>>>>>
>>>>>>         b. If the DIArray has a single entry then we simply remove
>>>>>> DIArray from ArrayBuffer and the HashMap.
>>>>>>
>>>>>>     <element name="root>
>>>>>>          <sequence ../>
>>>>>>          <element name="a" .../>
>>>>>>          <element name="a" maxOccurs="unbounded">
>>>>>>            ...
>>>>>>            <element name="b" .../>
>>>>>>            <element name="c" .../>
>>>>>>            ...
>>>>>>          </element>
>>>>>>        </sequence>
>>>>>>     </element>
>>>>>>
>>>>>> If the HashMap of DIComplex-root looks like the following:
>>>>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>
>>>>>> ArrayBuffer of DIComplex-root looks like the following:
>>>>>>      0 -> DIElement("a")
>>>>>>      1 -> DIArray("a")
>>>>>>
>>>>>> Where DIArray children ArrayBuffer looks like:
>>>>>>      0 -> DIElement("b")
>>>>>>      1 -> DIElement("c")
>>>>>>
>>>>>> Given the initial example if we backtrack(1):  // purge one item
>>>>>>      DIArray:
>>>>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>>>>
>>>>>>      HashMap DIComplex-root:
>>>>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>
>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>          0 -> DIElement("a")
>>>>>>          1 -> DIArray("a")
>>>>>>
>>>>>> Given the initial example if we backtrack(2): // purge two items
>>>>>>      DIArray:
>>>>>>          Empty
>>>>>>
>>>>>>      HashMap DIComplex-root:
>>>>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>>>>>                                         // removed and is now empty
>>>>>>
>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>          0 -> DIElement("a")
>>>>>>
>>>>>> ===
>>>>>>
>>>>>> class DIComplex {
>>>>>>   def nSlots = erd.nChildSlots
>>>>>
>>>>> I think the concept of nSlots and nChildSlots should go away completely.
>>>>> The children ArrayBuffer should just grow as elements are added and
>>>>> shrink as they are backtracked. It might make sense to have a tunable
>>>>> the determines the initial size of the children Array, but the Scala
>>>>> defaults for initializing and growing a mutable array might already be
>>>>> reasonable.
>>>>>
>>>>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>>>>
>>>>> One thing that might make this change easier is to not worry about the
>>>>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>>>>> really only necessary to support query style expressions, but that comes
>>>>> with a whole other slew of concerns (e.g. fn:count, same name but
>>>>> different types, etc.). There currently exist restrictions that do not
>>>>> allow query style expressions, so this Seq should always be a size of 1
>>>>> until remove those restrictions.
>>>>>
>>>>>>
>>>>>>   def addChild(node: DINode) {
>>>>>>     children += node
>>>>>>     fastLookup(node.name).append(node)
>>>>>>   }
>>>>>>
>>>>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>>>>   }
>>>>>>
>>>>>>   def backtrack(int n) {
>>>>>>     for (i = 0; i < n; i++) {
>>>>>>       val endIdx = children.size -1
>>>>>>       val child = children.remove(endIdx) // children.pop()
>>>>>>
>>>>>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>>>>>         case None => // do nothing
>>>>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>>>>         case Some(nodesSeq) => {
>>>>>>       val lastEntry = nodesSeq.last
>>>>>>           lastEntry match {
>>>>>>         case arr: DIArray => {
>>>>>>               // If the array has >= 2 items, remove last one and update
>>>>>> nodesSeq
>>>>>>               // with this array:
>>>>>>               //  val lastIdx = nodesSeq.length - 1
>>>>>>           //  arr.remove(arr.size - 1)
>>>>>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>>>>>> arr))
>>>>>>               //
>>>>>>               // If the array has 1 item, remove the array from nodesSeq
>>>>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>         }
>>>>>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>           }
>>>>>>
>>>>>>         }
>>>>>>       }
>>>>>>     }
>>>>>>   }
>>>>>>
>>>>>> }
>>>>>>
>>>>>> class DIArray {
>>>>>>     private val initialSize =
>>>>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>>>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>>>>>
>>>>>>     def append(ie: InfosetElement): Unit = {
>>>>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>>>>         ie.setArray(this)
>>>>>>     }
>>>>>>
>>>>>>     //
>>>>>>     // For backtracking purposes
>>>>>>     //
>>>>>>     def remove(n: Int): DIElement = {
>>>>>>         _contents.remove(n)
>>>>>>     }
>>>>>> }
>>>>>>
>>>>>> ===
>>>>>>
>>>>>> To facillitate a breadth-first ordering scheme we shall have to
>>>>>
>>>>> That this is only necessary for unordered sequences, where the insertion
>>>>> order in the ArrayBuffer differs from the display order, correct? If
>>>>> that's true, we definitely want to make sure to only do the below
>>>>> sorting when an element contains an unordred sequence.
>>>>>
>>>>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>>>>> shall be called childElementDeclPosition and pertains to the static
>>>>>> order of the element children of a complex type element.  When sorting
>>>>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>>>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>>>>> entry in the sequence that needs to be sorted.  All elements in this
>>>>>> array will have the same childElementDeclPosition.
>>>>>>
>>>>>> The difficulty in implementing this is that an Element's
>>>>>> dsom.ComplexType can have groups within the groups before finding actual
>>>>>> element children.  Therefore in the Document Schema Object Model (DSOM)
>>>>>> the model group object shall have a startingChildElementDeclPosition and
>>>>>> an endingChildElementDeclPosition.
>>>>>>
>>>>>> 1. The startingChildElementPosition shall be passed to the constructor
>>>>>> for the Term.
>>>>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>>>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>>>>> model group, then its first child term's
>>>>>> startingChildElementDeclPosition is equal to its own
>>>>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>>>>> startingChildElementDeclPosition is the prior term's
>>>>>> endingChildElementDeclPosition + 1.
>>>>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>>>>> endingChildElmentDeclPosition of its last child term.
>>>>>>
>>>>>
>>>>> Looks good to me. +1. This should definitely make implementation of
>>>>> unordered sequences and query style expressions easier. Removal of the
>>>>> slot concept should also help remove concepts that prevent faster schema
>>>>> compilation.
>>>>>
>>>>> - Steve
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Steve Lawrence <sl...@apache.org>.

Actually, I think we're over complicating the array situation. I think
we can avoid needing a unique ID and I think we can also avoid needing
to always insert arrays into a HashMap.

When addChild is called to add an DIElement to an array, there are two
cases we need to think about:

1) The DIArray does not exist and it must be created. The
DIElement is appended to its contents.

2) The DIArray does exist and we just need to find it. The
DIElement is appended to its contents.

So the issue is determining if the DIArray exists or not. Everything
after that is the same. And since children are inserted in the order
that we see them, a DIArray will always be at the end of the children
ArrayBuffer if it exists. If the DINode at the end of the children
ArrayBuffer is not this array, it needs to be created. So addChild
just becomes something like this:

  def addChild(e: DIElement) {
    if (e.erd.isArray) {
      if (children.isEmpty || children.last.erd != e.erd) {
        // no children, or last child is not a DIArray for
        // this element, create the DIArray
        children.append(new DIArray(...))
      }
      // the array is now always last, add the new child to it
      children.last.append(e)
    } else {
      children.append(e)
    }
  }

So we no longer need a unique ID or a tuple for the HashMap key. The
fast HashMap lookup only needs to work on the name of the element, which
allows query-style expressions to work in the future. And we now only
need to insert DIArray's into the fast HashMap if they are used in
expressions addChild no longer needs getChildArray, minimizing hash
look ups.

Related, I'm wondering if getChildArray should change as well so that it
does not create a DIArray if it doesn't exist? I think it seems
reasonable that only addChild should create new DIArrays. And if
something like DPath calls getChildArray, we probably don't want to end
up creating an empty DIArray if it doesn't exist. It should probably
just throw an InfosetNoSuchChildElementException if the array doesn't
exist, just like getChild.

- Steve

On 10/16/2017 12:02 PM, Steve Lawrence wrote:
> The ID might cause some issues when we want to support query-style
> expressions. For example, say we have a schema like this:
> 
> <sequence>
>   <xs:element name="array" maxOccurs="unbounded" ... />
>   <xs:element name="element" ... />
>   <xs:element name="array" maxOccurs="unbounded" ... />
> </xs:sequence>
> 
> So two arrays with the same name separated by a single element. Each
> element would have a unique ID, so would be inserted in the hash table
> as something like
> 
> ("array", 1) -> Seq(DIArray)
> ("element", 2) -> Seq(DISimple)
> ("array", 3) -> Seq(DIArray)
> 
> A query style expression that accesses the "array" elements (e.g.
> fn:count(./array)) would have trouble finding both arrays to get the
> full count since the keys for the two arrays are the same.
> 
> Perhaps one option is to keep the name -> DINode map type, so we'd end
> up with something like this:
> 
> "array" -> Seq(DIArray, DIArray)
> "element" -> Seq(DISimple)
> 
> And then just iterator of the Seq until we find DIArray with the right
> ID? That's could have a performance hit if there were lots of things
> with the same name, but I suspect that won't be super common.
> 
> Alternatively, we could relook at chaning the InfosetImpl API so that
> when an DIArray is created, it's store in the PState for quick access,
> and getChildArray would essentially go away.
> 
> - Steve
> 
> 
> On 10/13/2017 02:14 PM, Mike Beckerle wrote:
>> I think the key has to change to include name and namespace of element, and also an additional integer to disambiguate the first a element from the 2nd, from the Nth. This integer need not be 1, 2, 3. It can be any value at all, so long as they are distinct.
>>
>>
>> Let's make sure another case works. E.g.,  ordered sequence like the one you have here, another element to separate it from  an unordered sequence that also contains elements named a.
>>
>>
>> However, an unordered sequence can't have more than 1 element named a, so if every element a declaration gets an additional integer, the one in the unordered sequence will also get one, and that will behave properly.
>>
>>
>> Is this integer the same breadth-first ordering thing we discussed elsewhere? I expect it is.
>>
>> ________________________________
>> From: Taylor Wise
>> Sent: Friday, October 13, 2017 1:55:22 PM
>> To: Steve Lawrence; dev@daffodil.apache.org; Mike Beckerle
>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>
>>
>> I'm down to a single error: test_dfdlPosition3
>>
>>
>> java.lang.Exception:
>> Comparison failed.
>> Expected
>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2><a><v>1</v></a><a><v>2</v></a></f>
>> Actual
>>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2></f>
>> Differences were (path, expected, actual):
>>  (f,<cnt2>2</cnt2>,<a><v>1</v></a>)
>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>
>>     <xs:element name="f">
>>       <xs:complexType>
>>         <xs:sequence>
>>           <xs:element name="cnt1" type="xs:int"
>>             dfdl:lengthKind="delimited" dfdl:terminator=";" />
>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt1) }">
>>             <xs:complexType>
>>               <xs:sequence>
>>                 <xs:element name="v" type="xs:int"
>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>               </xs:sequence>
>>             </xs:complexType>
>>           </xs:element>
>>           <xs:element name="cnt2" type="xs:int"
>>             dfdl:lengthKind="delimited" />
>>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt2) }">
>>             <xs:complexType>
>>               <xs:sequence>
>>                 <xs:element name="v" type="xs:int"
>>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>>               </xs:sequence>
>>             </xs:complexType>
>>           </xs:element>
>>         </xs:sequence>
>>       </xs:complexType>
>>     </xs:element>
>>
>>
>>
>> I believe what's happening, from stepping through, is that the getChildArray code is confused as to where to add the second set of <a/> elements.  Below are the values of the structures at the point where <cnt2/> has been added:
>>
>> ChildNodes aka Insertion Order
>> ====
>> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk)), DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a'))), DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk)))
>>
>> FastLookup
>> ====
>> Map(cnt2 -> ArrayBuffer(DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk))), cnt1 -> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk))), a -> ArrayBuffer(DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a')))))
>>
>> So when we go to add the next array 'a', it sees that 'a' is already in the fastLookup and appends it there.  This adds it to the existing DIArray in the insertion order list.  Really, it should create a new DIArray inserting it after <cnt2/> in the insertion order list.  And inserting it after the DIArray in the fastLookup Hash: a -> Seq(DIArray(a, ...), DIArray(a, ...))
>>
>> How do we fix this?
>>
>> Because this is an 'ordered' sequence, can we look at the slotIndexInParent?  Isn't this the position in the sequence definition as defined in the schema above?  This should differentiate between the two 'a' arrays.  And then for 'unordered' sequences, this shouldn't matter.  We shouldn't run into two or more members with the same name and namespace.  Correct?
>>
>> This also means that the 'key' will have to change to include the namespace in addition to the element name.
>>
>>
>>
>> ________________________________
>> From: Steve Lawrence <sl...@apache.org>
>> Sent: Friday, October 13, 2017 8:45:06 AM
>> To: dev@daffodil.apache.org; Taylor Wise; Mike Beckerle
>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>
>> On 10/12/2017 08:19 PM, Taylor Wise wrote:
>>> Been debugging using the TestSimpleTypes tests.  I'm down to just 1 failure.  Expected results are slightly off.  I've got an additional <text/> in the result.  So, getting closer.  Stopping for the night.
>>>
>>>
>>> I'm not so sure that we need the _lastSlotAdded variable anymore.  I changed it to be just a def that reflects the current length of childNodes (ArrayBuffer).
>>> z
>>>
>>> I'm betting my current issue is within the restore functions I modified, but I'll continue looking at that in the morning.  I would think that 'restoring' would simply be copying the previous state... unless I'm misunderstanding something.
>>>
>>>
>>>   abstract override def restoreInto(e: DIElement) {
>>>     val c = e.asInstanceOf[DIComplex]
>>>     c.restoreFrom(this)
>>>     //c._lastSlotAdded = this._lastSlotAdded
>>>     super.restoreInto(e)
>>>   }
>>>
>>> final def restoreFrom(cs: DIComplexSharedImplMixin){
>>>   // Shouldn't I be able to just say something like...
>>>  //
>>>  childNodes.clear()
>>>  childNodes.append(cs.childNodes)
>>>  fastLookup.clear()
>>>  cs.fastLookup.toSeq.foreach( (name, seqDINodes) =>
>>>        fastLookup.put(name, seqDINodes) )
>>> }
>>
>> This is going to be somewhat inefficient. This essentially deletes and
>> recreates the entire childNodes array and fastLookup hashmap everytime
>> we backtrack. This could be pretty back if we childNodes had 1000
>> elements and we just wanted to backtrack 1. Instead we just want to
>> remove the elements that are being backtracked.
>>
>> So I think you still need a concept of _lastSlotAdded. But rather than
>> the lastSlotAdded, it's just the length of the children (maybe call it
>> _numChildren?). I would expect restoreFrom for DIComplex to look
>> something like this (very similar to the existing restoreFrom):
>>
>> final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>   var i = childNodes.length - 1
>>
>>   // for each child we will remove, remove it from the fastLookup map.
>>   // this should always be the last element in the hashmap seq
>>   while (i >= cs._numChildren) {
>>     var childToRemove = childNodes(i)
>>     var fastSeq = fastLookup(childToRemove.name)
>>     if (fastSeq.length == 1)
>>       // last one, remove the whole key entry
>>       fastLookup.remove(childToRemove.name)
>>     else
>>       // not the last one, just drop the end
>>       fastSeq.dropRight(1)
>>     i -= 1
>>   }
>>
>>   // now just quickly remove all those children
>>   childNodes.reduceToSize(cs._numChildren)
>>   _numChildren = cs._numChildren
>>
>>   if (cs._arraySize.isDefined) {
>>     childNodes(_numChildren - 1).reduceToSize(cs._arraySize.get)
>>   }
>> }
>>
>> The logic here is very similar to the CURRENT code, just is a little
>> more simple since it can make certain assumptions. Like removing
>> something form the fastSeq just means removing the last one
>> (dropRight(1)), and the existing _arraySize handles the special case for
>> DIArrays to simplify that logic.
>>
>> Note that we might want to rethink about the type fastLookup value.
>> Rather than the value being a Seq, it probably wants to be some sort of
>> mutable Seq that has constant time append dropRight. Maybe another
>> ArrayBuf is the right data structure here? That would allow something like
>>
>>   fastLookup.reduceToSize(fastSeq.length - 1)
>>
>>
>>
>>
>>> // ORIGINAL
>>>   //    final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>   //    Assert.invariant(_arraySize == MaybeInt.Nope)
>>>   //
>>>   //    var i = _lastSlotAdded
>>>   //    while (i > cs._lastSlotAdded) {
>>>   //      _slots(i) = null
>>>   //      i -= 1
>>>   //    }
>>>   //    _lastSlotAdded = cs._lastSlotAdded
>>>   //
>>>   //    // check invariant. All slots after last slot added
>>>   //    // should be null.
>>>   //    {
>>>   //      var i = _lastSlotAdded + 1
>>>   //      while (i < _slots.length) {
>>>   //        Assert.invariant(_slots(i) eq null)
>>>   //        i = i + 1
>>>   //      }
>>>   //    }
>>>   //
>>>   //    if (cs._arraySize.isDefined) {
>>>   //      _slots(_lastSlotAdded).asInstanceOf[DIArray].reduceToSize(cs._arraySize.get)
>>>   //    }
>>>   //  }
>>>
>>> // CURRENT, lot of logic here.  Wondering if it's even necessary, thus the question above
>>> //
>>>   final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>>     Assert.invariant(_arraySize == MaybeInt.Nope)
>>>
>>>     var i = _lastSlotAdded
>>>     while (i > cs._lastSlotAdded) {
>>>       val lastNode = childNodes.last
>>>       val name = lastNode.namedQName.toQNameString
>>>
>>>       lastNode match {
>>>         case arr: DIArray if (arr.length > 1) => {
>>>           // No need to delete the array, just the last DINode in it
>>>           arr.pop()
>>>         }
>>>         case arr: DIArray if (arr.length == 1) => {
>>>           // Last DINode in array, remove the array entirely
>>>           //
>>>           arr.pop()
>>>           //
>>>           // Update insertion order list
>>>           //
>>>           childNodes.remove(childNodes.length - 1)
>>>           //
>>>           // Update Fast Lookup
>>>           //
>>>           val seq = fastLookup(name)
>>>           if (seq.length <= 1) {
>>>             // Last DINode with THIS name, remove it entirely.
>>>             //
>>>             fastLookup.remove(name)
>>>           } else {
>>>             // Not the last DINode with THIS name, just remove this particular one.
>>>             //
>>>             val idx = seq.lastIndexWhere(node => node.isInstanceOf[DIArray])
>>>             val result = seq diff Seq(lastNode)
>>>             fastLookup.update(name, result)
>>>           }
>>>         }
>>>         case _ => {
>>>           //
>>>           // Update insertion order list
>>>           //
>>>           childNodes.remove(childNodes.length - 1)
>>>           //
>>>           // Update Fast Lookup
>>>           //
>>>           val seq = fastLookup(name)
>>>           if (seq.length <= 1) {
>>>             // Last DINode with THIS name, remove it entirely
>>>             //
>>>             fastLookup.remove(name)
>>>           } else {
>>>             // Not the last DINode with THIS name.  Just remove this one!
>>>             //
>>>             val idx = seq.lastIndexWhere(node => !node.isInstanceOf[DIArray])
>>>             val result = seq diff Seq(lastNode)
>>>             fastLookup.update(name, result)
>>>           }
>>>         }
>>>       }
>>>     }
>>>
>>>   }
>>>
>>> The last failure in SimpleTypes, I haven't run the rest of the suite yet though.  Baby steps.
>>> java.lang.Exception:
>>> Comparison failed.
>>> Expected
>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text></hb_root01>
>>> Actual
>>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text><text/></hb_root01>
>>> Differences were (path, expected, actual):
>>>  (hb_root01,List(),List(<text/>))
>>> at edu.illinois.ncsa.daffodil.xml.XMLUtils$.compareAndReport(XMLUtils.scala:678)
>>> at edu.illinois.ncsa.daffodil.tdml.VerifyTestCase$.verifyParserTestData(TDMLRunner.scala:1074)
>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runParseExpectSuccess(TDMLRunner.scala:776)
>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:675)
>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:673)
>>> at scala.util.Either$RightProjection.foreach(Either.scala:468)
>>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runProcessor(TDMLRunner.scala:673)
>>> at edu.illinois.ncsa.daffodil.tdml.TestCase.run(TDMLRunner.scala:623)
>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTestWithDataVolumes(TDMLRunner.scala:340)
>>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTest(TDMLRunner.scala:328)
>>> at edu.illinois.ncsa.daffodil.tdml.Runner.runOneTest(RunnerFactory.scala:122)
>>> at edu.illinois.ncsa.daffodil.section05.simple_types.TestSimpleTypes.test_hexBinary_Delimited_01(TestSimpleTypes.scala:63)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>>> at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>>> at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>>> at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>>> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>>> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>>> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>>> at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
>>> at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
>>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>>
>>>
>>>
>>>
>>>
>>> ________________________________
>>> From: Steve Lawrence <sl...@apache.org>
>>> Sent: Wednesday, October 11, 2017 4:07:51 PM
>>> To: dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>
>>> On 10/11/2017 01:01 PM, Mike Beckerle wrote:
>>>> You have to hash lookup the name to append to an array if you're just appending one thing at a time starting from the slot/hashID. The current Infoset API hides the construction of the DIArray, and only provides access via the slot/hashID, so there's no way to avoid the hash per each append.
>>>
>>> The hashmap and hashlookups should really only be needed for elements
>>> that are used in expressions. We came up with that optimization so that
>>> we could remove the vast majority of hash lookups. Due to how
>>> getChildArray works, that optimization can't occur. I'm not sure how
>>> much this will affect performance, but there are ways to avoid the need
>>> for a hash lookup for arrays not used in expressions. The removal of
>>> DIArray was one. Maybe API/parser changes is better.
>>>
>>>> But that's fixable if we want. Probably the API to infoset needs to allow appending to an array more efficently, so that a loop like a repeat combinator can be directly holding onto the DIArray's arrayBuffer appending objects as the parser returns them. But I would not make this change as yet. First things first. Don't optimize without profiling info, etc.
>>>>
>>>>
>>>> I'm very reluctant to remove DIArray because XML's data model isn't the only game in town. The XML decision to not have arrays, just adjacent things that happen to have the same name, has all sorts of negative performance implications.
>>>
>>> The removal of a DIArray doesn't necessarily mean we don't have a
>>> concept of an arrays. Every element added to Complex child arrayBuf
>>> still knows if it part of an array via ERD.isArray. And comparing the
>>> current DIElements ERD to the previous and next is enough to determine
>>> where array starts and ends. From there, we can support any model that
>>> requires a concept of an array. I'm not saying we should remove the
>>> concept of arrays, just that our InfosetImpl doesn't necessarily need to
>>> have a special DIArray. I'll agree that it does make the logic easier to
>>> get it to a different model, but other things are more complicated.
>>>
>>>> Consider this: we could create DIArrays of elements of simple type that don't even create a DISimple node for each array element. The values could be stored in the DIArray's arrayBuffer directly, eliminating all that overhead of a DISimple node. (This was the original intent for DIArray, but was punted to get it working sooner.) The opportunity to do this is entirely lost if you don't have DIArray, but just a bunch of adjacent DIElements that happen to share ERD.
>>>
>>> Fair point. Seems a good argument for keeping DIArray, but if we don't
>>> get around to doing this optimization, we're taking multiple hits. We're
>>> still allocating many DISimples plus a DIArray, plus the hit from the
>>> hashmap.
>>>
>>>> More speculatively, when we someday implement incremental array trimming so that streaming parse/unparse can drop most array elements once they've been delivered to consumer, well then we will definitely want a DIArray object to abstract the complexity of this trim behavior and the underlying representation it requires.
>>>
>>> It might simplify this, but I'm not entirely convinced. I think the
>>> trimming of unused Infoset elements is going to have some complex
>>> heuristics regardless. Adding an extra heuristic for arrays won't
>>> necessarily make this easier. But this is just speculation. And again,
>>> we would still know elements are part of an array, it just takes a
>>> couple extra comparisons to determine start/end. But I do agree that
>>> removal of an an entire array is much easier with a DIArray concept.
>>>
>>>
>>>
>>> It seems to me these are good enough arguments for keeping DIArray
>>> around, but it might be worth thinking a little bit about what changes
>>> would be necessary to the API/parser to not need this hash lookup. If
>>> it's a minor change, it could reduce a lot of hash lookups, especially
>>> for large arrays, which are not uncommon.
>>>
>>>
>>>>
>>>> ________________________________
>>>> From: Steve Lawrence <sl...@apache.org>
>>>> Sent: Wednesday, October 11, 2017 12:21 PM
>>>> To: Taylor Wise; dev@daffodil.apache.org
>>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>
>>>> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>>>>
>>>>> What you'll notice is that the current implementation will construct the DIArray
>>>>> if it doesn't already exist and append to it if it does already exist.  But it
>>>>> needs the 'slot' number in order to check this.  The slot number is obtained
>>>>> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of slots
>>>>> entirely, how can we accomplish this?  Will we have to do a named lookup then
>>>>> every time we want to append to a DIArray?
>>>> I think you're right. The way things work right now, anytime we add to a
>>>> a DIArray we would have to do a hash lookup on the DIComplex. It's not
>>>> limited to just DPath.
>>>>
>>>> Maybe this is another argument for removing the concept of a DIArray? We
>>>> still need to know if a DIElement is part of an array for the
>>>> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
>>>> element should just means creating a new DISimple/DIComplex and adding
>>>> it to the DIComplex child ArrayBuffer and appending to the HashLookup
>>>> Seq? This would also simplify some query style logic, e.g. fn:count.
>>>>
>>>> Alternatively, we could modify the parser logic so that when it's
>>>> dealing with arrays it stores the DIArray locally somewhere (PState or
>>>> Parser) and appends straight to that without having to ask the parent
>>>> DIComplex for the DIArray? Not sure how big of a change that would be
>>>> though.
>>>>
>>>>>
>>>>>
>>>>> Current Implementation:
>>>>>
>>>>> override def addChild(e: InfosetElement): Unit = {
>>>>>
>>>>>   if (e.runtimeData.isArray)  {
>>>>>
>>>>>    // Make sure there is an array to accept the child
>>>>>
>>>>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>>>>
>>>>>      Assert.invariant(arr ne null)
>>>>>
>>>>>      arr.append(e)
>>>>>
>>>>>    } else {
>>>>>
>>>>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>>>>
>>>>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>>>>
>>>>>    }
>>>>>
>>>>>    e.setParent(this)
>>>>>
>>>>> }
>>>>>
>>>>>
>>>>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>>>>>      Assert.usage(childERD.isArray)
>>>>>      getChildArray(childERD.dpathElementCompileInfo)
>>>>>    }
>>>>>
>>>>>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>>>>>      Assert.usage(info.isArray)
>>>>>      val slot = info.slotIndexInParent
>>>>>      getChildArray(slot)
>>>>>    }
>>>>>
>>>>>    private def getChildArray(slot: Int): InfosetArray = {
>>>>>      val slotVal = _slots(slot)
>>>>>      if (slotVal ne null)
>>>>>        slotVal match {
>>>>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>>>>          case _ => Assert.usageError("not an array")
>>>>>        }
>>>>>      else {
>>>>>        val arrayERD = erd.childERDs(slot)
>>>>>        // slot is null. There isn't even an array object yet.
>>>>>        // create one (it will have zero entries)
>>>>>        val ia = new DIArray(arrayERD, this)
>>>>>        // no array there yet. So we have to create one.
>>>>>        setChildArray(slot, ia)
>>>>>        ia
>>>>>      }
>>>>>    }
>>>>>
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------------
>>>>> *From:* Steve Lawrence <sl...@apache.org>
>>>>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>>>>> *To:* dev@daffodil.apache.org; Taylor Wise
>>>>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>>>>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>>>>> Goal: Remove concept of slots from InfosetImpl
>>>>>>
>>>>>> Reasons:
>>>>>>     1. So that element/node lookup can be name based.
>>>>>>     2. Ease re-enabling of Unordered Sequences.
>>>>>>     3. Ease enabling of 'query-style' expressions that return more than
>>>>>> one node.
>>>>>>
>>>>>> An Infoset is a document (DIDocument) and the document contains exactly
>>>>>> one element (DIElement) as its root.  These elements can be simple
>>>>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>>>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>>>>> inherit from DINode.
>>>>>>
>>>>>> Currently DIComplex models its children as an Array of DINodes. The
>>>>>> problem with the current implementation is that there's no way for us to
>>>>>> just ask for an element by name without walking the structure.
>>>>>>
>>>>>> The solution involves modifying DIComplex to represent its children both
>>>>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>>>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>>>>> insertion ordering information we need regarding the children as well as
>>>>>> constant time indexing access to the child in question.  The ArrayBuffer
>>>>>> also has the added benefit of growing and shrinking as objects are added
>>>>>> and removed.  While the HashMap will allow us to associate a name
>>>>>> (String) with DINodes that have that name thus providing a fast named
>>>>>> lookup.
>>>>>>
>>>>>> This named lookup can be optimized further by only inserting into the
>>>>>> HashMap those elements that we know are used in expressions. These are
>>>>>> the elements that would most likely benefit from a fast named lookup and
>>>>>> will allow us to avoid additional overhead of hashing strings and
>>>>>> allocating/appending to a sequence.
>>>>>>
>>>>>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>>>>>> it's possible for the following:
>>>>>>
>>>>>>     <element name="a" .../>
>>>>>>     <element name="a" maxOccurs="unbounded"..../>
>>>>>>     <element name="b".../>
>>>>>>     <element name="a" ..../>
>>>>>>
>>>>>> Each element named 'a' would have an associated DINode.  Because there
>>>>>> are multiple DINodes named 'a', we need to represent them in the HashMap
>>>>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>>>>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>>>>>
>>>>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>>>>     "b" -> mutable.Seq(DIElement)
>>>>>>
>>>>>> The mutable.ArrayBuffer maintains the insertion order and would look as
>>>>>> follows:
>>>>>>
>>>>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>>>>
>>>>>> Back tracking shall be accomplished by going to the last item in the
>>>>>> DIComplex ArrayBuffer and removing it.
>>>>>>     1. If the last item is a DIElement, we simply delete it from the
>>>>>> ArrayBuffer and then remove it also from the HashMap.
>>>>>>     2. If the last item is a DIArray, we then have to look at the
>>>>>> contents of the DIArray.
>>>>>>         a. If the DIArray has more than one entry, remove the last
>>>>>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>>>>>> with the modified DIArray.
>>>>>
>>>>> I think the second part of this is actually unnecessary. Since the
>>>>> DIArray is mutable, we only need to remove its last child. The entry in
>>>>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>>>>> only difference is the contents of the DIArray will be one smaller.
>>>>>
>>>>>>         b. If the DIArray has a single entry then we simply remove
>>>>>> DIArray from ArrayBuffer and the HashMap.
>>>>>>
>>>>>>     <element name="root>
>>>>>>          <sequence ../>
>>>>>>          <element name="a" .../>
>>>>>>          <element name="a" maxOccurs="unbounded">
>>>>>>            ...
>>>>>>            <element name="b" .../>
>>>>>>            <element name="c" .../>
>>>>>>            ...
>>>>>>          </element>
>>>>>>        </sequence>
>>>>>>     </element>
>>>>>>
>>>>>> If the HashMap of DIComplex-root looks like the following:
>>>>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>
>>>>>> ArrayBuffer of DIComplex-root looks like the following:
>>>>>>      0 -> DIElement("a")
>>>>>>      1 -> DIArray("a")
>>>>>>
>>>>>> Where DIArray children ArrayBuffer looks like:
>>>>>>      0 -> DIElement("b")
>>>>>>      1 -> DIElement("c")
>>>>>>
>>>>>> Given the initial example if we backtrack(1):  // purge one item
>>>>>>      DIArray:
>>>>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>>>>
>>>>>>      HashMap DIComplex-root:
>>>>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>>>>
>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>          0 -> DIElement("a")
>>>>>>          1 -> DIArray("a")
>>>>>>
>>>>>> Given the initial example if we backtrack(2): // purge two items
>>>>>>      DIArray:
>>>>>>          Empty
>>>>>>
>>>>>>      HashMap DIComplex-root:
>>>>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>>>>>                                         // removed and is now empty
>>>>>>
>>>>>>      ArrayBuffer DIComplex-root:
>>>>>>          0 -> DIElement("a")
>>>>>>
>>>>>> ===
>>>>>>
>>>>>> class DIComplex {
>>>>>>   def nSlots = erd.nChildSlots
>>>>>
>>>>> I think the concept of nSlots and nChildSlots should go away completely.
>>>>> The children ArrayBuffer should just grow as elements are added and
>>>>> shrink as they are backtracked. It might make sense to have a tunable
>>>>> the determines the initial size of the children Array, but the Scala
>>>>> defaults for initializing and growing a mutable array might already be
>>>>> reasonable.
>>>>>
>>>>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>>>>
>>>>> One thing that might make this change easier is to not worry about the
>>>>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>>>>> really only necessary to support query style expressions, but that comes
>>>>> with a whole other slew of concerns (e.g. fn:count, same name but
>>>>> different types, etc.). There currently exist restrictions that do not
>>>>> allow query style expressions, so this Seq should always be a size of 1
>>>>> until remove those restrictions.
>>>>>
>>>>>>
>>>>>>   def addChild(node: DINode) {
>>>>>>     children += node
>>>>>>     fastLookup(node.name).append(node)
>>>>>>   }
>>>>>>
>>>>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>>>>   }
>>>>>>
>>>>>>   def backtrack(int n) {
>>>>>>     for (i = 0; i < n; i++) {
>>>>>>       val endIdx = children.size -1
>>>>>>       val child = children.remove(endIdx) // children.pop()
>>>>>>
>>>>>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>>>>>         case None => // do nothing
>>>>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>>>>         case Some(nodesSeq) => {
>>>>>>       val lastEntry = nodesSeq.last
>>>>>>           lastEntry match {
>>>>>>         case arr: DIArray => {
>>>>>>               // If the array has >= 2 items, remove last one and update
>>>>>> nodesSeq
>>>>>>               // with this array:
>>>>>>               //  val lastIdx = nodesSeq.length - 1
>>>>>>           //  arr.remove(arr.size - 1)
>>>>>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>>>>>> arr))
>>>>>>               //
>>>>>>               // If the array has 1 item, remove the array from nodesSeq
>>>>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>         }
>>>>>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>>           }
>>>>>>
>>>>>>         }
>>>>>>       }
>>>>>>     }
>>>>>>   }
>>>>>>
>>>>>> }
>>>>>>
>>>>>> class DIArray {
>>>>>>     private val initialSize =
>>>>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>>>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>>>>>
>>>>>>     def append(ie: InfosetElement): Unit = {
>>>>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>>>>         ie.setArray(this)
>>>>>>     }
>>>>>>
>>>>>>     //
>>>>>>     // For backtracking purposes
>>>>>>     //
>>>>>>     def remove(n: Int): DIElement = {
>>>>>>         _contents.remove(n)
>>>>>>     }
>>>>>> }
>>>>>>
>>>>>> ===
>>>>>>
>>>>>> To facillitate a breadth-first ordering scheme we shall have to
>>>>>
>>>>> That this is only necessary for unordered sequences, where the insertion
>>>>> order in the ArrayBuffer differs from the display order, correct? If
>>>>> that's true, we definitely want to make sure to only do the below
>>>>> sorting when an element contains an unordred sequence.
>>>>>
>>>>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>>>>> shall be called childElementDeclPosition and pertains to the static
>>>>>> order of the element children of a complex type element.  When sorting
>>>>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>>>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>>>>> entry in the sequence that needs to be sorted.  All elements in this
>>>>>> array will have the same childElementDeclPosition.
>>>>>>
>>>>>> The difficulty in implementing this is that an Element's
>>>>>> dsom.ComplexType can have groups within the groups before finding actual
>>>>>> element children.  Therefore in the Document Schema Object Model (DSOM)
>>>>>> the model group object shall have a startingChildElementDeclPosition and
>>>>>> an endingChildElementDeclPosition.
>>>>>>
>>>>>> 1. The startingChildElementPosition shall be passed to the constructor
>>>>>> for the Term.
>>>>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>>>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>>>>> model group, then its first child term's
>>>>>> startingChildElementDeclPosition is equal to its own
>>>>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>>>>> startingChildElementDeclPosition is the prior term's
>>>>>> endingChildElementDeclPosition + 1.
>>>>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>>>>> endingChildElmentDeclPosition of its last child term.
>>>>>>
>>>>>
>>>>> Looks good to me. +1. This should definitely make implementation of
>>>>> unordered sequences and query style expressions easier. Removal of the
>>>>> slot concept should also help remove concepts that prevent faster schema
>>>>> compilation.
>>>>>
>>>>> - Steve
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Steve Lawrence <sl...@apache.org>.

The ID might cause some issues when we want to support query-style
expressions. For example, say we have a schema like this:

<sequence>
  <xs:element name="array" maxOccurs="unbounded" ... />
  <xs:element name="element" ... />
  <xs:element name="array" maxOccurs="unbounded" ... />
</xs:sequence>

So two arrays with the same name separated by a single element. Each
element would have a unique ID, so would be inserted in the hash table
as something like

("array", 1) -> Seq(DIArray)
("element", 2) -> Seq(DISimple)
("array", 3) -> Seq(DIArray)

A query style expression that accesses the "array" elements (e.g.
fn:count(./array)) would have trouble finding both arrays to get the
full count since the keys for the two arrays are the same.

Perhaps one option is to keep the name -> DINode map type, so we'd end
up with something like this:

"array" -> Seq(DIArray, DIArray)
"element" -> Seq(DISimple)

And then just iterator of the Seq until we find DIArray with the right
ID? That's could have a performance hit if there were lots of things
with the same name, but I suspect that won't be super common.

Alternatively, we could relook at chaning the InfosetImpl API so that
when an DIArray is created, it's store in the PState for quick access,
and getChildArray would essentially go away.

- Steve


On 10/13/2017 02:14 PM, Mike Beckerle wrote:
> I think the key has to change to include name and namespace of element, and also an additional integer to disambiguate the first a element from the 2nd, from the Nth. This integer need not be 1, 2, 3. It can be any value at all, so long as they are distinct.
> 
> 
> Let's make sure another case works. E.g.,  ordered sequence like the one you have here, another element to separate it from  an unordered sequence that also contains elements named a.
> 
> 
> However, an unordered sequence can't have more than 1 element named a, so if every element a declaration gets an additional integer, the one in the unordered sequence will also get one, and that will behave properly.
> 
> 
> Is this integer the same breadth-first ordering thing we discussed elsewhere? I expect it is.
> 
> ________________________________
> From: Taylor Wise
> Sent: Friday, October 13, 2017 1:55:22 PM
> To: Steve Lawrence; dev@daffodil.apache.org; Mike Beckerle
> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
> 
> 
> I'm down to a single error: test_dfdlPosition3
> 
> 
> java.lang.Exception:
> Comparison failed.
> Expected
>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2><a><v>1</v></a><a><v>2</v></a></f>
> Actual
>           <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2></f>
> Differences were (path, expected, actual):
>  (f,<cnt2>2</cnt2>,<a><v>1</v></a>)
> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
> 
>     <xs:element name="f">
>       <xs:complexType>
>         <xs:sequence>
>           <xs:element name="cnt1" type="xs:int"
>             dfdl:lengthKind="delimited" dfdl:terminator=";" />
>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt1) }">
>             <xs:complexType>
>               <xs:sequence>
>                 <xs:element name="v" type="xs:int"
>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>               </xs:sequence>
>             </xs:complexType>
>           </xs:element>
>           <xs:element name="cnt2" type="xs:int"
>             dfdl:lengthKind="delimited" />
>           <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
>             dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt2) }">
>             <xs:complexType>
>               <xs:sequence>
>                 <xs:element name="v" type="xs:int"
>                   dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
>               </xs:sequence>
>             </xs:complexType>
>           </xs:element>
>         </xs:sequence>
>       </xs:complexType>
>     </xs:element>
> 
> 
> 
> I believe what's happening, from stepping through, is that the getChildArray code is confused as to where to add the second set of <a/> elements.  Below are the values of the structures at the point where <cnt2/> has been added:
> 
> ChildNodes aka Insertion Order
> ====
> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk)), DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a'))), DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk)))
> 
> FastLookup
> ====
> Map(cnt2 -> ArrayBuffer(DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk))), cnt1 -> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk))), a -> ArrayBuffer(DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a')))))
> 
> So when we go to add the next array 'a', it sees that 'a' is already in the fastLookup and appends it there.  This adds it to the existing DIArray in the insertion order list.  Really, it should create a new DIArray inserting it after <cnt2/> in the insertion order list.  And inserting it after the DIArray in the fastLookup Hash: a -> Seq(DIArray(a, ...), DIArray(a, ...))
> 
> How do we fix this?
> 
> Because this is an 'ordered' sequence, can we look at the slotIndexInParent?  Isn't this the position in the sequence definition as defined in the schema above?  This should differentiate between the two 'a' arrays.  And then for 'unordered' sequences, this shouldn't matter.  We shouldn't run into two or more members with the same name and namespace.  Correct?
> 
> This also means that the 'key' will have to change to include the namespace in addition to the element name.
> 
> 
> 
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Friday, October 13, 2017 8:45:06 AM
> To: dev@daffodil.apache.org; Taylor Wise; Mike Beckerle
> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
> 
> On 10/12/2017 08:19 PM, Taylor Wise wrote:
>> Been debugging using the TestSimpleTypes tests.  I'm down to just 1 failure.  Expected results are slightly off.  I've got an additional <text/> in the result.  So, getting closer.  Stopping for the night.
>>
>>
>> I'm not so sure that we need the _lastSlotAdded variable anymore.  I changed it to be just a def that reflects the current length of childNodes (ArrayBuffer).
>> z
>>
>> I'm betting my current issue is within the restore functions I modified, but I'll continue looking at that in the morning.  I would think that 'restoring' would simply be copying the previous state... unless I'm misunderstanding something.
>>
>>
>>   abstract override def restoreInto(e: DIElement) {
>>     val c = e.asInstanceOf[DIComplex]
>>     c.restoreFrom(this)
>>     //c._lastSlotAdded = this._lastSlotAdded
>>     super.restoreInto(e)
>>   }
>>
>> final def restoreFrom(cs: DIComplexSharedImplMixin){
>>   // Shouldn't I be able to just say something like...
>>  //
>>  childNodes.clear()
>>  childNodes.append(cs.childNodes)
>>  fastLookup.clear()
>>  cs.fastLookup.toSeq.foreach( (name, seqDINodes) =>
>>        fastLookup.put(name, seqDINodes) )
>> }
> 
> This is going to be somewhat inefficient. This essentially deletes and
> recreates the entire childNodes array and fastLookup hashmap everytime
> we backtrack. This could be pretty back if we childNodes had 1000
> elements and we just wanted to backtrack 1. Instead we just want to
> remove the elements that are being backtracked.
> 
> So I think you still need a concept of _lastSlotAdded. But rather than
> the lastSlotAdded, it's just the length of the children (maybe call it
> _numChildren?). I would expect restoreFrom for DIComplex to look
> something like this (very similar to the existing restoreFrom):
> 
> final def restoreFrom(cs: DIComplexSharedImplMixin) {
>   var i = childNodes.length - 1
> 
>   // for each child we will remove, remove it from the fastLookup map.
>   // this should always be the last element in the hashmap seq
>   while (i >= cs._numChildren) {
>     var childToRemove = childNodes(i)
>     var fastSeq = fastLookup(childToRemove.name)
>     if (fastSeq.length == 1)
>       // last one, remove the whole key entry
>       fastLookup.remove(childToRemove.name)
>     else
>       // not the last one, just drop the end
>       fastSeq.dropRight(1)
>     i -= 1
>   }
> 
>   // now just quickly remove all those children
>   childNodes.reduceToSize(cs._numChildren)
>   _numChildren = cs._numChildren
> 
>   if (cs._arraySize.isDefined) {
>     childNodes(_numChildren - 1).reduceToSize(cs._arraySize.get)
>   }
> }
> 
> The logic here is very similar to the CURRENT code, just is a little
> more simple since it can make certain assumptions. Like removing
> something form the fastSeq just means removing the last one
> (dropRight(1)), and the existing _arraySize handles the special case for
> DIArrays to simplify that logic.
> 
> Note that we might want to rethink about the type fastLookup value.
> Rather than the value being a Seq, it probably wants to be some sort of
> mutable Seq that has constant time append dropRight. Maybe another
> ArrayBuf is the right data structure here? That would allow something like
> 
>   fastLookup.reduceToSize(fastSeq.length - 1)
> 
> 
> 
> 
>> // ORIGINAL
>>   //    final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>   //    Assert.invariant(_arraySize == MaybeInt.Nope)
>>   //
>>   //    var i = _lastSlotAdded
>>   //    while (i > cs._lastSlotAdded) {
>>   //      _slots(i) = null
>>   //      i -= 1
>>   //    }
>>   //    _lastSlotAdded = cs._lastSlotAdded
>>   //
>>   //    // check invariant. All slots after last slot added
>>   //    // should be null.
>>   //    {
>>   //      var i = _lastSlotAdded + 1
>>   //      while (i < _slots.length) {
>>   //        Assert.invariant(_slots(i) eq null)
>>   //        i = i + 1
>>   //      }
>>   //    }
>>   //
>>   //    if (cs._arraySize.isDefined) {
>>   //      _slots(_lastSlotAdded).asInstanceOf[DIArray].reduceToSize(cs._arraySize.get)
>>   //    }
>>   //  }
>>
>> // CURRENT, lot of logic here.  Wondering if it's even necessary, thus the question above
>> //
>>   final def restoreFrom(cs: DIComplexSharedImplMixin) {
>>     Assert.invariant(_arraySize == MaybeInt.Nope)
>>
>>     var i = _lastSlotAdded
>>     while (i > cs._lastSlotAdded) {
>>       val lastNode = childNodes.last
>>       val name = lastNode.namedQName.toQNameString
>>
>>       lastNode match {
>>         case arr: DIArray if (arr.length > 1) => {
>>           // No need to delete the array, just the last DINode in it
>>           arr.pop()
>>         }
>>         case arr: DIArray if (arr.length == 1) => {
>>           // Last DINode in array, remove the array entirely
>>           //
>>           arr.pop()
>>           //
>>           // Update insertion order list
>>           //
>>           childNodes.remove(childNodes.length - 1)
>>           //
>>           // Update Fast Lookup
>>           //
>>           val seq = fastLookup(name)
>>           if (seq.length <= 1) {
>>             // Last DINode with THIS name, remove it entirely.
>>             //
>>             fastLookup.remove(name)
>>           } else {
>>             // Not the last DINode with THIS name, just remove this particular one.
>>             //
>>             val idx = seq.lastIndexWhere(node => node.isInstanceOf[DIArray])
>>             val result = seq diff Seq(lastNode)
>>             fastLookup.update(name, result)
>>           }
>>         }
>>         case _ => {
>>           //
>>           // Update insertion order list
>>           //
>>           childNodes.remove(childNodes.length - 1)
>>           //
>>           // Update Fast Lookup
>>           //
>>           val seq = fastLookup(name)
>>           if (seq.length <= 1) {
>>             // Last DINode with THIS name, remove it entirely
>>             //
>>             fastLookup.remove(name)
>>           } else {
>>             // Not the last DINode with THIS name.  Just remove this one!
>>             //
>>             val idx = seq.lastIndexWhere(node => !node.isInstanceOf[DIArray])
>>             val result = seq diff Seq(lastNode)
>>             fastLookup.update(name, result)
>>           }
>>         }
>>       }
>>     }
>>
>>   }
>>
>> The last failure in SimpleTypes, I haven't run the rest of the suite yet though.  Baby steps.
>> java.lang.Exception:
>> Comparison failed.
>> Expected
>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text></hb_root01>
>> Actual
>>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text><text/></hb_root01>
>> Differences were (path, expected, actual):
>>  (hb_root01,List(),List(<text/>))
>> at edu.illinois.ncsa.daffodil.xml.XMLUtils$.compareAndReport(XMLUtils.scala:678)
>> at edu.illinois.ncsa.daffodil.tdml.VerifyTestCase$.verifyParserTestData(TDMLRunner.scala:1074)
>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runParseExpectSuccess(TDMLRunner.scala:776)
>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:675)
>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:673)
>> at scala.util.Either$RightProjection.foreach(Either.scala:468)
>> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runProcessor(TDMLRunner.scala:673)
>> at edu.illinois.ncsa.daffodil.tdml.TestCase.run(TDMLRunner.scala:623)
>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTestWithDataVolumes(TDMLRunner.scala:340)
>> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTest(TDMLRunner.scala:328)
>> at edu.illinois.ncsa.daffodil.tdml.Runner.runOneTest(RunnerFactory.scala:122)
>> at edu.illinois.ncsa.daffodil.section05.simple_types.TestSimpleTypes.test_hexBinary_Delimited_01(TestSimpleTypes.scala:63)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>> at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>> at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>> at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>> at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
>> at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
>> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>>
>>
>>
>>
>>
>> ________________________________
>> From: Steve Lawrence <sl...@apache.org>
>> Sent: Wednesday, October 11, 2017 4:07:51 PM
>> To: dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>
>> On 10/11/2017 01:01 PM, Mike Beckerle wrote:
>>> You have to hash lookup the name to append to an array if you're just appending one thing at a time starting from the slot/hashID. The current Infoset API hides the construction of the DIArray, and only provides access via the slot/hashID, so there's no way to avoid the hash per each append.
>>
>> The hashmap and hashlookups should really only be needed for elements
>> that are used in expressions. We came up with that optimization so that
>> we could remove the vast majority of hash lookups. Due to how
>> getChildArray works, that optimization can't occur. I'm not sure how
>> much this will affect performance, but there are ways to avoid the need
>> for a hash lookup for arrays not used in expressions. The removal of
>> DIArray was one. Maybe API/parser changes is better.
>>
>>> But that's fixable if we want. Probably the API to infoset needs to allow appending to an array more efficently, so that a loop like a repeat combinator can be directly holding onto the DIArray's arrayBuffer appending objects as the parser returns them. But I would not make this change as yet. First things first. Don't optimize without profiling info, etc.
>>>
>>>
>>> I'm very reluctant to remove DIArray because XML's data model isn't the only game in town. The XML decision to not have arrays, just adjacent things that happen to have the same name, has all sorts of negative performance implications.
>>
>> The removal of a DIArray doesn't necessarily mean we don't have a
>> concept of an arrays. Every element added to Complex child arrayBuf
>> still knows if it part of an array via ERD.isArray. And comparing the
>> current DIElements ERD to the previous and next is enough to determine
>> where array starts and ends. From there, we can support any model that
>> requires a concept of an array. I'm not saying we should remove the
>> concept of arrays, just that our InfosetImpl doesn't necessarily need to
>> have a special DIArray. I'll agree that it does make the logic easier to
>> get it to a different model, but other things are more complicated.
>>
>>> Consider this: we could create DIArrays of elements of simple type that don't even create a DISimple node for each array element. The values could be stored in the DIArray's arrayBuffer directly, eliminating all that overhead of a DISimple node. (This was the original intent for DIArray, but was punted to get it working sooner.) The opportunity to do this is entirely lost if you don't have DIArray, but just a bunch of adjacent DIElements that happen to share ERD.
>>
>> Fair point. Seems a good argument for keeping DIArray, but if we don't
>> get around to doing this optimization, we're taking multiple hits. We're
>> still allocating many DISimples plus a DIArray, plus the hit from the
>> hashmap.
>>
>>> More speculatively, when we someday implement incremental array trimming so that streaming parse/unparse can drop most array elements once they've been delivered to consumer, well then we will definitely want a DIArray object to abstract the complexity of this trim behavior and the underlying representation it requires.
>>
>> It might simplify this, but I'm not entirely convinced. I think the
>> trimming of unused Infoset elements is going to have some complex
>> heuristics regardless. Adding an extra heuristic for arrays won't
>> necessarily make this easier. But this is just speculation. And again,
>> we would still know elements are part of an array, it just takes a
>> couple extra comparisons to determine start/end. But I do agree that
>> removal of an an entire array is much easier with a DIArray concept.
>>
>>
>>
>> It seems to me these are good enough arguments for keeping DIArray
>> around, but it might be worth thinking a little bit about what changes
>> would be necessary to the API/parser to not need this hash lookup. If
>> it's a minor change, it could reduce a lot of hash lookups, especially
>> for large arrays, which are not uncommon.
>>
>>
>>>
>>> ________________________________
>>> From: Steve Lawrence <sl...@apache.org>
>>> Sent: Wednesday, October 11, 2017 12:21 PM
>>> To: Taylor Wise; dev@daffodil.apache.org
>>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>>
>>> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>>>
>>>> What you'll notice is that the current implementation will construct the DIArray
>>>> if it doesn't already exist and append to it if it does already exist.  But it
>>>> needs the 'slot' number in order to check this.  The slot number is obtained
>>>> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of slots
>>>> entirely, how can we accomplish this?  Will we have to do a named lookup then
>>>> every time we want to append to a DIArray?
>>> I think you're right. The way things work right now, anytime we add to a
>>> a DIArray we would have to do a hash lookup on the DIComplex. It's not
>>> limited to just DPath.
>>>
>>> Maybe this is another argument for removing the concept of a DIArray? We
>>> still need to know if a DIElement is part of an array for the
>>> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
>>> element should just means creating a new DISimple/DIComplex and adding
>>> it to the DIComplex child ArrayBuffer and appending to the HashLookup
>>> Seq? This would also simplify some query style logic, e.g. fn:count.
>>>
>>> Alternatively, we could modify the parser logic so that when it's
>>> dealing with arrays it stores the DIArray locally somewhere (PState or
>>> Parser) and appends straight to that without having to ask the parent
>>> DIComplex for the DIArray? Not sure how big of a change that would be
>>> though.
>>>
>>>>
>>>>
>>>> Current Implementation:
>>>>
>>>> override def addChild(e: InfosetElement): Unit = {
>>>>
>>>>   if (e.runtimeData.isArray)  {
>>>>
>>>>    // Make sure there is an array to accept the child
>>>>
>>>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>>>
>>>>      Assert.invariant(arr ne null)
>>>>
>>>>      arr.append(e)
>>>>
>>>>    } else {
>>>>
>>>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>>>
>>>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>>>
>>>>    }
>>>>
>>>>    e.setParent(this)
>>>>
>>>> }
>>>>
>>>>
>>>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>>>>      Assert.usage(childERD.isArray)
>>>>      getChildArray(childERD.dpathElementCompileInfo)
>>>>    }
>>>>
>>>>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>>>>      Assert.usage(info.isArray)
>>>>      val slot = info.slotIndexInParent
>>>>      getChildArray(slot)
>>>>    }
>>>>
>>>>    private def getChildArray(slot: Int): InfosetArray = {
>>>>      val slotVal = _slots(slot)
>>>>      if (slotVal ne null)
>>>>        slotVal match {
>>>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>>>          case _ => Assert.usageError("not an array")
>>>>        }
>>>>      else {
>>>>        val arrayERD = erd.childERDs(slot)
>>>>        // slot is null. There isn't even an array object yet.
>>>>        // create one (it will have zero entries)
>>>>        val ia = new DIArray(arrayERD, this)
>>>>        // no array there yet. So we have to create one.
>>>>        setChildArray(slot, ia)
>>>>        ia
>>>>      }
>>>>    }
>>>>
>>>>
>>>>
>>>> --------------------------------------------------------------------------------
>>>> *From:* Steve Lawrence <sl...@apache.org>
>>>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>>>> *To:* dev@daffodil.apache.org; Taylor Wise
>>>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>>>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>>>> Goal: Remove concept of slots from InfosetImpl
>>>>>
>>>>> Reasons:
>>>>>     1. So that element/node lookup can be name based.
>>>>>     2. Ease re-enabling of Unordered Sequences.
>>>>>     3. Ease enabling of 'query-style' expressions that return more than
>>>>> one node.
>>>>>
>>>>> An Infoset is a document (DIDocument) and the document contains exactly
>>>>> one element (DIElement) as its root.  These elements can be simple
>>>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>>>> inherit from DINode.
>>>>>
>>>>> Currently DIComplex models its children as an Array of DINodes. The
>>>>> problem with the current implementation is that there's no way for us to
>>>>> just ask for an element by name without walking the structure.
>>>>>
>>>>> The solution involves modifying DIComplex to represent its children both
>>>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>>>> insertion ordering information we need regarding the children as well as
>>>>> constant time indexing access to the child in question.  The ArrayBuffer
>>>>> also has the added benefit of growing and shrinking as objects are added
>>>>> and removed.  While the HashMap will allow us to associate a name
>>>>> (String) with DINodes that have that name thus providing a fast named
>>>>> lookup.
>>>>>
>>>>> This named lookup can be optimized further by only inserting into the
>>>>> HashMap those elements that we know are used in expressions. These are
>>>>> the elements that would most likely benefit from a fast named lookup and
>>>>> will allow us to avoid additional overhead of hashing strings and
>>>>> allocating/appending to a sequence.
>>>>>
>>>>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>>>>> it's possible for the following:
>>>>>
>>>>>     <element name="a" .../>
>>>>>     <element name="a" maxOccurs="unbounded"..../>
>>>>>     <element name="b".../>
>>>>>     <element name="a" ..../>
>>>>>
>>>>> Each element named 'a' would have an associated DINode.  Because there
>>>>> are multiple DINodes named 'a', we need to represent them in the HashMap
>>>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>>>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>>>>
>>>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>>>     "b" -> mutable.Seq(DIElement)
>>>>>
>>>>> The mutable.ArrayBuffer maintains the insertion order and would look as
>>>>> follows:
>>>>>
>>>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>>>
>>>>> Back tracking shall be accomplished by going to the last item in the
>>>>> DIComplex ArrayBuffer and removing it.
>>>>>     1. If the last item is a DIElement, we simply delete it from the
>>>>> ArrayBuffer and then remove it also from the HashMap.
>>>>>     2. If the last item is a DIArray, we then have to look at the
>>>>> contents of the DIArray.
>>>>>         a. If the DIArray has more than one entry, remove the last
>>>>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>>>>> with the modified DIArray.
>>>>
>>>> I think the second part of this is actually unnecessary. Since the
>>>> DIArray is mutable, we only need to remove its last child. The entry in
>>>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>>>> only difference is the contents of the DIArray will be one smaller.
>>>>
>>>>>         b. If the DIArray has a single entry then we simply remove
>>>>> DIArray from ArrayBuffer and the HashMap.
>>>>>
>>>>>     <element name="root>
>>>>>          <sequence ../>
>>>>>          <element name="a" .../>
>>>>>          <element name="a" maxOccurs="unbounded">
>>>>>            ...
>>>>>            <element name="b" .../>
>>>>>            <element name="c" .../>
>>>>>            ...
>>>>>          </element>
>>>>>        </sequence>
>>>>>     </element>
>>>>>
>>>>> If the HashMap of DIComplex-root looks like the following:
>>>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>>>
>>>>> ArrayBuffer of DIComplex-root looks like the following:
>>>>>      0 -> DIElement("a")
>>>>>      1 -> DIArray("a")
>>>>>
>>>>> Where DIArray children ArrayBuffer looks like:
>>>>>      0 -> DIElement("b")
>>>>>      1 -> DIElement("c")
>>>>>
>>>>> Given the initial example if we backtrack(1):  // purge one item
>>>>>      DIArray:
>>>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>>>
>>>>>      HashMap DIComplex-root:
>>>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>>>
>>>>>      ArrayBuffer DIComplex-root:
>>>>>          0 -> DIElement("a")
>>>>>          1 -> DIArray("a")
>>>>>
>>>>> Given the initial example if we backtrack(2): // purge two items
>>>>>      DIArray:
>>>>>          Empty
>>>>>
>>>>>      HashMap DIComplex-root:
>>>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>>>>                                         // removed and is now empty
>>>>>
>>>>>      ArrayBuffer DIComplex-root:
>>>>>          0 -> DIElement("a")
>>>>>
>>>>> ===
>>>>>
>>>>> class DIComplex {
>>>>>   def nSlots = erd.nChildSlots
>>>>
>>>> I think the concept of nSlots and nChildSlots should go away completely.
>>>> The children ArrayBuffer should just grow as elements are added and
>>>> shrink as they are backtracked. It might make sense to have a tunable
>>>> the determines the initial size of the children Array, but the Scala
>>>> defaults for initializing and growing a mutable array might already be
>>>> reasonable.
>>>>
>>>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>>>
>>>> One thing that might make this change easier is to not worry about the
>>>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>>>> really only necessary to support query style expressions, but that comes
>>>> with a whole other slew of concerns (e.g. fn:count, same name but
>>>> different types, etc.). There currently exist restrictions that do not
>>>> allow query style expressions, so this Seq should always be a size of 1
>>>> until remove those restrictions.
>>>>
>>>>>
>>>>>   def addChild(node: DINode) {
>>>>>     children += node
>>>>>     fastLookup(node.name).append(node)
>>>>>   }
>>>>>
>>>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>>>   }
>>>>>
>>>>>   def backtrack(int n) {
>>>>>     for (i = 0; i < n; i++) {
>>>>>       val endIdx = children.size -1
>>>>>       val child = children.remove(endIdx) // children.pop()
>>>>>
>>>>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>>>>         case None => // do nothing
>>>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>>>         case Some(nodesSeq) => {
>>>>>       val lastEntry = nodesSeq.last
>>>>>           lastEntry match {
>>>>>         case arr: DIArray => {
>>>>>               // If the array has >= 2 items, remove last one and update
>>>>> nodesSeq
>>>>>               // with this array:
>>>>>               //  val lastIdx = nodesSeq.length - 1
>>>>>           //  arr.remove(arr.size - 1)
>>>>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>>>>> arr))
>>>>>               //
>>>>>               // If the array has 1 item, remove the array from nodesSeq
>>>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>         }
>>>>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>>           }
>>>>>
>>>>>         }
>>>>>       }
>>>>>     }
>>>>>   }
>>>>>
>>>>> }
>>>>>
>>>>> class DIArray {
>>>>>     private val initialSize =
>>>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>>>>
>>>>>     def append(ie: InfosetElement): Unit = {
>>>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>>>         ie.setArray(this)
>>>>>     }
>>>>>
>>>>>     //
>>>>>     // For backtracking purposes
>>>>>     //
>>>>>     def remove(n: Int): DIElement = {
>>>>>         _contents.remove(n)
>>>>>     }
>>>>> }
>>>>>
>>>>> ===
>>>>>
>>>>> To facillitate a breadth-first ordering scheme we shall have to
>>>>
>>>> That this is only necessary for unordered sequences, where the insertion
>>>> order in the ArrayBuffer differs from the display order, correct? If
>>>> that's true, we definitely want to make sure to only do the below
>>>> sorting when an element contains an unordred sequence.
>>>>
>>>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>>>> shall be called childElementDeclPosition and pertains to the static
>>>>> order of the element children of a complex type element.  When sorting
>>>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>>>> entry in the sequence that needs to be sorted.  All elements in this
>>>>> array will have the same childElementDeclPosition.
>>>>>
>>>>> The difficulty in implementing this is that an Element's
>>>>> dsom.ComplexType can have groups within the groups before finding actual
>>>>> element children.  Therefore in the Document Schema Object Model (DSOM)
>>>>> the model group object shall have a startingChildElementDeclPosition and
>>>>> an endingChildElementDeclPosition.
>>>>>
>>>>> 1. The startingChildElementPosition shall be passed to the constructor
>>>>> for the Term.
>>>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>>>> model group, then its first child term's
>>>>> startingChildElementDeclPosition is equal to its own
>>>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>>>> startingChildElementDeclPosition is the prior term's
>>>>> endingChildElementDeclPosition + 1.
>>>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>>>> endingChildElmentDeclPosition of its last child term.
>>>>>
>>>>
>>>> Looks good to me. +1. This should definitely make implementation of
>>>> unordered sequences and query style expressions easier. Removal of the
>>>> slot concept should also help remove concepts that prevent faster schema
>>>> compilation.
>>>>
>>>> - Steve
>>>>
>>>
>>>
>>
>>
> 
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Mike Beckerle <mb...@tresys.com>.

I think the key has to change to include name and namespace of element, and also an additional integer to disambiguate the first a element from the 2nd, from the Nth. This integer need not be 1, 2, 3. It can be any value at all, so long as they are distinct.


Let's make sure another case works. E.g.,  ordered sequence like the one you have here, another element to separate it from  an unordered sequence that also contains elements named a.


However, an unordered sequence can't have more than 1 element named a, so if every element a declaration gets an additional integer, the one in the unordered sequence will also get one, and that will behave properly.


Is this integer the same breadth-first ordering thing we discussed elsewhere? I expect it is.

________________________________
From: Taylor Wise
Sent: Friday, October 13, 2017 1:55:22 PM
To: Steve Lawrence; dev@daffodil.apache.org; Mike Beckerle
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion


I'm down to a single error: test_dfdlPosition3


java.lang.Exception:
Comparison failed.
Expected
          <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2><a><v>1</v></a><a><v>2</v></a></f>
Actual
          <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2></f>
Differences were (path, expected, actual):
 (f,<cnt2>2</cnt2>,<a><v>1</v></a>)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)

    <xs:element name="f">
      <xs:complexType>
        <xs:sequence>
          <xs:element name="cnt1" type="xs:int"
            dfdl:lengthKind="delimited" dfdl:terminator=";" />
          <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
            dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt1) }">
            <xs:complexType>
              <xs:sequence>
                <xs:element name="v" type="xs:int"
                  dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
              </xs:sequence>
            </xs:complexType>
          </xs:element>
          <xs:element name="cnt2" type="xs:int"
            dfdl:lengthKind="delimited" />
          <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
            dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt2) }">
            <xs:complexType>
              <xs:sequence>
                <xs:element name="v" type="xs:int"
                  dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
              </xs:sequence>
            </xs:complexType>
          </xs:element>
        </xs:sequence>
      </xs:complexType>
    </xs:element>



I believe what's happening, from stepping through, is that the getChildArray code is confused as to where to add the second set of <a/> elements.  Below are the values of the structures at the point where <cnt2/> has been added:

ChildNodes aka Insertion Order
====
ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk)), DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a'))), DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk)))

FastLookup
====
Map(cnt2 -> ArrayBuffer(DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk))), cnt1 -> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk))), a -> ArrayBuffer(DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a')))))

So when we go to add the next array 'a', it sees that 'a' is already in the fastLookup and appends it there.  This adds it to the existing DIArray in the insertion order list.  Really, it should create a new DIArray inserting it after <cnt2/> in the insertion order list.  And inserting it after the DIArray in the fastLookup Hash: a -> Seq(DIArray(a, ...), DIArray(a, ...))

How do we fix this?

Because this is an 'ordered' sequence, can we look at the slotIndexInParent?  Isn't this the position in the sequence definition as defined in the schema above?  This should differentiate between the two 'a' arrays.  And then for 'unordered' sequences, this shouldn't matter.  We shouldn't run into two or more members with the same name and namespace.  Correct?

This also means that the 'key' will have to change to include the namespace in addition to the element name.



________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Friday, October 13, 2017 8:45:06 AM
To: dev@daffodil.apache.org; Taylor Wise; Mike Beckerle
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion

On 10/12/2017 08:19 PM, Taylor Wise wrote:
> Been debugging using the TestSimpleTypes tests.  I'm down to just 1 failure.  Expected results are slightly off.  I've got an additional <text/> in the result.  So, getting closer.  Stopping for the night.
>
>
> I'm not so sure that we need the _lastSlotAdded variable anymore.  I changed it to be just a def that reflects the current length of childNodes (ArrayBuffer).
> z
>
> I'm betting my current issue is within the restore functions I modified, but I'll continue looking at that in the morning.  I would think that 'restoring' would simply be copying the previous state... unless I'm misunderstanding something.
>
>
>   abstract override def restoreInto(e: DIElement) {
>     val c = e.asInstanceOf[DIComplex]
>     c.restoreFrom(this)
>     //c._lastSlotAdded = this._lastSlotAdded
>     super.restoreInto(e)
>   }
>
> final def restoreFrom(cs: DIComplexSharedImplMixin){
>   // Shouldn't I be able to just say something like...
>  //
>  childNodes.clear()
>  childNodes.append(cs.childNodes)
>  fastLookup.clear()
>  cs.fastLookup.toSeq.foreach( (name, seqDINodes) =>
>        fastLookup.put(name, seqDINodes) )
> }

This is going to be somewhat inefficient. This essentially deletes and
recreates the entire childNodes array and fastLookup hashmap everytime
we backtrack. This could be pretty back if we childNodes had 1000
elements and we just wanted to backtrack 1. Instead we just want to
remove the elements that are being backtracked.

So I think you still need a concept of _lastSlotAdded. But rather than
the lastSlotAdded, it's just the length of the children (maybe call it
_numChildren?). I would expect restoreFrom for DIComplex to look
something like this (very similar to the existing restoreFrom):

final def restoreFrom(cs: DIComplexSharedImplMixin) {
  var i = childNodes.length - 1

  // for each child we will remove, remove it from the fastLookup map.
  // this should always be the last element in the hashmap seq
  while (i >= cs._numChildren) {
    var childToRemove = childNodes(i)
    var fastSeq = fastLookup(childToRemove.name)
    if (fastSeq.length == 1)
      // last one, remove the whole key entry
      fastLookup.remove(childToRemove.name)
    else
      // not the last one, just drop the end
      fastSeq.dropRight(1)
    i -= 1
  }

  // now just quickly remove all those children
  childNodes.reduceToSize(cs._numChildren)
  _numChildren = cs._numChildren

  if (cs._arraySize.isDefined) {
    childNodes(_numChildren - 1).reduceToSize(cs._arraySize.get)
  }
}

The logic here is very similar to the CURRENT code, just is a little
more simple since it can make certain assumptions. Like removing
something form the fastSeq just means removing the last one
(dropRight(1)), and the existing _arraySize handles the special case for
DIArrays to simplify that logic.

Note that we might want to rethink about the type fastLookup value.
Rather than the value being a Seq, it probably wants to be some sort of
mutable Seq that has constant time append dropRight. Maybe another
ArrayBuf is the right data structure here? That would allow something like

  fastLookup.reduceToSize(fastSeq.length - 1)




> // ORIGINAL
>   //    final def restoreFrom(cs: DIComplexSharedImplMixin) {
>   //    Assert.invariant(_arraySize == MaybeInt.Nope)
>   //
>   //    var i = _lastSlotAdded
>   //    while (i > cs._lastSlotAdded) {
>   //      _slots(i) = null
>   //      i -= 1
>   //    }
>   //    _lastSlotAdded = cs._lastSlotAdded
>   //
>   //    // check invariant. All slots after last slot added
>   //    // should be null.
>   //    {
>   //      var i = _lastSlotAdded + 1
>   //      while (i < _slots.length) {
>   //        Assert.invariant(_slots(i) eq null)
>   //        i = i + 1
>   //      }
>   //    }
>   //
>   //    if (cs._arraySize.isDefined) {
>   //      _slots(_lastSlotAdded).asInstanceOf[DIArray].reduceToSize(cs._arraySize.get)
>   //    }
>   //  }
>
> // CURRENT, lot of logic here.  Wondering if it's even necessary, thus the question above
> //
>   final def restoreFrom(cs: DIComplexSharedImplMixin) {
>     Assert.invariant(_arraySize == MaybeInt.Nope)
>
>     var i = _lastSlotAdded
>     while (i > cs._lastSlotAdded) {
>       val lastNode = childNodes.last
>       val name = lastNode.namedQName.toQNameString
>
>       lastNode match {
>         case arr: DIArray if (arr.length > 1) => {
>           // No need to delete the array, just the last DINode in it
>           arr.pop()
>         }
>         case arr: DIArray if (arr.length == 1) => {
>           // Last DINode in array, remove the array entirely
>           //
>           arr.pop()
>           //
>           // Update insertion order list
>           //
>           childNodes.remove(childNodes.length - 1)
>           //
>           // Update Fast Lookup
>           //
>           val seq = fastLookup(name)
>           if (seq.length <= 1) {
>             // Last DINode with THIS name, remove it entirely.
>             //
>             fastLookup.remove(name)
>           } else {
>             // Not the last DINode with THIS name, just remove this particular one.
>             //
>             val idx = seq.lastIndexWhere(node => node.isInstanceOf[DIArray])
>             val result = seq diff Seq(lastNode)
>             fastLookup.update(name, result)
>           }
>         }
>         case _ => {
>           //
>           // Update insertion order list
>           //
>           childNodes.remove(childNodes.length - 1)
>           //
>           // Update Fast Lookup
>           //
>           val seq = fastLookup(name)
>           if (seq.length <= 1) {
>             // Last DINode with THIS name, remove it entirely
>             //
>             fastLookup.remove(name)
>           } else {
>             // Not the last DINode with THIS name.  Just remove this one!
>             //
>             val idx = seq.lastIndexWhere(node => !node.isInstanceOf[DIArray])
>             val result = seq diff Seq(lastNode)
>             fastLookup.update(name, result)
>           }
>         }
>       }
>     }
>
>   }
>
> The last failure in SimpleTypes, I haven't run the rest of the suite yet though.  Baby steps.
> java.lang.Exception:
> Comparison failed.
> Expected
>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text></hb_root01>
> Actual
>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text><text/></hb_root01>
> Differences were (path, expected, actual):
>  (hb_root01,List(),List(<text/>))
> at edu.illinois.ncsa.daffodil.xml.XMLUtils$.compareAndReport(XMLUtils.scala:678)
> at edu.illinois.ncsa.daffodil.tdml.VerifyTestCase$.verifyParserTestData(TDMLRunner.scala:1074)
> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runParseExpectSuccess(TDMLRunner.scala:776)
> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:675)
> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:673)
> at scala.util.Either$RightProjection.foreach(Either.scala:468)
> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runProcessor(TDMLRunner.scala:673)
> at edu.illinois.ncsa.daffodil.tdml.TestCase.run(TDMLRunner.scala:623)
> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTestWithDataVolumes(TDMLRunner.scala:340)
> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTest(TDMLRunner.scala:328)
> at edu.illinois.ncsa.daffodil.tdml.Runner.runOneTest(RunnerFactory.scala:122)
> at edu.illinois.ncsa.daffodil.section05.simple_types.TestSimpleTypes.test_hexBinary_Delimited_01(TestSimpleTypes.scala:63)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
> at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>
>
>
>
>
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Wednesday, October 11, 2017 4:07:51 PM
> To: dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>
> On 10/11/2017 01:01 PM, Mike Beckerle wrote:
>> You have to hash lookup the name to append to an array if you're just appending one thing at a time starting from the slot/hashID. The current Infoset API hides the construction of the DIArray, and only provides access via the slot/hashID, so there's no way to avoid the hash per each append.
>
> The hashmap and hashlookups should really only be needed for elements
> that are used in expressions. We came up with that optimization so that
> we could remove the vast majority of hash lookups. Due to how
> getChildArray works, that optimization can't occur. I'm not sure how
> much this will affect performance, but there are ways to avoid the need
> for a hash lookup for arrays not used in expressions. The removal of
> DIArray was one. Maybe API/parser changes is better.
>
>> But that's fixable if we want. Probably the API to infoset needs to allow appending to an array more efficently, so that a loop like a repeat combinator can be directly holding onto the DIArray's arrayBuffer appending objects as the parser returns them. But I would not make this change as yet. First things first. Don't optimize without profiling info, etc.
>>
>>
>> I'm very reluctant to remove DIArray because XML's data model isn't the only game in town. The XML decision to not have arrays, just adjacent things that happen to have the same name, has all sorts of negative performance implications.
>
> The removal of a DIArray doesn't necessarily mean we don't have a
> concept of an arrays. Every element added to Complex child arrayBuf
> still knows if it part of an array via ERD.isArray. And comparing the
> current DIElements ERD to the previous and next is enough to determine
> where array starts and ends. From there, we can support any model that
> requires a concept of an array. I'm not saying we should remove the
> concept of arrays, just that our InfosetImpl doesn't necessarily need to
> have a special DIArray. I'll agree that it does make the logic easier to
> get it to a different model, but other things are more complicated.
>
>> Consider this: we could create DIArrays of elements of simple type that don't even create a DISimple node for each array element. The values could be stored in the DIArray's arrayBuffer directly, eliminating all that overhead of a DISimple node. (This was the original intent for DIArray, but was punted to get it working sooner.) The opportunity to do this is entirely lost if you don't have DIArray, but just a bunch of adjacent DIElements that happen to share ERD.
>
> Fair point. Seems a good argument for keeping DIArray, but if we don't
> get around to doing this optimization, we're taking multiple hits. We're
> still allocating many DISimples plus a DIArray, plus the hit from the
> hashmap.
>
>> More speculatively, when we someday implement incremental array trimming so that streaming parse/unparse can drop most array elements once they've been delivered to consumer, well then we will definitely want a DIArray object to abstract the complexity of this trim behavior and the underlying representation it requires.
>
> It might simplify this, but I'm not entirely convinced. I think the
> trimming of unused Infoset elements is going to have some complex
> heuristics regardless. Adding an extra heuristic for arrays won't
> necessarily make this easier. But this is just speculation. And again,
> we would still know elements are part of an array, it just takes a
> couple extra comparisons to determine start/end. But I do agree that
> removal of an an entire array is much easier with a DIArray concept.
>
>
>
> It seems to me these are good enough arguments for keeping DIArray
> around, but it might be worth thinking a little bit about what changes
> would be necessary to the API/parser to not need this hash lookup. If
> it's a minor change, it could reduce a lot of hash lookups, especially
> for large arrays, which are not uncommon.
>
>
>>
>> ________________________________
>> From: Steve Lawrence <sl...@apache.org>
>> Sent: Wednesday, October 11, 2017 12:21 PM
>> To: Taylor Wise; dev@daffodil.apache.org
>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>
>> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>>
>>> What you'll notice is that the current implementation will construct the DIArray
>>> if it doesn't already exist and append to it if it does already exist.  But it
>>> needs the 'slot' number in order to check this.  The slot number is obtained
>>> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of slots
>>> entirely, how can we accomplish this?  Will we have to do a named lookup then
>>> every time we want to append to a DIArray?
>> I think you're right. The way things work right now, anytime we add to a
>> a DIArray we would have to do a hash lookup on the DIComplex. It's not
>> limited to just DPath.
>>
>> Maybe this is another argument for removing the concept of a DIArray? We
>> still need to know if a DIElement is part of an array for the
>> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
>> element should just means creating a new DISimple/DIComplex and adding
>> it to the DIComplex child ArrayBuffer and appending to the HashLookup
>> Seq? This would also simplify some query style logic, e.g. fn:count.
>>
>> Alternatively, we could modify the parser logic so that when it's
>> dealing with arrays it stores the DIArray locally somewhere (PState or
>> Parser) and appends straight to that without having to ask the parent
>> DIComplex for the DIArray? Not sure how big of a change that would be
>> though.
>>
>>>
>>>
>>> Current Implementation:
>>>
>>> override def addChild(e: InfosetElement): Unit = {
>>>
>>>   if (e.runtimeData.isArray)  {
>>>
>>>    // Make sure there is an array to accept the child
>>>
>>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>>
>>>      Assert.invariant(arr ne null)
>>>
>>>      arr.append(e)
>>>
>>>    } else {
>>>
>>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>>
>>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>>
>>>    }
>>>
>>>    e.setParent(this)
>>>
>>> }
>>>
>>>
>>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>>>      Assert.usage(childERD.isArray)
>>>      getChildArray(childERD.dpathElementCompileInfo)
>>>    }
>>>
>>>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>>>      Assert.usage(info.isArray)
>>>      val slot = info.slotIndexInParent
>>>      getChildArray(slot)
>>>    }
>>>
>>>    private def getChildArray(slot: Int): InfosetArray = {
>>>      val slotVal = _slots(slot)
>>>      if (slotVal ne null)
>>>        slotVal match {
>>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>>          case _ => Assert.usageError("not an array")
>>>        }
>>>      else {
>>>        val arrayERD = erd.childERDs(slot)
>>>        // slot is null. There isn't even an array object yet.
>>>        // create one (it will have zero entries)
>>>        val ia = new DIArray(arrayERD, this)
>>>        // no array there yet. So we have to create one.
>>>        setChildArray(slot, ia)
>>>        ia
>>>      }
>>>    }
>>>
>>>
>>>
>>> --------------------------------------------------------------------------------
>>> *From:* Steve Lawrence <sl...@apache.org>
>>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>>> *To:* dev@daffodil.apache.org; Taylor Wise
>>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>>> Goal: Remove concept of slots from InfosetImpl
>>>>
>>>> Reasons:
>>>>     1. So that element/node lookup can be name based.
>>>>     2. Ease re-enabling of Unordered Sequences.
>>>>     3. Ease enabling of 'query-style' expressions that return more than
>>>> one node.
>>>>
>>>> An Infoset is a document (DIDocument) and the document contains exactly
>>>> one element (DIElement) as its root.  These elements can be simple
>>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>>> inherit from DINode.
>>>>
>>>> Currently DIComplex models its children as an Array of DINodes. The
>>>> problem with the current implementation is that there's no way for us to
>>>> just ask for an element by name without walking the structure.
>>>>
>>>> The solution involves modifying DIComplex to represent its children both
>>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>>> insertion ordering information we need regarding the children as well as
>>>> constant time indexing access to the child in question.  The ArrayBuffer
>>>> also has the added benefit of growing and shrinking as objects are added
>>>> and removed.  While the HashMap will allow us to associate a name
>>>> (String) with DINodes that have that name thus providing a fast named
>>>> lookup.
>>>>
>>>> This named lookup can be optimized further by only inserting into the
>>>> HashMap those elements that we know are used in expressions. These are
>>>> the elements that would most likely benefit from a fast named lookup and
>>>> will allow us to avoid additional overhead of hashing strings and
>>>> allocating/appending to a sequence.
>>>>
>>>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>>>> it's possible for the following:
>>>>
>>>>     <element name="a" .../>
>>>>     <element name="a" maxOccurs="unbounded"..../>
>>>>     <element name="b".../>
>>>>     <element name="a" ..../>
>>>>
>>>> Each element named 'a' would have an associated DINode.  Because there
>>>> are multiple DINodes named 'a', we need to represent them in the HashMap
>>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>>>
>>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>>     "b" -> mutable.Seq(DIElement)
>>>>
>>>> The mutable.ArrayBuffer maintains the insertion order and would look as
>>>> follows:
>>>>
>>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>>
>>>> Back tracking shall be accomplished by going to the last item in the
>>>> DIComplex ArrayBuffer and removing it.
>>>>     1. If the last item is a DIElement, we simply delete it from the
>>>> ArrayBuffer and then remove it also from the HashMap.
>>>>     2. If the last item is a DIArray, we then have to look at the
>>>> contents of the DIArray.
>>>>         a. If the DIArray has more than one entry, remove the last
>>>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>>>> with the modified DIArray.
>>>
>>> I think the second part of this is actually unnecessary. Since the
>>> DIArray is mutable, we only need to remove its last child. The entry in
>>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>>> only difference is the contents of the DIArray will be one smaller.
>>>
>>>>         b. If the DIArray has a single entry then we simply remove
>>>> DIArray from ArrayBuffer and the HashMap.
>>>>
>>>>     <element name="root>
>>>>          <sequence ../>
>>>>          <element name="a" .../>
>>>>          <element name="a" maxOccurs="unbounded">
>>>>            ...
>>>>            <element name="b" .../>
>>>>            <element name="c" .../>
>>>>            ...
>>>>          </element>
>>>>        </sequence>
>>>>     </element>
>>>>
>>>> If the HashMap of DIComplex-root looks like the following:
>>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>>
>>>> ArrayBuffer of DIComplex-root looks like the following:
>>>>      0 -> DIElement("a")
>>>>      1 -> DIArray("a")
>>>>
>>>> Where DIArray children ArrayBuffer looks like:
>>>>      0 -> DIElement("b")
>>>>      1 -> DIElement("c")
>>>>
>>>> Given the initial example if we backtrack(1):  // purge one item
>>>>      DIArray:
>>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>>
>>>>      HashMap DIComplex-root:
>>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>>
>>>>      ArrayBuffer DIComplex-root:
>>>>          0 -> DIElement("a")
>>>>          1 -> DIArray("a")
>>>>
>>>> Given the initial example if we backtrack(2): // purge two items
>>>>      DIArray:
>>>>          Empty
>>>>
>>>>      HashMap DIComplex-root:
>>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>>>                                         // removed and is now empty
>>>>
>>>>      ArrayBuffer DIComplex-root:
>>>>          0 -> DIElement("a")
>>>>
>>>> ===
>>>>
>>>> class DIComplex {
>>>>   def nSlots = erd.nChildSlots
>>>
>>> I think the concept of nSlots and nChildSlots should go away completely.
>>> The children ArrayBuffer should just grow as elements are added and
>>> shrink as they are backtracked. It might make sense to have a tunable
>>> the determines the initial size of the children Array, but the Scala
>>> defaults for initializing and growing a mutable array might already be
>>> reasonable.
>>>
>>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>>
>>> One thing that might make this change easier is to not worry about the
>>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>>> really only necessary to support query style expressions, but that comes
>>> with a whole other slew of concerns (e.g. fn:count, same name but
>>> different types, etc.). There currently exist restrictions that do not
>>> allow query style expressions, so this Seq should always be a size of 1
>>> until remove those restrictions.
>>>
>>>>
>>>>   def addChild(node: DINode) {
>>>>     children += node
>>>>     fastLookup(node.name).append(node)
>>>>   }
>>>>
>>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>>   }
>>>>
>>>>   def backtrack(int n) {
>>>>     for (i = 0; i < n; i++) {
>>>>       val endIdx = children.size -1
>>>>       val child = children.remove(endIdx) // children.pop()
>>>>
>>>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>>>         case None => // do nothing
>>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>>         case Some(nodesSeq) => {
>>>>       val lastEntry = nodesSeq.last
>>>>           lastEntry match {
>>>>         case arr: DIArray => {
>>>>               // If the array has >= 2 items, remove last one and update
>>>> nodesSeq
>>>>               // with this array:
>>>>               //  val lastIdx = nodesSeq.length - 1
>>>>           //  arr.remove(arr.size - 1)
>>>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>>>> arr))
>>>>               //
>>>>               // If the array has 1 item, remove the array from nodesSeq
>>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>         }
>>>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>           }
>>>>
>>>>         }
>>>>       }
>>>>     }
>>>>   }
>>>>
>>>> }
>>>>
>>>> class DIArray {
>>>>     private val initialSize =
>>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>>>
>>>>     def append(ie: InfosetElement): Unit = {
>>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>>         ie.setArray(this)
>>>>     }
>>>>
>>>>     //
>>>>     // For backtracking purposes
>>>>     //
>>>>     def remove(n: Int): DIElement = {
>>>>         _contents.remove(n)
>>>>     }
>>>> }
>>>>
>>>> ===
>>>>
>>>> To facillitate a breadth-first ordering scheme we shall have to
>>>
>>> That this is only necessary for unordered sequences, where the insertion
>>> order in the ArrayBuffer differs from the display order, correct? If
>>> that's true, we definitely want to make sure to only do the below
>>> sorting when an element contains an unordred sequence.
>>>
>>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>>> shall be called childElementDeclPosition and pertains to the static
>>>> order of the element children of a complex type element.  When sorting
>>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>>> entry in the sequence that needs to be sorted.  All elements in this
>>>> array will have the same childElementDeclPosition.
>>>>
>>>> The difficulty in implementing this is that an Element's
>>>> dsom.ComplexType can have groups within the groups before finding actual
>>>> element children.  Therefore in the Document Schema Object Model (DSOM)
>>>> the model group object shall have a startingChildElementDeclPosition and
>>>> an endingChildElementDeclPosition.
>>>>
>>>> 1. The startingChildElementPosition shall be passed to the constructor
>>>> for the Term.
>>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>>> model group, then its first child term's
>>>> startingChildElementDeclPosition is equal to its own
>>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>>> startingChildElementDeclPosition is the prior term's
>>>> endingChildElementDeclPosition + 1.
>>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>>> endingChildElmentDeclPosition of its last child term.
>>>>
>>>
>>> Looks good to me. +1. This should definitely make implementation of
>>> unordered sequences and query style expressions easier. Removal of the
>>> slot concept should also help remove concepts that prevent faster schema
>>> compilation.
>>>
>>> - Steve
>>>
>>
>>
>
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Taylor Wise <tw...@tresys.com>.

I'm down to a single error: test_dfdlPosition3


java.lang.Exception:
Comparison failed.
Expected
          <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2><a><v>1</v></a><a><v>2</v></a></f>
Actual
          <f><cnt1>2</cnt1><a><v>1</v></a><a><v>2</v></a><a><v>1</v></a><a><v>2</v></a><cnt2>2</cnt2></f>
Differences were (path, expected, actual):
 (f,<cnt2>2</cnt2>,<a><v>1</v></a>)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)

    <xs:element name="f">
      <xs:complexType>
        <xs:sequence>
          <xs:element name="cnt1" type="xs:int"
            dfdl:lengthKind="delimited" dfdl:terminator=";" />
          <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
            dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt1) }">
            <xs:complexType>
              <xs:sequence>
                <xs:element name="v" type="xs:int"
                  dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
              </xs:sequence>
            </xs:complexType>
          </xs:element>
          <xs:element name="cnt2" type="xs:int"
            dfdl:lengthKind="delimited" />
          <xs:element name="a" minOccurs="0" maxOccurs="unbounded"
            dfdl:occursCountKind="expression" dfdl:occursCount="{ xs:int(../ex:cnt2) }">
            <xs:complexType>
              <xs:sequence>
                <xs:element name="v" type="xs:int"
                  dfdl:inputValueCalc="{ dfdl:occursIndex() }" />
              </xs:sequence>
            </xs:complexType>
          </xs:element>
        </xs:sequence>
      </xs:complexType>
    </xs:element>



I believe what's happening, from stepping through, is that the getChildArray code is confused as to where to add the second set of <a/> elements.  Below are the values of the structures at the point where <cnt2/> has been added:

ChildNodes aka Insertion Order
====
ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk)), DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a'))), DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk)))

FastLookup
====
Map(cnt2 -> ArrayBuffer(DISimple(name='cnt2' 2 valueLen(One(0)unk, One(8)unk))), cnt1 -> ArrayBuffer(DISimple(name='cnt1' 2 valueLen(One(0)unk, One(8)unk))), a -> ArrayBuffer(DIArray({http://example.com}a,ArrayBuffer(DIComplex(name='a'), DIComplex(name='a')))))

So when we go to add the next array 'a', it sees that 'a' is already in the fastLookup and appends it there.  This adds it to the existing DIArray in the insertion order list.  Really, it should create a new DIArray inserting it after <cnt2/> in the insertion order list.  And inserting it after the DIArray in the fastLookup Hash: a -> Seq(DIArray(a, ...), DIArray(a, ...))

How do we fix this?

Because this is an 'ordered' sequence, can we look at the slotIndexInParent?  Isn't this the position in the sequence definition as defined in the schema above?  This should differentiate between the two 'a' arrays.  And then for 'unordered' sequences, this shouldn't matter.  We shouldn't run into two or more members with the same name and namespace.  Correct?

This also means that the 'key' will have to change to include the namespace in addition to the element name.



________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Friday, October 13, 2017 8:45:06 AM
To: dev@daffodil.apache.org; Taylor Wise; Mike Beckerle
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion

On 10/12/2017 08:19 PM, Taylor Wise wrote:
> Been debugging using the TestSimpleTypes tests.  I'm down to just 1 failure.  Expected results are slightly off.  I've got an additional <text/> in the result.  So, getting closer.  Stopping for the night.
>
>
> I'm not so sure that we need the _lastSlotAdded variable anymore.  I changed it to be just a def that reflects the current length of childNodes (ArrayBuffer).
> z
>
> I'm betting my current issue is within the restore functions I modified, but I'll continue looking at that in the morning.  I would think that 'restoring' would simply be copying the previous state... unless I'm misunderstanding something.
>
>
>   abstract override def restoreInto(e: DIElement) {
>     val c = e.asInstanceOf[DIComplex]
>     c.restoreFrom(this)
>     //c._lastSlotAdded = this._lastSlotAdded
>     super.restoreInto(e)
>   }
>
> final def restoreFrom(cs: DIComplexSharedImplMixin){
>   // Shouldn't I be able to just say something like...
>  //
>  childNodes.clear()
>  childNodes.append(cs.childNodes)
>  fastLookup.clear()
>  cs.fastLookup.toSeq.foreach( (name, seqDINodes) =>
>        fastLookup.put(name, seqDINodes) )
> }

This is going to be somewhat inefficient. This essentially deletes and
recreates the entire childNodes array and fastLookup hashmap everytime
we backtrack. This could be pretty back if we childNodes had 1000
elements and we just wanted to backtrack 1. Instead we just want to
remove the elements that are being backtracked.

So I think you still need a concept of _lastSlotAdded. But rather than
the lastSlotAdded, it's just the length of the children (maybe call it
_numChildren?). I would expect restoreFrom for DIComplex to look
something like this (very similar to the existing restoreFrom):

final def restoreFrom(cs: DIComplexSharedImplMixin) {
  var i = childNodes.length - 1

  // for each child we will remove, remove it from the fastLookup map.
  // this should always be the last element in the hashmap seq
  while (i >= cs._numChildren) {
    var childToRemove = childNodes(i)
    var fastSeq = fastLookup(childToRemove.name)
    if (fastSeq.length == 1)
      // last one, remove the whole key entry
      fastLookup.remove(childToRemove.name)
    else
      // not the last one, just drop the end
      fastSeq.dropRight(1)
    i -= 1
  }

  // now just quickly remove all those children
  childNodes.reduceToSize(cs._numChildren)
  _numChildren = cs._numChildren

  if (cs._arraySize.isDefined) {
    childNodes(_numChildren - 1).reduceToSize(cs._arraySize.get)
  }
}

The logic here is very similar to the CURRENT code, just is a little
more simple since it can make certain assumptions. Like removing
something form the fastSeq just means removing the last one
(dropRight(1)), and the existing _arraySize handles the special case for
DIArrays to simplify that logic.

Note that we might want to rethink about the type fastLookup value.
Rather than the value being a Seq, it probably wants to be some sort of
mutable Seq that has constant time append dropRight. Maybe another
ArrayBuf is the right data structure here? That would allow something like

  fastLookup.reduceToSize(fastSeq.length - 1)




> // ORIGINAL
>   //    final def restoreFrom(cs: DIComplexSharedImplMixin) {
>   //    Assert.invariant(_arraySize == MaybeInt.Nope)
>   //
>   //    var i = _lastSlotAdded
>   //    while (i > cs._lastSlotAdded) {
>   //      _slots(i) = null
>   //      i -= 1
>   //    }
>   //    _lastSlotAdded = cs._lastSlotAdded
>   //
>   //    // check invariant. All slots after last slot added
>   //    // should be null.
>   //    {
>   //      var i = _lastSlotAdded + 1
>   //      while (i < _slots.length) {
>   //        Assert.invariant(_slots(i) eq null)
>   //        i = i + 1
>   //      }
>   //    }
>   //
>   //    if (cs._arraySize.isDefined) {
>   //      _slots(_lastSlotAdded).asInstanceOf[DIArray].reduceToSize(cs._arraySize.get)
>   //    }
>   //  }
>
> // CURRENT, lot of logic here.  Wondering if it's even necessary, thus the question above
> //
>   final def restoreFrom(cs: DIComplexSharedImplMixin) {
>     Assert.invariant(_arraySize == MaybeInt.Nope)
>
>     var i = _lastSlotAdded
>     while (i > cs._lastSlotAdded) {
>       val lastNode = childNodes.last
>       val name = lastNode.namedQName.toQNameString
>
>       lastNode match {
>         case arr: DIArray if (arr.length > 1) => {
>           // No need to delete the array, just the last DINode in it
>           arr.pop()
>         }
>         case arr: DIArray if (arr.length == 1) => {
>           // Last DINode in array, remove the array entirely
>           //
>           arr.pop()
>           //
>           // Update insertion order list
>           //
>           childNodes.remove(childNodes.length - 1)
>           //
>           // Update Fast Lookup
>           //
>           val seq = fastLookup(name)
>           if (seq.length <= 1) {
>             // Last DINode with THIS name, remove it entirely.
>             //
>             fastLookup.remove(name)
>           } else {
>             // Not the last DINode with THIS name, just remove this particular one.
>             //
>             val idx = seq.lastIndexWhere(node => node.isInstanceOf[DIArray])
>             val result = seq diff Seq(lastNode)
>             fastLookup.update(name, result)
>           }
>         }
>         case _ => {
>           //
>           // Update insertion order list
>           //
>           childNodes.remove(childNodes.length - 1)
>           //
>           // Update Fast Lookup
>           //
>           val seq = fastLookup(name)
>           if (seq.length <= 1) {
>             // Last DINode with THIS name, remove it entirely
>             //
>             fastLookup.remove(name)
>           } else {
>             // Not the last DINode with THIS name.  Just remove this one!
>             //
>             val idx = seq.lastIndexWhere(node => !node.isInstanceOf[DIArray])
>             val result = seq diff Seq(lastNode)
>             fastLookup.update(name, result)
>           }
>         }
>       }
>     }
>
>   }
>
> The last failure in SimpleTypes, I haven't run the rest of the suite yet though.  Baby steps.
> java.lang.Exception:
> Comparison failed.
> Expected
>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text></hb_root01>
> Actual
>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text><text/></hb_root01>
> Differences were (path, expected, actual):
>  (hb_root01,List(),List(<text/>))
> at edu.illinois.ncsa.daffodil.xml.XMLUtils$.compareAndReport(XMLUtils.scala:678)
> at edu.illinois.ncsa.daffodil.tdml.VerifyTestCase$.verifyParserTestData(TDMLRunner.scala:1074)
> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runParseExpectSuccess(TDMLRunner.scala:776)
> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:675)
> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:673)
> at scala.util.Either$RightProjection.foreach(Either.scala:468)
> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runProcessor(TDMLRunner.scala:673)
> at edu.illinois.ncsa.daffodil.tdml.TestCase.run(TDMLRunner.scala:623)
> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTestWithDataVolumes(TDMLRunner.scala:340)
> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTest(TDMLRunner.scala:328)
> at edu.illinois.ncsa.daffodil.tdml.Runner.runOneTest(RunnerFactory.scala:122)
> at edu.illinois.ncsa.daffodil.section05.simple_types.TestSimpleTypes.test_hexBinary_Delimited_01(TestSimpleTypes.scala:63)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
> at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
>
>
>
>
>
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Wednesday, October 11, 2017 4:07:51 PM
> To: dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>
> On 10/11/2017 01:01 PM, Mike Beckerle wrote:
>> You have to hash lookup the name to append to an array if you're just appending one thing at a time starting from the slot/hashID. The current Infoset API hides the construction of the DIArray, and only provides access via the slot/hashID, so there's no way to avoid the hash per each append.
>
> The hashmap and hashlookups should really only be needed for elements
> that are used in expressions. We came up with that optimization so that
> we could remove the vast majority of hash lookups. Due to how
> getChildArray works, that optimization can't occur. I'm not sure how
> much this will affect performance, but there are ways to avoid the need
> for a hash lookup for arrays not used in expressions. The removal of
> DIArray was one. Maybe API/parser changes is better.
>
>> But that's fixable if we want. Probably the API to infoset needs to allow appending to an array more efficently, so that a loop like a repeat combinator can be directly holding onto the DIArray's arrayBuffer appending objects as the parser returns them. But I would not make this change as yet. First things first. Don't optimize without profiling info, etc.
>>
>>
>> I'm very reluctant to remove DIArray because XML's data model isn't the only game in town. The XML decision to not have arrays, just adjacent things that happen to have the same name, has all sorts of negative performance implications.
>
> The removal of a DIArray doesn't necessarily mean we don't have a
> concept of an arrays. Every element added to Complex child arrayBuf
> still knows if it part of an array via ERD.isArray. And comparing the
> current DIElements ERD to the previous and next is enough to determine
> where array starts and ends. From there, we can support any model that
> requires a concept of an array. I'm not saying we should remove the
> concept of arrays, just that our InfosetImpl doesn't necessarily need to
> have a special DIArray. I'll agree that it does make the logic easier to
> get it to a different model, but other things are more complicated.
>
>> Consider this: we could create DIArrays of elements of simple type that don't even create a DISimple node for each array element. The values could be stored in the DIArray's arrayBuffer directly, eliminating all that overhead of a DISimple node. (This was the original intent for DIArray, but was punted to get it working sooner.) The opportunity to do this is entirely lost if you don't have DIArray, but just a bunch of adjacent DIElements that happen to share ERD.
>
> Fair point. Seems a good argument for keeping DIArray, but if we don't
> get around to doing this optimization, we're taking multiple hits. We're
> still allocating many DISimples plus a DIArray, plus the hit from the
> hashmap.
>
>> More speculatively, when we someday implement incremental array trimming so that streaming parse/unparse can drop most array elements once they've been delivered to consumer, well then we will definitely want a DIArray object to abstract the complexity of this trim behavior and the underlying representation it requires.
>
> It might simplify this, but I'm not entirely convinced. I think the
> trimming of unused Infoset elements is going to have some complex
> heuristics regardless. Adding an extra heuristic for arrays won't
> necessarily make this easier. But this is just speculation. And again,
> we would still know elements are part of an array, it just takes a
> couple extra comparisons to determine start/end. But I do agree that
> removal of an an entire array is much easier with a DIArray concept.
>
>
>
> It seems to me these are good enough arguments for keeping DIArray
> around, but it might be worth thinking a little bit about what changes
> would be necessary to the API/parser to not need this hash lookup. If
> it's a minor change, it could reduce a lot of hash lookups, especially
> for large arrays, which are not uncommon.
>
>
>>
>> ________________________________
>> From: Steve Lawrence <sl...@apache.org>
>> Sent: Wednesday, October 11, 2017 12:21 PM
>> To: Taylor Wise; dev@daffodil.apache.org
>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>
>> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>>
>>> What you'll notice is that the current implementation will construct the DIArray
>>> if it doesn't already exist and append to it if it does already exist.  But it
>>> needs the 'slot' number in order to check this.  The slot number is obtained
>>> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of slots
>>> entirely, how can we accomplish this?  Will we have to do a named lookup then
>>> every time we want to append to a DIArray?
>> I think you're right. The way things work right now, anytime we add to a
>> a DIArray we would have to do a hash lookup on the DIComplex. It's not
>> limited to just DPath.
>>
>> Maybe this is another argument for removing the concept of a DIArray? We
>> still need to know if a DIElement is part of an array for the
>> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
>> element should just means creating a new DISimple/DIComplex and adding
>> it to the DIComplex child ArrayBuffer and appending to the HashLookup
>> Seq? This would also simplify some query style logic, e.g. fn:count.
>>
>> Alternatively, we could modify the parser logic so that when it's
>> dealing with arrays it stores the DIArray locally somewhere (PState or
>> Parser) and appends straight to that without having to ask the parent
>> DIComplex for the DIArray? Not sure how big of a change that would be
>> though.
>>
>>>
>>>
>>> Current Implementation:
>>>
>>> override def addChild(e: InfosetElement): Unit = {
>>>
>>>   if (e.runtimeData.isArray)  {
>>>
>>>    // Make sure there is an array to accept the child
>>>
>>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>>
>>>      Assert.invariant(arr ne null)
>>>
>>>      arr.append(e)
>>>
>>>    } else {
>>>
>>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>>
>>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>>
>>>    }
>>>
>>>    e.setParent(this)
>>>
>>> }
>>>
>>>
>>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>>>      Assert.usage(childERD.isArray)
>>>      getChildArray(childERD.dpathElementCompileInfo)
>>>    }
>>>
>>>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>>>      Assert.usage(info.isArray)
>>>      val slot = info.slotIndexInParent
>>>      getChildArray(slot)
>>>    }
>>>
>>>    private def getChildArray(slot: Int): InfosetArray = {
>>>      val slotVal = _slots(slot)
>>>      if (slotVal ne null)
>>>        slotVal match {
>>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>>          case _ => Assert.usageError("not an array")
>>>        }
>>>      else {
>>>        val arrayERD = erd.childERDs(slot)
>>>        // slot is null. There isn't even an array object yet.
>>>        // create one (it will have zero entries)
>>>        val ia = new DIArray(arrayERD, this)
>>>        // no array there yet. So we have to create one.
>>>        setChildArray(slot, ia)
>>>        ia
>>>      }
>>>    }
>>>
>>>
>>>
>>> --------------------------------------------------------------------------------
>>> *From:* Steve Lawrence <sl...@apache.org>
>>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>>> *To:* dev@daffodil.apache.org; Taylor Wise
>>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>>> Goal: Remove concept of slots from InfosetImpl
>>>>
>>>> Reasons:
>>>>     1. So that element/node lookup can be name based.
>>>>     2. Ease re-enabling of Unordered Sequences.
>>>>     3. Ease enabling of 'query-style' expressions that return more than
>>>> one node.
>>>>
>>>> An Infoset is a document (DIDocument) and the document contains exactly
>>>> one element (DIElement) as its root.  These elements can be simple
>>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>>> inherit from DINode.
>>>>
>>>> Currently DIComplex models its children as an Array of DINodes. The
>>>> problem with the current implementation is that there's no way for us to
>>>> just ask for an element by name without walking the structure.
>>>>
>>>> The solution involves modifying DIComplex to represent its children both
>>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>>> insertion ordering information we need regarding the children as well as
>>>> constant time indexing access to the child in question.  The ArrayBuffer
>>>> also has the added benefit of growing and shrinking as objects are added
>>>> and removed.  While the HashMap will allow us to associate a name
>>>> (String) with DINodes that have that name thus providing a fast named
>>>> lookup.
>>>>
>>>> This named lookup can be optimized further by only inserting into the
>>>> HashMap those elements that we know are used in expressions. These are
>>>> the elements that would most likely benefit from a fast named lookup and
>>>> will allow us to avoid additional overhead of hashing strings and
>>>> allocating/appending to a sequence.
>>>>
>>>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>>>> it's possible for the following:
>>>>
>>>>     <element name="a" .../>
>>>>     <element name="a" maxOccurs="unbounded"..../>
>>>>     <element name="b".../>
>>>>     <element name="a" ..../>
>>>>
>>>> Each element named 'a' would have an associated DINode.  Because there
>>>> are multiple DINodes named 'a', we need to represent them in the HashMap
>>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>>>
>>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>>     "b" -> mutable.Seq(DIElement)
>>>>
>>>> The mutable.ArrayBuffer maintains the insertion order and would look as
>>>> follows:
>>>>
>>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>>
>>>> Back tracking shall be accomplished by going to the last item in the
>>>> DIComplex ArrayBuffer and removing it.
>>>>     1. If the last item is a DIElement, we simply delete it from the
>>>> ArrayBuffer and then remove it also from the HashMap.
>>>>     2. If the last item is a DIArray, we then have to look at the
>>>> contents of the DIArray.
>>>>         a. If the DIArray has more than one entry, remove the last
>>>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>>>> with the modified DIArray.
>>>
>>> I think the second part of this is actually unnecessary. Since the
>>> DIArray is mutable, we only need to remove its last child. The entry in
>>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>>> only difference is the contents of the DIArray will be one smaller.
>>>
>>>>         b. If the DIArray has a single entry then we simply remove
>>>> DIArray from ArrayBuffer and the HashMap.
>>>>
>>>>     <element name="root>
>>>>          <sequence ../>
>>>>          <element name="a" .../>
>>>>          <element name="a" maxOccurs="unbounded">
>>>>            ...
>>>>            <element name="b" .../>
>>>>            <element name="c" .../>
>>>>            ...
>>>>          </element>
>>>>        </sequence>
>>>>     </element>
>>>>
>>>> If the HashMap of DIComplex-root looks like the following:
>>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>>
>>>> ArrayBuffer of DIComplex-root looks like the following:
>>>>      0 -> DIElement("a")
>>>>      1 -> DIArray("a")
>>>>
>>>> Where DIArray children ArrayBuffer looks like:
>>>>      0 -> DIElement("b")
>>>>      1 -> DIElement("c")
>>>>
>>>> Given the initial example if we backtrack(1):  // purge one item
>>>>      DIArray:
>>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>>
>>>>      HashMap DIComplex-root:
>>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>>
>>>>      ArrayBuffer DIComplex-root:
>>>>          0 -> DIElement("a")
>>>>          1 -> DIArray("a")
>>>>
>>>> Given the initial example if we backtrack(2): // purge two items
>>>>      DIArray:
>>>>          Empty
>>>>
>>>>      HashMap DIComplex-root:
>>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>>>                                         // removed and is now empty
>>>>
>>>>      ArrayBuffer DIComplex-root:
>>>>          0 -> DIElement("a")
>>>>
>>>> ===
>>>>
>>>> class DIComplex {
>>>>   def nSlots = erd.nChildSlots
>>>
>>> I think the concept of nSlots and nChildSlots should go away completely.
>>> The children ArrayBuffer should just grow as elements are added and
>>> shrink as they are backtracked. It might make sense to have a tunable
>>> the determines the initial size of the children Array, but the Scala
>>> defaults for initializing and growing a mutable array might already be
>>> reasonable.
>>>
>>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>>
>>> One thing that might make this change easier is to not worry about the
>>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>>> really only necessary to support query style expressions, but that comes
>>> with a whole other slew of concerns (e.g. fn:count, same name but
>>> different types, etc.). There currently exist restrictions that do not
>>> allow query style expressions, so this Seq should always be a size of 1
>>> until remove those restrictions.
>>>
>>>>
>>>>   def addChild(node: DINode) {
>>>>     children += node
>>>>     fastLookup(node.name).append(node)
>>>>   }
>>>>
>>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>>   }
>>>>
>>>>   def backtrack(int n) {
>>>>     for (i = 0; i < n; i++) {
>>>>       val endIdx = children.size -1
>>>>       val child = children.remove(endIdx) // children.pop()
>>>>
>>>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>>>         case None => // do nothing
>>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>>         case Some(nodesSeq) => {
>>>>       val lastEntry = nodesSeq.last
>>>>           lastEntry match {
>>>>         case arr: DIArray => {
>>>>               // If the array has >= 2 items, remove last one and update
>>>> nodesSeq
>>>>               // with this array:
>>>>               //  val lastIdx = nodesSeq.length - 1
>>>>           //  arr.remove(arr.size - 1)
>>>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>>>> arr))
>>>>               //
>>>>               // If the array has 1 item, remove the array from nodesSeq
>>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>         }
>>>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>           }
>>>>
>>>>         }
>>>>       }
>>>>     }
>>>>   }
>>>>
>>>> }
>>>>
>>>> class DIArray {
>>>>     private val initialSize =
>>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>>>
>>>>     def append(ie: InfosetElement): Unit = {
>>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>>         ie.setArray(this)
>>>>     }
>>>>
>>>>     //
>>>>     // For backtracking purposes
>>>>     //
>>>>     def remove(n: Int): DIElement = {
>>>>         _contents.remove(n)
>>>>     }
>>>> }
>>>>
>>>> ===
>>>>
>>>> To facillitate a breadth-first ordering scheme we shall have to
>>>
>>> That this is only necessary for unordered sequences, where the insertion
>>> order in the ArrayBuffer differs from the display order, correct? If
>>> that's true, we definitely want to make sure to only do the below
>>> sorting when an element contains an unordred sequence.
>>>
>>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>>> shall be called childElementDeclPosition and pertains to the static
>>>> order of the element children of a complex type element.  When sorting
>>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>>> entry in the sequence that needs to be sorted.  All elements in this
>>>> array will have the same childElementDeclPosition.
>>>>
>>>> The difficulty in implementing this is that an Element's
>>>> dsom.ComplexType can have groups within the groups before finding actual
>>>> element children.  Therefore in the Document Schema Object Model (DSOM)
>>>> the model group object shall have a startingChildElementDeclPosition and
>>>> an endingChildElementDeclPosition.
>>>>
>>>> 1. The startingChildElementPosition shall be passed to the constructor
>>>> for the Term.
>>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>>> model group, then its first child term's
>>>> startingChildElementDeclPosition is equal to its own
>>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>>> startingChildElementDeclPosition is the prior term's
>>>> endingChildElementDeclPosition + 1.
>>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>>> endingChildElmentDeclPosition of its last child term.
>>>>
>>>
>>> Looks good to me. +1. This should definitely make implementation of
>>> unordered sequences and query style expressions easier. Removal of the
>>> slot concept should also help remove concepts that prevent faster schema
>>> compilation.
>>>
>>> - Steve
>>>
>>
>>
>
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Steve Lawrence <sl...@apache.org>.

On 10/12/2017 08:19 PM, Taylor Wise wrote:
> Been debugging using the TestSimpleTypes tests.  I'm down to just 1 failure.  Expected results are slightly off.  I've got an additional <text/> in the result.  So, getting closer.  Stopping for the night.
> 
> 
> I'm not so sure that we need the _lastSlotAdded variable anymore.  I changed it to be just a def that reflects the current length of childNodes (ArrayBuffer).
> z
> 
> I'm betting my current issue is within the restore functions I modified, but I'll continue looking at that in the morning.  I would think that 'restoring' would simply be copying the previous state... unless I'm misunderstanding something.
> 
> 
>   abstract override def restoreInto(e: DIElement) {
>     val c = e.asInstanceOf[DIComplex]
>     c.restoreFrom(this)
>     //c._lastSlotAdded = this._lastSlotAdded
>     super.restoreInto(e)
>   }
> 
> final def restoreFrom(cs: DIComplexSharedImplMixin){
>   // Shouldn't I be able to just say something like...
>  //
>  childNodes.clear()
>  childNodes.append(cs.childNodes)
>  fastLookup.clear()
>  cs.fastLookup.toSeq.foreach( (name, seqDINodes) =>
>        fastLookup.put(name, seqDINodes) )
> }

This is going to be somewhat inefficient. This essentially deletes and
recreates the entire childNodes array and fastLookup hashmap everytime
we backtrack. This could be pretty back if we childNodes had 1000
elements and we just wanted to backtrack 1. Instead we just want to
remove the elements that are being backtracked.

So I think you still need a concept of _lastSlotAdded. But rather than
the lastSlotAdded, it's just the length of the children (maybe call it
_numChildren?). I would expect restoreFrom for DIComplex to look
something like this (very similar to the existing restoreFrom):

final def restoreFrom(cs: DIComplexSharedImplMixin) {
  var i = childNodes.length - 1

  // for each child we will remove, remove it from the fastLookup map.
  // this should always be the last element in the hashmap seq
  while (i >= cs._numChildren) {
    var childToRemove = childNodes(i)
    var fastSeq = fastLookup(childToRemove.name)
    if (fastSeq.length == 1)
      // last one, remove the whole key entry
      fastLookup.remove(childToRemove.name)
    else
      // not the last one, just drop the end
      fastSeq.dropRight(1)
    i -= 1
  }

  // now just quickly remove all those children
  childNodes.reduceToSize(cs._numChildren)
  _numChildren = cs._numChildren

  if (cs._arraySize.isDefined) {
    childNodes(_numChildren - 1).reduceToSize(cs._arraySize.get)
  }
}

The logic here is very similar to the CURRENT code, just is a little
more simple since it can make certain assumptions. Like removing
something form the fastSeq just means removing the last one
(dropRight(1)), and the existing _arraySize handles the special case for
DIArrays to simplify that logic.

Note that we might want to rethink about the type fastLookup value.
Rather than the value being a Seq, it probably wants to be some sort of
mutable Seq that has constant time append dropRight. Maybe another
ArrayBuf is the right data structure here? That would allow something like

  fastLookup.reduceToSize(fastSeq.length - 1)




> // ORIGINAL
>   //    final def restoreFrom(cs: DIComplexSharedImplMixin) {
>   //    Assert.invariant(_arraySize == MaybeInt.Nope)
>   //
>   //    var i = _lastSlotAdded
>   //    while (i > cs._lastSlotAdded) {
>   //      _slots(i) = null
>   //      i -= 1
>   //    }
>   //    _lastSlotAdded = cs._lastSlotAdded
>   //
>   //    // check invariant. All slots after last slot added
>   //    // should be null.
>   //    {
>   //      var i = _lastSlotAdded + 1
>   //      while (i < _slots.length) {
>   //        Assert.invariant(_slots(i) eq null)
>   //        i = i + 1
>   //      }
>   //    }
>   //
>   //    if (cs._arraySize.isDefined) {
>   //      _slots(_lastSlotAdded).asInstanceOf[DIArray].reduceToSize(cs._arraySize.get)
>   //    }
>   //  }
> 
> // CURRENT, lot of logic here.  Wondering if it's even necessary, thus the question above
> //
>   final def restoreFrom(cs: DIComplexSharedImplMixin) {
>     Assert.invariant(_arraySize == MaybeInt.Nope)
> 
>     var i = _lastSlotAdded
>     while (i > cs._lastSlotAdded) {
>       val lastNode = childNodes.last
>       val name = lastNode.namedQName.toQNameString
> 
>       lastNode match {
>         case arr: DIArray if (arr.length > 1) => {
>           // No need to delete the array, just the last DINode in it
>           arr.pop()
>         }
>         case arr: DIArray if (arr.length == 1) => {
>           // Last DINode in array, remove the array entirely
>           //
>           arr.pop()
>           //
>           // Update insertion order list
>           //
>           childNodes.remove(childNodes.length - 1)
>           //
>           // Update Fast Lookup
>           //
>           val seq = fastLookup(name)
>           if (seq.length <= 1) {
>             // Last DINode with THIS name, remove it entirely.
>             //
>             fastLookup.remove(name)
>           } else {
>             // Not the last DINode with THIS name, just remove this particular one.
>             //
>             val idx = seq.lastIndexWhere(node => node.isInstanceOf[DIArray])
>             val result = seq diff Seq(lastNode)
>             fastLookup.update(name, result)
>           }
>         }
>         case _ => {
>           //
>           // Update insertion order list
>           //
>           childNodes.remove(childNodes.length - 1)
>           //
>           // Update Fast Lookup
>           //
>           val seq = fastLookup(name)
>           if (seq.length <= 1) {
>             // Last DINode with THIS name, remove it entirely
>             //
>             fastLookup.remove(name)
>           } else {
>             // Not the last DINode with THIS name.  Just remove this one!
>             //
>             val idx = seq.lastIndexWhere(node => !node.isInstanceOf[DIArray])
>             val result = seq diff Seq(lastNode)
>             fastLookup.update(name, result)
>           }
>         }
>       }
>     }
> 
>   }
> 
> The last failure in SimpleTypes, I haven't run the rest of the suite yet though.  Baby steps.
> java.lang.Exception:
> Comparison failed.
> Expected
>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text></hb_root01>
> Actual
>           <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text><text/></hb_root01>
> Differences were (path, expected, actual):
>  (hb_root01,List(),List(<text/>))
> at edu.illinois.ncsa.daffodil.xml.XMLUtils$.compareAndReport(XMLUtils.scala:678)
> at edu.illinois.ncsa.daffodil.tdml.VerifyTestCase$.verifyParserTestData(TDMLRunner.scala:1074)
> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runParseExpectSuccess(TDMLRunner.scala:776)
> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:675)
> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:673)
> at scala.util.Either$RightProjection.foreach(Either.scala:468)
> at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runProcessor(TDMLRunner.scala:673)
> at edu.illinois.ncsa.daffodil.tdml.TestCase.run(TDMLRunner.scala:623)
> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTestWithDataVolumes(TDMLRunner.scala:340)
> at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTest(TDMLRunner.scala:328)
> at edu.illinois.ncsa.daffodil.tdml.Runner.runOneTest(RunnerFactory.scala:122)
> at edu.illinois.ncsa.daffodil.section05.simple_types.TestSimpleTypes.test_hexBinary_Delimited_01(TestSimpleTypes.scala:63)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
> at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
> at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
> 
> 
> 
> 
> 
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Wednesday, October 11, 2017 4:07:51 PM
> To: dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
> 
> On 10/11/2017 01:01 PM, Mike Beckerle wrote:
>> You have to hash lookup the name to append to an array if you're just appending one thing at a time starting from the slot/hashID. The current Infoset API hides the construction of the DIArray, and only provides access via the slot/hashID, so there's no way to avoid the hash per each append.
> 
> The hashmap and hashlookups should really only be needed for elements
> that are used in expressions. We came up with that optimization so that
> we could remove the vast majority of hash lookups. Due to how
> getChildArray works, that optimization can't occur. I'm not sure how
> much this will affect performance, but there are ways to avoid the need
> for a hash lookup for arrays not used in expressions. The removal of
> DIArray was one. Maybe API/parser changes is better.
> 
>> But that's fixable if we want. Probably the API to infoset needs to allow appending to an array more efficently, so that a loop like a repeat combinator can be directly holding onto the DIArray's arrayBuffer appending objects as the parser returns them. But I would not make this change as yet. First things first. Don't optimize without profiling info, etc.
>>
>>
>> I'm very reluctant to remove DIArray because XML's data model isn't the only game in town. The XML decision to not have arrays, just adjacent things that happen to have the same name, has all sorts of negative performance implications.
> 
> The removal of a DIArray doesn't necessarily mean we don't have a
> concept of an arrays. Every element added to Complex child arrayBuf
> still knows if it part of an array via ERD.isArray. And comparing the
> current DIElements ERD to the previous and next is enough to determine
> where array starts and ends. From there, we can support any model that
> requires a concept of an array. I'm not saying we should remove the
> concept of arrays, just that our InfosetImpl doesn't necessarily need to
> have a special DIArray. I'll agree that it does make the logic easier to
> get it to a different model, but other things are more complicated.
> 
>> Consider this: we could create DIArrays of elements of simple type that don't even create a DISimple node for each array element. The values could be stored in the DIArray's arrayBuffer directly, eliminating all that overhead of a DISimple node. (This was the original intent for DIArray, but was punted to get it working sooner.) The opportunity to do this is entirely lost if you don't have DIArray, but just a bunch of adjacent DIElements that happen to share ERD.
> 
> Fair point. Seems a good argument for keeping DIArray, but if we don't
> get around to doing this optimization, we're taking multiple hits. We're
> still allocating many DISimples plus a DIArray, plus the hit from the
> hashmap.
> 
>> More speculatively, when we someday implement incremental array trimming so that streaming parse/unparse can drop most array elements once they've been delivered to consumer, well then we will definitely want a DIArray object to abstract the complexity of this trim behavior and the underlying representation it requires.
> 
> It might simplify this, but I'm not entirely convinced. I think the
> trimming of unused Infoset elements is going to have some complex
> heuristics regardless. Adding an extra heuristic for arrays won't
> necessarily make this easier. But this is just speculation. And again,
> we would still know elements are part of an array, it just takes a
> couple extra comparisons to determine start/end. But I do agree that
> removal of an an entire array is much easier with a DIArray concept.
> 
> 
> 
> It seems to me these are good enough arguments for keeping DIArray
> around, but it might be worth thinking a little bit about what changes
> would be necessary to the API/parser to not need this hash lookup. If
> it's a minor change, it could reduce a lot of hash lookups, especially
> for large arrays, which are not uncommon.
> 
> 
>>
>> ________________________________
>> From: Steve Lawrence <sl...@apache.org>
>> Sent: Wednesday, October 11, 2017 12:21 PM
>> To: Taylor Wise; dev@daffodil.apache.org
>> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>>
>> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>>
>>> What you'll notice is that the current implementation will construct the DIArray
>>> if it doesn't already exist and append to it if it does already exist.  But it
>>> needs the 'slot' number in order to check this.  The slot number is obtained
>>> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of slots
>>> entirely, how can we accomplish this?  Will we have to do a named lookup then
>>> every time we want to append to a DIArray?
>> I think you're right. The way things work right now, anytime we add to a
>> a DIArray we would have to do a hash lookup on the DIComplex. It's not
>> limited to just DPath.
>>
>> Maybe this is another argument for removing the concept of a DIArray? We
>> still need to know if a DIElement is part of an array for the
>> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
>> element should just means creating a new DISimple/DIComplex and adding
>> it to the DIComplex child ArrayBuffer and appending to the HashLookup
>> Seq? This would also simplify some query style logic, e.g. fn:count.
>>
>> Alternatively, we could modify the parser logic so that when it's
>> dealing with arrays it stores the DIArray locally somewhere (PState or
>> Parser) and appends straight to that without having to ask the parent
>> DIComplex for the DIArray? Not sure how big of a change that would be
>> though.
>>
>>>
>>>
>>> Current Implementation:
>>>
>>> override def addChild(e: InfosetElement): Unit = {
>>>
>>>   if (e.runtimeData.isArray)  {
>>>
>>>    // Make sure there is an array to accept the child
>>>
>>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>>
>>>      Assert.invariant(arr ne null)
>>>
>>>      arr.append(e)
>>>
>>>    } else {
>>>
>>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>>
>>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>>
>>>    }
>>>
>>>    e.setParent(this)
>>>
>>> }
>>>
>>>
>>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>>>      Assert.usage(childERD.isArray)
>>>      getChildArray(childERD.dpathElementCompileInfo)
>>>    }
>>>
>>>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>>>      Assert.usage(info.isArray)
>>>      val slot = info.slotIndexInParent
>>>      getChildArray(slot)
>>>    }
>>>
>>>    private def getChildArray(slot: Int): InfosetArray = {
>>>      val slotVal = _slots(slot)
>>>      if (slotVal ne null)
>>>        slotVal match {
>>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>>          case _ => Assert.usageError("not an array")
>>>        }
>>>      else {
>>>        val arrayERD = erd.childERDs(slot)
>>>        // slot is null. There isn't even an array object yet.
>>>        // create one (it will have zero entries)
>>>        val ia = new DIArray(arrayERD, this)
>>>        // no array there yet. So we have to create one.
>>>        setChildArray(slot, ia)
>>>        ia
>>>      }
>>>    }
>>>
>>>
>>>
>>> --------------------------------------------------------------------------------
>>> *From:* Steve Lawrence <sl...@apache.org>
>>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>>> *To:* dev@daffodil.apache.org; Taylor Wise
>>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>>> Goal: Remove concept of slots from InfosetImpl
>>>>
>>>> Reasons:
>>>>     1. So that element/node lookup can be name based.
>>>>     2. Ease re-enabling of Unordered Sequences.
>>>>     3. Ease enabling of 'query-style' expressions that return more than
>>>> one node.
>>>>
>>>> An Infoset is a document (DIDocument) and the document contains exactly
>>>> one element (DIElement) as its root.  These elements can be simple
>>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>>> inherit from DINode.
>>>>
>>>> Currently DIComplex models its children as an Array of DINodes. The
>>>> problem with the current implementation is that there's no way for us to
>>>> just ask for an element by name without walking the structure.
>>>>
>>>> The solution involves modifying DIComplex to represent its children both
>>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>>> insertion ordering information we need regarding the children as well as
>>>> constant time indexing access to the child in question.  The ArrayBuffer
>>>> also has the added benefit of growing and shrinking as objects are added
>>>> and removed.  While the HashMap will allow us to associate a name
>>>> (String) with DINodes that have that name thus providing a fast named
>>>> lookup.
>>>>
>>>> This named lookup can be optimized further by only inserting into the
>>>> HashMap those elements that we know are used in expressions. These are
>>>> the elements that would most likely benefit from a fast named lookup and
>>>> will allow us to avoid additional overhead of hashing strings and
>>>> allocating/appending to a sequence.
>>>>
>>>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>>>> it's possible for the following:
>>>>
>>>>     <element name="a" .../>
>>>>     <element name="a" maxOccurs="unbounded"..../>
>>>>     <element name="b".../>
>>>>     <element name="a" ..../>
>>>>
>>>> Each element named 'a' would have an associated DINode.  Because there
>>>> are multiple DINodes named 'a', we need to represent them in the HashMap
>>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>>>
>>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>>     "b" -> mutable.Seq(DIElement)
>>>>
>>>> The mutable.ArrayBuffer maintains the insertion order and would look as
>>>> follows:
>>>>
>>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>>
>>>> Back tracking shall be accomplished by going to the last item in the
>>>> DIComplex ArrayBuffer and removing it.
>>>>     1. If the last item is a DIElement, we simply delete it from the
>>>> ArrayBuffer and then remove it also from the HashMap.
>>>>     2. If the last item is a DIArray, we then have to look at the
>>>> contents of the DIArray.
>>>>         a. If the DIArray has more than one entry, remove the last
>>>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>>>> with the modified DIArray.
>>>
>>> I think the second part of this is actually unnecessary. Since the
>>> DIArray is mutable, we only need to remove its last child. The entry in
>>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>>> only difference is the contents of the DIArray will be one smaller.
>>>
>>>>         b. If the DIArray has a single entry then we simply remove
>>>> DIArray from ArrayBuffer and the HashMap.
>>>>
>>>>     <element name="root>
>>>>          <sequence ../>
>>>>          <element name="a" .../>
>>>>          <element name="a" maxOccurs="unbounded">
>>>>            ...
>>>>            <element name="b" .../>
>>>>            <element name="c" .../>
>>>>            ...
>>>>          </element>
>>>>        </sequence>
>>>>     </element>
>>>>
>>>> If the HashMap of DIComplex-root looks like the following:
>>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>>
>>>> ArrayBuffer of DIComplex-root looks like the following:
>>>>      0 -> DIElement("a")
>>>>      1 -> DIArray("a")
>>>>
>>>> Where DIArray children ArrayBuffer looks like:
>>>>      0 -> DIElement("b")
>>>>      1 -> DIElement("c")
>>>>
>>>> Given the initial example if we backtrack(1):  // purge one item
>>>>      DIArray:
>>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>>
>>>>      HashMap DIComplex-root:
>>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>>
>>>>      ArrayBuffer DIComplex-root:
>>>>          0 -> DIElement("a")
>>>>          1 -> DIArray("a")
>>>>
>>>> Given the initial example if we backtrack(2): // purge two items
>>>>      DIArray:
>>>>          Empty
>>>>
>>>>      HashMap DIComplex-root:
>>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>>>                                         // removed and is now empty
>>>>
>>>>      ArrayBuffer DIComplex-root:
>>>>          0 -> DIElement("a")
>>>>
>>>> ===
>>>>
>>>> class DIComplex {
>>>>   def nSlots = erd.nChildSlots
>>>
>>> I think the concept of nSlots and nChildSlots should go away completely.
>>> The children ArrayBuffer should just grow as elements are added and
>>> shrink as they are backtracked. It might make sense to have a tunable
>>> the determines the initial size of the children Array, but the Scala
>>> defaults for initializing and growing a mutable array might already be
>>> reasonable.
>>>
>>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>>
>>> One thing that might make this change easier is to not worry about the
>>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>>> really only necessary to support query style expressions, but that comes
>>> with a whole other slew of concerns (e.g. fn:count, same name but
>>> different types, etc.). There currently exist restrictions that do not
>>> allow query style expressions, so this Seq should always be a size of 1
>>> until remove those restrictions.
>>>
>>>>
>>>>   def addChild(node: DINode) {
>>>>     children += node
>>>>     fastLookup(node.name).append(node)
>>>>   }
>>>>
>>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>>   }
>>>>
>>>>   def backtrack(int n) {
>>>>     for (i = 0; i < n; i++) {
>>>>       val endIdx = children.size -1
>>>>       val child = children.remove(endIdx) // children.pop()
>>>>
>>>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>>>         case None => // do nothing
>>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>>         case Some(nodesSeq) => {
>>>>       val lastEntry = nodesSeq.last
>>>>           lastEntry match {
>>>>         case arr: DIArray => {
>>>>               // If the array has >= 2 items, remove last one and update
>>>> nodesSeq
>>>>               // with this array:
>>>>               //  val lastIdx = nodesSeq.length - 1
>>>>           //  arr.remove(arr.size - 1)
>>>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>>>> arr))
>>>>               //
>>>>               // If the array has 1 item, remove the array from nodesSeq
>>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>         }
>>>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>>           }
>>>>
>>>>         }
>>>>       }
>>>>     }
>>>>   }
>>>>
>>>> }
>>>>
>>>> class DIArray {
>>>>     private val initialSize =
>>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>>>
>>>>     def append(ie: InfosetElement): Unit = {
>>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>>         ie.setArray(this)
>>>>     }
>>>>
>>>>     //
>>>>     // For backtracking purposes
>>>>     //
>>>>     def remove(n: Int): DIElement = {
>>>>         _contents.remove(n)
>>>>     }
>>>> }
>>>>
>>>> ===
>>>>
>>>> To facillitate a breadth-first ordering scheme we shall have to
>>>
>>> That this is only necessary for unordered sequences, where the insertion
>>> order in the ArrayBuffer differs from the display order, correct? If
>>> that's true, we definitely want to make sure to only do the below
>>> sorting when an element contains an unordred sequence.
>>>
>>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>>> shall be called childElementDeclPosition and pertains to the static
>>>> order of the element children of a complex type element.  When sorting
>>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>>> entry in the sequence that needs to be sorted.  All elements in this
>>>> array will have the same childElementDeclPosition.
>>>>
>>>> The difficulty in implementing this is that an Element's
>>>> dsom.ComplexType can have groups within the groups before finding actual
>>>> element children.  Therefore in the Document Schema Object Model (DSOM)
>>>> the model group object shall have a startingChildElementDeclPosition and
>>>> an endingChildElementDeclPosition.
>>>>
>>>> 1. The startingChildElementPosition shall be passed to the constructor
>>>> for the Term.
>>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>>> model group, then its first child term's
>>>> startingChildElementDeclPosition is equal to its own
>>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>>> startingChildElementDeclPosition is the prior term's
>>>> endingChildElementDeclPosition + 1.
>>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>>> endingChildElmentDeclPosition of its last child term.
>>>>
>>>
>>> Looks good to me. +1. This should definitely make implementation of
>>> unordered sequences and query style expressions easier. Removal of the
>>> slot concept should also help remove concepts that prevent faster schema
>>> compilation.
>>>
>>> - Steve
>>>
>>
>>
> 
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Taylor Wise <tw...@tresys.com>.

Been debugging using the TestSimpleTypes tests.  I'm down to just 1 failure.  Expected results are slightly off.  I've got an additional <text/> in the result.  So, getting closer.  Stopping for the night.


I'm not so sure that we need the _lastSlotAdded variable anymore.  I changed it to be just a def that reflects the current length of childNodes (ArrayBuffer).


I'm betting my current issue is within the restore functions I modified, but I'll continue looking at that in the morning.  I would think that 'restoring' would simply be copying the previous state... unless I'm misunderstanding something.


  abstract override def restoreInto(e: DIElement) {
    val c = e.asInstanceOf[DIComplex]
    c.restoreFrom(this)
    //c._lastSlotAdded = this._lastSlotAdded
    super.restoreInto(e)
  }

final def restoreFrom(cs: DIComplexSharedImplMixin){
  // Shouldn't I be able to just say something like...
 //
 childNodes.clear()
 childNodes.append(cs.childNodes)
 fastLookup.clear()
 cs.fastLookup.toSeq.foreach( (name, seqDINodes) =>
       fastLookup.put(name, seqDINodes) )
}

// ORIGINAL
  //    final def restoreFrom(cs: DIComplexSharedImplMixin) {
  //    Assert.invariant(_arraySize == MaybeInt.Nope)
  //
  //    var i = _lastSlotAdded
  //    while (i > cs._lastSlotAdded) {
  //      _slots(i) = null
  //      i -= 1
  //    }
  //    _lastSlotAdded = cs._lastSlotAdded
  //
  //    // check invariant. All slots after last slot added
  //    // should be null.
  //    {
  //      var i = _lastSlotAdded + 1
  //      while (i < _slots.length) {
  //        Assert.invariant(_slots(i) eq null)
  //        i = i + 1
  //      }
  //    }
  //
  //    if (cs._arraySize.isDefined) {
  //      _slots(_lastSlotAdded).asInstanceOf[DIArray].reduceToSize(cs._arraySize.get)
  //    }
  //  }

// CURRENT, lot of logic here.  Wondering if it's even necessary, thus the question above
//
  final def restoreFrom(cs: DIComplexSharedImplMixin) {
    Assert.invariant(_arraySize == MaybeInt.Nope)

    var i = _lastSlotAdded
    while (i > cs._lastSlotAdded) {
      val lastNode = childNodes.last
      val name = lastNode.namedQName.toQNameString

      lastNode match {
        case arr: DIArray if (arr.length > 1) => {
          // No need to delete the array, just the last DINode in it
          arr.pop()
        }
        case arr: DIArray if (arr.length == 1) => {
          // Last DINode in array, remove the array entirely
          //
          arr.pop()
          //
          // Update insertion order list
          //
          childNodes.remove(childNodes.length - 1)
          //
          // Update Fast Lookup
          //
          val seq = fastLookup(name)
          if (seq.length <= 1) {
            // Last DINode with THIS name, remove it entirely.
            //
            fastLookup.remove(name)
          } else {
            // Not the last DINode with THIS name, just remove this particular one.
            //
            val idx = seq.lastIndexWhere(node => node.isInstanceOf[DIArray])
            val result = seq diff Seq(lastNode)
            fastLookup.update(name, result)
          }
        }
        case _ => {
          //
          // Update insertion order list
          //
          childNodes.remove(childNodes.length - 1)
          //
          // Update Fast Lookup
          //
          val seq = fastLookup(name)
          if (seq.length <= 1) {
            // Last DINode with THIS name, remove it entirely
            //
            fastLookup.remove(name)
          } else {
            // Not the last DINode with THIS name.  Just remove this one!
            //
            val idx = seq.lastIndexWhere(node => !node.isInstanceOf[DIArray])
            val result = seq diff Seq(lastNode)
            fastLookup.update(name, result)
          }
        }
      }
    }

  }

The last failure in SimpleTypes, I haven't run the rest of the suite yet though.  Baby steps.
java.lang.Exception:
Comparison failed.
Expected
          <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text></hb_root01>
Actual
          <hb_root01><text>57652063616E277420626520636F6E73756D6564206279206F757220706574747920646966666572656E63657320616E79206D6F7265</text><text>2057652077696C6C20626520756E6974656420696E206F757220636F6D6D6F6E20696E746572657374</text><text>205065726861707320697427732066617465207468617420746F6461792069732074686520347468206F66204A756C79</text><text>20616E6420796F752077696C6C206F6E636520616761696E206265206669676874696E6720666F72206F75722066726565646F6D</text><text/></hb_root01>
Differences were (path, expected, actual):
 (hb_root01,List(),List(<text/>))
at edu.illinois.ncsa.daffodil.xml.XMLUtils$.compareAndReport(XMLUtils.scala:678)
at edu.illinois.ncsa.daffodil.tdml.VerifyTestCase$.verifyParserTestData(TDMLRunner.scala:1074)
at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runParseExpectSuccess(TDMLRunner.scala:776)
at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:675)
at edu.illinois.ncsa.daffodil.tdml.ParserTestCase$$anonfun$runProcessor$3.apply(TDMLRunner.scala:673)
at scala.util.Either$RightProjection.foreach(Either.scala:468)
at edu.illinois.ncsa.daffodil.tdml.ParserTestCase.runProcessor(TDMLRunner.scala:673)
at edu.illinois.ncsa.daffodil.tdml.TestCase.run(TDMLRunner.scala:623)
at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTestWithDataVolumes(TDMLRunner.scala:340)
at edu.illinois.ncsa.daffodil.tdml.DFDLTestSuite.runOneTest(TDMLRunner.scala:328)
at edu.illinois.ncsa.daffodil.tdml.Runner.runOneTest(RunnerFactory.scala:122)
at edu.illinois.ncsa.daffodil.section05.simple_types.TestSimpleTypes.test_hexBinary_Delimited_01(TestSimpleTypes.scala:63)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)





________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Wednesday, October 11, 2017 4:07:51 PM
To: dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion

On 10/11/2017 01:01 PM, Mike Beckerle wrote:
> You have to hash lookup the name to append to an array if you're just appending one thing at a time starting from the slot/hashID. The current Infoset API hides the construction of the DIArray, and only provides access via the slot/hashID, so there's no way to avoid the hash per each append.

The hashmap and hashlookups should really only be needed for elements
that are used in expressions. We came up with that optimization so that
we could remove the vast majority of hash lookups. Due to how
getChildArray works, that optimization can't occur. I'm not sure how
much this will affect performance, but there are ways to avoid the need
for a hash lookup for arrays not used in expressions. The removal of
DIArray was one. Maybe API/parser changes is better.

> But that's fixable if we want. Probably the API to infoset needs to allow appending to an array more efficently, so that a loop like a repeat combinator can be directly holding onto the DIArray's arrayBuffer appending objects as the parser returns them. But I would not make this change as yet. First things first. Don't optimize without profiling info, etc.
>
>
> I'm very reluctant to remove DIArray because XML's data model isn't the only game in town. The XML decision to not have arrays, just adjacent things that happen to have the same name, has all sorts of negative performance implications.

The removal of a DIArray doesn't necessarily mean we don't have a
concept of an arrays. Every element added to Complex child arrayBuf
still knows if it part of an array via ERD.isArray. And comparing the
current DIElements ERD to the previous and next is enough to determine
where array starts and ends. From there, we can support any model that
requires a concept of an array. I'm not saying we should remove the
concept of arrays, just that our InfosetImpl doesn't necessarily need to
have a special DIArray. I'll agree that it does make the logic easier to
get it to a different model, but other things are more complicated.

> Consider this: we could create DIArrays of elements of simple type that don't even create a DISimple node for each array element. The values could be stored in the DIArray's arrayBuffer directly, eliminating all that overhead of a DISimple node. (This was the original intent for DIArray, but was punted to get it working sooner.) The opportunity to do this is entirely lost if you don't have DIArray, but just a bunch of adjacent DIElements that happen to share ERD.

Fair point. Seems a good argument for keeping DIArray, but if we don't
get around to doing this optimization, we're taking multiple hits. We're
still allocating many DISimples plus a DIArray, plus the hit from the
hashmap.

> More speculatively, when we someday implement incremental array trimming so that streaming parse/unparse can drop most array elements once they've been delivered to consumer, well then we will definitely want a DIArray object to abstract the complexity of this trim behavior and the underlying representation it requires.

It might simplify this, but I'm not entirely convinced. I think the
trimming of unused Infoset elements is going to have some complex
heuristics regardless. Adding an extra heuristic for arrays won't
necessarily make this easier. But this is just speculation. And again,
we would still know elements are part of an array, it just takes a
couple extra comparisons to determine start/end. But I do agree that
removal of an an entire array is much easier with a DIArray concept.



It seems to me these are good enough arguments for keeping DIArray
around, but it might be worth thinking a little bit about what changes
would be necessary to the API/parser to not need this hash lookup. If
it's a minor change, it could reduce a lot of hash lookups, especially
for large arrays, which are not uncommon.


>
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Wednesday, October 11, 2017 12:21 PM
> To: Taylor Wise; dev@daffodil.apache.org
> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>
> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>
>> What you'll notice is that the current implementation will construct the DIArray
>> if it doesn't already exist and append to it if it does already exist.  But it
>> needs the 'slot' number in order to check this.  The slot number is obtained
>> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of slots
>> entirely, how can we accomplish this?  Will we have to do a named lookup then
>> every time we want to append to a DIArray?
> I think you're right. The way things work right now, anytime we add to a
> a DIArray we would have to do a hash lookup on the DIComplex. It's not
> limited to just DPath.
>
> Maybe this is another argument for removing the concept of a DIArray? We
> still need to know if a DIElement is part of an array for the
> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
> element should just means creating a new DISimple/DIComplex and adding
> it to the DIComplex child ArrayBuffer and appending to the HashLookup
> Seq? This would also simplify some query style logic, e.g. fn:count.
>
> Alternatively, we could modify the parser logic so that when it's
> dealing with arrays it stores the DIArray locally somewhere (PState or
> Parser) and appends straight to that without having to ask the parent
> DIComplex for the DIArray? Not sure how big of a change that would be
> though.
>
>>
>>
>> Current Implementation:
>>
>> override def addChild(e: InfosetElement): Unit = {
>>
>>   if (e.runtimeData.isArray)  {
>>
>>    // Make sure there is an array to accept the child
>>
>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>
>>      Assert.invariant(arr ne null)
>>
>>      arr.append(e)
>>
>>    } else {
>>
>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>
>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>
>>    }
>>
>>    e.setParent(this)
>>
>> }
>>
>>
>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>>      Assert.usage(childERD.isArray)
>>      getChildArray(childERD.dpathElementCompileInfo)
>>    }
>>
>>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>>      Assert.usage(info.isArray)
>>      val slot = info.slotIndexInParent
>>      getChildArray(slot)
>>    }
>>
>>    private def getChildArray(slot: Int): InfosetArray = {
>>      val slotVal = _slots(slot)
>>      if (slotVal ne null)
>>        slotVal match {
>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>          case _ => Assert.usageError("not an array")
>>        }
>>      else {
>>        val arrayERD = erd.childERDs(slot)
>>        // slot is null. There isn't even an array object yet.
>>        // create one (it will have zero entries)
>>        val ia = new DIArray(arrayERD, this)
>>        // no array there yet. So we have to create one.
>>        setChildArray(slot, ia)
>>        ia
>>      }
>>    }
>>
>>
>>
>> --------------------------------------------------------------------------------
>> *From:* Steve Lawrence <sl...@apache.org>
>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>> *To:* dev@daffodil.apache.org; Taylor Wise
>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>> Goal: Remove concept of slots from InfosetImpl
>>>
>>> Reasons:
>>>     1. So that element/node lookup can be name based.
>>>     2. Ease re-enabling of Unordered Sequences.
>>>     3. Ease enabling of 'query-style' expressions that return more than
>>> one node.
>>>
>>> An Infoset is a document (DIDocument) and the document contains exactly
>>> one element (DIElement) as its root.  These elements can be simple
>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>> inherit from DINode.
>>>
>>> Currently DIComplex models its children as an Array of DINodes. The
>>> problem with the current implementation is that there's no way for us to
>>> just ask for an element by name without walking the structure.
>>>
>>> The solution involves modifying DIComplex to represent its children both
>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>> insertion ordering information we need regarding the children as well as
>>> constant time indexing access to the child in question.  The ArrayBuffer
>>> also has the added benefit of growing and shrinking as objects are added
>>> and removed.  While the HashMap will allow us to associate a name
>>> (String) with DINodes that have that name thus providing a fast named
>>> lookup.
>>>
>>> This named lookup can be optimized further by only inserting into the
>>> HashMap those elements that we know are used in expressions. These are
>>> the elements that would most likely benefit from a fast named lookup and
>>> will allow us to avoid additional overhead of hashing strings and
>>> allocating/appending to a sequence.
>>>
>>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>>> it's possible for the following:
>>>
>>>     <element name="a" .../>
>>>     <element name="a" maxOccurs="unbounded"..../>
>>>     <element name="b".../>
>>>     <element name="a" ..../>
>>>
>>> Each element named 'a' would have an associated DINode.  Because there
>>> are multiple DINodes named 'a', we need to represent them in the HashMap
>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>>
>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>     "b" -> mutable.Seq(DIElement)
>>>
>>> The mutable.ArrayBuffer maintains the insertion order and would look as
>>> follows:
>>>
>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>
>>> Back tracking shall be accomplished by going to the last item in the
>>> DIComplex ArrayBuffer and removing it.
>>>     1. If the last item is a DIElement, we simply delete it from the
>>> ArrayBuffer and then remove it also from the HashMap.
>>>     2. If the last item is a DIArray, we then have to look at the
>>> contents of the DIArray.
>>>         a. If the DIArray has more than one entry, remove the last
>>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>>> with the modified DIArray.
>>
>> I think the second part of this is actually unnecessary. Since the
>> DIArray is mutable, we only need to remove its last child. The entry in
>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>> only difference is the contents of the DIArray will be one smaller.
>>
>>>         b. If the DIArray has a single entry then we simply remove
>>> DIArray from ArrayBuffer and the HashMap.
>>>
>>>     <element name="root>
>>>          <sequence ../>
>>>          <element name="a" .../>
>>>          <element name="a" maxOccurs="unbounded">
>>>            ...
>>>            <element name="b" .../>
>>>            <element name="c" .../>
>>>            ...
>>>          </element>
>>>        </sequence>
>>>     </element>
>>>
>>> If the HashMap of DIComplex-root looks like the following:
>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>
>>> ArrayBuffer of DIComplex-root looks like the following:
>>>      0 -> DIElement("a")
>>>      1 -> DIArray("a")
>>>
>>> Where DIArray children ArrayBuffer looks like:
>>>      0 -> DIElement("b")
>>>      1 -> DIElement("c")
>>>
>>> Given the initial example if we backtrack(1):  // purge one item
>>>      DIArray:
>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>
>>>      HashMap DIComplex-root:
>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>
>>>      ArrayBuffer DIComplex-root:
>>>          0 -> DIElement("a")
>>>          1 -> DIArray("a")
>>>
>>> Given the initial example if we backtrack(2): // purge two items
>>>      DIArray:
>>>          Empty
>>>
>>>      HashMap DIComplex-root:
>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>>                                         // removed and is now empty
>>>
>>>      ArrayBuffer DIComplex-root:
>>>          0 -> DIElement("a")
>>>
>>> ===
>>>
>>> class DIComplex {
>>>   def nSlots = erd.nChildSlots
>>
>> I think the concept of nSlots and nChildSlots should go away completely.
>> The children ArrayBuffer should just grow as elements are added and
>> shrink as they are backtracked. It might make sense to have a tunable
>> the determines the initial size of the children Array, but the Scala
>> defaults for initializing and growing a mutable array might already be
>> reasonable.
>>
>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>
>> One thing that might make this change easier is to not worry about the
>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>> really only necessary to support query style expressions, but that comes
>> with a whole other slew of concerns (e.g. fn:count, same name but
>> different types, etc.). There currently exist restrictions that do not
>> allow query style expressions, so this Seq should always be a size of 1
>> until remove those restrictions.
>>
>>>
>>>   def addChild(node: DINode) {
>>>     children += node
>>>     fastLookup(node.name).append(node)
>>>   }
>>>
>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>   }
>>>
>>>   def backtrack(int n) {
>>>     for (i = 0; i < n; i++) {
>>>       val endIdx = children.size -1
>>>       val child = children.remove(endIdx) // children.pop()
>>>
>>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>>         case None => // do nothing
>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>         case Some(nodesSeq) => {
>>>       val lastEntry = nodesSeq.last
>>>           lastEntry match {
>>>         case arr: DIArray => {
>>>               // If the array has >= 2 items, remove last one and update
>>> nodesSeq
>>>               // with this array:
>>>               //  val lastIdx = nodesSeq.length - 1
>>>           //  arr.remove(arr.size - 1)
>>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>>> arr))
>>>               //
>>>               // If the array has 1 item, remove the array from nodesSeq
>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>         }
>>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>           }
>>>
>>>         }
>>>       }
>>>     }
>>>   }
>>>
>>> }
>>>
>>> class DIArray {
>>>     private val initialSize =
>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>>
>>>     def append(ie: InfosetElement): Unit = {
>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>         ie.setArray(this)
>>>     }
>>>
>>>     //
>>>     // For backtracking purposes
>>>     //
>>>     def remove(n: Int): DIElement = {
>>>         _contents.remove(n)
>>>     }
>>> }
>>>
>>> ===
>>>
>>> To facillitate a breadth-first ordering scheme we shall have to
>>
>> That this is only necessary for unordered sequences, where the insertion
>> order in the ArrayBuffer differs from the display order, correct? If
>> that's true, we definitely want to make sure to only do the below
>> sorting when an element contains an unordred sequence.
>>
>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>> shall be called childElementDeclPosition and pertains to the static
>>> order of the element children of a complex type element.  When sorting
>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>> entry in the sequence that needs to be sorted.  All elements in this
>>> array will have the same childElementDeclPosition.
>>>
>>> The difficulty in implementing this is that an Element's
>>> dsom.ComplexType can have groups within the groups before finding actual
>>> element children.  Therefore in the Document Schema Object Model (DSOM)
>>> the model group object shall have a startingChildElementDeclPosition and
>>> an endingChildElementDeclPosition.
>>>
>>> 1. The startingChildElementPosition shall be passed to the constructor
>>> for the Term.
>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>> model group, then its first child term's
>>> startingChildElementDeclPosition is equal to its own
>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>> startingChildElementDeclPosition is the prior term's
>>> endingChildElementDeclPosition + 1.
>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>> endingChildElmentDeclPosition of its last child term.
>>>
>>
>> Looks good to me. +1. This should definitely make implementation of
>> unordered sequences and query style expressions easier. Removal of the
>> slot concept should also help remove concepts that prevent faster schema
>> compilation.
>>
>> - Steve
>>
>
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Taylor Wise <tw...@tresys.com>.

Alright, currently debugging an out of bounds exception.  Yay... xD

________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Wednesday, October 11, 2017 4:07:51 PM
To: dev@daffodil.apache.org; Mike Beckerle; Taylor Wise
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion

On 10/11/2017 01:01 PM, Mike Beckerle wrote:
> You have to hash lookup the name to append to an array if you're just appending one thing at a time starting from the slot/hashID. The current Infoset API hides the construction of the DIArray, and only provides access via the slot/hashID, so there's no way to avoid the hash per each append.

The hashmap and hashlookups should really only be needed for elements
that are used in expressions. We came up with that optimization so that
we could remove the vast majority of hash lookups. Due to how
getChildArray works, that optimization can't occur. I'm not sure how
much this will affect performance, but there are ways to avoid the need
for a hash lookup for arrays not used in expressions. The removal of
DIArray was one. Maybe API/parser changes is better.

> But that's fixable if we want. Probably the API to infoset needs to allow appending to an array more efficently, so that a loop like a repeat combinator can be directly holding onto the DIArray's arrayBuffer appending objects as the parser returns them. But I would not make this change as yet. First things first. Don't optimize without profiling info, etc.
>
>
> I'm very reluctant to remove DIArray because XML's data model isn't the only game in town. The XML decision to not have arrays, just adjacent things that happen to have the same name, has all sorts of negative performance implications.

The removal of a DIArray doesn't necessarily mean we don't have a
concept of an arrays. Every element added to Complex child arrayBuf
still knows if it part of an array via ERD.isArray. And comparing the
current DIElements ERD to the previous and next is enough to determine
where array starts and ends. From there, we can support any model that
requires a concept of an array. I'm not saying we should remove the
concept of arrays, just that our InfosetImpl doesn't necessarily need to
have a special DIArray. I'll agree that it does make the logic easier to
get it to a different model, but other things are more complicated.

> Consider this: we could create DIArrays of elements of simple type that don't even create a DISimple node for each array element. The values could be stored in the DIArray's arrayBuffer directly, eliminating all that overhead of a DISimple node. (This was the original intent for DIArray, but was punted to get it working sooner.) The opportunity to do this is entirely lost if you don't have DIArray, but just a bunch of adjacent DIElements that happen to share ERD.

Fair point. Seems a good argument for keeping DIArray, but if we don't
get around to doing this optimization, we're taking multiple hits. We're
still allocating many DISimples plus a DIArray, plus the hit from the
hashmap.

> More speculatively, when we someday implement incremental array trimming so that streaming parse/unparse can drop most array elements once they've been delivered to consumer, well then we will definitely want a DIArray object to abstract the complexity of this trim behavior and the underlying representation it requires.

It might simplify this, but I'm not entirely convinced. I think the
trimming of unused Infoset elements is going to have some complex
heuristics regardless. Adding an extra heuristic for arrays won't
necessarily make this easier. But this is just speculation. And again,
we would still know elements are part of an array, it just takes a
couple extra comparisons to determine start/end. But I do agree that
removal of an an entire array is much easier with a DIArray concept.



It seems to me these are good enough arguments for keeping DIArray
around, but it might be worth thinking a little bit about what changes
would be necessary to the API/parser to not need this hash lookup. If
it's a minor change, it could reduce a lot of hash lookups, especially
for large arrays, which are not uncommon.


>
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Wednesday, October 11, 2017 12:21 PM
> To: Taylor Wise; dev@daffodil.apache.org
> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
>
> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>
>> What you'll notice is that the current implementation will construct the DIArray
>> if it doesn't already exist and append to it if it does already exist.  But it
>> needs the 'slot' number in order to check this.  The slot number is obtained
>> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of slots
>> entirely, how can we accomplish this?  Will we have to do a named lookup then
>> every time we want to append to a DIArray?
> I think you're right. The way things work right now, anytime we add to a
> a DIArray we would have to do a hash lookup on the DIComplex. It's not
> limited to just DPath.
>
> Maybe this is another argument for removing the concept of a DIArray? We
> still need to know if a DIElement is part of an array for the
> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
> element should just means creating a new DISimple/DIComplex and adding
> it to the DIComplex child ArrayBuffer and appending to the HashLookup
> Seq? This would also simplify some query style logic, e.g. fn:count.
>
> Alternatively, we could modify the parser logic so that when it's
> dealing with arrays it stores the DIArray locally somewhere (PState or
> Parser) and appends straight to that without having to ask the parent
> DIComplex for the DIArray? Not sure how big of a change that would be
> though.
>
>>
>>
>> Current Implementation:
>>
>> override def addChild(e: InfosetElement): Unit = {
>>
>>   if (e.runtimeData.isArray)  {
>>
>>    // Make sure there is an array to accept the child
>>
>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>
>>      Assert.invariant(arr ne null)
>>
>>      arr.append(e)
>>
>>    } else {
>>
>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>
>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>
>>    }
>>
>>    e.setParent(this)
>>
>> }
>>
>>
>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>>      Assert.usage(childERD.isArray)
>>      getChildArray(childERD.dpathElementCompileInfo)
>>    }
>>
>>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>>      Assert.usage(info.isArray)
>>      val slot = info.slotIndexInParent
>>      getChildArray(slot)
>>    }
>>
>>    private def getChildArray(slot: Int): InfosetArray = {
>>      val slotVal = _slots(slot)
>>      if (slotVal ne null)
>>        slotVal match {
>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>          case _ => Assert.usageError("not an array")
>>        }
>>      else {
>>        val arrayERD = erd.childERDs(slot)
>>        // slot is null. There isn't even an array object yet.
>>        // create one (it will have zero entries)
>>        val ia = new DIArray(arrayERD, this)
>>        // no array there yet. So we have to create one.
>>        setChildArray(slot, ia)
>>        ia
>>      }
>>    }
>>
>>
>>
>> --------------------------------------------------------------------------------
>> *From:* Steve Lawrence <sl...@apache.org>
>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>> *To:* dev@daffodil.apache.org; Taylor Wise
>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>> Goal: Remove concept of slots from InfosetImpl
>>>
>>> Reasons:
>>>     1. So that element/node lookup can be name based.
>>>     2. Ease re-enabling of Unordered Sequences.
>>>     3. Ease enabling of 'query-style' expressions that return more than
>>> one node.
>>>
>>> An Infoset is a document (DIDocument) and the document contains exactly
>>> one element (DIElement) as its root.  These elements can be simple
>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>> inherit from DINode.
>>>
>>> Currently DIComplex models its children as an Array of DINodes. The
>>> problem with the current implementation is that there's no way for us to
>>> just ask for an element by name without walking the structure.
>>>
>>> The solution involves modifying DIComplex to represent its children both
>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>> insertion ordering information we need regarding the children as well as
>>> constant time indexing access to the child in question.  The ArrayBuffer
>>> also has the added benefit of growing and shrinking as objects are added
>>> and removed.  While the HashMap will allow us to associate a name
>>> (String) with DINodes that have that name thus providing a fast named
>>> lookup.
>>>
>>> This named lookup can be optimized further by only inserting into the
>>> HashMap those elements that we know are used in expressions. These are
>>> the elements that would most likely benefit from a fast named lookup and
>>> will allow us to avoid additional overhead of hashing strings and
>>> allocating/appending to a sequence.
>>>
>>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>>> it's possible for the following:
>>>
>>>     <element name="a" .../>
>>>     <element name="a" maxOccurs="unbounded"..../>
>>>     <element name="b".../>
>>>     <element name="a" ..../>
>>>
>>> Each element named 'a' would have an associated DINode.  Because there
>>> are multiple DINodes named 'a', we need to represent them in the HashMap
>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>>
>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>     "b" -> mutable.Seq(DIElement)
>>>
>>> The mutable.ArrayBuffer maintains the insertion order and would look as
>>> follows:
>>>
>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>
>>> Back tracking shall be accomplished by going to the last item in the
>>> DIComplex ArrayBuffer and removing it.
>>>     1. If the last item is a DIElement, we simply delete it from the
>>> ArrayBuffer and then remove it also from the HashMap.
>>>     2. If the last item is a DIArray, we then have to look at the
>>> contents of the DIArray.
>>>         a. If the DIArray has more than one entry, remove the last
>>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>>> with the modified DIArray.
>>
>> I think the second part of this is actually unnecessary. Since the
>> DIArray is mutable, we only need to remove its last child. The entry in
>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>> only difference is the contents of the DIArray will be one smaller.
>>
>>>         b. If the DIArray has a single entry then we simply remove
>>> DIArray from ArrayBuffer and the HashMap.
>>>
>>>     <element name="root>
>>>          <sequence ../>
>>>          <element name="a" .../>
>>>          <element name="a" maxOccurs="unbounded">
>>>            ...
>>>            <element name="b" .../>
>>>            <element name="c" .../>
>>>            ...
>>>          </element>
>>>        </sequence>
>>>     </element>
>>>
>>> If the HashMap of DIComplex-root looks like the following:
>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>
>>> ArrayBuffer of DIComplex-root looks like the following:
>>>      0 -> DIElement("a")
>>>      1 -> DIArray("a")
>>>
>>> Where DIArray children ArrayBuffer looks like:
>>>      0 -> DIElement("b")
>>>      1 -> DIElement("c")
>>>
>>> Given the initial example if we backtrack(1):  // purge one item
>>>      DIArray:
>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>
>>>      HashMap DIComplex-root:
>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>
>>>      ArrayBuffer DIComplex-root:
>>>          0 -> DIElement("a")
>>>          1 -> DIArray("a")
>>>
>>> Given the initial example if we backtrack(2): // purge two items
>>>      DIArray:
>>>          Empty
>>>
>>>      HashMap DIComplex-root:
>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>>                                         // removed and is now empty
>>>
>>>      ArrayBuffer DIComplex-root:
>>>          0 -> DIElement("a")
>>>
>>> ===
>>>
>>> class DIComplex {
>>>   def nSlots = erd.nChildSlots
>>
>> I think the concept of nSlots and nChildSlots should go away completely.
>> The children ArrayBuffer should just grow as elements are added and
>> shrink as they are backtracked. It might make sense to have a tunable
>> the determines the initial size of the children Array, but the Scala
>> defaults for initializing and growing a mutable array might already be
>> reasonable.
>>
>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>
>> One thing that might make this change easier is to not worry about the
>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>> really only necessary to support query style expressions, but that comes
>> with a whole other slew of concerns (e.g. fn:count, same name but
>> different types, etc.). There currently exist restrictions that do not
>> allow query style expressions, so this Seq should always be a size of 1
>> until remove those restrictions.
>>
>>>
>>>   def addChild(node: DINode) {
>>>     children += node
>>>     fastLookup(node.name).append(node)
>>>   }
>>>
>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>   }
>>>
>>>   def backtrack(int n) {
>>>     for (i = 0; i < n; i++) {
>>>       val endIdx = children.size -1
>>>       val child = children.remove(endIdx) // children.pop()
>>>
>>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>>         case None => // do nothing
>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>         case Some(nodesSeq) => {
>>>       val lastEntry = nodesSeq.last
>>>           lastEntry match {
>>>         case arr: DIArray => {
>>>               // If the array has >= 2 items, remove last one and update
>>> nodesSeq
>>>               // with this array:
>>>               //  val lastIdx = nodesSeq.length - 1
>>>           //  arr.remove(arr.size - 1)
>>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>>> arr))
>>>               //
>>>               // If the array has 1 item, remove the array from nodesSeq
>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>         }
>>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>           }
>>>
>>>         }
>>>       }
>>>     }
>>>   }
>>>
>>> }
>>>
>>> class DIArray {
>>>     private val initialSize =
>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>>
>>>     def append(ie: InfosetElement): Unit = {
>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>         ie.setArray(this)
>>>     }
>>>
>>>     //
>>>     // For backtracking purposes
>>>     //
>>>     def remove(n: Int): DIElement = {
>>>         _contents.remove(n)
>>>     }
>>> }
>>>
>>> ===
>>>
>>> To facillitate a breadth-first ordering scheme we shall have to
>>
>> That this is only necessary for unordered sequences, where the insertion
>> order in the ArrayBuffer differs from the display order, correct? If
>> that's true, we definitely want to make sure to only do the below
>> sorting when an element contains an unordred sequence.
>>
>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>> shall be called childElementDeclPosition and pertains to the static
>>> order of the element children of a complex type element.  When sorting
>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>> entry in the sequence that needs to be sorted.  All elements in this
>>> array will have the same childElementDeclPosition.
>>>
>>> The difficulty in implementing this is that an Element's
>>> dsom.ComplexType can have groups within the groups before finding actual
>>> element children.  Therefore in the Document Schema Object Model (DSOM)
>>> the model group object shall have a startingChildElementDeclPosition and
>>> an endingChildElementDeclPosition.
>>>
>>> 1. The startingChildElementPosition shall be passed to the constructor
>>> for the Term.
>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>> model group, then its first child term's
>>> startingChildElementDeclPosition is equal to its own
>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>> startingChildElementDeclPosition is the prior term's
>>> endingChildElementDeclPosition + 1.
>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>> endingChildElmentDeclPosition of its last child term.
>>>
>>
>> Looks good to me. +1. This should definitely make implementation of
>> unordered sequences and query style expressions easier. Removal of the
>> slot concept should also help remove concepts that prevent faster schema
>> compilation.
>>
>> - Steve
>>
>
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Steve Lawrence <sl...@apache.org>.

On 10/11/2017 01:01 PM, Mike Beckerle wrote:
> You have to hash lookup the name to append to an array if you're just appending one thing at a time starting from the slot/hashID. The current Infoset API hides the construction of the DIArray, and only provides access via the slot/hashID, so there's no way to avoid the hash per each append.

The hashmap and hashlookups should really only be needed for elements
that are used in expressions. We came up with that optimization so that
we could remove the vast majority of hash lookups. Due to how
getChildArray works, that optimization can't occur. I'm not sure how
much this will affect performance, but there are ways to avoid the need
for a hash lookup for arrays not used in expressions. The removal of
DIArray was one. Maybe API/parser changes is better.

> But that's fixable if we want. Probably the API to infoset needs to allow appending to an array more efficently, so that a loop like a repeat combinator can be directly holding onto the DIArray's arrayBuffer appending objects as the parser returns them. But I would not make this change as yet. First things first. Don't optimize without profiling info, etc.
> 
> 
> I'm very reluctant to remove DIArray because XML's data model isn't the only game in town. The XML decision to not have arrays, just adjacent things that happen to have the same name, has all sorts of negative performance implications.

The removal of a DIArray doesn't necessarily mean we don't have a
concept of an arrays. Every element added to Complex child arrayBuf
still knows if it part of an array via ERD.isArray. And comparing the
current DIElements ERD to the previous and next is enough to determine
where array starts and ends. From there, we can support any model that
requires a concept of an array. I'm not saying we should remove the
concept of arrays, just that our InfosetImpl doesn't necessarily need to
have a special DIArray. I'll agree that it does make the logic easier to
get it to a different model, but other things are more complicated.

> Consider this: we could create DIArrays of elements of simple type that don't even create a DISimple node for each array element. The values could be stored in the DIArray's arrayBuffer directly, eliminating all that overhead of a DISimple node. (This was the original intent for DIArray, but was punted to get it working sooner.) The opportunity to do this is entirely lost if you don't have DIArray, but just a bunch of adjacent DIElements that happen to share ERD.

Fair point. Seems a good argument for keeping DIArray, but if we don't
get around to doing this optimization, we're taking multiple hits. We're
still allocating many DISimples plus a DIArray, plus the hit from the
hashmap.

> More speculatively, when we someday implement incremental array trimming so that streaming parse/unparse can drop most array elements once they've been delivered to consumer, well then we will definitely want a DIArray object to abstract the complexity of this trim behavior and the underlying representation it requires.

It might simplify this, but I'm not entirely convinced. I think the
trimming of unused Infoset elements is going to have some complex
heuristics regardless. Adding an extra heuristic for arrays won't
necessarily make this easier. But this is just speculation. And again,
we would still know elements are part of an array, it just takes a
couple extra comparisons to determine start/end. But I do agree that
removal of an an entire array is much easier with a DIArray concept.



It seems to me these are good enough arguments for keeping DIArray
around, but it might be worth thinking a little bit about what changes
would be necessary to the API/parser to not need this hash lookup. If
it's a minor change, it could reduce a lot of hash lookups, especially
for large arrays, which are not uncommon.


> 
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Wednesday, October 11, 2017 12:21 PM
> To: Taylor Wise; dev@daffodil.apache.org
> Subject: Re: Remove concept of slots from InfosetImpl Design Discussion
> 
> On 10/11/2017 12:11 PM, Taylor Wise wrote:
>>
>> What you'll notice is that the current implementation will construct the DIArray
>> if it doesn't already exist and append to it if it does already exist.  But it
>> needs the 'slot' number in order to check this.  The slot number is obtained
>> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of slots
>> entirely, how can we accomplish this?  Will we have to do a named lookup then
>> every time we want to append to a DIArray?
> I think you're right. The way things work right now, anytime we add to a
> a DIArray we would have to do a hash lookup on the DIComplex. It's not
> limited to just DPath.
> 
> Maybe this is another argument for removing the concept of a DIArray? We
> still need to know if a DIElement is part of an array for the
> InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
> element should just means creating a new DISimple/DIComplex and adding
> it to the DIComplex child ArrayBuffer and appending to the HashLookup
> Seq? This would also simplify some query style logic, e.g. fn:count.
> 
> Alternatively, we could modify the parser logic so that when it's
> dealing with arrays it stores the DIArray locally somewhere (PState or
> Parser) and appends straight to that without having to ask the parent
> DIComplex for the DIArray? Not sure how big of a change that would be
> though.
> 
>>
>>
>> Current Implementation:
>>
>> override def addChild(e: InfosetElement): Unit = {
>>
>>   if (e.runtimeData.isArray)  {
>>
>>    // Make sure there is an array to accept the child
>>
>>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>>
>>      Assert.invariant(arr ne null)
>>
>>      arr.append(e)
>>
>>    } else {
>>
>>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>>
>>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>>
>>    }
>>
>>    e.setParent(this)
>>
>> }
>>
>>
>>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>>      Assert.usage(childERD.isArray)
>>      getChildArray(childERD.dpathElementCompileInfo)
>>    }
>>
>>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>>      Assert.usage(info.isArray)
>>      val slot = info.slotIndexInParent
>>      getChildArray(slot)
>>    }
>>
>>    private def getChildArray(slot: Int): InfosetArray = {
>>      val slotVal = _slots(slot)
>>      if (slotVal ne null)
>>        slotVal match {
>>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>>          case _ => Assert.usageError("not an array")
>>        }
>>      else {
>>        val arrayERD = erd.childERDs(slot)
>>        // slot is null. There isn't even an array object yet.
>>        // create one (it will have zero entries)
>>        val ia = new DIArray(arrayERD, this)
>>        // no array there yet. So we have to create one.
>>        setChildArray(slot, ia)
>>        ia
>>      }
>>    }
>>
>>
>>
>> --------------------------------------------------------------------------------
>> *From:* Steve Lawrence <sl...@apache.org>
>> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
>> *To:* dev@daffodil.apache.org; Taylor Wise
>> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
>> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>>> Goal: Remove concept of slots from InfosetImpl
>>>
>>> Reasons:
>>>     1. So that element/node lookup can be name based.
>>>     2. Ease re-enabling of Unordered Sequences.
>>>     3. Ease enabling of 'query-style' expressions that return more than
>>> one node.
>>>
>>> An Infoset is a document (DIDocument) and the document contains exactly
>>> one element (DIElement) as its root.  These elements can be simple
>>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>> 1). A DIArray shall always be a child of DIComplex.  All of these
>>> inherit from DINode.
>>>
>>> Currently DIComplex models its children as an Array of DINodes. The
>>> problem with the current implementation is that there's no way for us to
>>> just ask for an element by name without walking the structure.
>>>
>>> The solution involves modifying DIComplex to represent its children both
>>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>>> insertion ordering information we need regarding the children as well as
>>> constant time indexing access to the child in question.  The ArrayBuffer
>>> also has the added benefit of growing and shrinking as objects are added
>>> and removed.  While the HashMap will allow us to associate a name
>>> (String) with DINodes that have that name thus providing a fast named
>>> lookup.
>>>
>>> This named lookup can be optimized further by only inserting into the
>>> HashMap those elements that we know are used in expressions. These are
>>> the elements that would most likely benefit from a fast named lookup and
>>> will allow us to avoid additional overhead of hashing strings and
>>> allocating/appending to a sequence.
>>>
>>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>>> it's possible for the following:
>>>
>>>     <element name="a" .../>
>>>     <element name="a" maxOccurs="unbounded"..../>
>>>     <element name="b".../>
>>>     <element name="a" ..../>
>>>
>>> Each element named 'a' would have an associated DINode.  Because there
>>> are multiple DINodes named 'a', we need to represent them in the HashMap
>>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>>
>>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>>     "b" -> mutable.Seq(DIElement)
>>>
>>> The mutable.ArrayBuffer maintains the insertion order and would look as
>>> follows:
>>>
>>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>>
>>> Back tracking shall be accomplished by going to the last item in the
>>> DIComplex ArrayBuffer and removing it.
>>>     1. If the last item is a DIElement, we simply delete it from the
>>> ArrayBuffer and then remove it also from the HashMap.
>>>     2. If the last item is a DIArray, we then have to look at the
>>> contents of the DIArray.
>>>         a. If the DIArray has more than one entry, remove the last
>>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>>> with the modified DIArray.
>>
>> I think the second part of this is actually unnecessary. Since the
>> DIArray is mutable, we only need to remove its last child. The entry in
>> the ArrayBuffer and HashMap will still point to the same DIArray. The
>> only difference is the contents of the DIArray will be one smaller.
>>
>>>         b. If the DIArray has a single entry then we simply remove
>>> DIArray from ArrayBuffer and the HashMap.
>>>
>>>     <element name="root>
>>>          <sequence ../>
>>>          <element name="a" .../>
>>>          <element name="a" maxOccurs="unbounded">
>>>            ...
>>>            <element name="b" .../>
>>>            <element name="c" .../>
>>>            ...
>>>          </element>
>>>        </sequence>
>>>     </element>
>>>
>>> If the HashMap of DIComplex-root looks like the following:
>>>      "a" -> mutable.Seq(DIElement, DIArray)
>>>
>>> ArrayBuffer of DIComplex-root looks like the following:
>>>      0 -> DIElement("a")
>>>      1 -> DIArray("a")
>>>
>>> Where DIArray children ArrayBuffer looks like:
>>>      0 -> DIElement("b")
>>>      1 -> DIElement("c")
>>>
>>> Given the initial example if we backtrack(1):  // purge one item
>>>      DIArray:
>>>          0 -> DIElement("b")  // removed one item from DIArray
>>>
>>>      HashMap DIComplex-root:
>>>          "a" -> mutable.Seq(DIElement, DIArray)
>>>
>>>      ArrayBuffer DIComplex-root:
>>>          0 -> DIElement("a")
>>>          1 -> DIArray("a")
>>>
>>> Given the initial example if we backtrack(2): // purge two items
>>>      DIArray:
>>>          Empty
>>>
>>>      HashMap DIComplex-root:
>>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>>                                         // removed and is now empty
>>>
>>>      ArrayBuffer DIComplex-root:
>>>          0 -> DIElement("a")
>>>
>>> ===
>>>
>>> class DIComplex {
>>>   def nSlots = erd.nChildSlots
>>
>> I think the concept of nSlots and nChildSlots should go away completely.
>> The children ArrayBuffer should just grow as elements are added and
>> shrink as they are backtracked. It might make sense to have a tunable
>> the determines the initial size of the children Array, but the Scala
>> defaults for initializing and growing a mutable array might already be
>> reasonable.
>>
>>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>>
>> One thing that might make this change easier is to not worry about the
>> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
>> really only necessary to support query style expressions, but that comes
>> with a whole other slew of concerns (e.g. fn:count, same name but
>> different types, etc.). There currently exist restrictions that do not
>> allow query style expressions, so this Seq should always be a size of 1
>> until remove those restrictions.
>>
>>>
>>>   def addChild(node: DINode) {
>>>     children += node
>>>     fastLookup(node.name).append(node)
>>>   }
>>>
>>>   def getChild(name: String): immutable.Seq[DINode] {
>>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>>   }
>>>
>>>   def backtrack(int n) {
>>>     for (i = 0; i < n; i++) {
>>>       val endIdx = children.size -1
>>>       val child = children.remove(endIdx) // children.pop()
>>>
>>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>>         case None => // do nothing
>>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>>         case Some(nodesSeq) => {
>>>       val lastEntry = nodesSeq.last
>>>           lastEntry match {
>>>         case arr: DIArray => {
>>>               // If the array has >= 2 items, remove last one and update
>>> nodesSeq
>>>               // with this array:
>>>               //  val lastIdx = nodesSeq.length - 1
>>>           //  arr.remove(arr.size - 1)
>>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>>> arr))
>>>               //
>>>               // If the array has 1 item, remove the array from nodesSeq
>>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>         }
>>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>>           }
>>>
>>>         }
>>>       }
>>>     }
>>>   }
>>>
>>> }
>>>
>>> class DIArray {
>>>     private val initialSize =
>>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>>
>>>     def append(ie: InfosetElement): Unit = {
>>>         _contents.append(ie.asInstanceOf[DIElement])
>>>         ie.setArray(this)
>>>     }
>>>
>>>     //
>>>     // For backtracking purposes
>>>     //
>>>     def remove(n: Int): DIElement = {
>>>         _contents.remove(n)
>>>     }
>>> }
>>>
>>> ===
>>>
>>> To facillitate a breadth-first ordering scheme we shall have to
>>
>> That this is only necessary for unordered sequences, where the insertion
>> order in the ArrayBuffer differs from the display order, correct? If
>> that's true, we definitely want to make sure to only do the below
>> sorting when an element contains an unordred sequence.
>>
>>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>>> shall be called childElementDeclPosition and pertains to the static
>>> order of the element children of a complex type element.  When sorting
>>> DINodes, childElementDeclPosition can be obtained by way of reaching
>>> from the DINode to the coresponding ERD.  Any array shall be a single
>>> entry in the sequence that needs to be sorted.  All elements in this
>>> array will have the same childElementDeclPosition.
>>>
>>> The difficulty in implementing this is that an Element's
>>> dsom.ComplexType can have groups within the groups before finding actual
>>> element children.  Therefore in the Document Schema Object Model (DSOM)
>>> the model group object shall have a startingChildElementDeclPosition and
>>> an endingChildElementDeclPosition.
>>>
>>> 1. The startingChildElementPosition shall be passed to the constructor
>>> for the Term.
>>> 2. If the term is an element, the endingChildElementDeclPosition is
>>> equal to the startingChildElementDeclPosition.  3. If the term is a
>>> model group, then its first child term's
>>> startingChildElementDeclPosition is equal to its own
>>> startingChildElementDeclPosition, the 2nd to Nth child term's
>>> startingChildElementDeclPosition is the prior term's
>>> endingChildElementDeclPosition + 1.
>>> 4. The endingChildElementDeclPosition for a model-group shall be the
>>> endingChildElmentDeclPosition of its last child term.
>>>
>>
>> Looks good to me. +1. This should definitely make implementation of
>> unordered sequences and query style expressions easier. Removal of the
>> slot concept should also help remove concepts that prevent faster schema
>> compilation.
>>
>> - Steve
>>
> 
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Mike Beckerle <mb...@tresys.com>.

You have to hash lookup the name to append to an array if you're just appending one thing at a time starting from the slot/hashID. The current Infoset API hides the construction of the DIArray, and only provides access via the slot/hashID, so there's no way to avoid the hash per each append.


But that's fixable if we want. Probably the API to infoset needs to allow appending to an array more efficently, so that a loop like a repeat combinator can be directly holding onto the DIArray's arrayBuffer appending objects as the parser returns them. But I would not make this change as yet. First things first. Don't optimize without profiling info, etc.


I'm very reluctant to remove DIArray because XML's data model isn't the only game in town. The XML decision to not have arrays, just adjacent things that happen to have the same name, has all sorts of negative performance implications.


Consider this: we could create DIArrays of elements of simple type that don't even create a DISimple node for each array element. The values could be stored in the DIArray's arrayBuffer directly, eliminating all that overhead of a DISimple node. (This was the original intent for DIArray, but was punted to get it working sooner.) The opportunity to do this is entirely lost if you don't have DIArray, but just a bunch of adjacent DIElements that happen to share ERD.


More speculatively, when we someday implement incremental array trimming so that streaming parse/unparse can drop most array elements once they've been delivered to consumer, well then we will definitely want a DIArray object to abstract the complexity of this trim behavior and the underlying representation it requires.


________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Wednesday, October 11, 2017 12:21 PM
To: Taylor Wise; dev@daffodil.apache.org
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion

On 10/11/2017 12:11 PM, Taylor Wise wrote:
>
> What you'll notice is that the current implementation will construct the DIArray
> if it doesn't already exist and append to it if it does already exist.  But it
> needs the 'slot' number in order to check this.  The slot number is obtained
> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of slots
> entirely, how can we accomplish this?  Will we have to do a named lookup then
> every time we want to append to a DIArray?
I think you're right. The way things work right now, anytime we add to a
a DIArray we would have to do a hash lookup on the DIComplex. It's not
limited to just DPath.

Maybe this is another argument for removing the concept of a DIArray? We
still need to know if a DIElement is part of an array for the
InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
element should just means creating a new DISimple/DIComplex and adding
it to the DIComplex child ArrayBuffer and appending to the HashLookup
Seq? This would also simplify some query style logic, e.g. fn:count.

Alternatively, we could modify the parser logic so that when it's
dealing with arrays it stores the DIArray locally somewhere (PState or
Parser) and appends straight to that without having to ask the parent
DIComplex for the DIArray? Not sure how big of a change that would be
though.

>
>
> Current Implementation:
>
> override def addChild(e: InfosetElement): Unit = {
>
>   if (e.runtimeData.isArray)  {
>
>    // Make sure there is an array to accept the child
>
>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
>
>      Assert.invariant(arr ne null)
>
>      arr.append(e)
>
>    } else {
>
>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
>
>      _lastSlotAdded = e.runtimeData.slotIndexInParent
>
>    }
>
>    e.setParent(this)
>
> }
>
>
>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>      Assert.usage(childERD.isArray)
>      getChildArray(childERD.dpathElementCompileInfo)
>    }
>
>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>      Assert.usage(info.isArray)
>      val slot = info.slotIndexInParent
>      getChildArray(slot)
>    }
>
>    private def getChildArray(slot: Int): InfosetArray = {
>      val slotVal = _slots(slot)
>      if (slotVal ne null)
>        slotVal match {
>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>          case _ => Assert.usageError("not an array")
>        }
>      else {
>        val arrayERD = erd.childERDs(slot)
>        // slot is null. There isn't even an array object yet.
>        // create one (it will have zero entries)
>        val ia = new DIArray(arrayERD, this)
>        // no array there yet. So we have to create one.
>        setChildArray(slot, ia)
>        ia
>      }
>    }
>
>
>
> --------------------------------------------------------------------------------
> *From:* Steve Lawrence <sl...@apache.org>
> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
> *To:* dev@daffodil.apache.org; Taylor Wise
> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>> Goal: Remove concept of slots from InfosetImpl
>>
>> Reasons:
>>     1. So that element/node lookup can be name based.
>>     2. Ease re-enabling of Unordered Sequences.
>>     3. Ease enabling of 'query-style' expressions that return more than
>> one node.
>>
>> An Infoset is a document (DIDocument) and the document contains exactly
>> one element (DIElement) as its root.  These elements can be simple
>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>1). A DIArray shall always be a child of DIComplex.  All of these
>> inherit from DINode.
>>
>> Currently DIComplex models its children as an Array of DINodes. The
>> problem with the current implementation is that there's no way for us to
>> just ask for an element by name without walking the structure.
>>
>> The solution involves modifying DIComplex to represent its children both
>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>> insertion ordering information we need regarding the children as well as
>> constant time indexing access to the child in question.  The ArrayBuffer
>> also has the added benefit of growing and shrinking as objects are added
>> and removed.  While the HashMap will allow us to associate a name
>> (String) with DINodes that have that name thus providing a fast named
>> lookup.
>>
>> This named lookup can be optimized further by only inserting into the
>> HashMap those elements that we know are used in expressions. These are
>> the elements that would most likely benefit from a fast named lookup and
>> will allow us to avoid additional overhead of hashing strings and
>> allocating/appending to a sequence.
>>
>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>> it's possible for the following:
>>
>>     <element name="a" .../>
>>     <element name="a" maxOccurs="unbounded"..../>
>>     <element name="b".../>
>>     <element name="a" ..../>
>>
>> Each element named 'a' would have an associated DINode.  Because there
>> are multiple DINodes named 'a', we need to represent them in the HashMap
>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>>
>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>     "b" -> mutable.Seq(DIElement)
>>
>> The mutable.ArrayBuffer maintains the insertion order and would look as
>> follows:
>>
>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>>
>> Back tracking shall be accomplished by going to the last item in the
>> DIComplex ArrayBuffer and removing it.
>>     1. If the last item is a DIElement, we simply delete it from the
>> ArrayBuffer and then remove it also from the HashMap.
>>     2. If the last item is a DIArray, we then have to look at the
>> contents of the DIArray.
>>         a. If the DIArray has more than one entry, remove the last
>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>> with the modified DIArray.
>
> I think the second part of this is actually unnecessary. Since the
> DIArray is mutable, we only need to remove its last child. The entry in
> the ArrayBuffer and HashMap will still point to the same DIArray. The
> only difference is the contents of the DIArray will be one smaller.
>
>>         b. If the DIArray has a single entry then we simply remove
>> DIArray from ArrayBuffer and the HashMap.
>>
>>     <element name="root>
>>          <sequence ../>
>>          <element name="a" .../>
>>          <element name="a" maxOccurs="unbounded">
>>            ...
>>            <element name="b" .../>
>>            <element name="c" .../>
>>            ...
>>          </element>
>>        </sequence>
>>     </element>
>>
>> If the HashMap of DIComplex-root looks like the following:
>>      "a" -> mutable.Seq(DIElement, DIArray)
>>
>> ArrayBuffer of DIComplex-root looks like the following:
>>      0 -> DIElement("a")
>>      1 -> DIArray("a")
>>
>> Where DIArray children ArrayBuffer looks like:
>>      0 -> DIElement("b")
>>      1 -> DIElement("c")
>>
>> Given the initial example if we backtrack(1):  // purge one item
>>      DIArray:
>>          0 -> DIElement("b")  // removed one item from DIArray
>>
>>      HashMap DIComplex-root:
>>          "a" -> mutable.Seq(DIElement, DIArray)
>>
>>      ArrayBuffer DIComplex-root:
>>          0 -> DIElement("a")
>>          1 -> DIArray("a")
>>
>> Given the initial example if we backtrack(2): // purge two items
>>      DIArray:
>>          Empty
>>
>>      HashMap DIComplex-root:
>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>                                         // removed and is now empty
>>
>>      ArrayBuffer DIComplex-root:
>>          0 -> DIElement("a")
>>
>> ===
>>
>> class DIComplex {
>>   def nSlots = erd.nChildSlots
>
> I think the concept of nSlots and nChildSlots should go away completely.
> The children ArrayBuffer should just grow as elements are added and
> shrink as they are backtracked. It might make sense to have a tunable
> the determines the initial size of the children Array, but the Scala
> defaults for initializing and growing a mutable array might already be
> reasonable.
>
>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
>
> One thing that might make this change easier is to not worry about the
> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
> really only necessary to support query style expressions, but that comes
> with a whole other slew of concerns (e.g. fn:count, same name but
> different types, etc.). There currently exist restrictions that do not
> allow query style expressions, so this Seq should always be a size of 1
> until remove those restrictions.
>
>>
>>   def addChild(node: DINode) {
>>     children += node
>>     fastLookup(node.name).append(node)
>>   }
>>
>>   def getChild(name: String): immutable.Seq[DINode] {
>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>   }
>>
>>   def backtrack(int n) {
>>     for (i = 0; i < n; i++) {
>>       val endIdx = children.size -1
>>       val child = children.remove(endIdx) // children.pop()
>>
>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>         case None => // do nothing
>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>         case Some(nodesSeq) => {
>>       val lastEntry = nodesSeq.last
>>           lastEntry match {
>>         case arr: DIArray => {
>>               // If the array has >= 2 items, remove last one and update
>> nodesSeq
>>               // with this array:
>>               //  val lastIdx = nodesSeq.length - 1
>>           //  arr.remove(arr.size - 1)
>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>> arr))
>>               //
>>               // If the array has 1 item, remove the array from nodesSeq
>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>         }
>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>           }
>>
>>         }
>>       }
>>     }
>>   }
>>
>> }
>>
>> class DIArray {
>>     private val initialSize =
>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>>
>>     def append(ie: InfosetElement): Unit = {
>>         _contents.append(ie.asInstanceOf[DIElement])
>>         ie.setArray(this)
>>     }
>>
>>     //
>>     // For backtracking purposes
>>     //
>>     def remove(n: Int): DIElement = {
>>         _contents.remove(n)
>>     }
>> }
>>
>> ===
>>
>> To facillitate a breadth-first ordering scheme we shall have to
>
> That this is only necessary for unordered sequences, where the insertion
> order in the ArrayBuffer differs from the display order, correct? If
> that's true, we definitely want to make sure to only do the below
> sorting when an element contains an unordred sequence.
>
>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>> shall be called childElementDeclPosition and pertains to the static
>> order of the element children of a complex type element.  When sorting
>> DINodes, childElementDeclPosition can be obtained by way of reaching
>> from the DINode to the coresponding ERD.  Any array shall be a single
>> entry in the sequence that needs to be sorted.  All elements in this
>> array will have the same childElementDeclPosition.
>>
>> The difficulty in implementing this is that an Element's
>> dsom.ComplexType can have groups within the groups before finding actual
>> element children.  Therefore in the Document Schema Object Model (DSOM)
>> the model group object shall have a startingChildElementDeclPosition and
>> an endingChildElementDeclPosition.
>>
>> 1. The startingChildElementPosition shall be passed to the constructor
>> for the Term.
>> 2. If the term is an element, the endingChildElementDeclPosition is
>> equal to the startingChildElementDeclPosition.  3. If the term is a
>> model group, then its first child term's
>> startingChildElementDeclPosition is equal to its own
>> startingChildElementDeclPosition, the 2nd to Nth child term's
>> startingChildElementDeclPosition is the prior term's
>> endingChildElementDeclPosition + 1.
>> 4. The endingChildElementDeclPosition for a model-group shall be the
>> endingChildElmentDeclPosition of its last child term.
>>
>
> Looks good to me. +1. This should definitely make implementation of
> unordered sequences and query style expressions easier. Removal of the
> slot concept should also help remove concepts that prevent faster schema
> compilation.
>
> - Steve
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Steve Lawrence <sl...@apache.org>.

On 10/11/2017 12:11 PM, Taylor Wise wrote:
> 
> What you'll notice is that the current implementation will construct the DIArray 
> if it doesn't already exist and append to it if it does already exist.  But it 
> needs the 'slot' number in order to check this.  The slot number is obtained 
> from DPathCompileInfo.slotIndexInParent. If we're removing this concept of slots 
> entirely, how can we accomplish this?  Will we have to do a named lookup then 
> every time we want to append to a DIArray?
I think you're right. The way things work right now, anytime we add to a
a DIArray we would have to do a hash lookup on the DIComplex. It's not
limited to just DPath.

Maybe this is another argument for removing the concept of a DIArray? We
still need to know if a DIElement is part of an array for the
InfosetOutputter (erd.isArray), but maybe adding an DIElement a DIArray
element should just means creating a new DISimple/DIComplex and adding
it to the DIComplex child ArrayBuffer and appending to the HashLookup
Seq? This would also simplify some query style logic, e.g. fn:count.

Alternatively, we could modify the parser logic so that when it's
dealing with arrays it stores the DIArray locally somewhere (PState or
Parser) and appends straight to that without having to ask the parent
DIComplex for the DIArray? Not sure how big of a change that would be
though.

> 
> 
> Current Implementation:
> 
> override def addChild(e: InfosetElement): Unit = {
> 
>   if (e.runtimeData.isArray)  {
> 
>    // Make sure there is an array to accept the child
> 
>    val arr = getChildArray(e.runtimeData) // creates if doesn't exist
> 
>      Assert.invariant(arr ne null)
> 
>      arr.append(e)
> 
>    } else {
> 
>      _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]
> 
>      _lastSlotAdded = e.runtimeData.slotIndexInParent
> 
>    }
> 
>    e.setParent(this)
> 
> }
> 
> 
>    final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
>      Assert.usage(childERD.isArray)
>      getChildArray(childERD.dpathElementCompileInfo)
>    }
> 
>    final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
>      Assert.usage(info.isArray)
>      val slot = info.slotIndexInParent
>      getChildArray(slot)
>    }
> 
>    private def getChildArray(slot: Int): InfosetArray = {
>      val slotVal = _slots(slot)
>      if (slotVal ne null)
>        slotVal match {
>          case arr: DIArray => slotVal.asInstanceOf[DIArray]
>          case _ => Assert.usageError("not an array")
>        }
>      else {
>        val arrayERD = erd.childERDs(slot)
>        // slot is null. There isn't even an array object yet.
>        // create one (it will have zero entries)
>        val ia = new DIArray(arrayERD, this)
>        // no array there yet. So we have to create one.
>        setChildArray(slot, ia)
>        ia
>      }
>    }
> 
> 
> 
> --------------------------------------------------------------------------------
> *From:* Steve Lawrence <sl...@apache.org>
> *Sent:* Tuesday, October 3, 2017 10:05:43 AM
> *To:* dev@daffodil.apache.org; Taylor Wise
> *Subject:* Re: Remove concept of slots from InfosetImpl Design Discussion
> On 09/22/2017 04:25 PM, Taylor Wise wrote:
>> Goal: Remove concept of slots from InfosetImpl
>> 
>> Reasons:
>>     1. So that element/node lookup can be name based.
>>     2. Ease re-enabling of Unordered Sequences.
>>     3. Ease enabling of 'query-style' expressions that return more than
>> one node.
>> 
>> An Infoset is a document (DIDocument) and the document contains exactly
>> one element (DIElement) as its root.  These elements can be simple
>> (DISimple) or complex (DIComplex).  Simple elements do not contain
>> children, complex elements do.  DIArray represents an array (maxOccurs
>>>1). A DIArray shall always be a child of DIComplex.  All of these
>> inherit from DINode.
>> 
>> Currently DIComplex models its children as an Array of DINodes. The
>> problem with the current implementation is that there's no way for us to
>> just ask for an element by name without walking the structure.
>> 
>> The solution involves modifying DIComplex to represent its children both
>> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
>> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
>> insertion ordering information we need regarding the children as well as
>> constant time indexing access to the child in question.  The ArrayBuffer
>> also has the added benefit of growing and shrinking as objects are added
>> and removed.  While the HashMap will allow us to associate a name
>> (String) with DINodes that have that name thus providing a fast named
>> lookup.
>> 
>> This named lookup can be optimized further by only inserting into the
>> HashMap those elements that we know are used in expressions. These are
>> the elements that would most likely benefit from a fast named lookup and
>> will allow us to avoid additional overhead of hashing strings and
>> allocating/appending to a sequence.
>> 
>> The reason for having the HashMap contain a mutable.Seq[DINode] is that
>> it's possible for the following:
>> 
>>     <element name="a" .../>
>>     <element name="a" maxOccurs="unbounded"..../>
>>     <element name="b".../>
>>     <element name="a" ..../>
>> 
>> Each element named 'a' would have an associated DINode.  Because there
>> are multiple DINodes named 'a', we need to represent them in the HashMap
>> using the same key 'a'.  To do this, we add all instances of 'a' to a
>> Seq[DINode].  In this particular case, the HashMap would look as follows:
>> 
>>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>>     "b" -> mutable.Seq(DIElement)
>> 
>> The mutable.ArrayBuffer maintains the insertion order and would look as
>> follows:
>> 
>>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>> 
>> Back tracking shall be accomplished by going to the last item in the
>> DIComplex ArrayBuffer and removing it.
>>     1. If the last item is a DIElement, we simply delete it from the
>> ArrayBuffer and then remove it also from the HashMap.
>>     2. If the last item is a DIArray, we then have to look at the
>> contents of the DIArray.
>>         a. If the DIArray has more than one entry, remove the last
>> entry.  Then update the associated entry in the ArrayBuffer and HashMap
>> with the modified DIArray.
> 
> I think the second part of this is actually unnecessary. Since the
> DIArray is mutable, we only need to remove its last child. The entry in
> the ArrayBuffer and HashMap will still point to the same DIArray. The
> only difference is the contents of the DIArray will be one smaller.
> 
>>         b. If the DIArray has a single entry then we simply remove
>> DIArray from ArrayBuffer and the HashMap.
>> 
>>     <element name="root>
>>          <sequence ../>
>>          <element name="a" .../>
>>          <element name="a" maxOccurs="unbounded">
>>            ...
>>            <element name="b" .../>
>>            <element name="c" .../>
>>            ...
>>          </element>
>>        </sequence>
>>     </element>
>> 
>> If the HashMap of DIComplex-root looks like the following:
>>      "a" -> mutable.Seq(DIElement, DIArray)
>> 
>> ArrayBuffer of DIComplex-root looks like the following:
>>      0 -> DIElement("a")
>>      1 -> DIArray("a")
>> 
>> Where DIArray children ArrayBuffer looks like:
>>      0 -> DIElement("b")
>>      1 -> DIElement("c")
>> 
>> Given the initial example if we backtrack(1):  // purge one item
>>      DIArray:
>>          0 -> DIElement("b")  // removed one item from DIArray
>> 
>>      HashMap DIComplex-root:
>>          "a" -> mutable.Seq(DIElement, DIArray)
>> 
>>      ArrayBuffer DIComplex-root:
>>          0 -> DIElement("a")
>>          1 -> DIArray("a")
>> 
>> Given the initial example if we backtrack(2): // purge two items
>>      DIArray:
>>          Empty
>> 
>>      HashMap DIComplex-root:
>>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>>                                         // removed and is now empty
>> 
>>      ArrayBuffer DIComplex-root:
>>          0 -> DIElement("a")
>> 
>> ===
>> 
>> class DIComplex {
>>   def nSlots = erd.nChildSlots
> 
> I think the concept of nSlots and nChildSlots should go away completely.
> The children ArrayBuffer should just grow as elements are added and
> shrink as they are backtracked. It might make sense to have a tunable
> the determines the initial size of the children Array, but the Scala
> defaults for initializing and growing a mutable array might already be
> reasonable.
> 
>>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]
> 
> One thing that might make this change easier is to not worry about the
> mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
> really only necessary to support query style expressions, but that comes
> with a whole other slew of concerns (e.g. fn:count, same name but
> different types, etc.). There currently exist restrictions that do not
> allow query style expressions, so this Seq should always be a size of 1
> until remove those restrictions.
> 
>> 
>>   def addChild(node: DINode) {
>>     children += node
>>     fastLookup(node.name).append(node)
>>   }
>> 
>>   def getChild(name: String): immutable.Seq[DINode] {
>>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>>   }
>> 
>>   def backtrack(int n) {
>>     for (i = 0; i < n; i++) {
>>       val endIdx = children.size -1
>>       val child = children.remove(endIdx) // children.pop()
>> 
>>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>>         case None => // do nothing
>>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>>         case Some(nodesSeq) => {
>>       val lastEntry = nodesSeq.last
>>           lastEntry match {
>>         case arr: DIArray => {
>>               // If the array has >= 2 items, remove last one and update
>> nodesSeq
>>               // with this array:
>>               //  val lastIdx = nodesSeq.length - 1
>>           //  arr.remove(arr.size - 1)
>>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
>> arr))
>>               //
>>               // If the array has 1 item, remove the array from nodesSeq
>>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>>         }
>>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>>           }
>> 
>>         }
>>       }
>>     }
>>   }
>> 
>> }
>> 
>> class DIArray {
>>     private val initialSize =
>> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>> 
>>     def append(ie: InfosetElement): Unit = {
>>         _contents.append(ie.asInstanceOf[DIElement])
>>         ie.setArray(this)
>>     }
>> 
>>     //
>>     // For backtracking purposes
>>     //
>>     def remove(n: Int): DIElement = {
>>         _contents.remove(n)
>>     }
>> }
>> 
>> ===
>> 
>> To facillitate a breadth-first ordering scheme we shall have to
> 
> That this is only necessary for unordered sequences, where the insertion
> order in the ArrayBuffer differs from the display order, correct? If
> that's true, we definitely want to make sure to only do the below
> sorting when an element contains an unordred sequence.
> 
>> construct a 'global' that lives on the Element Runtime Data (ERD).  It
>> shall be called childElementDeclPosition and pertains to the static
>> order of the element children of a complex type element.  When sorting
>> DINodes, childElementDeclPosition can be obtained by way of reaching
>> from the DINode to the coresponding ERD.  Any array shall be a single
>> entry in the sequence that needs to be sorted.  All elements in this
>> array will have the same childElementDeclPosition.
>> 
>> The difficulty in implementing this is that an Element's
>> dsom.ComplexType can have groups within the groups before finding actual
>> element children.  Therefore in the Document Schema Object Model (DSOM)
>> the model group object shall have a startingChildElementDeclPosition and
>> an endingChildElementDeclPosition.
>> 
>> 1. The startingChildElementPosition shall be passed to the constructor
>> for the Term.
>> 2. If the term is an element, the endingChildElementDeclPosition is
>> equal to the startingChildElementDeclPosition.  3. If the term is a
>> model group, then its first child term's
>> startingChildElementDeclPosition is equal to its own
>> startingChildElementDeclPosition, the 2nd to Nth child term's
>> startingChildElementDeclPosition is the prior term's
>> endingChildElementDeclPosition + 1.
>> 4. The endingChildElementDeclPosition for a model-group shall be the
>> endingChildElmentDeclPosition of its last child term.
>> 
> 
> Looks good to me. +1. This should definitely make implementation of
> unordered sequences and query style expressions easier. Removal of the
> slot concept should also help remove concepts that prevent faster schema
> compilation.
> 
> - Steve
>

Re: Remove concept of slots from InfosetImpl Design Discussion

Posted by Taylor Wise <tw...@tresys.com>.

What you'll notice is that the current implementation will construct the DIArray if it doesn't already exist and append to it if it does already exist.  But it needs the 'slot' number in order to check this.  The slot number is obtained from DPathCompileInfo.slotIndexInParent.  If we're removing this concept of slots entirely, how can we accomplish this?  Will we have to do a named lookup then every time we want to append to a DIArray?


Current Implementation:

override def addChild(e: InfosetElement): Unit = {

 if (e.runtimeData.isArray)  {

  // Make sure there is an array to accept the child

  val arr = getChildArray(e.runtimeData) // creates if doesn't exist

    Assert.invariant(arr ne null)

    arr.append(e)

  } else {

    _slots(e.runtimeData.slotIndexInParent) = e.asInstanceOf[DINode]

    _lastSlotAdded = e.runtimeData.slotIndexInParent

  }

  e.setParent(this)

}


  final def getChildArray(childERD: ElementRuntimeData): InfosetArray = {
    Assert.usage(childERD.isArray)
    getChildArray(childERD.dpathElementCompileInfo)
  }

  final def getChildArray(info: DPathElementCompileInfo): InfosetArray = {
    Assert.usage(info.isArray)
    val slot = info.slotIndexInParent
    getChildArray(slot)
  }


  private def getChildArray(slot: Int): InfosetArray = {
    val slotVal = _slots(slot)
    if (slotVal ne null)
      slotVal match {
        case arr: DIArray => slotVal.asInstanceOf[DIArray]
        case _ => Assert.usageError("not an array")
      }
    else {
      val arrayERD = erd.childERDs(slot)
      // slot is null. There isn't even an array object yet.
      // create one (it will have zero entries)
      val ia = new DIArray(arrayERD, this)
      // no array there yet. So we have to create one.
      setChildArray(slot, ia)
      ia
    }
  }



________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Tuesday, October 3, 2017 10:05:43 AM
To: dev@daffodil.apache.org; Taylor Wise
Subject: Re: Remove concept of slots from InfosetImpl Design Discussion

On 09/22/2017 04:25 PM, Taylor Wise wrote:
> Goal: Remove concept of slots from InfosetImpl
>
> Reasons:
>     1. So that element/node lookup can be name based.
>     2. Ease re-enabling of Unordered Sequences.
>     3. Ease enabling of 'query-style' expressions that return more than
> one node.
>
> An Infoset is a document (DIDocument) and the document contains exactly
> one element (DIElement) as its root.  These elements can be simple
> (DISimple) or complex (DIComplex).  Simple elements do not contain
> children, complex elements do.  DIArray represents an array (maxOccurs
>>1). A DIArray shall always be a child of DIComplex.  All of these
> inherit from DINode.
>
> Currently DIComplex models its children as an Array of DINodes. The
> problem with the current implementation is that there's no way for us to
> just ask for an element by name without walking the structure.
>
> The solution involves modifying DIComplex to represent its children both
> as a mutable.ArrayBuffer[DINode] and a HashMap[String,
> mutable.Seq[DINode]].  The ArrayBuffer shall give us the necessary
> insertion ordering information we need regarding the children as well as
> constant time indexing access to the child in question.  The ArrayBuffer
> also has the added benefit of growing and shrinking as objects are added
> and removed.  While the HashMap will allow us to associate a name
> (String) with DINodes that have that name thus providing a fast named
> lookup.
>
> This named lookup can be optimized further by only inserting into the
> HashMap those elements that we know are used in expressions. These are
> the elements that would most likely benefit from a fast named lookup and
> will allow us to avoid additional overhead of hashing strings and
> allocating/appending to a sequence.
>
> The reason for having the HashMap contain a mutable.Seq[DINode] is that
> it's possible for the following:
>
>     <element name="a" .../>
>     <element name="a" maxOccurs="unbounded"..../>
>     <element name="b".../>
>     <element name="a" ..../>
>
> Each element named 'a' would have an associated DINode.  Because there
> are multiple DINodes named 'a', we need to represent them in the HashMap
> using the same key 'a'.  To do this, we add all instances of 'a' to a
> Seq[DINode].  In this particular case, the HashMap would look as follows:
>
>     "a" -> mutable.Seq(DIElement, DIArray, DIElement)
>     "b" -> mutable.Seq(DIElement)
>
> The mutable.ArrayBuffer maintains the insertion order and would look as
> follows:
>
>     mutable.ArrayBuffer(DIElement, DIArray, DIElement, DIElement)
>
> Back tracking shall be accomplished by going to the last item in the
> DIComplex ArrayBuffer and removing it.
>     1. If the last item is a DIElement, we simply delete it from the
> ArrayBuffer and then remove it also from the HashMap.
>     2. If the last item is a DIArray, we then have to look at the
> contents of the DIArray.
>         a. If the DIArray has more than one entry, remove the last
> entry.  Then update the associated entry in the ArrayBuffer and HashMap
> with the modified DIArray.

I think the second part of this is actually unnecessary. Since the
DIArray is mutable, we only need to remove its last child. The entry in
the ArrayBuffer and HashMap will still point to the same DIArray. The
only difference is the contents of the DIArray will be one smaller.

>         b. If the DIArray has a single entry then we simply remove
> DIArray from ArrayBuffer and the HashMap.
>
>     <element name="root>
>          <sequence ../>
>          <element name="a" .../>
>          <element name="a" maxOccurs="unbounded">
>            ...
>            <element name="b" .../>
>            <element name="c" .../>
>            ...
>          </element>
>        </sequence>
>     </element>
>
> If the HashMap of DIComplex-root looks like the following:
>      "a" -> mutable.Seq(DIElement, DIArray)
>
> ArrayBuffer of DIComplex-root looks like the following:
>      0 -> DIElement("a")
>      1 -> DIArray("a")
>
> Where DIArray children ArrayBuffer looks like:
>      0 -> DIElement("b")
>      1 -> DIElement("c")
>
> Given the initial example if we backtrack(1):  // purge one item
>      DIArray:
>          0 -> DIElement("b")  // removed one item from DIArray
>
>      HashMap DIComplex-root:
>          "a" -> mutable.Seq(DIElement, DIArray)
>
>      ArrayBuffer DIComplex-root:
>          0 -> DIElement("a")
>          1 -> DIArray("a")
>
> Given the initial example if we backtrack(2): // purge two items
>      DIArray:
>          Empty
>
>      HashMap DIComplex-root:
>          "a" -> mutable.Seq(DIElement)  // Because DIArray had two items
>                                         // removed and is now empty
>
>      ArrayBuffer DIComplex-root:
>          0 -> DIElement("a")
>
> ===
>
> class DIComplex {
>   def nSlots = erd.nChildSlots

I think the concept of nSlots and nChildSlots should go away completely.
The children ArrayBuffer should just grow as elements are added and
shrink as they are backtracked. It might make sense to have a tunable
the determines the initial size of the children Array, but the Scala
defaults for initializing and growing a mutable array might already be
reasonable.

>   val children: mutable.ArrayBuffer[DINode](nSlots) // insertion order
>   lazy val fastLookup: HashMap[String, mutable.Seq[DINode]]

One thing that might make this change easier is to not worry about the
mutable.Seq, and just make this a HashMap[String, DINode]. The Seq is
really only necessary to support query style expressions, but that comes
with a whole other slew of concerns (e.g. fn:count, same name but
different types, etc.). There currently exist restrictions that do not
allow query style expressions, so this Seq should always be a size of 1
until remove those restrictions.

>
>   def addChild(node: DINode) {
>     children += node
>     fastLookup(node.name).append(node)
>   }
>
>   def getChild(name: String): immutable.Seq[DINode] {
>     fastLookup.getOrElse(name, mutable.Seq.empty) // fastLookup(name)
>   }
>
>   def backtrack(int n) {
>     for (i = 0; i < n; i++) {
>       val endIdx = children.size -1
>       val child = children.remove(endIdx) // children.pop()
>
>       fastLookup.get(child.name) match {  // fastLookup(child.name).pop()
>         case None => // do nothing
>     case Some(nodesSeq) if nodesSeq.size == 0 => // Shouldn't happen?
>         case Some(nodesSeq) => {
>       val lastEntry = nodesSeq.last
>           lastEntry match {
>         case arr: DIArray => {
>               // If the array has >= 2 items, remove last one and update
> nodesSeq
>               // with this array:
>               //  val lastIdx = nodesSeq.length - 1
>           //  arr.remove(arr.size - 1)
>               //  fastLookup.update(child.name, nodesSeq.update(lastIdx,
> arr))
>               //
>               // If the array has 1 item, remove the array from nodesSeq
>               //  fastLookup.update(child.name, nodesSeq.dropRight(1))
>         }
>             case _ => fastLookup.update(child.name, nodesSeq.dropRight(1))
>           }
>
>         }
>       }
>     }
>   }
>
> }
>
> class DIArray {
>     private val initialSize =
> parent.tunable.initialElementOccurrencesHint.toInt // set to 10
>     protected final val _contents = new ArrayBuffer[DIElement](initialSize)
>
>     def append(ie: InfosetElement): Unit = {
>         _contents.append(ie.asInstanceOf[DIElement])
>         ie.setArray(this)
>     }
>
>     //
>     // For backtracking purposes
>     //
>     def remove(n: Int): DIElement = {
>         _contents.remove(n)
>     }
> }
>
> ===
>
> To facillitate a breadth-first ordering scheme we shall have to

That this is only necessary for unordered sequences, where the insertion
order in the ArrayBuffer differs from the display order, correct? If
that's true, we definitely want to make sure to only do the below
sorting when an element contains an unordred sequence.

> construct a 'global' that lives on the Element Runtime Data (ERD).  It
> shall be called childElementDeclPosition and pertains to the static
> order of the element children of a complex type element.  When sorting
> DINodes, childElementDeclPosition can be obtained by way of reaching
> from the DINode to the coresponding ERD.  Any array shall be a single
> entry in the sequence that needs to be sorted.  All elements in this
> array will have the same childElementDeclPosition.
>
> The difficulty in implementing this is that an Element's
> dsom.ComplexType can have groups within the groups before finding actual
> element children.  Therefore in the Document Schema Object Model (DSOM)
> the model group object shall have a startingChildElementDeclPosition and
> an endingChildElementDeclPosition.
>
> 1. The startingChildElementPosition shall be passed to the constructor
> for the Term.
> 2. If the term is an element, the endingChildElementDeclPosition is
> equal to the startingChildElementDeclPosition.  3. If the term is a
> model group, then its first child term's
> startingChildElementDeclPosition is equal to its own
> startingChildElementDeclPosition, the 2nd to Nth child term's
> startingChildElementDeclPosition is the prior term's
> endingChildElementDeclPosition + 1.
> 4. The endingChildElementDeclPosition for a model-group shall be the
> endingChildElmentDeclPosition of its last child term.
>

Looks good to me. +1. This should definitely make implementation of
unordered sequences and query style expressions easier. Removal of the
slot concept should also help remove concepts that prevent faster schema
compilation.

- Steve