You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@daffodil.apache.org by Taylor Wise <tw...@tresys.com> on 2017/10/24 15:21:44 UTC

Fw: Unordered Sequences Thoughts

Forwarding discussion on this.


________________________________
From: Stephen Lawrence
Sent: Tuesday, October 24, 2017 11:20 AM
To: Taylor Wise
Subject: Re: Unordered Sequences Thoughts

On 10/24/2017 10:54 AM, Taylor Wise wrote:
> <sequence sequenceKind="unordered">
>   <a minOccurs="0" maxOccurs="unbounded" initiator="A:"/>
>   <b minOccurs="0" maxOccurs="unbounded" initiator="B:"/>
>   <c minOccurs="0" maxOccurs="unbounded" initiator="C:" outputValueCalc="{
> count( ../b) }"/>
> </sequence>
>
> A:1B:2A:3A:4B:5B:6C:check
> <a>1</a><b>2</b><a>3</a><a>4</a><b>5</b><b>6</b><c>3</c>
>
> A:1B:2C:checkB:3C:check
> <a>1</a><b>2</b><c>1</c><b>3</b><c>2</c>

Do you mean inputValueCalc instead of outputValueCalc? If so these both
look correct to me, and should work by sorting at the end. We should
also be able to do something like

  inputValueCalc="{ ../b[fn:count(../b) - 1] }"

or something to get the last b. Both of these will require query style
expression support (i.e. when the Seq() in the hash table contains more
than one element) though.

>
> I suppose that the above is legal.  Right?  So I guess there's no issue with
> expressions.  But if you were to say instead of doing an outputValueCalc, but do
> an Nth child lookup (however that is written), 'c' at each point would
> eventually be outdated and could lead to undesirable results.  And of course,
> like we mentioned in the meeting, people would be thinking of Nth child in the
> 'sorted' representation vs the insertion representation.

Agreed with this. If we supported Nth-child, things break down. BTW, to
do Nth child would be something like this:

/path/to/parent/*[position() = N]

DFDL does not support * in path steps or the position() function. I
suspect it will never support * since that breaks static type checking.

> Oooh.. question.
>
> <sequence sequenceKind="unordered">
>   <a minOccurs="0" maxOccurs="unbounded" initiator="A:"/>
>   <c minOccurs="0" maxOccurs="unbounded" initiator="C:" outputValueCalc="{
> count( ../b) }"/>
>   <b minOccurs="0" maxOccurs="unbounded" initiator="B:"/>
> </sequence>
>
> Is that now illegal?  I'm asking for the value of something that could
> potentially come after me or not at all.

Good question. I think that should be allowed, but I can't find anything
in the spec that says one way or the other. Note that outputValueCalc is
actually a bit weird, since on unparse the infoset will be sorted, so
everytime count(../b) is executed, it will always be the same.
inputValueCalc will be different though since that is executed before
things are parsed.



>
> --------------------------------------------------------------------------------
> *From:* Stephen Lawrence
> *Sent:* Tuesday, October 24, 2017 10:08:35 AM
> *To:* Taylor Wise
> *Subject:* Re: Unordered Sequences Thoughts
> Not sure I understand. Nth child isn't even part of the DFDL
> specification, seems perfectly reasonable to have an expression that
> works for all cases except for that. Maybe I'm misunderstanding?
>
> On 10/24/2017 10:03 AM, Taylor Wise wrote:
>> I think for the time being, it's probably best to just sort at the end.  Then at
>> least we have unordered sequences re-enabled and can make a decision at a later
>> date if we want to pursue the added benefit of finding the n'th item.  But, I
>> would argue that an unordered sequence isn't 'finished' until it has been
>> completed.  And so you shouldn't be able to run expressions on it until then.
>>
>> --------------------------------------------------------------------------------
>> *From:* Stephen Lawrence
>> *Sent:* Tuesday, October 24, 2017 8:18:29 AM
>> *To:* Taylor Wise
>> *Subject:* Unordered Sequences Thoughts
>> Taylor,
>>
>> I was thinking about the solutions that we brainstormed about unordered
>> sequences last week. I think the solutions were essentially to create a
>> new UnorderedSequence parser combinator and it can keep track of where
>> the infoset elements start and end and so it can handle sorting the
>> infoset elements, somehow.
>>
>> Two thoughts came up on how to do this:
>>
>> 1. Insert everything as normal. Upon completion of the new parser
>> combinator, it sorts all the elements that were added based on the
>> schema order. It might need to make a copy of the pre-sorted elements in
>> case they need to be restored do to backtracking (though, i'm not sure
>> this is necessary, I *think* once an unordered sequence is completed,
>> there's no way to backtrack to the middle of it. This *might* be wrong
>> though). The downside with this is that it doesn't work well with n-th
>> child, though this isn't really a requirement and changes a lot of
>> assumptions (for example, static type checking when compiling schemas
>> would be useless) so I'm not sure it really matters.
>>
>> 2. Insert new elements in the sorted position. This is essentially
>> "insertion sort". One drawback of this (and maybe a dealbreaker?) is
>> that we currently look at the end of the list to determine if we are
>> added to an existing array. So I *think* we need to maintain insertion
>> order so that we can append to the correct DIArray. There might be a way
>> around this, but I'm not sure.
>>
>> Point being, before you get to far into implementing unodered sequences,
>> make sure you think through this issue. I'm not immediately sure of a
>> workaround, and sorting at the end might be easier to implement, and
>> might actually be more efficient. Insertion sort isn't particularly
>> efficient, especially for large out of order lists. Sorting at the end
>> would allow for some other algorithm that might be better.
>>
>> Just some food for thought.
>>
>> Also, semi-related. I took a look at the final slot chnages. Looks
>> really good! Only thing to work on in the future is a commit message.
>> You breifly explain what changes were made, but the more important thing
>> to have in a commit message is "why" the changes were made. When someone
>> looks back on this code in a year, it's easy to figure out what was
>> changed, but the reasoning behind it may not be as clear, and that's
>> usually the question someone is asking when they are looking at really
>> old code.
>>
>> - Steve
>>
>


Re: Unordered Sequences Thoughts

Posted by Mike Beckerle <mb...@tresys.com>.
A few more comments inline in blue


________________________________
From: Taylor Wise <tw...@tresys.com>
Sent: Tuesday, October 24, 2017 11:21 AM
To: dev@daffodil.apache.org
Subject: Fw: Unordered Sequences Thoughts

Forwarding discussion on this.


________________________________
From: Stephen Lawrence
Sent: Tuesday, October 24, 2017 11:20 AM
To: Taylor Wise
Subject: Re: Unordered Sequences Thoughts

On 10/24/2017 10:54 AM, Taylor Wise wrote:
> <sequence sequenceKind="unordered">
>   <a minOccurs="0" maxOccurs="unbounded" initiator="A:"/>
>   <b minOccurs="0" maxOccurs="unbounded" initiator="B:"/>
>   <c minOccurs="0" maxOccurs="unbounded" initiator="C:" outputValueCalc="{
> count( ../b) }"/>
> </sequence>
>
> A:1B:2A:3A:4B:5B:6C:check
> <a>1</a><b>2</b><a>3</a><a>4</a><b>5</b><b>6</b><c>3</c>
>
> A:1B:2C:checkB:3C:check
> <a>1</a><b>2</b><c>1</c><b>3</b><c>2</c>

Do you mean inputValueCalc instead of outputValueCalc?

I think IVC doesn't make sense inside an unordered. Nor is it legal since element c has minOccurs 0, and IVC not allowed on arrays.

So let's imagine the min/max occurs on element c are gone. It's a scalar.

Unordered really does behave like an array of a choice of the elements. Choices when unparsing have some unresolved issues, e.g., if there is nothing in the infoset to represent an element of the choice, then how do you determine which element?  This is an open topic (for which we need a JIRA ticket with a write up of the issue - I did some work on this over last weekend, and have a bunch of notes to type up.)

This all looks too hairy to me, I think we should just say that the elements of an unordered sequence cannot have IVC nor OVC on them. I'd like to see use-cases for why these might be needed where this is the compelling way to model the data before undertaking this snarl.

If so these both
look correct to me, and should work by sorting at the end. We should
also be able to do something like

  inputValueCalc="{ ../b[fn:count(../b) - 1] }"

or something to get the last b. Both of these will require query style
expression support (i.e. when the Seq() in the hash table contains more
than one element) though.

>
> I suppose that the above is legal.  Right?  So I guess there's no issue with
> expressions.  But if you were to say instead of doing an outputValueCalc, but do
> an Nth child lookup (however that is written), 'c' at each point would
> eventually be outdated and could lead to undesirable results.  And of course,
> like we mentioned in the meeting, people would be thinking of Nth child in the
> 'sorted' representation vs the insertion representation.

Agreed with this. If we supported Nth-child, things break down. BTW, to
do Nth child would be something like this:

/path/to/parent/*[position() = N]

DFDL does not support * in path steps or the position() function. I
suspect it will never support * since that breaks static type checking.

There's a proposal to allow "*" in DFDL expressions in syntax like this.

Part of the unparsing+choices situation mentioned above.

Turns out to be needed to support computing things that are outside of a choice while reaching into things inside a choice. Comes up in Link16 - if you want to put the message label and sublabel outside the message so they can be examined by a dfdl:choiceDispatchKey expression, then when unparsing you have to compute those from the corresponding fields inside the choice branches. Each choice branch is a different element, but inside that element is a copy of the label and sublabel fields.

So you need a wildcard... ex:

<element name="msg">
  <complexType>
     <!-- might want label and sublabel in a hidden group -->
     <element name="label" dfdl:outputValueCalc="{ ../../msg/*/label }"/><!-- note wildcard * -->
     <element name="subLabel" dfdl:outputValueCalc="{ ../../msg/*/subLabel }"/><!-- note wildcard * -->

     <choice dfdl:choiceDispatchKey="{ fn:concat(label, '.', sublabel) }">
          <element name="J0.0" dfdl:choiceBranchKey="0.0" ...> <!-- REACH OVER WITH * wildcard -->
              <complexType>
                  <element name="label" dfdl:inputValueCalc="{../../label }"/>
                  <element name="subLabel" dfdl:inputValueCalc="{../../subLabel }"/>
                  ...
         <element name="J0.1" dfdl:choiceBranchKey="0.1" ...><!-- REACH OVER WITH * wildcard -->
               <complexType>
                  <element name="label" dfdl:inputValueCalc="{../../label }"/>
                  <element name="subLabel" dfdl:inputValueCalc="{../../subLabel }"/>
          ....

etc. Turns out you can still type-check the expression. It simply must be type-checked for every context, i.e.,
everything the "*" can match.  The type has to be the same also. E.g., in one branch you can't have the 'label' element be a string and in some other one a number... that would require paths to be compiled to include different conversions. but if they are
all the same simple type, then the path can be compiled.

Fortunately, since the new names-not-slots implementation just uses the element names, the compilation just is effectively a list of names and maybe a conversion function at the end, so is the same in all situations.

So this "*" seems necessary, but also supportable.

Though I'd be thrilled if there's a way to avoid this ... but I can't think of one.

BTW: the Euro Space Agency folks who created DFDL4S needed a wildcard * like this. They invented an interesting feature. Unlike the XPath wildcard "*" they instead made the * be part of a regex that can be a path step, so you could do a path step like

../(.*foo)/bar

where that path step would match elements ending in "foo" suffix.

I believe their usage can all be converted into regular xpath "*" by splitting off the "foo" suffix into a child element e.g.,
../*/foo/bar

This is a discussion topic in the DFDL Workgroup. Waiting for some feedback from the ESA folks on whether regular xpath * meets their needs.

> Oooh.. question.
>
> <sequence sequenceKind="unordered">
>   <a minOccurs="0" maxOccurs="unbounded" initiator="A:"/>
>   <c minOccurs="0" maxOccurs="unbounded" initiator="C:" outputValueCalc="{
> count( ../b) }"/>
>   <b minOccurs="0" maxOccurs="unbounded" initiator="B:"/>
> </sequence>
>
> Is that now illegal?  I'm asking for the value of something that could
> potentially come after me or not at all.

Good question. I think that should be allowed, but I can't find anything
in the spec that says one way or the other. Note that outputValueCalc is
actually a bit weird, since on unparse the infoset will be sorted, so
everytime count(../b) is executed, it will always be the same.
inputValueCalc will be different though since that is executed before
things are parsed.

Yikes. I would really rather just disallow IVC/OVC inside unordered groups.  Unless someone has a plausible use case that makes sense.

One of the things we owe back to the DFDL workgroup is an experience document on all these aspects of implementing IVC and OVC, where Daffodil is breaking new ground here. No prior format description language I've seen has anything like IVC/OVC, so there's learning involved here in what we need to do so as to avoid truly excessive complexity.

There's a start at this document on our NCSA wiki.
https://opensource.ncsa.illinois.edu/confluence/pages/viewpage.action?pageId=98861346

If we decide we must add a restriction (as I'm suggesting), we should add to this page. I've put in a placeholder there.
Experience with dfdl:inputValueCalc, dfdl:outputValueCalc ...<https://opensource.ncsa.illinois.edu/confluence/pages/viewpage.action?pageId=98861346>
opensource.ncsa.illinois.edu
Daffodil is the first DFDL implementation to provide the advanced features of. dfdl:inputValueCalc (IVC for short) dfdl:outputValueCalc (OVC for short)



Restricting IVC/OVC in contexts like element of an unordered sequence, and possibly some situations involving choices may be very reasonable.

>
> --------------------------------------------------------------------------------
> *From:* Stephen Lawrence
> *Sent:* Tuesday, October 24, 2017 10:08:35 AM
> *To:* Taylor Wise
> *Subject:* Re: Unordered Sequences Thoughts
> Not sure I understand. Nth child isn't even part of the DFDL
> specification, seems perfectly reasonable to have an expression that
> works for all cases except for that. Maybe I'm misunderstanding?
>
> On 10/24/2017 10:03 AM, Taylor Wise wrote:
>> I think for the time being, it's probably best to just sort at the end.  Then at
>> least we have unordered sequences re-enabled and can make a decision at a later
>> date if we want to pursue the added benefit of finding the n'th item.  But, I
>> would argue that an unordered sequence isn't 'finished' until it has been
>> completed.  And so you shouldn't be able to run expressions on it until then.
>>
>> --------------------------------------------------------------------------------
>> *From:* Stephen Lawrence
>> *Sent:* Tuesday, October 24, 2017 8:18:29 AM
>> *To:* Taylor Wise
>> *Subject:* Unordered Sequences Thoughts
>> Taylor,
>>
>> I was thinking about the solutions that we brainstormed about unordered
>> sequences last week. I think the solutions were essentially to create a
>> new UnorderedSequence parser combinator and it can keep track of where
>> the infoset elements start and end and so it can handle sorting the
>> infoset elements, somehow.
>>
>> Two thoughts came up on how to do this:
>>
>> 1. Insert everything as normal. Upon completion of the new parser
>> combinator, it sorts all the elements that were added based on the
>> schema order. It might need to make a copy of the pre-sorted elements in
>> case they need to be restored do to backtracking (though, i'm not sure
>> this is necessary, I *think* once an unordered sequence is completed,
>> there's no way to backtrack to the middle of it. This *might* be wrong
>> though). The downside with this is that it doesn't work well with n-th
>> child, though this isn't really a requirement and changes a lot of
>> assumptions (for example, static type checking when compiling schemas
>> would be useless) so I'm not sure it really matters.
>>
>> 2. Insert new elements in the sorted position. This is essentially
>> "insertion sort". One drawback of this (and maybe a dealbreaker?) is
>> that we currently look at the end of the list to determine if we are
>> added to an existing array. So I *think* we need to maintain insertion
>> order so that we can append to the correct DIArray. There might be a way
>> around this, but I'm not sure.
>>
>> Point being, before you get to far into implementing unodered sequences,
>> make sure you think through this issue. I'm not immediately sure of a
>> workaround, and sorting at the end might be easier to implement, and
>> might actually be more efficient. Insertion sort isn't particularly
>> efficient, especially for large out of order lists. Sorting at the end
>> would allow for some other algorithm that might be better.
>>
>> Just some food for thought.
>>
>> Also, semi-related. I took a look at the final slot chnages. Looks
>> really good! Only thing to work on in the future is a commit message.
>> You breifly explain what changes were made, but the more important thing
>> to have in a commit message is "why" the changes were made. When someone
>> looks back on this code in a year, it's easy to figure out what was
>> changed, but the reasoning behind it may not be as clear, and that's
>> usually the question someone is asking when they are looking at really
>> old code.
>>
>> - Steve
>>
>