You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@daffodil.apache.org by Mike Beckerle <mb...@tresys.com> on 2018/04/06 19:12:31 UTC

Simplified DFDL layering/base64 proposal

On looking into implementation complexity I've come up with simplifications that don't reduce expressive power at all, but massively simplify implementation (and documentation, and testing...) burdens.


https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Data+Layering+for+base64+-+Simplified

Re: Simplified DFDL layering/base64 proposal

Posted by Mike Beckerle <mb...@tresys.com>.

I've updated the VCalendar example to fix the typo, and I've wrapped an xs:sequence carrying the dfdl:ref="tns:folded" around the ProdID element.

You are correct to create a resuseable type that includes folding you have to use a complex type since only complex types can have an xs:sequence needed to carry the layering properties.

This problem is an artifact of the non-uniformity of simple/complex types in XSD, and there are lots of places in DFDL like this where you need a complex type in order to describe the representation of what ultimately one thinks of as a simple value, so you end up with the "value element problem".

This needs a general fix outside the scope of this layering proposal, along the lines of allowing a simple type to carry a dfdl:hiddenGroupRef property so that the simple element can have a sequence or choice group containing children elements to hold the complex representation of that simple type.

Your second observation I think is also correct which is that after running the decoding layering algorithm, one might have more data than one needs to satisfy the parsing.

When parsing this would be ignored/skipped.

Unparsing is a bit trickier, as this data may need to be provided - e.g., as padding - even though it is not carrying any data. It may just be an algorithm requirement. We certainly anticipate that data will have to be byte-oriented, that is, no final partial byte can be represented. So at least filling the final byte out with bits from fillByte may be necessary, but for many algorithms the requirement may be that the data is padded/filled to a certain byte boundary/alignment. It would be the schema authors responsibility to make sure unparsing the data provides a representation to the layering unparser that satisfies these requirements.

I will add something to this affect to the proposal page.

________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Monday, April 9, 2018 10:18:29 AM
To: dev@daffodil.apache.org; Mike Beckerle
Subject: Re: Simplified DFDL layering/base64 proposal

I like this simplified version alot! Some questions:

1) In the VCalendar example, ProdID is an element with a dfdl:ref (typo
of dfdl:formatRef) to tns:folded, which contains dfdl:layer* properties.
But layering properties are only allowed on xs:sequence's. I assume this
was just an example from the old proposal that wasn't fixed up, and
should be something like this instead:

  <xs:sequence dfdl:ref="tns:folded">
    <xs:element name="ProdID" type="xs:string" dfdl:initiator="PRODID:"
minOccurs="0"/>
  </xs:sequence>

Which raises a small issue with simple types: This layering transform
now applies to the initiator/terminator of the simple type. If you do
not want a layer to apply to those but only to the value, you'd need
make it a complex type with a "Value" element. I'm not sure this is a
big deal, but layering on simply types might get a little messier in
some cases if the initiators/termiantors shouldn't be transformed.

2) What happens with unused data in an overlying layer. For example:

Say we have something like

  <dfdl:defineFormat name="base64">
    <dfdl:format layerTransform="base64" layerLengthKind="explicit"
                 layerLength="8" ... />
  </dfdl:defineFormat>

  <xs:sequence>
    <xs:sequence dfdl:ref="base64">
      <xs:element name="foo" type="xs:string" dfdl:length="3" />
    <xs:sequence>
    <xs:element name="bar" type="xs:string" dfdl:length="3" />
  </xs:sequence>

Assume the data is this:

  Zm9vWA==bar

The first 8 characters are base64 encoded, and decode to "fooX". The foo
element would only consume three of those characters, so the last "X"
character would be not consumed by foo.

The length of the layer transform was 8 characters, so bar would start
parsing after that and consume the "bar" letters.

So what happens to the unconsumed "X" character? Is it just thrown away?
This seems consistent with how we treat a complex element with explicit
length where the children do not consume the full length. Or is this an
Runtime SDE? Related, on the unparse side, when we base64 decode "foo"
it is only 4 characters, but the layerLength is 8. Are pad characters
inserted to fill that out to 8? Do we need a layerPadCharacter and other
related pad properties?

- Steve

On 04/06/2018 04:10 PM, Mike Beckerle wrote:
> Never mind that one. I've simplified it even further:
>
>
> https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Data+Layering+for+base64+-+Super+Simplified
>
>
> ________________________________
> From: Mike Beckerle
> Sent: Friday, April 6, 2018 3:12:31 PM
> To: dev@daffodil.apache.org
> Subject: Simplified DFDL layering/base64 proposal
>
>
> On looking into implementation complexity I've come up with simplifications that don't reduce expressive power at all, but massively simplify implementation (and documentation, and testing...) burdens.
>
>
> https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Data+Layering+for+base64+-+Simplified
>
>
>
>

Re: Simplified DFDL layering/base64 proposal

Posted by Steve Lawrence <sl...@apache.org>.

I like this simplified version alot! Some questions:

1) In the VCalendar example, ProdID is an element with a dfdl:ref (typo
of dfdl:formatRef) to tns:folded, which contains dfdl:layer* properties.
But layering properties are only allowed on xs:sequence's. I assume this
was just an example from the old proposal that wasn't fixed up, and
should be something like this instead:

  <xs:sequence dfdl:ref="tns:folded">
    <xs:element name="ProdID" type="xs:string" dfdl:initiator="PRODID:"
minOccurs="0"/>
  </xs:sequence>

Which raises a small issue with simple types: This layering transform
now applies to the initiator/terminator of the simple type. If you do
not want a layer to apply to those but only to the value, you'd need
make it a complex type with a "Value" element. I'm not sure this is a
big deal, but layering on simply types might get a little messier in
some cases if the initiators/termiantors shouldn't be transformed.

2) What happens with unused data in an overlying layer. For example:

Say we have something like

  <dfdl:defineFormat name="base64">
    <dfdl:format layerTransform="base64" layerLengthKind="explicit"
                 layerLength="8" ... />
  </dfdl:defineFormat>

  <xs:sequence>
    <xs:sequence dfdl:ref="base64">
      <xs:element name="foo" type="xs:string" dfdl:length="3" />
    <xs:sequence>
    <xs:element name="bar" type="xs:string" dfdl:length="3" />
  </xs:sequence>

Assume the data is this:

  Zm9vWA==bar

The first 8 characters are base64 encoded, and decode to "fooX". The foo
element would only consume three of those characters, so the last "X"
character would be not consumed by foo.

The length of the layer transform was 8 characters, so bar would start
parsing after that and consume the "bar" letters.

So what happens to the unconsumed "X" character? Is it just thrown away?
This seems consistent with how we treat a complex element with explicit
length where the children do not consume the full length. Or is this an
Runtime SDE? Related, on the unparse side, when we base64 decode "foo"
it is only 4 characters, but the layerLength is 8. Are pad characters
inserted to fill that out to 8? Do we need a layerPadCharacter and other
related pad properties?

- Steve

On 04/06/2018 04:10 PM, Mike Beckerle wrote:
> Never mind that one. I've simplified it even further:
> 
> 
> https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Data+Layering+for+base64+-+Super+Simplified
> 
> 
> ________________________________
> From: Mike Beckerle
> Sent: Friday, April 6, 2018 3:12:31 PM
> To: dev@daffodil.apache.org
> Subject: Simplified DFDL layering/base64 proposal
> 
> 
> On looking into implementation complexity I've come up with simplifications that don't reduce expressive power at all, but massively simplify implementation (and documentation, and testing...) burdens.
> 
> 
> https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Data+Layering+for+base64+-+Simplified
> 
> 
> 
>

Re: Simplified DFDL layering/base64 proposal

Posted by Mike Beckerle <mb...@tresys.com>.

Never mind that one. I've simplified it even further:

https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Data+Layering+for+base64+-+Super+Simplified

________________________________
From: Mike Beckerle
Sent: Friday, April 6, 2018 3:12:31 PM
To: dev@daffodil.apache.org
Subject: Simplified DFDL layering/base64 proposal

On looking into implementation complexity I've come up with simplifications that don't reduce expressive power at all, but massively simplify implementation (and documentation, and testing...) burdens.

https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Data+Layering+for+base64+-+Simplified