You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@daffodil.apache.org by Steve Lawrence <sl...@apache.org> on 2020/12/17 14:07:33 UTC

Correct behavior when unparse does not ending on a byte boundary

I was looking at DAFFODIL-1565 thinking it could be closed with all the
recent streaming additions. But as I thought about it more, we have a
clearly asymmetrical behavior with parse and unparse that relates to
what this bug is talking about, so now I'm not so sure.

Say we have an element that parses a single bit with no alignment, e.g.:

  <xs:element name="onebit" type="xs:int"
    dfdl:representation="binary"
    dfdl:lengthKind="explicit"
    dfdl:lengthUnits="bits"
    dfdl:length="1"
    dfdl:alignmentUnits="bits"
    dfdl:alignment="1" />

Now say we parse a file that has a single 0xFF byte as its contents,
using the --stream option in the CLI, e.g.:

  $ daffodil parse --stream -s onebit.dfdl.xsd ff_byte.bin

The result is eight infosets, each with <onebyte>1</onebyte>. This is
because with the --stream option, when parse completes the next parse
continues at the exact bit position where the previous parse left off.

Now say we pipe this result to a call to daffodil unparse, i.e.:

  $ daffodil parse --stream -s onebit.dfdl.xsd ff_byte.bin | daffodil
unparse --stream onebit.dfdl.xsd -o res.bin

In this case, res.bin unexpectedly contains the following hex:

  80 80 80 80 80 80 80 80

So it contains eight bytes where the first bit of each byte is 1. This
is because at the end of each unparse call, we flush out the fragment
byte if it exists (in this case, it does--the single 1 bit) and in order
to do that we must write out a whole byte.

So the round trip is not symmetrical--parse a single byte, unparse to 8
bytes. This implies we are doing something wrong.

I think the change we need is either 1) starting a new parse should
automatically align to a byte boundary, or 2) the end of unparse should
not write fragment bytes unless we know no more unparses will occur.

My first instinct is option 2 feels like the correct behavior, but has
API implications, which I think is at the heart of DAFFODIL-1565.

For example, we would now need a way to carry state between unparse
calls that keeps track of things like bitPosition, fragment byte,
fragment length, and bitOrder. We also need some way to tell whatever
stores this state that we are actually done and that fragment data
should be flushed to the underlying stream.

For symmetry to the parse API, the logical name for this state carrier
is OutputSourceDataOutputStream. The API would probably look something
like this:

  val os = new OutputStream(...)
  val osdos = new OutputSourceDataOutputStream(os)
  dp.unparse(infoset1, osdos) // leaves state in osdos
  dp.unparse(infoset2, osdos) // uses osdos state for initialization
  osdos.close() // flushes fragment bytes stored in osdos

So the OutputSourceDataOutputStream wraps the underlying
OutputStream/WritableByteChannel/whatever we unaprse to, as well as
stores fragment information. This way if osdos is used in a subsequent
unparse() call, it can unparse where the previous left off.

The close() method tells the OutputSourceDataOutputStream to write the
fragment byte (if it exists) to the underlying stream. It also says that
this OSDOS cannot be used in any future calls to unparse().

Some last thoughts about this approach:

1) Say the OSDOS has a fragment byte and close() is called. We must
write a full byte because the underlying OutputStream can only accepts
full bytes. So that must mean we need to pad this fragment byte to a
full byte. What value do we use for this padding? The obvious choice is
probably the dfdl:fillByte property, but the OSDOS isn't tied to a
particular schema with a particular fillByte. For example, you could do
this:

  dp1.unparse(infoset1, osdos)
  dp2.unparse(infoset2, osdos)

If each dp1 and dp2 have a different fillByte values, which do we use,
if either? Do we just use the fill byte from the last data processor
that wrote to this stream (so fillByte from dp2?). Or is this a special
case, and we just always pad with zeros?

2) This now affects alignment. I believe we optimize alignment with the
assumption that starting a parse/unparse is always byte aligned. If a
parse/unparse can start at any bitPosition based on the previous
parse/unparse. So this should essentially change our alignment
optimizations to say that the root element alignment is unknown rather
known to be at the beginning of data.

Thoughts?

Re: Correct behavior when unparse does not ending on a byte boundary

Posted by Steve Lawrence <st...@gmail.com>.

Regarding the fillByte to use, I think we actually don't track this
right now. If an unparse ends on a non-byte boundary, the remaining bits
are filled with zeros. It seems like the fillByte doesn't affect this.
So this is definitely a bug.

Though, thinking some more, it's not clear to me what element the
fillByte property should be taken from. For example, say we have this:

  <xs:element name="root" dfdl:fillByte="%#xAA;">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="oneBit" dfdl:fillByte="%#BB;" ... />
      </xs:sequence>
    </xs:complexType>
  <xs:element>

If we unparsed the infoset:

  <root>
    <oneBit>1</oneBit>
  </root>

And we're not streaming or following this with anything, what's the
result when we close() the OSDOS? Do we fill with the 0xAA from the
root, or 0xBB from oneBit? I would assume you use the fillByte of the
root element since this region is essentially after the "root" element?
I'm not sure if the spec is clear on how this region is unparsed. The
regions that the spec says fillByte fills (RightFill, ElementUnused,
ChoiceUnused, LeadingSkip, AlignmentFill, and TrailingSkip) don't seem
to apply to this region.

Also, if we compose this with an element with byte alignment, e.g.:

  <xs:sequence>
    <xs:element ref="root" />
    <xs:element name="byteAligned" dfdl:fillByte="%#CC;" ... />
  </xs:sequence>

Then those same bits that were either AA or BB are now definitely filled
with the CC fillByte from the byteAligned element. So this region feels
like it's the AlignmentFill region of whatever follows root. But without
this composition, nothing follows root. Still seems like root fillByte
is the most logical choice.

Seems like maybe the solution is the fillByte value from the root is
stored in OSDOS each time unparse() is called, but that value is only
used when the OSDOS is close()'ed? If another unparse() is called rather
than close() then we just overwrite the previous root fillByte?


On 12/17/20 10:05 AM, Beckerle, Mike wrote:
> What you call option 2 is definitely the right behavior, so this is a significant bug in the unparser streaming API.
> 
> There should never be aligning to a byte boundary automatically except when the output stream is closed.
> 
> The fillByte to use for a close is certainly the last fillbyte of the last unparse call's schema. We should just capture this in the osdos (we probably already do, because of buffering output. Note that if the osdos is positiioned somewhere in the middle of a byte, and you begin another unparse, which begins with alignment to a byte boundary, the fill byte used in that case is the NEW fill byte of the new unparse call.
> 
> You are correct that the alignment assumed at the start is only the starting bit position in the OSDOS, not zero (perfect alignment).
> 
> This is also true for parse. I.e, I think optimizations are wrong there as well, because the root element doesn't begin at bit 1.
> 
> So I think this is a second bug with parser. We've gotten away with this because most formats are byte-centric I guess.
> 
> But since DFDL/Daffodil is supposed to be the tool that frees people from this byte-centric stuff - I claim these bugs are critical priority.
> 
> We should create a group of API unit test that matches your example of just a single bit message being parsed and unparsed as a stream, with closes at various points and two different schemas with different fill bytes.
> 
> 
> -mikeb
> 
> 
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Thursday, December 17, 2020 9:07 AM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Correct behavior when unparse does not ending on a byte boundary
> 
> I was looking at DAFFODIL-1565 thinking it could be closed with all the
> recent streaming additions. But as I thought about it more, we have a
> clearly asymmetrical behavior with parse and unparse that relates to
> what this bug is talking about, so now I'm not so sure.
> 
> Say we have an element that parses a single bit with no alignment, e.g.:
> 
>   <xs:element name="onebit" type="xs:int"
>     dfdl:representation="binary"
>     dfdl:lengthKind="explicit"
>     dfdl:lengthUnits="bits"
>     dfdl:length="1"
>     dfdl:alignmentUnits="bits"
>     dfdl:alignment="1" />
> 
> Now say we parse a file that has a single 0xFF byte as its contents,
> using the --stream option in the CLI, e.g.:
> 
>   $ daffodil parse --stream -s onebit.dfdl.xsd ff_byte.bin
> 
> The result is eight infosets, each with <onebyte>1</onebyte>. This is
> because with the --stream option, when parse completes the next parse
> continues at the exact bit position where the previous parse left off.
> 
> Now say we pipe this result to a call to daffodil unparse, i.e.:
> 
>   $ daffodil parse --stream -s onebit.dfdl.xsd ff_byte.bin | daffodil
> unparse --stream onebit.dfdl.xsd -o res.bin
> 
> In this case, res.bin unexpectedly contains the following hex:
> 
>   80 80 80 80 80 80 80 80
> 
> So it contains eight bytes where the first bit of each byte is 1. This
> is because at the end of each unparse call, we flush out the fragment
> byte if it exists (in this case, it does--the single 1 bit) and in order
> to do that we must write out a whole byte.
> 
> So the round trip is not symmetrical--parse a single byte, unparse to 8
> bytes. This implies we are doing something wrong.
> 
> I think the change we need is either 1) starting a new parse should
> automatically align to a byte boundary, or 2) the end of unparse should
> not write fragment bytes unless we know no more unparses will occur.
> 
> My first instinct is option 2 feels like the correct behavior, but has
> API implications, which I think is at the heart of DAFFODIL-1565.
> 
> For example, we would now need a way to carry state between unparse
> calls that keeps track of things like bitPosition, fragment byte,
> fragment length, and bitOrder. We also need some way to tell whatever
> stores this state that we are actually done and that fragment data
> should be flushed to the underlying stream.
> 
> For symmetry to the parse API, the logical name for this state carrier
> is OutputSourceDataOutputStream. The API would probably look something
> like this:
> 
>   val os = new OutputStream(...)
>   val osdos = new OutputSourceDataOutputStream(os)
>   dp.unparse(infoset1, osdos) // leaves state in osdos
>   dp.unparse(infoset2, osdos) // uses osdos state for initialization
>   osdos.close() // flushes fragment bytes stored in osdos
> 
> So the OutputSourceDataOutputStream wraps the underlying
> OutputStream/WritableByteChannel/whatever we unaprse to, as well as
> stores fragment information. This way if osdos is used in a subsequent
> unparse() call, it can unparse where the previous left off.
> 
> The close() method tells the OutputSourceDataOutputStream to write the
> fragment byte (if it exists) to the underlying stream. It also says that
> this OSDOS cannot be used in any future calls to unparse().
> 
> Some last thoughts about this approach:
> 
> 1) Say the OSDOS has a fragment byte and close() is called. We must
> write a full byte because the underlying OutputStream can only accepts
> full bytes. So that must mean we need to pad this fragment byte to a
> full byte. What value do we use for this padding? The obvious choice is
> probably the dfdl:fillByte property, but the OSDOS isn't tied to a
> particular schema with a particular fillByte. For example, you could do
> this:
> 
>   dp1.unparse(infoset1, osdos)
>   dp2.unparse(infoset2, osdos)
> 
> If each dp1 and dp2 have a different fillByte values, which do we use,
> if either? Do we just use the fill byte from the last data processor
> that wrote to this stream (so fillByte from dp2?). Or is this a special
> case, and we just always pad with zeros?
> 
> 2) This now affects alignment. I believe we optimize alignment with the
> assumption that starting a parse/unparse is always byte aligned. If a
> parse/unparse can start at any bitPosition based on the previous
> parse/unparse. So this should essentially change our alignment
> optimizations to say that the root element alignment is unknown rather
> known to be at the beginning of data.
> 
> Thoughts?
>

Re: Correct behavior when unparse does not ending on a byte boundary

Posted by "Beckerle, Mike" <mb...@owlcyberdefense.com>.

What you call option 2 is definitely the right behavior, so this is a significant bug in the unparser streaming API.

There should never be aligning to a byte boundary automatically except when the output stream is closed.

The fillByte to use for a close is certainly the last fillbyte of the last unparse call's schema. We should just capture this in the osdos (we probably already do, because of buffering output. Note that if the osdos is positiioned somewhere in the middle of a byte, and you begin another unparse, which begins with alignment to a byte boundary, the fill byte used in that case is the NEW fill byte of the new unparse call.

You are correct that the alignment assumed at the start is only the starting bit position in the OSDOS, not zero (perfect alignment).

This is also true for parse. I.e, I think optimizations are wrong there as well, because the root element doesn't begin at bit 1.

So I think this is a second bug with parser. We've gotten away with this because most formats are byte-centric I guess.

But since DFDL/Daffodil is supposed to be the tool that frees people from this byte-centric stuff - I claim these bugs are critical priority.

We should create a group of API unit test that matches your example of just a single bit message being parsed and unparsed as a stream, with closes at various points and two different schemas with different fill bytes.


-mikeb


________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Thursday, December 17, 2020 9:07 AM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Correct behavior when unparse does not ending on a byte boundary

I was looking at DAFFODIL-1565 thinking it could be closed with all the
recent streaming additions. But as I thought about it more, we have a
clearly asymmetrical behavior with parse and unparse that relates to
what this bug is talking about, so now I'm not so sure.

Say we have an element that parses a single bit with no alignment, e.g.:

  <xs:element name="onebit" type="xs:int"
    dfdl:representation="binary"
    dfdl:lengthKind="explicit"
    dfdl:lengthUnits="bits"
    dfdl:length="1"
    dfdl:alignmentUnits="bits"
    dfdl:alignment="1" />

Now say we parse a file that has a single 0xFF byte as its contents,
using the --stream option in the CLI, e.g.:

  $ daffodil parse --stream -s onebit.dfdl.xsd ff_byte.bin

The result is eight infosets, each with <onebyte>1</onebyte>. This is
because with the --stream option, when parse completes the next parse
continues at the exact bit position where the previous parse left off.

Now say we pipe this result to a call to daffodil unparse, i.e.:

  $ daffodil parse --stream -s onebit.dfdl.xsd ff_byte.bin | daffodil
unparse --stream onebit.dfdl.xsd -o res.bin

In this case, res.bin unexpectedly contains the following hex:

  80 80 80 80 80 80 80 80

So it contains eight bytes where the first bit of each byte is 1. This
is because at the end of each unparse call, we flush out the fragment
byte if it exists (in this case, it does--the single 1 bit) and in order
to do that we must write out a whole byte.

So the round trip is not symmetrical--parse a single byte, unparse to 8
bytes. This implies we are doing something wrong.

I think the change we need is either 1) starting a new parse should
automatically align to a byte boundary, or 2) the end of unparse should
not write fragment bytes unless we know no more unparses will occur.

My first instinct is option 2 feels like the correct behavior, but has
API implications, which I think is at the heart of DAFFODIL-1565.

For example, we would now need a way to carry state between unparse
calls that keeps track of things like bitPosition, fragment byte,
fragment length, and bitOrder. We also need some way to tell whatever
stores this state that we are actually done and that fragment data
should be flushed to the underlying stream.

For symmetry to the parse API, the logical name for this state carrier
is OutputSourceDataOutputStream. The API would probably look something
like this:

  val os = new OutputStream(...)
  val osdos = new OutputSourceDataOutputStream(os)
  dp.unparse(infoset1, osdos) // leaves state in osdos
  dp.unparse(infoset2, osdos) // uses osdos state for initialization
  osdos.close() // flushes fragment bytes stored in osdos

So the OutputSourceDataOutputStream wraps the underlying
OutputStream/WritableByteChannel/whatever we unaprse to, as well as
stores fragment information. This way if osdos is used in a subsequent
unparse() call, it can unparse where the previous left off.

The close() method tells the OutputSourceDataOutputStream to write the
fragment byte (if it exists) to the underlying stream. It also says that
this OSDOS cannot be used in any future calls to unparse().

Some last thoughts about this approach:

1) Say the OSDOS has a fragment byte and close() is called. We must
write a full byte because the underlying OutputStream can only accepts
full bytes. So that must mean we need to pad this fragment byte to a
full byte. What value do we use for this padding? The obvious choice is
probably the dfdl:fillByte property, but the OSDOS isn't tied to a
particular schema with a particular fillByte. For example, you could do
this:

  dp1.unparse(infoset1, osdos)
  dp2.unparse(infoset2, osdos)

If each dp1 and dp2 have a different fillByte values, which do we use,
if either? Do we just use the fill byte from the last data processor
that wrote to this stream (so fillByte from dp2?). Or is this a special
case, and we just always pad with zeros?

2) This now affects alignment. I believe we optimize alignment with the
assumption that starting a parse/unparse is always byte aligned. If a
parse/unparse can start at any bitPosition based on the previous
parse/unparse. So this should essentially change our alignment
optimizations to say that the root element alignment is unknown rather
known to be at the beginning of data.

Thoughts?