You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@daffodil.apache.org by Steve Lawrence <sl...@apache.org> on 2018/01/02 16:30:10 UTC

Re: Please review & discuss - draft proposal for how to do base64, foldedLines, etc.

Comments inline.

> 
> This memo describes a proposed feature for expressing data stream pre/post processing operations.
> 
> Most of the discussion here will use parsing as context, but where the unparsing is not clearly symmetric, unparsing will also be described.
> 
> New DFDL schema annotations are shown in the "daf:" namespace so as to be clear what are DFDL standard, and what the new extensions are.
> 
> 
> The core concept is a cluster of new properties.
> 
> * streamEncoding (literal string or DFDL expression)
> * streamLengthKind (can be explicit, delimited, pattern, endOfParent, prefixed) 
> * streamLength - used for lengthKind 'explicit'
> * streamLengthUnits (bits or bytes)
> * streamLengthPattern - used for lengthKind 'pattern'
> * streamTerminator - (literal string or DFDL expression) - used for lengthKind delimited - not used nor allowed for other length kinds (TBD asymmetric with terminator on a non-delimited element)
> * streamEscapeSchemeRef - used for lengthKind delimited to escape the streamTerminator when necessary.
> 
> Those properties are valid on the DFDL annotation elements dfdl:format, dfdl:element, dfdl:simpleType, dfdl:sequence, dfdl:choice, dfdl:group.
> 
> An additional non-format property:
> 
> * streamTransform
> 
> This cannot appear on dfdl:format. Only on dfdl:element, dfdl:simpleType, dfdl:sequence, dfdl:choice, dfdl:group.
> 
> Specifying the streamTransform property puts a stream transform into use. The streamTransform property is specifically not allowed on dfdl:format because it is not sensible to put a stream transformation into effect across a lexical scope. Stream transforms apply to the dynamic scope of the Term they are associated with.
> 
> (This might not work out. It may be of value to define streamTransform in a format, if that format is named, and only referenced from the term that defines the dynamic scope where that stream transform is to be used. If we allow streamTransform on dfdl:format annotations, there are just certain situations where we would want SDE errors to be detected, such as if streamTransform is in lexical scope over a file.)
>

What about making this similar to escape schemes with something like
dfdl:defineStreamTransform/dfdl:streamTransformRef? Seems kindof similar
in usage to escape schemes and maybe helps with the scoping issue?

> A data stream is conceptually a stream of bytes. It can be an input stream for parsing, an output stream for unparsing.
> Use of the term "stream" here is consistent with java's use of stream as in InputStream and OutputStream. These are sources and sinks of bytes. If one wants to decode characters from them you must do so by specifying the encoding explicitly.
> 
> A stream transform is a layering to create one stream of bytes from another. An underlying stream is encapsulated by a transformation to create an overlying stream.
> 
> When parsing, reading from the overlying stream causes reading of data from the underlying stream, which data is then transformed and becomes the bytes of the overlying stream returned from the read.
> 
> The stream properties apply to the underlying stream data and indicate how to identify its bounds/length, and if a stream transform is textual, what encoding is used to interpret the underlying bytes.
> 
> Some transformations are naturally binary bytes to bytes. Data decompress/compress are the typical example here. When parsing, the overlying stream's bytes are the result of decompression of the underlying stream's bytes.
> 
> If a transform requires text, then a stream encoding must be defined. For example, base64 is a transform that creates bytes from text. Hence, a stream encoding is needed to convert the underlying stream of bytes into text, then the base64 decoding occurs on that text, which produces the bytes of the overlying stream.
> 
> We think of some transforms as text-to-text. Line folding/unfolding is one such. Lines of text that are too long are wrapped by inserting a line-ending and a space. As a DFDL stream transform this line folding transform requires an encoding. The underlying bytes are decoded into characters according to the encoding. Those characters are divided into lines, and the line unfolding (for parsing) is done to create longer lines of data, the resulting data is then encoded from characters back into bytes using the same encoding.
> 
> (There may be opportunities to shortcut these transformations if the overlying stream is the data stream for an element with scannable text representation using the same character set encoding.)
> 
> DFDL can describe a mixture of character set decoding/encoding and binary value parsing/unparsing against the same underlying data representation; hence, the underlying data stream concept is always one of bytes.
> 
> (TBD: maybe it has to be bits? E.g., in mil-std-2045 headers, the VMF payload data can be compressed. I don't know that this payload data always begins on a byte boundary.)

A VMF message has two parts, an application header and user data. The
spec says, "The application header shall always be a multiple of 8 bits.
If an application header is not a multiple of 8 bits, it shall be zero
filled so that it becomes a multiple of 8 bits.". So the user data part
(the part that could be compressed) must always start on a byte
boundary. Similarly the user data field is also always filled to a byte
boundary. The supported compression algorithms are LZW and GZIP, which
both only work on bytes. So as far as MIL-STD-2045/VMF, bits should not
be necessary, and makes things much easier.

> Daffodil parsing begins with a default standard data input stream. Unparsing begins with a default standard output stream.
> 
> When a DFDL schema wants to describe say, base64 decoding the DFDL annotations might look like this:
> 
> <element name="foo" daf:streamTransform="base64">
>   <complexType>
>     <sequence>
>       ....
>     </sequence>
>   </complexType>
> </element>
> 
> This annotation means: when parsing element foo, take whatever data stream is in effect, layer a base64 data stream on it, and use that until the end of element foo. The streamEncoding property would be taken from the lexically enclosing format. 
> 
> In this example, when element foo is being parsed, the current data input stream is augmented by being encapsulated in a base64 transformer. This transformer takes the data stream, decodes it to characters using the streamEncoding, then processes the resulting text converting base64 to binary data.
> 
> The APIs for defining the base64 or other transformers enable one to do these transformations in a streaming manner, on demand as data is pulled from the resulting data stream of bytes. Of course it is possible to just convert the entire data object, but we want to enable streaming behavior in case stream-encoded objects are large.
> 
> We just have seen how the dfdl:streamEncoding property is used by element foo as part of the dataStream transformation.
> 
> Let's consider how streamLength works.
> 
> There are two things we have to describe the length of now. One is the data that is to be transformed. The second is the length of the parsed element taken from the result of the transformation.
> 
> One may have a base64 encoded region of 1000 bytes streamLength, within that, once decoded one will have only 750 or so bytes available. That data is limited by the 750 length of the decoded data. At the time parsing begins neither of these numbers 1000, nor 750 may be known. 
> 
> <dfdl:defineFormat name="fooStreamFormat">
>   <dfdl:format streamEncoding="utf-16" streamLengthKind="explicit"/>
> </dfdl:defineFormat>
> 
> This data stream will decode utf-16 characters on the underlying data stream, then base64 decode that to get a stream of bytes.
> 
> <dfdl:defineFormat name="fooFormat">
>   <dfdl:format ref="tns:fooStreamFormat" encoding="utf-8" byteOrder="bigEndian"/>
> </dfdl:defineFormat>
> 
> Then the type 
> 
> <element name="len" type="xs:int".../>
> <element name="foo" dfdl:ref="tns:fooFormat" type="tns:fooType" dfdl:initiator="foo:"
> 	 daf:streamLength="{ ../len }" daf:streamTransform="base64"/>
> 
> Note how the property daf:streamLength is supplied where the expression is relevant, but the other properties controling the stream processing are expressed reusably.
> 
> In this example, we have that the dfdl:initiator for foo will be decoded in utf-8 characters from the byte-stream produced by the base64 transform. However, that base64 data was decoded from UTF-16 decode of the underlying byte stream. 
> 
> For the unparse direction, this len element needs a dfdl:outputValueCalc. The calculation needs the length of the base64 encoded data.
> 
> This would be expressed as
> 
> <element mame="len" type="xs:int" dfdl:outputValueCalc="{ daf:streamLength(../foo, 'bytes') }"/>
> 
> This function daf:streamLength is much like dfdl:valueLength and dfdl:contentLength, except that it accesses the
> underlying data stream representation. The units are 'bits', 'bytes' or 'characters'. If 'characters' is specified, then the value returned is the
> number of characters in the data stream's encoding of the data. In the example above, this would be the number of utf-16 characters
> in the underlying stream before base64 decoding takes place.

Is this last sentence correct? I would expect that it would output the
length of base64 encoded data, and I would expect valueLength to return
the number of utf-16 characters before base64 encoded the data?

> ('characters' may not be needed.)
> 
> If the units are specified as 'bytes' then the length in bytes of the underlying data stream prior to transformation, is provided.
> 
> ('bits' may or may not be needed, or if provided perhaps we get away with it just being like 'bytes' * 8 and require lengths to be multiple of a byte.)
> 
> Let's look at an example of two interacting data stream transforms.
> 
> <xs:sequence
>   daf:streamEncoding="utf-8" daf:streamTransform="foldedLines" daf:streamLengthKind="delimited">

streamLengthKind is delimited, but no streamTerminator is defined? How
does it know when to stop transforming folding lines. Does it look at
parent terminating markup? In which case, what is the purpose of
streamTerminator? Maybe this is just an omission? Or maybe
streamLengthKind should be endOfParent?

Related, how does parent terminating markup interact with a delimited
length stream. My assumption based on what you've said is that such
terminating markup is ignored and only applied after the transform? This
makes sense in the line folding case where we want to ignore terminating
markup until after the line folding is removed.

In that case, what if a stream is terminated by parent terminating
markup? Do you duplicate the delimiter in daf:streamTerminator? For
example, lets say we have an unbounded comma separated array of base64
encoded utf-16 data. I would expect it to look like this:

<xs:sequence dfdl:separator="," dfdl:separatorPosition="infix">
  <xs:element name="utfString" type="xs:string" maxOccurs="unbounded"
    dfdl:encoding="utf-16" dfdl:occursCountKind="implicit"
    daf:streamTransform="base64"
    daf:streamEncoding="us-ascii"
    daf:streamLengthKind="delimited"
    daf:streamTerminator=","
</xs:sequence>

This raises some questions/issues:

1) separatorPosition is infix, so how does the the base64 transform know
to stop for the last element that isn't followed by a comma if it
ignores parent terminating markup. Does streamTerminator need to be
modified to include all parent terminating markup? That seems doable but
difficult and might make reuse hard. So maybe my assumption was wrong
that parent terminating markup is ignored? However, I imagine there are
some cases where do want to ignore parent terminating markup, like in
the line folding case. Maybe we need different delimited
daf:streamLengthKinds? One that ignores parent terminating markup and
one that doesn't?

2) Is it expected that the streamTerminator is not consumed by a
delimited streamTransform? And that it is the responsibility of the
surrounding data to consume it? That is inconsistent with
dfdl:terminator, but might make sense. So in the above example, the
transform will not consume the comma separators and stop short of it?
This makes sense to me, as otherwise both the utfString and the
surrounding sequence will want to consume the separator. But maybe the
parent terminator markup thing means the streamTerminator isn't set to a
comma?

>   ...
>   ... presumably everything here is textual, and utf-8 because foldedLines only applies sensibly to text.
>   ...
>   <xs:sequence daf:streamEncoding="us-ascii" daf:streamTransform="base64" daf:streamLengthKind="delimited" daf:streamTerminator="{ ../marker }">
>       ...
>       ... everything here is parsed against the bytes obtained from base64 decoding
>       ... which is itself decoding the output of the foldedLines transform
>       ... above. Base64 requires only us-ascii, which is a subset of utf-8.
>       ...
>   </xs:sequence>
> </xs:sequence>
> 
> Summary
> * allows stacking transforms one on top of another. So you can have base64 encoded compressed data as the payload representation of
> a child element within a larger element.
> * allows specifying properties of the underlying data stream separately from the properties of the logical data.
> * scopes the transforms over a term (model-group or element)
> * prevents inadvertent lexical scoping of a streamTransform from a lexically enclosing top level format annotation.
> 
> 
> Implementation Notes:
> 
> Introduction of a stream transform basically appears in the Term grammar as a combinator that surrounds the contained Term contents.
> 

I really like this proposal, seems like the correct approach. Some
things that might be worth adding:

1) How are errors handled in a stream (e.g. trying to gunzip but its not
a valid gzip stream)? Are they just ProcessingErrors and things
backtrack as usual using standard PoC's?

2) What about options for a transform? For example, you might want to
specify a gzip stream to do something like --best or --fast to favor
compression size vs speed. Or what variation of base64 should be used.
Might also used to describe how errors should be handled specific to a
transform. For example, base64 can ignore garbage characters when
decoding, but that might want to be a processing error in some cases.

I guess this could be a single option with space separated key/value
pairs, e.g.

  daf:streamTransformOptions="base64_ignore_garbage=yes
base64_variant=rfc1421"

That's very extensible, but might not be consistent with the rest of
DFDL. Maybe we need specific options for each stream transform, e.g.

  daf:streamTransformBase64IgnoreGarbage="yes"
  daf:streamTransformBase64Variant="rfc1421"
  ..

- Steve

Re: Please review & discuss - draft proposal for how to do base64, foldedLines, etc.

Posted by Mike Beckerle <mb...@tresys.com>.

Received some excellent feedback on the proposal as presented on the Wiki from Steve Hanson of IBM.


The feedback was mostly very supportive of the proposal. He suggested this change:


He suggested that we avoid the term "streaming" and stick with "layering" in all the terminology as the behavior known as "streaming" already has strong connotations.


Throughout the long history of DFDL, the term layering was always used for these concepts where a transformation must be done to data before parsing (after unparsing).


We do use the term "data stream" or just "stream" as a direction-independent way of referring to the data being parsed (input stream) or unparsed (output stream), but "streaming" connotes processing in a manner consistent with an unbounded stream, using a small/finite memory footprint.


While layering can be done in a streaming manner or not, the point of layering is different, as it is about the algorithmic transformations, not the memory footprint nor length-boundedness of the data stream.


So I've updated the proposal to use the term layer, layered, and layering and use the term stream minimally.


The updated proposal, which now lives on the Apache Daffodil Wiki here:


https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Data+Layer+Annotations+for+Base64+and+other+Layered+Transformations


-Mike Beckerle

Tresys


________________________________
From: Mike Beckerle
Sent: Friday, January 5, 2018 6:28:23 PM
To: dev@daffodil.apache.org
Subject: Re: Please review & discuss - draft proposal for how to do base64, foldedLines, etc.


Updated proposal attached.

________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Thursday, January 4, 2018 9:28:51 AM
To: Mike Beckerle
Subject: Re: Please review & discuss - draft proposal for how to do base64, foldedLines, etc.

<-- snip -->

> 2) What about options for a transform? For example, you might want to
> specify a gzip stream to do something like --best or --fast to favor
> compression size vs speed. Or what variation of base64 should be used.
> Might also used to describe how errors should be handled specific to a
> transform. For example, base64 can ignore garbage characters when
> decoding, but that might want to be a processing error in some cases.
>
> I guess this could be a single option with space separated key/value
> pairs, e.g.
>
>    daf:streamTransformOptions="base64_ignore_garbage=yes
> base64_variant=rfc1421"
>
> That's very extensible, but might not be consistent with the rest of
> DFDL. Maybe we need specific options for each stream transform, e.g.
>
>    daf:streamTransformBase64IgnoreGarbage="yes"
>    daf:streamTransformBase64Variant="rfc1421"
>    ..
>
> MikeB: My suggestion would be to make these parameters part of the algorithm
> name for now. E.g.,
> daf:streamTransform="base64Best" or daf:streamTransform="base64_ignore_garbage".
>
> We're going to need a way to specify many of these stream transforms. Specifying
> gzip with options
> and naming it something new better not be very hard. So perhaps that is good
> enough for now.
>

My only (minor) concern with this is that if something had multiple
options, the combinations of names could expand quickly. But probably
not worth worrying about until that actually happens--it may not be an
issue in practice.

Everything else above sounds good.

Re: Please review & discuss - draft proposal for how to do base64, foldedLines, etc.

Posted by Mike Beckerle <mb...@tresys.com>.

Updated proposal attached.

________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Thursday, January 4, 2018 9:28:51 AM
To: Mike Beckerle
Subject: Re: Please review & discuss - draft proposal for how to do base64, foldedLines, etc.

<-- snip -->

> 2) What about options for a transform? For example, you might want to
> specify a gzip stream to do something like --best or --fast to favor
> compression size vs speed. Or what variation of base64 should be used.
> Might also used to describe how errors should be handled specific to a
> transform. For example, base64 can ignore garbage characters when
> decoding, but that might want to be a processing error in some cases.
>
> I guess this could be a single option with space separated key/value
> pairs, e.g.
>
>    daf:streamTransformOptions="base64_ignore_garbage=yes
> base64_variant=rfc1421"
>
> That's very extensible, but might not be consistent with the rest of
> DFDL. Maybe we need specific options for each stream transform, e.g.
>
>    daf:streamTransformBase64IgnoreGarbage="yes"
>    daf:streamTransformBase64Variant="rfc1421"
>    ..
>
> MikeB: My suggestion would be to make these parameters part of the algorithm
> name for now. E.g.,
> daf:streamTransform="base64Best" or daf:streamTransform="base64_ignore_garbage".
>
> We're going to need a way to specify many of these stream transforms. Specifying
> gzip with options
> and naming it something new better not be very hard. So perhaps that is good
> enough for now.
>

My only (minor) concern with this is that if something had multiple
options, the combinations of names could expand quickly. But probably
not worth worrying about until that actually happens--it may not be an
issue in practice.

Everything else above sounds good.