You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@daffodil.apache.org by Mike Beckerle <mb...@tresys.com> on 2017/11/17 02:40:00 UTC

Proposal for Implementing Base64, folded lines, quoted-printable, compress/decompress, etc.

This email is to start a discussion of features to enable DFDL to express more data formats - particularly those that use some form of encoding (not charset encoding, algorithmic encoding) of part or all of the data.

IETF data formats make extensive use of base64 encoding of binary data for inclusion in textual data.

In addition the textual formats make use of line-folding (A line longer than 72 characters is extended on the next line by beginning the next line with a space (or tab? not sure).

There are many other schemes where part of a data representation has to be algorithmically decoded before the DFDL parsing can process it.

A good example comes from the MIL-STD-2045 message header format. This header has flags that indicate whether the message contents is to be compressed, and with what compression algorithm. Parsing needs to choose among several algorithms based on values computed from the data. Unparsing similarly must determine which compression algorithm to use to compress the message contents.

Our plan in implementing this feature in Daffodil would be to gain experience with it, and such time as we're satisfied with it, propose the feature for inclusion in a future revision of the DFDL standard.

Perhaps there is a better name, but for this email we'll use the property dfdl:transferEncoding. This term comes from MIME where data can be transported encoded in a content transfer encoding designed to protect binary data from corruption, etc.

What is proposed is:

dfdl:transferEncoding takes a whitespace separated list of transfer encoding names. The empty string means no transfer encoding will be used. An expression can be used to evaluate to the whitespace separated list, or to the empty string.

A transfer encoding name identifies a transfer encoding algorithm. This algorithm can be

* bytes to bytes - example compress
* bytes to text - TBD (needed?)
* text to bytes - example base64, AIS
* text to text - TBD (needed?)

The whitespace separated list must be of compatible transfer encoding algorithms. The first named algorithm is applied first, so assuming these identifiers are valid dfdl:transferEncoding="base64 zip" would mean the data is text, and will be converted from text to bytes by the base64 decoder, and then from bytes to bytes by the unzip decoder. The inverse happens when unparsing.

When a DFDL element has a dfdl:transferEncoding, then the length of that element is the length of the transfer- encoded representation of the data.

For example: An element of complex type can have a prefixed length indicating it is 16457 bytes long. If its

transfer encoding specifies zip compression, then this 16457 bytes would be unzipped and the result would be larger. For example it could expand to 50873 bytes. The content of the complex type would then be parsed from this 50873 bytes.

The implementation of transfer encodings generally involves Daffodil's parser and unparser combinators.

Considering first parsing. The combinator would take action before and after parsing the content of the element. In the before action, the Daffodil DataInputStream would be encapsulated by another implementation of DataInputStream; except that this encapsulating stream would implement the transfer encoding decoder algorithm, reading data from the underlying DataInputStream. Multiple transfer encodings would result in multiple such encapsulations layered one upon the other.

After the content is unparsed, the action taken after by the combinator is to unencapsulate the DataInputStream, returning to the original DataInputStream, from which some data will have been consumed.

The position of the original DataInputStream must be precise and exactly the position after the last bit of the transfer-encoded data.

Some formats will require nested elements such that an outer element having a transfer encoding specified can have a text dfdl:encoding property specifying the text charset used in the transfer-encoded representation. The inner nested element can then have a different dfdl:encoding property - which is used to interpret the decoded data as text. For example suppose you have a large text string in UTF-8. This can be compressed to get bytes, and those bytes base64 encoded into the US-ASCII charset. This would be expressed by something like

....

About extensibility

It was a goal for this set of transfer encodings to be readily extensible. This is because many formats have specific encodings particular to them. AIS has one, ASN.1 BER has one (so called "object" encoding), and there are a wide variety of compression algorithms.

However, it is probably best to build some of these transfer encoders/decoders first, and then consider what is necessary to specify one without access to Daffodil internal classes and data structures.

About MIME names for encodings.

TBD: identifiers like base64 mean different things in different contexts. In the XML world it is just an algorithm for creating a single long string of characters. (Much like how hexBinary means a single long string of hex digits).

But in IETF Internet Message Format, base64 means a particular syntax with lines of a specific length. An IMF base64 encoded binary has a block structure with human-tolerable line-lengths (max 72) and a specific introduction and termination to indicate the start/end.

Perhaps use QNames so that ietf:base64 or mime:base64 can provide the distinctions using normal namespace qualification.

TBD: parameters to transfer encoding algorithms.

We may need some way to express these. Perhaps a URL-style thing like

dfdl:transferEncoding='compress?method=bz2'

...mike beckerle

Re: Proposal for Implementing Base64, folded lines, quoted-printable, compress/decompress, etc.

Posted by Mike Beckerle <mb...@tresys.com>.

Let's consider an example where lengths interact.


Suppose I have a base64 encoded thing, delimited by the string "----920902aeb929----".


Then inside that encoded region, we have a text string, delimited by %NL;


Because we need two lengths here with different delimiters, we will need two elements here nested. The outer one is complex type, has base64, dfdl:lengthKind="delimited" and the dfdl:terminator="----920902aeb929----". The inner element inside the complex type also has dfdl:lengthKind="delimited", but dfdl:terminator="%NL;".


I think nesting like this can fix any situations where you would otherwise need two length specifications for the same element.

________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Friday, November 17, 2017 8:59:06 AM
To: dev@daffodil.apache.org; Mike Beckerle
Subject: Re: Proposal for Implementing Base64, folded lines, quoted-printable, compress/decompress, etc.

On 11/16/2017 09:40 PM, Mike Beckerle wrote:
> This email is to start a discussion of features to enable DFDL to express more data formats - particularly those that use some form of encoding (not charset encoding, algorithmic encoding) of part or all of the data.
>
>
> IETF data formats make extensive use of base64 encoding of binary data for inclusion in textual data.
>
> In addition the textual formats make use of line-folding (A line longer than 72 characters is extended on the next line by beginning the next line with a space (or tab? not sure).
>
>
> There are many other schemes where part of a data representation has to be algorithmically decoded before the DFDL parsing can process it.
>
>
> A good example comes from the MIL-STD-2045 message header format. This header has flags that indicate whether the message contents is to be compressed, and with what compression algorithm. Parsing needs to choose among several algorithms based on values computed from the data. Unparsing similarly must determine which compression algorithm to use to compress the message contents.
>
>
> Our plan in implementing this feature in Daffodil would be to gain experience with it, and such time as we're satisfied with it, propose the feature for inclusion in a future revision of the DFDL standard.
>
>
> Perhaps there is a better name, but for this email we'll use the property dfdl:transferEncoding. This term comes from MIME where data can be transported encoded in a content transfer encoding designed to protect binary data from corruption, etc.
>
>
> What is proposed is:
>
>
> dfdl:transferEncoding takes a whitespace separated list of transfer encoding names. The empty string means no transfer encoding will be used. An expression can be used to evaluate to the whitespace separated list, or to the empty string.
>
>
> A transfer encoding name identifies a transfer encoding algorithm. This algorithm can be
>
>   *   bytes to bytes - example compress
>   *   bytes to text - TBD (needed?)
>   *   text to bytes - example base64, AIS
>   *   text to text - TBD (needed?)
>
>
> The whitespace separated list must be of compatible transfer encoding algorithms. The first named algorithm is applied first, so assuming these identifiers are valid dfdl:transferEncoding="base64 zip" would mean the data is text, and will be converted from text to bytes by the base64 decoder, and then from bytes to bytes by the unzip decoder. The inverse happens when unparsing.
>
>
> When a DFDL element has a dfdl:transferEncoding, then the length of that element is the length of the transfer- encoded representation of the data.
>
>
> For example: An element of complex type can have a prefixed length indicating it is 16457 bytes long. If its
>
> transfer encoding specifies zip compression, then this 16457 bytes would be unzipped and the result would be larger. For example it could expand to 50873 bytes. The content of the complex type would then be parsed from this 50873 bytes.
>
>
> The implementation of transfer encodings generally involves Daffodil's parser and unparser combinators.
>
> Considering first parsing. The combinator would take action before and after parsing the content of the element. In the before action, the Daffodil DataInputStream would be encapsulated by another implementation of DataInputStream; except that this encapsulating stream would implement the transfer encoding decoder algorithm, reading data from the underlying DataInputStream. Multiple transfer encodings would result in multiple such encapsulations layered one upon the other.
>
>
> After the content is unparsed, the action taken after by the combinator is to unencapsulate the DataInputStream, returning to the original DataInputStream, from which some data will have been consumed.
>
>
> The position of the original DataInputStream must be precise and exactly the position after the last bit of the transfer-encoded data.
>
>
> Some formats will require nested elements such that an outer element having a transfer encoding specified can have a text dfdl:encoding property specifying the text charset used in the transfer-encoded representation. The inner nested element can then have a different dfdl:encoding property - which is used to interpret the decoded data as text.  For example suppose you have a large text string in UTF-8. This can be compressed to get bytes, and those bytes base64 encoded into the US-ASCII charset. This would be expressed by something like
>
>
> <element name="outer" dfdl:encoding="us-ascii" dfdl:transferEncoding="base64 compress">
>
>    <complexType>
>
>      <sequence>
>
>        <element name="inner" type="xs:string" dfdl:encoding="utf-8" dfdl:lengthKind="delimited"/>
>
>    ....
>
>
> About extensibility
>
>
> It was a goal for this set of transfer encodings to be readily extensible. This is because many formats have specific encodings particular to them. AIS has one, ASN.1 BER has one (so called "object" encoding), and there are a wide variety of compression algorithms.
>
>
> However, it is probably best to build some of these transfer encoders/decoders first, and then consider what is necessary to specify one without access to Daffodil internal classes and data structures.
>
>
> About MIME names for encodings.
>
>
> TBD: identifiers like base64 mean different things in different contexts. In the XML world it is just an algorithm for creating a single long string of characters. (Much like how hexBinary means a single long string of hex digits).
>
> But in IETF Internet Message Format, base64 means a particular syntax with lines of a specific length. An IMF base64 encoded binary has a block structure with human-tolerable line-lengths (max 72) and a specific introduction and termination to indicate the start/end.
>
>
> Perhaps use QNames so that ietf:base64 or mime:base64 can provide the distinctions using normal namespace qualification.
>
>
> TBD: parameters to transfer encoding algorithms.
>
>
> We may need some way to express these. Perhaps a URL-style thing like
>
> dfdl:transferEncoding='compress?method=bz2'
>
>
> ...mike beckerle
>

Initial thoughts, I like it, but how do lengths work with this? Does an
element with dfdl:transferEncoding need to have an explicit length? I
imagine any kindof speculative parsing with something like delimited
length would be difficult, but does seem useful. For example, I could
imagine format like:

  DATA=<base 64 data terminated by newline>

This would look something like:

  <element name="Data" dfdl:transferEncoding="base64"
    dfdl:lengthKind="delimited"
    dfdl:initiator="DATA=" dfdl:terminator="%NL;">
    <complexType>
      <sequence>
         ....

So this is kindof interesting. Normally, the complex element would just
push the %NL; delimiter on the delimiter stack and then start parsing
the children. But that doesn't work here. Instead, the complex element
with transfer encoding must scan for its local delimiter, decode all the
data, and then continue parsing children with decoded data as normal.

But line-folding seems a little different to me. Say we had something
like this:

  <element name="Data"
dfdl:transferEncoding="fold?width=72&chars=%NL;%SP;%SP;"
    dfdl:lengthKind="delimited"
    dfdl:initiator="DATA=" dfdl:terminator="%NL;">
    <complexType>
      <sequence>
         ....

So this allows data to look like this:

  DATA=This is long data that is wrapped
    with a new-line followed by two spaces--the
    new-line and spaces should be removed when
    in the infoset

So in this case, I don't think we want to just scan the data for the
%NL; terminator and then treat it all as a decode. Since the %NL
terminator is part of the chars. Or maybe it's just a special case, and
the fold transfer encoding knows the difference between %NL;%SP;%SP; and
just %NL;?

Another thought about lengths, what if an element references
content/valueLength of the transferEncoding element? Does it return the
original length or the decoded length? I guess a child can't reference
the length of it's parent, so only siblings or parents can reference a
contentLength? In which case they probably just just return the original
non-decoded length? Is there ever a time something would need the
decoded length?

- Steve

Re: Proposal for Implementing Base64, folded lines, quoted-printable, compress/decompress, etc.

Posted by Steve Lawrence <sl...@apache.org>.

On 11/16/2017 09:40 PM, Mike Beckerle wrote:
> This email is to start a discussion of features to enable DFDL to express more data formats - particularly those that use some form of encoding (not charset encoding, algorithmic encoding) of part or all of the data.
> 
> 
> IETF data formats make extensive use of base64 encoding of binary data for inclusion in textual data.
> 
> In addition the textual formats make use of line-folding (A line longer than 72 characters is extended on the next line by beginning the next line with a space (or tab? not sure).
> 
> 
> There are many other schemes where part of a data representation has to be algorithmically decoded before the DFDL parsing can process it.
> 
> 
> A good example comes from the MIL-STD-2045 message header format. This header has flags that indicate whether the message contents is to be compressed, and with what compression algorithm. Parsing needs to choose among several algorithms based on values computed from the data. Unparsing similarly must determine which compression algorithm to use to compress the message contents.
> 
> 
> Our plan in implementing this feature in Daffodil would be to gain experience with it, and such time as we're satisfied with it, propose the feature for inclusion in a future revision of the DFDL standard.
> 
> 
> Perhaps there is a better name, but for this email we'll use the property dfdl:transferEncoding. This term comes from MIME where data can be transported encoded in a content transfer encoding designed to protect binary data from corruption, etc.
> 
> 
> What is proposed is:
> 
> 
> dfdl:transferEncoding takes a whitespace separated list of transfer encoding names. The empty string means no transfer encoding will be used. An expression can be used to evaluate to the whitespace separated list, or to the empty string.
> 
> 
> A transfer encoding name identifies a transfer encoding algorithm. This algorithm can be
> 
>   *   bytes to bytes - example compress
>   *   bytes to text - TBD (needed?)
>   *   text to bytes - example base64, AIS
>   *   text to text - TBD (needed?)
> 
> 
> The whitespace separated list must be of compatible transfer encoding algorithms. The first named algorithm is applied first, so assuming these identifiers are valid dfdl:transferEncoding="base64 zip" would mean the data is text, and will be converted from text to bytes by the base64 decoder, and then from bytes to bytes by the unzip decoder. The inverse happens when unparsing.
> 
> 
> When a DFDL element has a dfdl:transferEncoding, then the length of that element is the length of the transfer- encoded representation of the data.
> 
> 
> For example: An element of complex type can have a prefixed length indicating it is 16457 bytes long. If its
> 
> transfer encoding specifies zip compression, then this 16457 bytes would be unzipped and the result would be larger. For example it could expand to 50873 bytes. The content of the complex type would then be parsed from this 50873 bytes.
> 
> 
> The implementation of transfer encodings generally involves Daffodil's parser and unparser combinators.
> 
> Considering first parsing. The combinator would take action before and after parsing the content of the element. In the before action, the Daffodil DataInputStream would be encapsulated by another implementation of DataInputStream; except that this encapsulating stream would implement the transfer encoding decoder algorithm, reading data from the underlying DataInputStream. Multiple transfer encodings would result in multiple such encapsulations layered one upon the other.
> 
> 
> After the content is unparsed, the action taken after by the combinator is to unencapsulate the DataInputStream, returning to the original DataInputStream, from which some data will have been consumed.
> 
> 
> The position of the original DataInputStream must be precise and exactly the position after the last bit of the transfer-encoded data.
> 
> 
> Some formats will require nested elements such that an outer element having a transfer encoding specified can have a text dfdl:encoding property specifying the text charset used in the transfer-encoded representation. The inner nested element can then have a different dfdl:encoding property - which is used to interpret the decoded data as text.  For example suppose you have a large text string in UTF-8. This can be compressed to get bytes, and those bytes base64 encoded into the US-ASCII charset. This would be expressed by something like
> 
> 
> <element name="outer" dfdl:encoding="us-ascii" dfdl:transferEncoding="base64 compress">
> 
>    <complexType>
> 
>      <sequence>
> 
>        <element name="inner" type="xs:string" dfdl:encoding="utf-8" dfdl:lengthKind="delimited"/>
> 
>    ....
> 
> 
> About extensibility
> 
> 
> It was a goal for this set of transfer encodings to be readily extensible. This is because many formats have specific encodings particular to them. AIS has one, ASN.1 BER has one (so called "object" encoding), and there are a wide variety of compression algorithms.
> 
> 
> However, it is probably best to build some of these transfer encoders/decoders first, and then consider what is necessary to specify one without access to Daffodil internal classes and data structures.
> 
> 
> About MIME names for encodings.
> 
> 
> TBD: identifiers like base64 mean different things in different contexts. In the XML world it is just an algorithm for creating a single long string of characters. (Much like how hexBinary means a single long string of hex digits).
> 
> But in IETF Internet Message Format, base64 means a particular syntax with lines of a specific length. An IMF base64 encoded binary has a block structure with human-tolerable line-lengths (max 72) and a specific introduction and termination to indicate the start/end.
> 
> 
> Perhaps use QNames so that ietf:base64 or mime:base64 can provide the distinctions using normal namespace qualification.
> 
> 
> TBD: parameters to transfer encoding algorithms.
> 
> 
> We may need some way to express these. Perhaps a URL-style thing like
> 
> dfdl:transferEncoding='compress?method=bz2'
> 
> 
> ...mike beckerle
> 

Initial thoughts, I like it, but how do lengths work with this? Does an
element with dfdl:transferEncoding need to have an explicit length? I
imagine any kindof speculative parsing with something like delimited
length would be difficult, but does seem useful. For example, I could
imagine format like:

  DATA=<base 64 data terminated by newline>

This would look something like:

  <element name="Data" dfdl:transferEncoding="base64"
    dfdl:lengthKind="delimited"
    dfdl:initiator="DATA=" dfdl:terminator="%NL;">
    <complexType>
      <sequence>
         ....

So this is kindof interesting. Normally, the complex element would just
push the %NL; delimiter on the delimiter stack and then start parsing
the children. But that doesn't work here. Instead, the complex element
with transfer encoding must scan for its local delimiter, decode all the
data, and then continue parsing children with decoded data as normal.

But line-folding seems a little different to me. Say we had something
like this:

  <element name="Data"
dfdl:transferEncoding="fold?width=72&chars=%NL;%SP;%SP;"
    dfdl:lengthKind="delimited"
    dfdl:initiator="DATA=" dfdl:terminator="%NL;">
    <complexType>
      <sequence>
         ....

So this allows data to look like this:

  DATA=This is long data that is wrapped
    with a new-line followed by two spaces--the
    new-line and spaces should be removed when
    in the infoset

So in this case, I don't think we want to just scan the data for the
%NL; terminator and then treat it all as a decode. Since the %NL
terminator is part of the chars. Or maybe it's just a special case, and
the fold transfer encoding knows the difference between %NL;%SP;%SP; and
just %NL;?

Another thought about lengths, what if an element references
content/valueLength of the transferEncoding element? Does it return the
original length or the decoded length? I guess a child can't reference
the length of it's parent, so only siblings or parents can reference a
contentLength? In which case they probably just just return the original
non-decoded length? Is there ever a time something would need the
decoded length?

- Steve