You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@daffodil.apache.org by Mike Beckerle <mb...@apache.org> on 2022/07/18 20:16:24 UTC

EXI capability for Daffodil - requirements and design

This email thread for discussion of EXI capabilities for Daffodil.

The primary requirement is improved performance by avoiding the processing
and size overhead of XML (or JSON) textual infoset output creation from
parsing, and input to unparsing.
Users want to process large binary data files (think 800GBytes) using
Daffodil. Textual XML can blow up the size of binary data by a factor of
100, which is infeasible both space and processing-overhead wise for large
input data files like this.

Even for small data messages the overhead of XML text can be excessive and
have major performance impact.

Users of Daffodil need to be able to create applications that never realize
textual XML in processing pipelines that parse data, transform it using
XSLT, validate it using XSD validation and/or schematron validation, and
unparse back to original format. Keeping the data as EXI as it moves
between these kinds of processing should provide substantial performance
benefits.

Phased approach: I believe the requirements can be done in phases, e.g., I
would be fine with requiring a specific open-source compatible EXI
library in our CLI as a first version even though ultimately we want it to
be pluggable. Also for phasing, schema-unaware EXI is a fine stepping stone
to schema-aware EXI.

Theoretically, at least, there is no need for Daffodil to support EXI
directly, i.e., no changes to Daffodil. This EXI-enabling effort could, in
theory, just be the creation of a couple of example applications of
Daffodil and an EXI library using each from their APIs.

In practice there may be changes to Daffodil needed because:

* Daffodil APIs may need change to make use of various EXI libraries
possible or smoother/easier.
* CLI may want to expose EXI capability for easy user experience with it.
* Daffodil's unparser SAX API has some overhead we may want to bypass. The
unparser is naturally a pull/StAX style of API. If EXI libraries can
accommodate this then that may be substantially better in performance. EXI
is all about performance after all.

Some requirements:

1) support for multiple open and closed source EXI implementations that are
not incorporated into Daffodil as dependencies
I know we have users who want to see tests with at least Agile Delta EXI
(closed source) and EXIfficient.

2) support for schema-unaware EXI encoding

3) support for schema-ware EXI encoding. This may introduce new
requirements - e.g., unlike XML text or schema unaware, one may (I have a
lack of EXI knowledge/experience here) need the schema (or some
EXI-compiled flavor thereof) in order to consume such EXI. (Bunch of TBD
here.)

4)  ? TDML runner support (? is there any requirement here ? Unclear)

5) CLI support to output schema-unaware exi.

5.5) CLI support to output schema-aware exi. (TBD: is this needed for CLI?
Applications can do this from API, do we really also need to offer it from
the CLI?)

6) Enable EXI LZW Compression feature (or not) - EXI is all about
performance by improving the data density hence the handling overhead. We
should do experiments measuring the on/off of options such as compression
(a LZW-style compression feature built into EXI encoders/decoders) which is
optionally enabled. If this improves compression with low overhead we would
just turn it on. If the benefits are small we would not bother with it,
but...  if it reduces size substantially, but has real measurable cost,
then we probably need a switch for on/off. An interesting point would be
the use of LZW compression with non-schema aware EXI vs. schema-aware EXI
(with or without compression).

7) Unparser - API Pull support - Speculation here - do we need to create a
standard StAX API for Daffodil unparsing so that EXI software supporting
StAX (or any other kind of StAX software) can be used with Daffodil more
easily.

8) Rich examples of Daffodil using EXI: Examples (openDFDL, not part of
Daffodil) should show how to parse, transform (simple XSLT thing), and
unparse data using Daffodil with EXI as the intermediate form between the
parse and transform, and between the transform and unparse. This should be
shown in schema-unaware and schema-aware variants. An important part of
this example is illustrating any added complexities that schema-aware EXI
imposes. These are effectively EXI versions of the openDFDL helloWorld
example.

Re: EXI capability for Daffodil - requirements and design

Posted by Mike Beckerle <mb...@apache.org>.

On Tue, Jul 19, 2022 at 11:06 AM Adams, Joshua <ja...@owlcyberdefense.com>
wrote:

> Here are my notes based on my work on supporting Exificient so far:
>
> > * Daffodil APIs may need change to make use of various EXI libraries
> > possible or smoother/easier.
>
> I think this is definitely true.  Right now in my current pull request for
> adding Exificient to the CLI tool when I want to parse with Exificient
> using SAX it looks like this:
>
>     val saxXmlRdr = processor.newXMLReaderInstance
>     saxXmlRdr.setContentHandler(saxContentHandler)
>     saxXmlRdr.setProperty(XMLUtils.DAFFODIL_SAX_URN_BLOBDIRECTORY, blobDir)
>     saxXmlRdr.setProperty(XMLUtils.DAFFODIL_SAX_URN_BLOBSUFFIX, blobSuffix)
>     saxXmlRdr.parse(data)
>
>  I feel that it could be smoothed out greatly if we could do something
> like:
>
>      processor.parseWithSAX(data, saxContentHandler, saxProperties)
>
>  Similar improvements could be made on the unparse side of things as well.
>

I'm not sure this will be smaller lines of code when you consider that the
"saxProperties" argument is going to require multiple lines of code to
assemble.
I mean whether it's setProperty, or adding a name-value pair to a map/list,
it's still one line per.

The important initialization cost is what is done to specify the XML Schema
to the EXI saxContentHandler itself
so that it can do EXI-schema-aware encoding. Loading up that file may take
seconds. (or more if it is not somehow pre-compiled)


>  > * Daffodil's unparser SAX API has some overhead we may want to bypass.
> The
> > unparser is naturally a pull/StAX style of API. If EXI libraries can
> > accommodate this then that may be substantially better in performance.
> EXI
> > is all about performance after all.
>
> I did some testing with the current SAX based approach for EXI in my pull
> request and there doesn't seem to be any difference in performance for
> parsing between EXI and regular XML infosets, but for unparsing EXI (using
> SAX) is about 3 times slower than normal XML.
>

For parsing, I wouldn't expect to see benefit until you have fairly large
files and/or long-running flows.

For unparsing, yeah, the SAX unparse API has to invert from callbacks to
queuing up events for a Stax-like pull by another "thread". This uses
coroutines that in principle should only context switch every 100 events,
and that should be pretty low overhead, but somehow that still has too much
overhead.
Future versions of Java JVMs will have lighter-weight thread
implementations, so that may help here, but I am not holding my breath for
this.
Something may really be wrong with our SAX unparse inversion-of-control
implementation that we need to find/fix because at granularity 100 it
doesn't seem like the context switch overhead should matter. But regardless
of that it is always going to add some overhead vs. the basic
pull-orientation. So pull/StAX is better.


>
> Exificient has a StAX API as well so this is probably worth
> investigating.  Does daffodil already support StAX or would we need to
> implement some sort of XMLStreamReader/XMLStreamWriter?
>

I think we want to implement an API for daffodil that allows StAX to be
directly plugged in, with no knowledge of what the source of the StAX
events is: any variety of XML or EXI parser.

Then we would use that API to implement EXI unparsing, and possibly to
cleanup & remove duplicate effort in implementation of XML unparsing.


>
> > 1) support for multiple open and closed source EXI implementations that
> > are not incorporated into Daffodil as dependencies
> > I know we have users who want to see tests with at least Agile Delta EXI
> > (closed source) and EXIfficient.
>
> I'm looking into Agile Delta and have requested an evaluation copy through
> their website, but I'm not sure that will give us access to their SDK.
> Should allow us to at least verify that we can unparse an EXI file encoded
> by Agile Delta with our current implementation using Exificient though.
>
> > 2) support for schema-unaware EXI encoding
>
> This is how it is currently implemented in my pull request
>
> > 3) support for schema-ware EXI encoding. This may introduce new
> > requirements - e.g., unlike XML text or schema unaware, one may (I have a
> > lack of EXI knowledge/experience here) need the schema (or some
> > EXI-compiled flavor thereof) in order to consume such EXI. (Bunch of TBD
> > here.)
>
> This hopefully wouldn't be too difficult to implement in the CLI, the only
> thing I'm not sure about is how well the EXI libraries would handle
> resolving our schemas spread out across several files.


This issue, that Daffodil uses its own unique class-path-based resolver,
and that we depend on this behavior to compose large DFDL schemas from
smaller ones in layers, is quite problematic.

This EXI feature for Daffodil gives us an opportunity to see/address this
problem directly, ourselves, and come up with solutions rather than others
using Daffodil having to struggle with it.

Anyone who exploits both Daffodil, and XML Validators, or any other
schema-aware processing, in the same application, has the problem that the
DFDL schema, treated as an XML Schema for validation, is being handled by a
different code-base, not Daffodil, and that XML Validator may not be able
to make the include/import statements work to construct the schema the same
way as Daffodil.

The standard Apache XML Commons Resolver is an XML Catalog resolver. It is
unclear if a big/elaborate XML Catalog can be created which enables an XML
Schema to be assembled without the Daffodil classpath-based resolver.

Certainly a supported API for daffodil should allow one to get the Daffodil
resolver for use in other software that is also handling the DFDL schema
(as an XSD). But this will only work for Java-based software.  For C-based
software a custom resolver, or an equivalent resolver to the Daffodil
resolver may need to be created. The workaround is quite awful (rename lots
of schema files, clobber include/import schema locations in them, etc.)


> Should be a solvable problem though.  I'm thinking it would only be
> supported when the --schema option is present on the CLI (i.e. use
> schema-unaware if using saved parsers).
>

I expect one or all EXI implementations to provide a compiler that consumes
an XML schema and creates some pre-digested version of it for use by the
EXI processor.
The EXI standard may in fact specify this. Have to investigate this.  But
if this compiler exists, then it is the compiler that has to deal with the
XML schemaLocation resolver.


>
> > 4)  ? TDML runner support (? is there any requirement here ? Unclear)
>
> Only thing that might be nice to have here is a way to compare EXI
> infosets, but I'm not sure if this is really necessary.  There isn't much
> value in inspecting an EXI infoset, so long as you can verify that it
> unparses correctly, matching the original file.


Unless there is need for more than just comparing binary EXI files, I think
there is no requirement for anything new here.


>
> > 5) CLI support to output schema-unaware exi.
> >
> > 5.5) CLI support to output schema-aware exi. (TBD: is this needed for
> CLI?
> > Applications can do this from API, do we really also need to offer it
> from
> > the CLI?)
>
> Touched on these earlier.  Should be doable, but might be limited to
> schema-unaware for saved-parsers
>
> > 6) Enable EXI LZW Compression feature (or not) - EXI is all about
> > performance by improving the data density hence the handling overhead. We
> > should do experiments measuring the on/off of options such as compression
> > (a LZW-style compression feature built into EXI encoders/decoders) which
> is
> > optionally enabled. If this improves compression with low overhead we
> would
> > just turn it on. If the benefits are small we would not bother with it,
> > but...  if it reduces size substantially, but has real measurable cost,
> > then we probably need a switch for on/off. An interesting point would be
> > the use of LZW compression with non-schema aware EXI vs. schema-aware EXI
> > (with or without compression).
>
> This would simply be a matter of setting the appropriate flags to the
> EXIFactory before creating the EXI SaxContentHandler.  Could easily be
> added to my existing pull request.
>
> > 7) Unparser - API Pull support - Speculation here - do we need to create
> a
> > standard StAX API for Daffodil unparsing so that EXI software supporting
> > StAX (or any other kind of StAX software) can be used with Daffodil more
> > easily.
>
> Touched on this earlier that Exificient does have a StAX API and is worht
> investigating due to the performance overhead of SAX when unparsing.
>
> * Examples should show both file-of-data mode, and streaming mode (many
> messages on unbounded input)
>
> * CLI exi feature must support both single-file parse/unparse, and
> streaming mode.
>
> Exificient does have code in their sample program for streaming, but I'm
> not sure how easily it would integrate into our CLI tool without looking
> into it further.  Not to mention the NUL separator issue.
>
> Josh
> ________________________________
> From: Mike Beckerle <mb...@apache.org>
> Sent: Monday, July 18, 2022 4:39 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: EXI capability for Daffodil - requirements and design
>
> Additional requirements:
>
> * Examples should show both file-of-data mode, and streaming mode (many
> messages on unbounded input)
>
> * CLI exi feature must support both single-file parse/unparse, and
> streaming mode.
> Note that this raises question how messages are separated on the stream, if
> at all. I know our streaming mode now uses NUL bytes between XML text
> outputs from the parser, and the streaming unparser expects these NUL
> bytes. This NUL between messages might need to become configurable, for
> equivalence of XML-text streams with corresponding EXI streams.
>
>
> On Mon, Jul 18, 2022 at 4:16 PM Mike Beckerle <mb...@apache.org>
> wrote:
>
> > This email thread for discussion of EXI capabilities for Daffodil.
> >
> > The primary requirement is improved performance by avoiding the
> processing
> > and size overhead of XML (or JSON) textual infoset output creation from
> > parsing, and input to unparsing.
> > Users want to process large binary data files (think 800GBytes) using
> > Daffodil. Textual XML can blow up the size of binary data by a factor of
> > 100, which is infeasible both space and processing-overhead wise for
> large
> > input data files like this.
> >
> > Even for small data messages the overhead of XML text can be excessive
> and
> > have major performance impact.
> >
> > Users of Daffodil need to be able to create applications that never
> > realize textual XML in processing pipelines that parse data, transform it
> > using XSLT, validate it using XSD validation and/or schematron
> validation,
> > and unparse back to original format. Keeping the data as EXI as it moves
> > between these kinds of processing should provide substantial performance
> > benefits.
> >
> > Phased approach: I believe the requirements can be done in phases, e.g.,
> I
> > would be fine with requiring a specific open-source compatible EXI
> > library in our CLI as a first version even though ultimately we want it
> to
> > be pluggable. Also for phasing, schema-unaware EXI is a fine stepping
> stone
> > to schema-aware EXI.
> >
> > Theoretically, at least, there is no need for Daffodil to support EXI
> > directly, i.e., no changes to Daffodil. This EXI-enabling effort could,
> in
> > theory, just be the creation of a couple of example applications of
> > Daffodil and an EXI library using each from their APIs.
> >
> > In practice there may be changes to Daffodil needed because:
> >
> > * Daffodil APIs may need change to make use of various EXI libraries
> > possible or smoother/easier.
> > * CLI may want to expose EXI capability for easy user experience with it.
> > * Daffodil's unparser SAX API has some overhead we may want to bypass.
> The
> > unparser is naturally a pull/StAX style of API. If EXI libraries can
> > accommodate this then that may be substantially better in performance.
> EXI
> > is all about performance after all.
> >
> > Some requirements:
> >
> > 1) support for multiple open and closed source EXI implementations that
> > are not incorporated into Daffodil as dependencies
> > I know we have users who want to see tests with at least Agile Delta EXI
> > (closed source) and EXIfficient.
> >
> > 2) support for schema-unaware EXI encoding
> >
> > 3) support for schema-ware EXI encoding. This may introduce new
> > requirements - e.g., unlike XML text or schema unaware, one may (I have a
> > lack of EXI knowledge/experience here) need the schema (or some
> > EXI-compiled flavor thereof) in order to consume such EXI. (Bunch of TBD
> > here.)
> >
> > 4)  ? TDML runner support (? is there any requirement here ? Unclear)
> >
> > 5) CLI support to output schema-unaware exi.
> >
> > 5.5) CLI support to output schema-aware exi. (TBD: is this needed for
> CLI?
> > Applications can do this from API, do we really also need to offer it
> from
> > the CLI?)
> >
> > 6) Enable EXI LZW Compression feature (or not) - EXI is all about
> > performance by improving the data density hence the handling overhead. We
> > should do experiments measuring the on/off of options such as compression
> > (a LZW-style compression feature built into EXI encoders/decoders) which
> is
> > optionally enabled. If this improves compression with low overhead we
> would
> > just turn it on. If the benefits are small we would not bother with it,
> > but...  if it reduces size substantially, but has real measurable cost,
> > then we probably need a switch for on/off. An interesting point would be
> > the use of LZW compression with non-schema aware EXI vs. schema-aware EXI
> > (with or without compression).
> >
> > 7) Unparser - API Pull support - Speculation here - do we need to create
> a
> > standard StAX API for Daffodil unparsing so that EXI software supporting
> > StAX (or any other kind of StAX software) can be used with Daffodil more
> > easily.
> >
> > 8) Rich examples of Daffodil using EXI: Examples (openDFDL, not part of
> > Daffodil) should show how to parse, transform (simple XSLT thing), and
> > unparse data using Daffodil with EXI as the intermediate form between the
> > parse and transform, and between the transform and unparse. This should
> be
> > shown in schema-unaware and schema-aware variants. An important part of
> > this example is illustrating any added complexities that schema-aware EXI
> > imposes. These are effectively EXI versions of the openDFDL helloWorld
> > example.
> >
> >
> >
> >
> >
>

Re: EXI capability for Daffodil - requirements and design

Posted by "Adams, Joshua" <ja...@owlcyberdefense.com>.

Here are my notes based on my work on supporting Exificient so far:

> * Daffodil APIs may need change to make use of various EXI libraries
> possible or smoother/easier.

I think this is definitely true.  Right now in my current pull request for adding Exificient to the CLI tool when I want to parse with Exificient using SAX it looks like this:

    val saxXmlRdr = processor.newXMLReaderInstance
    saxXmlRdr.setContentHandler(saxContentHandler)
    saxXmlRdr.setProperty(XMLUtils.DAFFODIL_SAX_URN_BLOBDIRECTORY, blobDir)
    saxXmlRdr.setProperty(XMLUtils.DAFFODIL_SAX_URN_BLOBSUFFIX, blobSuffix)
    saxXmlRdr.parse(data)

 I feel that it could be smoothed out greatly if we could do something like:

     processor.parseWithSAX(data, saxContentHandler, saxProperties)

 Similar improvements could be made on the unparse side of things as well.

 > * Daffodil's unparser SAX API has some overhead we may want to bypass. The
> unparser is naturally a pull/StAX style of API. If EXI libraries can
> accommodate this then that may be substantially better in performance. EXI
> is all about performance after all.

I did some testing with the current SAX based approach for EXI in my pull request and there doesn't seem to be any difference in performance for parsing between EXI and regular XML infosets, but for unparsing EXI (using SAX) is about 3 times slower than normal XML.

Exificient has a StAX API as well so this is probably worth investigating.  Does daffodil already support StAX or would we need to implement some sort of XMLStreamReader/XMLStreamWriter?

> 1) support for multiple open and closed source EXI implementations that
> are not incorporated into Daffodil as dependencies
> I know we have users who want to see tests with at least Agile Delta EXI
> (closed source) and EXIfficient.

I'm looking into Agile Delta and have requested an evaluation copy through their website, but I'm not sure that will give us access to their SDK.  Should allow us to at least verify that we can unparse an EXI file encoded by Agile Delta with our current implementation using Exificient though.

> 2) support for schema-unaware EXI encoding

This is how it is currently implemented in my pull request

> 3) support for schema-ware EXI encoding. This may introduce new
> requirements - e.g., unlike XML text or schema unaware, one may (I have a
> lack of EXI knowledge/experience here) need the schema (or some
> EXI-compiled flavor thereof) in order to consume such EXI. (Bunch of TBD
> here.)

This hopefully wouldn't be too difficult to implement in the CLI, the only thing I'm not sure about is how well the EXI libraries would handle resolving our schemas spread out across several files.  Should be a solvable problem though.  I'm thinking it would only be supported when the --schema option is present on the CLI (i.e. use schema-unaware if using saved parsers).

> 4)  ? TDML runner support (? is there any requirement here ? Unclear)

Only thing that might be nice to have here is a way to compare EXI infosets, but I'm not sure if this is really necessary.  There isn't much value in inspecting an EXI infoset, so long as you can verify that it unparses correctly, matching the original file.

> 5) CLI support to output schema-unaware exi.
>
> 5.5) CLI support to output schema-aware exi. (TBD: is this needed for CLI?
> Applications can do this from API, do we really also need to offer it from
> the CLI?)

Touched on these earlier.  Should be doable, but might be limited to schema-unaware for saved-parsers

> 6) Enable EXI LZW Compression feature (or not) - EXI is all about
> performance by improving the data density hence the handling overhead. We
> should do experiments measuring the on/off of options such as compression
> (a LZW-style compression feature built into EXI encoders/decoders) which is
> optionally enabled. If this improves compression with low overhead we would
> just turn it on. If the benefits are small we would not bother with it,
> but...  if it reduces size substantially, but has real measurable cost,
> then we probably need a switch for on/off. An interesting point would be
> the use of LZW compression with non-schema aware EXI vs. schema-aware EXI
> (with or without compression).

This would simply be a matter of setting the appropriate flags to the EXIFactory before creating the EXI SaxContentHandler.  Could easily be added to my existing pull request.

> 7) Unparser - API Pull support - Speculation here - do we need to create a
> standard StAX API for Daffodil unparsing so that EXI software supporting
> StAX (or any other kind of StAX software) can be used with Daffodil more
> easily.

Touched on this earlier that Exificient does have a StAX API and is worht investigating due to the performance overhead of SAX when unparsing.

* Examples should show both file-of-data mode, and streaming mode (many
messages on unbounded input)

* CLI exi feature must support both single-file parse/unparse, and
streaming mode.

Exificient does have code in their sample program for streaming, but I'm not sure how easily it would integrate into our CLI tool without looking into it further.  Not to mention the NUL separator issue.

Josh
________________________________
From: Mike Beckerle <mb...@apache.org>
Sent: Monday, July 18, 2022 4:39 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: EXI capability for Daffodil - requirements and design

Additional requirements:

* Examples should show both file-of-data mode, and streaming mode (many
messages on unbounded input)

* CLI exi feature must support both single-file parse/unparse, and
streaming mode.
Note that this raises question how messages are separated on the stream, if
at all. I know our streaming mode now uses NUL bytes between XML text
outputs from the parser, and the streaming unparser expects these NUL
bytes. This NUL between messages might need to become configurable, for
equivalence of XML-text streams with corresponding EXI streams.


On Mon, Jul 18, 2022 at 4:16 PM Mike Beckerle <mb...@apache.org> wrote:

> This email thread for discussion of EXI capabilities for Daffodil.
>
> The primary requirement is improved performance by avoiding the processing
> and size overhead of XML (or JSON) textual infoset output creation from
> parsing, and input to unparsing.
> Users want to process large binary data files (think 800GBytes) using
> Daffodil. Textual XML can blow up the size of binary data by a factor of
> 100, which is infeasible both space and processing-overhead wise for large
> input data files like this.
>
> Even for small data messages the overhead of XML text can be excessive and
> have major performance impact.
>
> Users of Daffodil need to be able to create applications that never
> realize textual XML in processing pipelines that parse data, transform it
> using XSLT, validate it using XSD validation and/or schematron validation,
> and unparse back to original format. Keeping the data as EXI as it moves
> between these kinds of processing should provide substantial performance
> benefits.
>
> Phased approach: I believe the requirements can be done in phases, e.g., I
> would be fine with requiring a specific open-source compatible EXI
> library in our CLI as a first version even though ultimately we want it to
> be pluggable. Also for phasing, schema-unaware EXI is a fine stepping stone
> to schema-aware EXI.
>
> Theoretically, at least, there is no need for Daffodil to support EXI
> directly, i.e., no changes to Daffodil. This EXI-enabling effort could, in
> theory, just be the creation of a couple of example applications of
> Daffodil and an EXI library using each from their APIs.
>
> In practice there may be changes to Daffodil needed because:
>
> * Daffodil APIs may need change to make use of various EXI libraries
> possible or smoother/easier.
> * CLI may want to expose EXI capability for easy user experience with it.
> * Daffodil's unparser SAX API has some overhead we may want to bypass. The
> unparser is naturally a pull/StAX style of API. If EXI libraries can
> accommodate this then that may be substantially better in performance. EXI
> is all about performance after all.
>
> Some requirements:
>
> 1) support for multiple open and closed source EXI implementations that
> are not incorporated into Daffodil as dependencies
> I know we have users who want to see tests with at least Agile Delta EXI
> (closed source) and EXIfficient.
>
> 2) support for schema-unaware EXI encoding
>
> 3) support for schema-ware EXI encoding. This may introduce new
> requirements - e.g., unlike XML text or schema unaware, one may (I have a
> lack of EXI knowledge/experience here) need the schema (or some
> EXI-compiled flavor thereof) in order to consume such EXI. (Bunch of TBD
> here.)
>
> 4)  ? TDML runner support (? is there any requirement here ? Unclear)
>
> 5) CLI support to output schema-unaware exi.
>
> 5.5) CLI support to output schema-aware exi. (TBD: is this needed for CLI?
> Applications can do this from API, do we really also need to offer it from
> the CLI?)
>
> 6) Enable EXI LZW Compression feature (or not) - EXI is all about
> performance by improving the data density hence the handling overhead. We
> should do experiments measuring the on/off of options such as compression
> (a LZW-style compression feature built into EXI encoders/decoders) which is
> optionally enabled. If this improves compression with low overhead we would
> just turn it on. If the benefits are small we would not bother with it,
> but...  if it reduces size substantially, but has real measurable cost,
> then we probably need a switch for on/off. An interesting point would be
> the use of LZW compression with non-schema aware EXI vs. schema-aware EXI
> (with or without compression).
>
> 7) Unparser - API Pull support - Speculation here - do we need to create a
> standard StAX API for Daffodil unparsing so that EXI software supporting
> StAX (or any other kind of StAX software) can be used with Daffodil more
> easily.
>
> 8) Rich examples of Daffodil using EXI: Examples (openDFDL, not part of
> Daffodil) should show how to parse, transform (simple XSLT thing), and
> unparse data using Daffodil with EXI as the intermediate form between the
> parse and transform, and between the transform and unparse. This should be
> shown in schema-unaware and schema-aware variants. An important part of
> this example is illustrating any added complexities that schema-aware EXI
> imposes. These are effectively EXI versions of the openDFDL helloWorld
> example.
>
>
>
>
>

Re: EXI capability for Daffodil - requirements and design

Posted by Mike Beckerle <mb...@apache.org>.

Additional requirements:

* Examples should show both file-of-data mode, and streaming mode (many
messages on unbounded input)

* CLI exi feature must support both single-file parse/unparse, and
streaming mode.
Note that this raises question how messages are separated on the stream, if
at all. I know our streaming mode now uses NUL bytes between XML text
outputs from the parser, and the streaming unparser expects these NUL
bytes. This NUL between messages might need to become configurable, for
equivalence of XML-text streams with corresponding EXI streams.


On Mon, Jul 18, 2022 at 4:16 PM Mike Beckerle <mb...@apache.org> wrote:

> This email thread for discussion of EXI capabilities for Daffodil.
>
> The primary requirement is improved performance by avoiding the processing
> and size overhead of XML (or JSON) textual infoset output creation from
> parsing, and input to unparsing.
> Users want to process large binary data files (think 800GBytes) using
> Daffodil. Textual XML can blow up the size of binary data by a factor of
> 100, which is infeasible both space and processing-overhead wise for large
> input data files like this.
>
> Even for small data messages the overhead of XML text can be excessive and
> have major performance impact.
>
> Users of Daffodil need to be able to create applications that never
> realize textual XML in processing pipelines that parse data, transform it
> using XSLT, validate it using XSD validation and/or schematron validation,
> and unparse back to original format. Keeping the data as EXI as it moves
> between these kinds of processing should provide substantial performance
> benefits.
>
> Phased approach: I believe the requirements can be done in phases, e.g., I
> would be fine with requiring a specific open-source compatible EXI
> library in our CLI as a first version even though ultimately we want it to
> be pluggable. Also for phasing, schema-unaware EXI is a fine stepping stone
> to schema-aware EXI.
>
> Theoretically, at least, there is no need for Daffodil to support EXI
> directly, i.e., no changes to Daffodil. This EXI-enabling effort could, in
> theory, just be the creation of a couple of example applications of
> Daffodil and an EXI library using each from their APIs.
>
> In practice there may be changes to Daffodil needed because:
>
> * Daffodil APIs may need change to make use of various EXI libraries
> possible or smoother/easier.
> * CLI may want to expose EXI capability for easy user experience with it.
> * Daffodil's unparser SAX API has some overhead we may want to bypass. The
> unparser is naturally a pull/StAX style of API. If EXI libraries can
> accommodate this then that may be substantially better in performance. EXI
> is all about performance after all.
>
> Some requirements:
>
> 1) support for multiple open and closed source EXI implementations that
> are not incorporated into Daffodil as dependencies
> I know we have users who want to see tests with at least Agile Delta EXI
> (closed source) and EXIfficient.
>
> 2) support for schema-unaware EXI encoding
>
> 3) support for schema-ware EXI encoding. This may introduce new
> requirements - e.g., unlike XML text or schema unaware, one may (I have a
> lack of EXI knowledge/experience here) need the schema (or some
> EXI-compiled flavor thereof) in order to consume such EXI. (Bunch of TBD
> here.)
>
> 4)  ? TDML runner support (? is there any requirement here ? Unclear)
>
> 5) CLI support to output schema-unaware exi.
>
> 5.5) CLI support to output schema-aware exi. (TBD: is this needed for CLI?
> Applications can do this from API, do we really also need to offer it from
> the CLI?)
>
> 6) Enable EXI LZW Compression feature (or not) - EXI is all about
> performance by improving the data density hence the handling overhead. We
> should do experiments measuring the on/off of options such as compression
> (a LZW-style compression feature built into EXI encoders/decoders) which is
> optionally enabled. If this improves compression with low overhead we would
> just turn it on. If the benefits are small we would not bother with it,
> but...  if it reduces size substantially, but has real measurable cost,
> then we probably need a switch for on/off. An interesting point would be
> the use of LZW compression with non-schema aware EXI vs. schema-aware EXI
> (with or without compression).
>
> 7) Unparser - API Pull support - Speculation here - do we need to create a
> standard StAX API for Daffodil unparsing so that EXI software supporting
> StAX (or any other kind of StAX software) can be used with Daffodil more
> easily.
>
> 8) Rich examples of Daffodil using EXI: Examples (openDFDL, not part of
> Daffodil) should show how to parse, transform (simple XSLT thing), and
> unparse data using Daffodil with EXI as the intermediate form between the
> parse and transform, and between the transform and unparse. This should be
> shown in schema-unaware and schema-aware variants. An important part of
> this example is illustrating any added complexities that schema-aware EXI
> imposes. These are effectively EXI versions of the openDFDL helloWorld
> example.
>
>
>
>
>