You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@daffodil.apache.org by "Ballard, Tom - US" <to...@caci.com> on 2021/09/22 15:58:46 UTC

Parsing formats with embedded XML -- recursion and/or layering required?

All,

I have a complex data format I am trying to implement a DFDL schema for, but don't believe it's possible without support for either recursion decomposition and/or layering.  The format in question has a subset of messages which consist of a binary "header" followed by an XML payload.  The messages begin with a handful of binary metadata fields, followed by a binary length field, and then an XML payload (which is the length indicated in length field).  In some cases there may be binary data subsequent to the XML payload as well.  I assume I can pull the XML payload in as an opaque string blob, but the problem is I also need to validate that XML against a schema.

I know recursion and layering are on the project wish list, but is there a way to accomplish full parsing and validation of "hybrid" messages like I described possible without them?

V/R,
Tom Ballard



________________________________

This electronic message contains information from CACI International Inc or subsidiary companies, which may be company sensitive, proprietary, privileged or otherwise protected from disclosure. The information is intended to be used solely by the recipient(s) named above. If you are not an intended recipient, be aware that any review, disclosure, copying, distribution or use of this transmission or its contents is prohibited. If you have received this transmission in error, please notify the sender immediately.

Re: Parsing formats with embedded XML -- recursion and/or layering required?

Posted by "Beckerle, Mike" <mb...@owlcyberdefense.com>.
Tom,


Any data format that is popular gets encapsulated and carried around in other data formats. Nature of the whole data game is a long history of this. E.g., you just want to aggregate multiple different pieces of data about a particular event together in a common data structure, but you have a constraint on the format you must use for this aggregate.



So genericallly,

  *   using Daffodil to convert data from native (e.g., binary) into format X.
  *   Typically format X is textual, but not necessarily.
  *   The native data also contains data that is already in format X.
  *   Many use cases will want the result to be Format X, not Format X with embedded escapified Format X pieces.

Hence merging the translated with the encapsulated pieces is a natural need.



Format X could be XML, JSON, EXI (binary XML), S-expresions, SISL, or other things.



The fact that Daffodil has a built in validation module, that in the case of XML Schemas, would not be able to use the DFDL schema to validate "Format X" when Format X is XML, that's a corner case for XML.  If this really became important, we could add a validation feature to enable validation to choose a different XML schema than the DFDL schema. This is already needed just if you want the validation to have some things like key/unique constraints that are not allowed to appear in a DFDL schema. The feature is also almost already there because if you use schematron validation, that can use a separate schematron .sch file for the validation rules. So making the regular xerces XML validator able to take a different XML schema for the validation seems like a small thing.


So I think this is a good generic capability to add to Daffodil.


We just need a motivated contributor to create it 🙂 (always recruiting new developers!)


-mike beckerle

________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Wednesday, September 22, 2021 12:21 PM
To: users@daffodil.apache.org <us...@daffodil.apache.org>
Subject: Re: Parsing formats with embedded XML -- recursion and/or layering required?

We actually recently added a feature that was intended to solve just
this problem of including XML payloads in the resulting infoset as XML
rather than a string. Though it requires a custom InfosetInputter and
InfosetOutputter that have not been written yet.

The proposal is here:

https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Runtime+Properties

The idea is that your payload element is just a normal xs:string, and
you annotate it with a custom runtime property like
treatStringAsXML=true. Then you can write a custom InfosetOuputter that
uses his annotation and outputs the string as XML during parse, and a
custom InfosetInputter that converts that XML back to a string during
unparse.

The Example Implementation discusses this exact use case and gives an
idea of how one might implement the custom InfosetInputter/Outputter.
This example uses Scala XML Nodes for simplicity, but could be done with
the standard text inputter/outputters as well.

One thing to point out though is that to Daffodil and its internals,
this payload element is still a string. Daffodil has no knowledge about
what the InfosetInputter/Outputters are doing, so Daffodil cannot
reference the XML payload in DFDL expressions, or validate the XML
against a schema. For validation, you would need to pipe the resulting
infoset to some other tool with a modified schema that does not treat
this payload as a string.

Since this is the second time I've come across this requirement, it
might be worth considering if this will be a more common technique, and
if maybe we should add some built-in mechanism to DFDL, one that would
work with both DFDL expressions and validation...

- Steve

On 9/22/21 11:58 AM, Ballard, Tom - US wrote:
> All,
>
> I have a complex data format I am trying to implement a DFDL schema for, but
> don’t believe it’s possible without support for either recursion decomposition
> and/or layering.  The format in question has a subset of messages which consist
> of a binary “header” followed by an XML payload.  The messages begin with a
> handful of binary metadata fields, followed by a binary length field, and then
> an XML payload (which is the length indicated in length field).  In some cases
> there may be binary data subsequent to the XML payload as well.  I assume I can
> pull the XML payload in as an opaque string blob, but the problem is I also need
> to validate that XML against a schema.
>
> I know recursion and layering are on the project wish list, but is there a way
> to accomplish full parsing and validation of “hybrid” messages like I described
> possible without them?
>
> V/R,
>
> Tom Ballard
>
>
> --------------------------------------------------------------------------------
>
> This electronic message contains information from CACI International Inc or
> subsidiary companies, which may be company sensitive, proprietary, privileged or
> otherwise protected from disclosure. The information is intended to be used
> solely by the recipient(s) named above. If you are not an intended recipient, be
> aware that any review, disclosure, copying, distribution or use of this
> transmission or its contents is prohibited. If you have received this
> transmission in error, please notify the sender immediately.
>


Re: Parsing formats with embedded XML -- recursion and/or layering required?

Posted by Steve Lawrence <sl...@apache.org>.
We actually recently added a feature that was intended to solve just 
this problem of including XML payloads in the resulting infoset as XML 
rather than a string. Though it requires a custom InfosetInputter and 
InfosetOutputter that have not been written yet.

The proposal is here:

https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Runtime+Properties

The idea is that your payload element is just a normal xs:string, and 
you annotate it with a custom runtime property like 
treatStringAsXML=true. Then you can write a custom InfosetOuputter that 
uses his annotation and outputs the string as XML during parse, and a 
custom InfosetInputter that converts that XML back to a string during 
unparse.

The Example Implementation discusses this exact use case and gives an 
idea of how one might implement the custom InfosetInputter/Outputter. 
This example uses Scala XML Nodes for simplicity, but could be done with 
the standard text inputter/outputters as well.

One thing to point out though is that to Daffodil and its internals, 
this payload element is still a string. Daffodil has no knowledge about 
what the InfosetInputter/Outputters are doing, so Daffodil cannot 
reference the XML payload in DFDL expressions, or validate the XML 
against a schema. For validation, you would need to pipe the resulting 
infoset to some other tool with a modified schema that does not treat 
this payload as a string.

Since this is the second time I've come across this requirement, it 
might be worth considering if this will be a more common technique, and 
if maybe we should add some built-in mechanism to DFDL, one that would 
work with both DFDL expressions and validation...

- Steve

On 9/22/21 11:58 AM, Ballard, Tom - US wrote:
> All,
> 
> I have a complex data format I am trying to implement a DFDL schema for, but
> don’t believe it’s possible without support for either recursion decomposition
> and/or layering.  The format in question has a subset of messages which consist
> of a binary “header” followed by an XML payload.  The messages begin with a
> handful of binary metadata fields, followed by a binary length field, and then
> an XML payload (which is the length indicated in length field).  In some cases
> there may be binary data subsequent to the XML payload as well.  I assume I can
> pull the XML payload in as an opaque string blob, but the problem is I also need
> to validate that XML against a schema.
> 
> I know recursion and layering are on the project wish list, but is there a way
> to accomplish full parsing and validation of “hybrid” messages like I described
> possible without them?
> 
> V/R,
> 
> Tom Ballard
> 
> 
> --------------------------------------------------------------------------------
> 
> This electronic message contains information from CACI International Inc or
> subsidiary companies, which may be company sensitive, proprietary, privileged or
> otherwise protected from disclosure. The information is intended to be used
> solely by the recipient(s) named above. If you are not an intended recipient, be
> aware that any review, disclosure, copying, distribution or use of this
> transmission or its contents is prohibited. If you have received this
> transmission in error, please notify the sender immediately.
>