You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@daffodil.apache.org by "Beckerle, Mike" <mb...@owlcyberdefense.com> on 2021/04/02 16:49:45 UTC

XML String in Binary Data Question

I've started running into binary data containing XML strings.

If Daffodil is unparsing a piece of XML Like this:

<bodyString><ns:well formed="piece">of arbitrary xml</ns:well></bodyString>

Suppose the DFDL schema for bodyString is:

<element name="bodyString" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="{....}"/>

So the notion here is that the data contains a string, which is a well-formed piece of XML.
For example, the overall format may be binary data that just happens to contain this string of XML in it.

I suspect that the Daffodil unparser is just going to explode on this, because it will
be fed element events for the string contents. I.e., the unparsing converts the incoming XML text to infoset events by first parsing it as XML, and that process is schema-unaware, so has no notion that the XML parse should NOT parse the parts of the body string as XML elements.

Does it make sense for Daffodil's XML-text infoset importer (used by unparsing) to recognize this case, and convert the <ns:well formed="piece">of arbitrary xml</ns:well> into an escapified XML string like:

&lt;ns:well formed=&quot;piece&quot;&gt;of arbitrary xml&lt;/ns:well&gt;

and then unparse it as if that string had arrived as this XML event to the unparser/XML-text Infoset inputter:

<bodyString>&lt;ns:well formed=&quot;piece&quot;&gt;of arbitrary xml&lt;/ns:well&gt;</bodyString>

So would an option to have this behavior be a reasonable thing to add to Daffodil?

The corresponding parse feature would be to emit the string not as escapified XML, but just as a string of text of well-formed XML.

I guess the notion is that escapifying strings is because the string contents may not be well-formed XML, but in this case since they ARE well formed pieces of XML, when a string is required we can emit unescapified XML, and also consume the same for unparsing and convert into strings.

Thoughts?


Re: XML String in Binary Data Question

Posted by Steve Lawrence <sl...@apache.org>.
One potential issue with this is if the payload XML isn't well-formed
XML and we just drop it in the infoset unescaped, then parse will
succeed but the infoset will not be well formed. One could always
perform validation, either in Daffodil or out, but it does feel a bit
weird to have a successful parse with an invalid infoset.

Another downside to this is that it might be hard to support this kind
of thing for other infoset inputters/outputters (e.g. SAX, JDOM,
W3CDOM). If feels unfortunate to have a feature that only works with a
specific infoset type, especially since I imagine more and more users
will transition to the new SAX inputter/outputter.

Though, I personally would prefer this just be a handled outside of
Daffodil. I've only done a little research but it seems XSLT is capable
of doing thins.

- Steve

On 4/5/21 12:16 PM, Beckerle, Mike wrote:
> I will create the test case as you suggest, illustrating the whole situation and what Daffodil does today.
> 
> What I'm seeking is a way for the string <foo>bar</foo> to be rendered as a string as exactly those characters, so that we *fool* a subsequent XML validator into treating the string contents as a tree of well-formed XML elements.
> An XML schema for the resulting data would not have type xs:string for the myString element, but a complex type containing a "foo" child element. XPaths like myString/foo would be meaningful in this data.
> 
> Arguably, DFDL should not do this, rather, a post-processor of the XML-rendered infoset should do this XML-specific transformation.
> 
> The analogous situation does also occur for JSON. (Though nobody has asked for this as yet.)
> 
> The string { "foo" : "bar" } as a string value of a JSON field named "myString" would require a bunch of escaping. E.g., perhaps (I don't know JSON so well) like
> 
> "myString" : "\{ \"foo\": \"bar\" \""
> 
> This will be interesting to test.
> 
> 
> ________________________________
> From: Interrante, John A (GE Research, US) <Jo...@ge.com>
> Sent: Monday, April 5, 2021 7:36 AM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: RE: XML String in Binary Data Question
> 
> I was waiting for someone to offer an opinion but it seems it's up to me.  First of all, please write an actual test case of binary data with a well-formed piece of XML data inside a string.  Please round trip it through Daffodil so we can actually find out how well both the parser and the unparser handle this data.  I find it hard to believe Daffodil doesn't already use some escaping or quoting mechanism to handle this kind of situation where the infoset (represented as XML) contains an element whose body looks like well-formed XML elements in their own turn.
> 
> Even if this situation causes the Daffodil unparser to explode, what's to stop you from telling Daffodil to represent the infoset as JSON rather than XML?  Surely the Daffodil unparser wouldn't have a problem unparsing the JSON representation with XML elements inside a string element?
> 
> I also would be curious to find out whether the infoset's JSON representation has a similar problem handling an actual test case of binary data with a well-formed piece of JSON data inside a string.
> 
> Once we know what really happens (and we also can run the same JSON/XML test cases through IBM's Daffodil processor to get more data points), we can start to discuss what's the best solution to handle this kind of situation for both JSON and XML infoset representations automatically.
> 
> John
> 
> -----Original Message-----
> From: Beckerle, Mike <mb...@owlcyberdefense.com>
> Sent: Friday, April 2, 2021 12:50 PM
> To: dev@daffodil.apache.org
> Subject: EXT: XML String in Binary Data Question
> 
> I've started running into binary data containing XML strings.
> 
> If Daffodil is unparsing a piece of XML Like this:
> 
> <bodyString><ns:well formed="piece">of arbitrary xml</ns:well></bodyString>
> 
> Suppose the DFDL schema for bodyString is:
> 
> <element name="bodyString" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="{....}"/>
> 
> So the notion here is that the data contains a string, which is a well-formed piece of XML.
> For example, the overall format may be binary data that just happens to contain this string of XML in it.
> 
> I suspect that the Daffodil unparser is just going to explode on this, because it will be fed element events for the string contents. I.e., the unparsing converts the incoming XML text to infoset events by first parsing it as XML, and that process is schema-unaware, so has no notion that the XML parse should NOT parse the parts of the body string as XML elements.
> 
> Does it make sense for Daffodil's XML-text infoset importer (used by unparsing) to recognize this case, and convert the <ns:well formed="piece">of arbitrary xml</ns:well> into an escapified XML string like:
> 
> &lt;ns:well formed=&quot;piece&quot;&gt;of arbitrary xml&lt;/ns:well&gt;
> 
> and then unparse it as if that string had arrived as this XML event to the unparser/XML-text Infoset inputter:
> 
> <bodyString>&lt;ns:well formed=&quot;piece&quot;&gt;of arbitrary xml&lt;/ns:well&gt;</bodyString>
> 
> So would an option to have this behavior be a reasonable thing to add to Daffodil?
> 
> The corresponding parse feature would be to emit the string not as escapified XML, but just as a string of text of well-formed XML.
> 
> I guess the notion is that escapifying strings is because the string contents may not be well-formed XML, but in this case since they ARE well formed pieces of XML, when a string is required we can emit unescapified XML, and also consume the same for unparsing and convert into strings.
> 
> Thoughts?
> 
> 


Re: XML String in Binary Data Question

Posted by "Beckerle, Mike" <mb...@owlcyberdefense.com>.
I will create the test case as you suggest, illustrating the whole situation and what Daffodil does today.

What I'm seeking is a way for the string <foo>bar</foo> to be rendered as a string as exactly those characters, so that we *fool* a subsequent XML validator into treating the string contents as a tree of well-formed XML elements.
An XML schema for the resulting data would not have type xs:string for the myString element, but a complex type containing a "foo" child element. XPaths like myString/foo would be meaningful in this data.

Arguably, DFDL should not do this, rather, a post-processor of the XML-rendered infoset should do this XML-specific transformation.

The analogous situation does also occur for JSON. (Though nobody has asked for this as yet.)

The string { "foo" : "bar" } as a string value of a JSON field named "myString" would require a bunch of escaping. E.g., perhaps (I don't know JSON so well) like

"myString" : "\{ \"foo\": \"bar\" \""

This will be interesting to test.


________________________________
From: Interrante, John A (GE Research, US) <Jo...@ge.com>
Sent: Monday, April 5, 2021 7:36 AM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: RE: XML String in Binary Data Question

I was waiting for someone to offer an opinion but it seems it's up to me.  First of all, please write an actual test case of binary data with a well-formed piece of XML data inside a string.  Please round trip it through Daffodil so we can actually find out how well both the parser and the unparser handle this data.  I find it hard to believe Daffodil doesn't already use some escaping or quoting mechanism to handle this kind of situation where the infoset (represented as XML) contains an element whose body looks like well-formed XML elements in their own turn.

Even if this situation causes the Daffodil unparser to explode, what's to stop you from telling Daffodil to represent the infoset as JSON rather than XML?  Surely the Daffodil unparser wouldn't have a problem unparsing the JSON representation with XML elements inside a string element?

I also would be curious to find out whether the infoset's JSON representation has a similar problem handling an actual test case of binary data with a well-formed piece of JSON data inside a string.

Once we know what really happens (and we also can run the same JSON/XML test cases through IBM's Daffodil processor to get more data points), we can start to discuss what's the best solution to handle this kind of situation for both JSON and XML infoset representations automatically.

John

-----Original Message-----
From: Beckerle, Mike <mb...@owlcyberdefense.com>
Sent: Friday, April 2, 2021 12:50 PM
To: dev@daffodil.apache.org
Subject: EXT: XML String in Binary Data Question

I've started running into binary data containing XML strings.

If Daffodil is unparsing a piece of XML Like this:

<bodyString><ns:well formed="piece">of arbitrary xml</ns:well></bodyString>

Suppose the DFDL schema for bodyString is:

<element name="bodyString" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="{....}"/>

So the notion here is that the data contains a string, which is a well-formed piece of XML.
For example, the overall format may be binary data that just happens to contain this string of XML in it.

I suspect that the Daffodil unparser is just going to explode on this, because it will be fed element events for the string contents. I.e., the unparsing converts the incoming XML text to infoset events by first parsing it as XML, and that process is schema-unaware, so has no notion that the XML parse should NOT parse the parts of the body string as XML elements.

Does it make sense for Daffodil's XML-text infoset importer (used by unparsing) to recognize this case, and convert the <ns:well formed="piece">of arbitrary xml</ns:well> into an escapified XML string like:

&lt;ns:well formed=&quot;piece&quot;&gt;of arbitrary xml&lt;/ns:well&gt;

and then unparse it as if that string had arrived as this XML event to the unparser/XML-text Infoset inputter:

<bodyString>&lt;ns:well formed=&quot;piece&quot;&gt;of arbitrary xml&lt;/ns:well&gt;</bodyString>

So would an option to have this behavior be a reasonable thing to add to Daffodil?

The corresponding parse feature would be to emit the string not as escapified XML, but just as a string of text of well-formed XML.

I guess the notion is that escapifying strings is because the string contents may not be well-formed XML, but in this case since they ARE well formed pieces of XML, when a string is required we can emit unescapified XML, and also consume the same for unparsing and convert into strings.

Thoughts?


RE: XML String in Binary Data Question

Posted by "Interrante, John A (GE Research, US)" <Jo...@ge.com>.
I was waiting for someone to offer an opinion but it seems it's up to me.  First of all, please write an actual test case of binary data with a well-formed piece of XML data inside a string.  Please round trip it through Daffodil so we can actually find out how well both the parser and the unparser handle this data.  I find it hard to believe Daffodil doesn't already use some escaping or quoting mechanism to handle this kind of situation where the infoset (represented as XML) contains an element whose body looks like well-formed XML elements in their own turn.  

Even if this situation causes the Daffodil unparser to explode, what's to stop you from telling Daffodil to represent the infoset as JSON rather than XML?  Surely the Daffodil unparser wouldn't have a problem unparsing the JSON representation with XML elements inside a string element?

I also would be curious to find out whether the infoset's JSON representation has a similar problem handling an actual test case of binary data with a well-formed piece of JSON data inside a string.  

Once we know what really happens (and we also can run the same JSON/XML test cases through IBM's Daffodil processor to get more data points), we can start to discuss what's the best solution to handle this kind of situation for both JSON and XML infoset representations automatically.

John

-----Original Message-----
From: Beckerle, Mike <mb...@owlcyberdefense.com> 
Sent: Friday, April 2, 2021 12:50 PM
To: dev@daffodil.apache.org
Subject: EXT: XML String in Binary Data Question

I've started running into binary data containing XML strings.

If Daffodil is unparsing a piece of XML Like this:

<bodyString><ns:well formed="piece">of arbitrary xml</ns:well></bodyString>

Suppose the DFDL schema for bodyString is:

<element name="bodyString" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="{....}"/>

So the notion here is that the data contains a string, which is a well-formed piece of XML.
For example, the overall format may be binary data that just happens to contain this string of XML in it.

I suspect that the Daffodil unparser is just going to explode on this, because it will be fed element events for the string contents. I.e., the unparsing converts the incoming XML text to infoset events by first parsing it as XML, and that process is schema-unaware, so has no notion that the XML parse should NOT parse the parts of the body string as XML elements.

Does it make sense for Daffodil's XML-text infoset importer (used by unparsing) to recognize this case, and convert the <ns:well formed="piece">of arbitrary xml</ns:well> into an escapified XML string like:

&lt;ns:well formed=&quot;piece&quot;&gt;of arbitrary xml&lt;/ns:well&gt;

and then unparse it as if that string had arrived as this XML event to the unparser/XML-text Infoset inputter:

<bodyString>&lt;ns:well formed=&quot;piece&quot;&gt;of arbitrary xml&lt;/ns:well&gt;</bodyString>

So would an option to have this behavior be a reasonable thing to add to Daffodil?

The corresponding parse feature would be to emit the string not as escapified XML, but just as a string of text of well-formed XML.

I guess the notion is that escapifying strings is because the string contents may not be well-formed XML, but in this case since they ARE well formed pieces of XML, when a string is required we can emit unescapified XML, and also consume the same for unparsing and convert into strings.

Thoughts?