You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@daffodil.apache.org by "Costello, Roger L." <co...@mitre.org> on 2019/07/15 19:07:41 UTC

I will never use hidden groups again, here's why

Hello DFDL community,

Below I present an argument against the use of hidden groups. I welcome your counterargument.

[cid:image011.png@01D53B1F.09A3C390]


[cid:image012.png@01D53B1F.09A3C390]

[cid:image013.png@01D53B1F.09A3C390]

[cid:image014.png@01D53B1F.09A3C390]

[cid:image015.png@01D53B1F.09A3C390]

Re: I will never use hidden groups again, here's why

Posted by "Sloane, Brandon" <bs...@tresys.com>.

A counter-arguement to my counter-arguement. If you are doing this with the typeCalc feature, you would need to go out of your way to include the raw value in the infoset in the first place. As such, instead of making raw a hidden group, you should just not include it in the infoset.
________________________________
From: Sloane, Brandon <bs...@tresys.com>
Sent: Monday, July 15, 2019 4:45:24 PM
To: users@daffodil.apache.org
Subject: Re: I will never use hidden groups again, here's why

The new (experimental) typeCalc feature https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Feature+to+support+enumerations+and+typeValueCalc should make this type of schema less painful to write (although, your specific example would still be pretty painful).

At a higher level, there are a couple of good reasons to make the raw value hidden:

1) Not every hidden field is this complicated. For instance, consider a format with a field-presence-indicator. When the indicator is 1, the controled field is present, otherwise said field is not represented in the bitstream.

One way to write this schema is:

<xs:choice>

  <xs:sequence>

    <xs:sequence dfdl:hiddenGroupRef="tns:FPI_True"

    <xs:element ... />

  <xs:sequence>

    <xs:sequence dfdl:hiddenGroupRef="tns:FPI_False" />

</xs:choice>

which is, we parse either FPI=1 followed by the element, or FPI=0. In such a case, there OVC is trivial: 1 for FPI_True and 0 for FPI_False.

1.5)

A similar use-case is length fields. Again, such an OVC is simple: { dfdl:contentLength("../data") }. However, in this case it is actually highly non-trivial for the user of the schema to compute the value themselves, as it requires first unparsing the referenced content to determine how much space it requires.

This would be even more difficult for a checksum/hash type field; although neither DFDL nor Daffodil currently offer any functions for computing checksums or hashes.

2) Someone needs to compute it.

If the use is planning on transforming the schema, then not including an OVC will force the user to figure out the translation and compute it themselves. In some cases this is actually easier (eg. the difficulty in writing the OVC is a limitation of the DFDL expression language), but in many cases it is just as difficult. Forcing the user to compute it removes much of the benifit of DFDL, as you are still requireing the user to be aware of the details of the encoding.

3) Presenting the same information twice is error-prone.

If you expose the same value with both a raw field, and a symbolic value, then a user may attempt to transform the infoset my modifying only 1 of the two values. When you unparse the data, you must ignore one of the two values.

An approach that I have found success using is to actually not use hidden groups, but still put an OVC on the raw element. The main benifits here are:

1) Easier to debug the schema

2) Easier to integrate with legacy systems that are designed to consume the raw value.

I still do not have a solution to 3. If such a legacy system needs to mutate the data, it would need to be updated to convert the new form to the symbolic form so that Daffodil can unparse it; but I have yet to run into a situation where a legacy system needed to make such a mutation.

I believe the core difficulty with your specific example is that the output of the parse is not in a particuarly "computer readable" form. In such a case, it probably does make sense to leave the raw value in. Depending on the context, another approach might be to have a hidden "raw" value, then output a computer-readable translated value, as well as a human-readable translated value. In you example, the computer readable value might be a choice of:

  *    <opt1>XXX</opt1>
  *    <illegal/>
  *   <groupA>XXX</groupA>
  *   <groupB>XXX</groupB>

(When I do this, I tend to make illegal copy forward the raw value, so that the parse-unparse roundtrip can be lossless)

The experimental feature I mentioned earlier should make unparsing this type of schema relatively painless. Further, users attempted to programatically consume the infoset would probably appriciate having the data in this form, instead of needing to parse the human readable string you are currently generating.

In such an example, you can still generate the human readable string as a convenience, and just ignore it on unparse.

________________________________
From: Costello, Roger L. <co...@mitre.org>
Sent: Monday, July 15, 2019 3:07:41 PM
To: users@daffodil.apache.org
Subject: I will never use hidden groups again, here's why

Hello DFDL community,

Below I present an argument against the use of hidden groups. I welcome your counterargument.

[cid:image011.png@01D53B1F.09A3C390]

[cid:image012.png@01D53B1F.09A3C390]

[cid:image013.png@01D53B1F.09A3C390]

[cid:image014.png@01D53B1F.09A3C390]

[cid:image015.png@01D53B1F.09A3C390]

Re: I will never use hidden groups again, here's why

Posted by "Sloane, Brandon" <bs...@tresys.com>.

The new (experimental) typeCalc feature https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Feature+to+support+enumerations+and+typeValueCalc should make this type of schema less painful to write (although, your specific example would still be pretty painful).


At a higher level, there are a couple of good reasons to make the raw value hidden:


1) Not every hidden field is this complicated. For instance, consider a format with a field-presence-indicator. When the indicator is 1, the controled field is present, otherwise said field is not represented in the bitstream.


One way to write this schema is:


<xs:choice>

  <xs:sequence>

    <xs:sequence dfdl:hiddenGroupRef="tns:FPI_True"

    <xs:element ... />

  <xs:sequence>

    <xs:sequence dfdl:hiddenGroupRef="tns:FPI_False" />

</xs:choice>


which is, we parse either FPI=1 followed by the element, or FPI=0. In such a case, there OVC is trivial: 1 for FPI_True and 0 for FPI_False.


1.5)

A similar use-case is length fields. Again, such an OVC is simple: { dfdl:contentLength("../data") }. However, in this case it is actually highly non-trivial for the user of the schema to compute the value themselves, as it requires first unparsing the referenced content to determine how much space it requires.


This would be even more difficult for a checksum/hash type field; although neither DFDL nor Daffodil currently offer any functions for computing checksums or hashes.


2) Someone needs to compute it.

If the use is planning on transforming the schema, then not including an OVC will force the user to figure out the translation and compute it themselves. In some cases this is actually easier (eg. the difficulty in writing the OVC is a limitation of the DFDL expression language), but in many cases it is just as difficult. Forcing the user to compute it removes much of the benifit of DFDL, as you are still requireing the user to be aware of the details of the encoding.


3) Presenting the same information twice is error-prone.

If you expose the same value with both a raw field, and a symbolic value, then a user may attempt to transform the infoset my modifying only 1 of the two values. When you unparse the data, you must ignore one of the two values.


An approach that I have found success using is to actually not use hidden groups, but still put an OVC on the raw element. The main benifits here are:


1) Easier to debug the schema

2) Easier to integrate with legacy systems that are designed to consume the raw value.


I still do not have a solution to 3. If such a legacy system needs to mutate the data, it would need to be updated to convert the new form to the symbolic form so that Daffodil can unparse it; but I have yet to run into a situation where a legacy system needed to make such a mutation.


I believe the core difficulty with your specific example is that the output of the parse is not in a particuarly "computer readable" form. In such a case, it probably does make sense to leave the raw value in. Depending on the context, another approach might be to have a hidden "raw" value, then output a computer-readable translated value, as well as a human-readable translated value. In you example, the computer readable value might be a choice of:


  *    <opt1>XXX</opt1>
  *    <illegal/>
  *   <groupA>XXX</groupA>
  *   <groupB>XXX</groupB>


(When I do this, I tend to make illegal copy forward the raw value, so that the parse-unparse roundtrip can be lossless)


The experimental feature I mentioned earlier should make unparsing this type of schema relatively painless. Further, users attempted to programatically consume the infoset would probably appriciate having the data in this form, instead of needing to parse the human readable string you are currently generating.

In such an example, you can still generate the human readable string as a convenience, and just ignore it on unparse.

________________________________
From: Costello, Roger L. <co...@mitre.org>
Sent: Monday, July 15, 2019 3:07:41 PM
To: users@daffodil.apache.org
Subject: I will never use hidden groups again, here's why


Hello DFDL community,



Below I present an argument against the use of hidden groups. I welcome your counterargument.



[cid:image011.png@01D53B1F.09A3C390]





[cid:image012.png@01D53B1F.09A3C390]



[cid:image013.png@01D53B1F.09A3C390]



[cid:image014.png@01D53B1F.09A3C390]



[cid:image015.png@01D53B1F.09A3C390]