You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@daffodil.apache.org by "Interrante, John A (GE Research, US)" <Jo...@ge.com> on 2021/07/01 14:05:27 UTC

How to list offset and length of DFDL elements within native data?

I've been asked a Daffodil / DFDL question that I don't know how to answer.  The question is:

                How to implement a function like get_offset_len(data, schema, field_path) -> (offset, length) ?

                Do you know a good way (using Daffodil library functions or DFDL constructs) to pass some native data, a DFDL schema, an XPath or DPath expression referring to an element in the DFDL schema, and get the offset and length of that element's field within the native data?

                Alternatively, does Daffodil have a way to apply a DFDL schema to some native data, construct an infoset from the native data, and list all the elements in the infoset along with their DPath, offset, and length?

I searched the Daffodil codebase and wasn't able to find a specific API like that although I may have missed something usable.  I scanned the DFDL specification and I did find a DFDL function called "dfdl:contentLength" in section 18.5.3.  The function's signature is:

                dfdl:contentLength($node, $lengthUnits)

                Returns the length of the supplied node's SimpleContent region for elements of simple type, or ComplexContent region for elements of complex type. These regions are defined in Section 9.2 DFDL Data Syntax Grammar. The value is returned as an xs:unsignedLong.
The second argument is of type xs:string and must be 'bytes', 'characters', or 'bits' (Schema Definition Error otherwise) and determines the units of length.

Being able to get each element's length looks like it could help although a note in the same section said that the content length returned by dfdl:contentLength() excludes any alignment filling as well as any leading or trailing skip bytes.   That is, the returned length tells you about the length of the content, but does not tell you about the position of the content in the native data stream which is what I was asked to find.  Nevertheless, if the native data is not text but rather binary data with fixed-size fields, being able to list each content field with its length might be sufficient to deduce the position of each content field as well.

I wonder which would be easier to do?


  1.  Write a Scala program which calls some Daffodil API to parse some native data, construct an infoset from the native data, and list all the elements in the infoset along with their DPath, offset, and length?  This would require Daffodil to have an API to iterate over each element in the infoset and return each element's content length.
  2.  Add DFDL constructs to a DFDL schema which call dfdl:contentLength and dfdl:outputValueCalc to append the same information to the infoset?  This would require saving the infoset as XML and writing a program or command to read the information as a list.
  3.  Another way which I don't know about yet?
  4.  How would we handle any alignment filling as well as any leading or trailing skip bytes if the DFDL schema uses them?

Thanks,
John

Re: How to list offset and length of DFDL elements within native data?

Posted by "Beckerle, Mike" <mb...@owlcyberdefense.com>.
Daffodil doesn't currently have this ability.

The raw ingredients are largely there.

For example, the dfdl:valueLength or dfdl:contentLength function can be used as rulers to measure how big something is.

So if you organized a DFDL schema as

<element name="measureThis" ..../>
<element name="thingIWantStartPositionOf" .../>

Then you can put an element in the schema and literally ask for dfdl:valueLength(../measureThis) in a dfdl:outputValueCalc element.

The idea that we should be able to annotate every element with its start position and length, and carry this through as annotated Infoset output is a good one. The debugger hooks have this information and output it in the trace output.


________________________________
From: Interrante, John A (GE Research, US) <Jo...@ge.com>
Sent: Thursday, July 1, 2021 10:05 AM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: How to list offset and length of DFDL elements within native data?

I've been asked a Daffodil / DFDL question that I don't know how to answer.  The question is:

                How to implement a function like get_offset_len(data, schema, field_path) -> (offset, length) ?

                Do you know a good way (using Daffodil library functions or DFDL constructs) to pass some native data, a DFDL schema, an XPath or DPath expression referring to an element in the DFDL schema, and get the offset and length of that element's field within the native data?

                Alternatively, does Daffodil have a way to apply a DFDL schema to some native data, construct an infoset from the native data, and list all the elements in the infoset along with their DPath, offset, and length?

I searched the Daffodil codebase and wasn't able to find a specific API like that although I may have missed something usable.  I scanned the DFDL specification and I did find a DFDL function called "dfdl:contentLength" in section 18.5.3.  The function's signature is:

                dfdl:contentLength($node, $lengthUnits)

                Returns the length of the supplied node's SimpleContent region for elements of simple type, or ComplexContent region for elements of complex type. These regions are defined in Section 9.2 DFDL Data Syntax Grammar. The value is returned as an xs:unsignedLong.
The second argument is of type xs:string and must be 'bytes', 'characters', or 'bits' (Schema Definition Error otherwise) and determines the units of length.

Being able to get each element's length looks like it could help although a note in the same section said that the content length returned by dfdl:contentLength() excludes any alignment filling as well as any leading or trailing skip bytes.   That is, the returned length tells you about the length of the content, but does not tell you about the position of the content in the native data stream which is what I was asked to find.  Nevertheless, if the native data is not text but rather binary data with fixed-size fields, being able to list each content field with its length might be sufficient to deduce the position of each content field as well.

I wonder which would be easier to do?


  1.  Write a Scala program which calls some Daffodil API to parse some native data, construct an infoset from the native data, and list all the elements in the infoset along with their DPath, offset, and length?  This would require Daffodil to have an API to iterate over each element in the infoset and return each element's content length.
  2.  Add DFDL constructs to a DFDL schema which call dfdl:contentLength and dfdl:outputValueCalc to append the same information to the infoset?  This would require saving the infoset as XML and writing a program or command to read the information as a list.
  3.  Another way which I don't know about yet?
  4.  How would we handle any alignment filling as well as any leading or trailing skip bytes if the DFDL schema uses them?

Thanks,
John