You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@daffodil.apache.org by "Costello, Roger L." <co...@mitre.org> on 2019/07/09 19:07:36 UTC

A unified treatment of "no data"

Hello DFDL community,

A few weeks ago, while chatting with Mike he tossed out an exciting challenge. He said, "We need a unified treatment of no data." I'd like to take up Mike's challenge, with your help.

As you know, DFDL was created to describe file formats. Formats of all kinds, both text formats and binary formats. Mike once speculated that there are probably tens of thousands of file formats. I will further that and speculate that every one of them probably has the notion of "no data." Mike's challenge is to characterize that notion. A good characterization would be very useful, given that the notion is so broadly used.

I would like the characterization, at least initially, to be independent of DFDL. So, what are the ways that file formats throughout human history have expressed "no data"?
Below is a start at a characterization. Before I go too far, I wanted to check with you to see if I am on the right track.
A unified treatment of "no data"

  *   A region of a file may be treated as having no data in it. There are two possible reasons why a region has no data:
     *   No data was available when the file was created. We say the region has a nil (or null) value.
        *   Example: suppose a region of a file represents a person's middle name and when the file was created there was no information available about the person's middle name. The person has a middle name, but it was not known when the file was created. So, when the file was created, the region was given a nil middle name.
     *   Data is available and the data is the empty data. We say the region has an empty value.
        *   Example: again, suppose a region of a file represents a person's middle name but this time the person does not have a middle name. When the file is created, the region is given an empty middle name.
Two ways for a file to denote that a region has a nil value

  *   In-band nil: a symbol is inserted into the region to indicate nil. Thus, a part of the region's value space is reserved for indicating nil.
     *   Example: the string "N/A" is inserted into the region to indicate that data for the person's middle name is Not Available.
  *   Out-of-band nil: a symbol, separate from the region, is used to indicate that the region has a nil value.
.......
Well, what do you think, am I on the right track? Any criticisms/edits, big or small, is welcome. I want to get this right.
/Roger