You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@daffodil.apache.org by Roger L Costello <co...@mitre.org> on 2022/01/10 14:48:45 UTC

Why is the NUL character allowed in a DFDL schema?

Hi Folks,

XML Schema hosts the DFDL language, i.e., DFDL properties are added into an XML Schema.

XML Schema is XML.

XML does not allow (among others) the NUL character. That means I cannot directly copy (from somewhere) a NUL character and paste it into an XML document, nor can I indirectly use the NUL character via XML's character entity mechanism. 

DFDL allows the NUL character via the DFDL entity %NUL;

Isn't this a problem? Isn't DFDL violating a basic rule of XML?

/Roger

Re: Why is the NUL character allowed in a DFDL schema?

Posted by Mike Beckerle <mb...@apache.org>.

Not a problem. But it is a bit tricky to understand.

The "%NUL;" notation lets the schema talk about NUL characters in the data
without having any NUL characters in the schema.

The %NUL; does not turn into a NUL character in the DFDL schema (which is
an XML document). It remains the 5 characters "%NUL;" in the schema. Since
a DFDL Schema is itself an XML document, it has a XML Infoset. That Infoset
does not contain a NUL character. It contains 5 character strings "%NUL;"
to talk about NUL characters that appear in the data.

Now, if you use Apache Daffodil to convert data to XML, and the data
contains NUL characters, not only as delimiters, but right in the strings
of data, then we have to create XML and the XML we create needs to
represent that the underlying data had a NUL in it.

This is not a problem for DFDL or DFDL processors as the DFDL Infoset is
different from the XML Infoset and one way it is different is that the DFDL
Infoset is explicitly allowed to contain NUL and in fact is allowed to
contain all the other characters XML 1.0 disallows.

But that leaves the issue of what happens when the DFDL Infoset is
converted into say, an XML Infoset, or a JSON Infoset representation.
A string which has a NUL character in the middle of it must somehow
represent that NUL, but in XML, it cannot simply pass through the NUL, as
XML doesn't allow NUL.

This problem is not at all unique to DFDL. Lots of software has had to cope
with this XML restriction.

DFDL (the language) does not take any position on how DFDL implementations
cope with this. This is probably a flaw in the specification, as standard
conversions to/from common infoset representations should probably also be
standardized. Perhaps that will happen in the future.

Apache Daffodil uses a strategy adapted from what we found many other
pieces of software use. E.g., we do something similar to what Microsoft has
published as what they do for Microsoft Visio.

Apache Daffodil utilizes the Unicode Private Use Area (PUA) characters,
which are all legal as characters of XML documents, and converts characters
that are illegal in XML 1.0 to corresponding characters in the Unicode
Private Use Area. This is a bijection, so that on unparsing the inverse
mapping applies and is one-to-one.

A character is projected into the private use area by adding 0xE000 to its
unicode character code. So NUL (0) becomes 0xE000. Ctrl-A (which is unicode
codepoint 1) becomes 0xE001. And so on.

A different mapping is used for other disallowed Unicode characters such as
isolated surrogate halves. Etc. The gory details are here:
https://daffodil.apache.org/infoset/
In the section called "XML Illegal Characters".

A problem remains that the fonts people use to look at data on their
computer screens often lack glyphs for the Unicode PUA characters, so
0xE000 character often just shows up as a box with no ability to
distinguish whether that box represents 0xE000, or 0xE001, or 0xE002, etc.
But the underlying string of XML does have distinct code points in it. So
this "on screen printed" represntation is only a loss of information for a
reader's eyes. XML Infoset data saved to a file will have the distinct
character codes 0xE000, or 0xE001, etc.

This is one of the reasons Apache Daffodil says you must use a computer set
up for Unicode to use Daffodil, because even if all your data is US-ASCII,
NUL is a legal US-ASCII character, and NUL will end up as an 0xE000
character, which cannot be depicted without Unicode font support.

This headache is one of the ones the DFDL workgroup accepted when we chose
XML Schema and the XML Infoset as starting points for DFDL way back in year
2001.

JSON has a different set of restrictions on characters. E.g., a string in
DFDL cannot contain a line-ending character unless it is converted to an
escape such as "\n".

I suspect we're missing documentation of the conversion of the DFDL Infoset
to the JSON Infoset.
(Created: https://issues.apache.org/jira/browse/DAFFODIL-2621)

On Mon, Jan 10, 2022 at 9:48 AM Roger L Costello <co...@mitre.org> wrote:

> Hi Folks,
>
> XML Schema hosts the DFDL language, i.e., DFDL properties are added into
> an XML Schema.
>
> XML Schema is XML.
>
> XML does not allow (among others) the NUL character. That means I cannot
> directly copy (from somewhere) a NUL character and paste it into an XML
> document, nor can I indirectly use the NUL character via XML's character
> entity mechanism.
>
> DFDL allows the NUL character via the DFDL entity %NUL;
>
> Isn't this a problem? Isn't DFDL violating a basic rule of XML?
>
> /Roger
>
>