You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@daffodil.apache.org by "Costello, Roger L." <co...@mitre.org> on 2019/02/18 16:23:59 UTC

Is it a bad thing that the output of unparsing is different than the original input?

Hello DFDL community.

My input contains a series of rows, separated by one or more newlines/whitespaces:

Dear Sir: Thank you for your response.
Hello, world: How are you?

Sender: John Doe
Date: November 23, 2018



Foo: Bar
John: Doe

In my slides I show several ways to design the schema for parsing this input. Here is the first way that I show:

<xs:element name="input">
    <xs:complexType>
        <xs:sequence dfdl:separator="%NL;%WSP*;" dfdl:separatorPosition="infix">
            <xs:element name="row" maxOccurs="unbounded">
                <xs:complexType>
                    <xs:sequence dfdl:separator=":" dfdl:separatorPosition="infix">
                        <xs:element name="label" type="xs:string" />
                        <xs:element name="message" type="xs:string" />
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
        </xs:sequence>
    </xs:complexType>
</xs:element>

The input parses fine, but the output from unparsing differs from the original input:

[cid:image001.png@01D4C77C.70E56160]

Is it a bad thing that the output of unparsing is different than the original input?

Is it possible to make a small tweak to the schema (without altering the underlying schema design) so that the output of unparsing matches the original input?

/Roger

Re: Is it a bad thing that the output of unparsing is different than the original input?

Posted by "Beckerle, Mike" <mb...@tresys.com>.

It is often the case that the unparsed output does not exactly match the input. Nor is this desirable.


What you want is for the unparsed output to be *equivalent* to the parsed input, but not necessarily identical.


One way to think of it is the parser should accept the data in any legal representation, but the unparser should produce the data in a canonical form - the preferred representation.


Consider that many formats allow delimiters that are "one or more spaces". On unparsing, if you want the output to be identical to the input, the  schema must view the number of such delimiters as significant data so that it is captured as part of the infoset. You have to model, in your schema, this syntactic aspect, using elements so that the number of them is captured.


This is possible. It's most often not desirable. What you want the infoset to contain, and the parser to do, is capture the significant aspects of the data, suppressing/ignoring that which is not significant.


So once you think about it, perfect round-trip behavior is not only not necessary, it's not even desirable for many formats.


The Daffodil TDML test language and runner explicitly accommodates this by allowing tests to be run in a variety of "round trip" modes.


The basic "onePass" mode is the one where the unparser output must exactly match the parser input. Many formats do work this way.


The "twoPass" mode specifies that the unparser output will NOT match the input, but if you re-parse it again, you get the same parser infoset again. Textual formats often work this way. E.g., they have multiple alternative delimiters, one of which is considered the canonical one.


There is even a "threePass" mode which specifies that the first unparser output will NOT match the input, if you re-parse that again you will NOT get the same infoset as the first parse, but if you unparse that infoset you will get the same data you got from the first unparse. This is needed only in some cases of formats which have some ambiguities - usually around nillable element representations being ambiguous with empty-string representations.


These twoPass and threePass modes must be used carefully as they can mask all sorts of errors if you aren't careful. E.g., an unparse that produces nothing will succeed in threePass mode.




________________________________
From: Costello, Roger L. <co...@mitre.org>
Sent: Monday, February 18, 2019 11:23:59 AM
To: users@daffodil.apache.org
Subject: Is it a bad thing that the output of unparsing is different than the original input?


Hello DFDL community.



My input contains a series of rows, separated by one or more newlines/whitespaces:



Dear Sir: Thank you for your response.
Hello, world: How are you?

Sender: John Doe
Date: November 23, 2018



Foo: Bar
John: Doe



In my slides I show several ways to design the schema for parsing this input. Here is the first way that I show:



<xs:element name="input">
    <xs:complexType>
        <xs:sequence dfdl:separator="%NL;%WSP*;" dfdl:separatorPosition="infix">
            <xs:element name="row" maxOccurs="unbounded">
                <xs:complexType>
                    <xs:sequence dfdl:separator=":" dfdl:separatorPosition="infix">
                        <xs:element name="label" type="xs:string" />
                        <xs:element name="message" type="xs:string" />
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
        </xs:sequence>
    </xs:complexType>
</xs:element>



The input parses fine, but the output from unparsing differs from the original input:



[cid:image001.png@01D4C77C.70E56160]



Is it a bad thing that the output of unparsing is different than the original input?



Is it possible to make a small tweak to the schema (without altering the underlying schema design) so that the output of unparsing matches the original input?



/Roger