You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@daffodil.apache.org by "Costello, Roger L." <co...@mitre.org> on 2019/04/05 13:25:16 UTC

Bug in Daffodil?

Hello DFDL community,

My input file consists of a prolog of known format and a payload surrounded by parentheses. The payload consists of a series of text fields separated by hyphens. In some cases, the hyphen can be preceded by a new line, which can be a carriage return or CRLF combination.

Here is a sample input file; I show it in a hex editor so you can see that some hyphens are preceded by CRLF and others by just a CR.

[cid:image002.png@01D4EB91.79DBF2F0]

Here is my DFDL schema:

<xs:element name="input">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="prolog" type="xs:string" dfdl:terminator="%NL;" />
            <xs:element name="payload" dfdl:initiator="(" dfdl:terminator=")">
                <xs:complexType>
                    <xs:sequence dfdl:separator="-" dfdl:separatorPosition="infix">
                        <xs:element name="field" type="xs:string" maxOccurs="unbounded" />
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
        </xs:sequence>
    </xs:complexType>
</xs:element>

When I parse the input file using the DFDL schema, I get this XML:

<input>
  <prolog>PROLOG</prolog>
  <payload>
    <field>A</field>
    <field>B</field>
    <field>C
</field>
    <field>D</field>
    <field>E
</field>
    <field>F</field>
  </payload>
</input>

That's perfect.

When I unparse the XML I get this (please note the bug (?) described in yellow):

[cid:image006.png@01D4EB91.79DBF2F0]




Re: Bug in Daffodil?

Posted by Steve Lawrence <sl...@apache.org>.
This is actually the expected behavior, though it's maybe not always
desired.

The issue here is that XML is not allowed to contain CR's, only LF's are
allowed. So when we output infoset data, all CRLF's are converted to LF,
and any lone CR's are also converted to LF. Unfortunately, if your data
fields contains a CR, it's going to get lost. In a lot of cases this is
fine, since lots of formats don't care about CRLF vs LF. But there are
definitely some places where it matters.

DAFFODIL-1559 [1] is the issue to allowing changing this behavior. One
option would be to convert CR character in the data to a private use
area like we do with other illegal XML characters, but that makes the
infoset less useful. Another option might be to say that whenever an LF
appears in the data, we just always unparse it as a CRLF. This means if
your data mixes CRLF and LF, we'd always output CRLF, but that's
probably not a big deal if mixing is allowed in the format.

- Steve

[1] https://issues.apache.org/jira/browse/DAFFODIL-1559

On 4/5/19 9:25 AM, Costello, Roger L. wrote:
> Hello DFDL community,
> 
> My input file consists of a prolog of known format and a payload surrounded by 
> parentheses. The payload consists of a series of text fields separated by 
> hyphens. In some cases, the hyphen can be preceded by a new line, which can be a 
> carriage return or CRLF combination.
> 
> Here is a sample input file; I show it in a hex editor so you can see that some 
> hyphens are preceded by CRLF and others by just a CR.
> 
> Here is my DFDL schema:
> 
> <xs:elementname="input">
> <xs:complexType>
> <xs:sequence>
> <xs:elementname="prolog"type="xs:string"dfdl:terminator="%NL;"/>
> <xs:elementname="payload"dfdl:initiator="("dfdl:terminator=")">
> <xs:complexType>
> <xs:sequencedfdl:separator="-"dfdl:separatorPosition="infix">
> <xs:elementname="field"type="xs:string"maxOccurs="unbounded"/>
> </xs:sequence>
> </xs:complexType>
> </xs:element>
> </xs:sequence>
> </xs:complexType>
> </xs:element>
> 
> When I parse the input file using the DFDL schema, I get this XML:
> 
> <input>
> <prolog>PROLOG</prolog>
> <payload>
> <field>A</field>
> <field>B</field>
> <field>C
> </field>
> <field>D</field>
> <field>E
> </field>
> <field>F</field>
> </payload>
> </input>
> 
> That’s perfect.
> 
> When I unparse the XML I get this (please note the bug (?) described in yellow):
> 


Re: Bug in Daffodil?

Posted by "Beckerle, Mike" <mb...@tresys.com>.
Hmmm. Steve Lawrence is right. XML normalizes CR to LF.

But I'm wondering if you just need to use dfdl:outputNewLine="%CR;%LF;" property. This would make the output line endings into CRLF when unparsing.

This is not the default setting for this property. By default this property gets its value from the variable $dfdl:outputNewLine which has default value of %LF;.




________________________________
From: Costello, Roger L. <co...@mitre.org>
Sent: Friday, April 5, 2019 9:25 AM
To: users@daffodil.apache.org
Subject: Bug in Daffodil?


Hello DFDL community,



My input file consists of a prolog of known format and a payload surrounded by parentheses. The payload consists of a series of text fields separated by hyphens. In some cases, the hyphen can be preceded by a new line, which can be a carriage return or CRLF combination.



Here is a sample input file; I show it in a hex editor so you can see that some hyphens are preceded by CRLF and others by just a CR.



[cid:image002.png@01D4EB91.79DBF2F0]



Here is my DFDL schema:



<xs:element name="input">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="prolog" type="xs:string" dfdl:terminator="%NL;" />
            <xs:element name="payload" dfdl:initiator="(" dfdl:terminator=")">
                <xs:complexType>
                    <xs:sequence dfdl:separator="-" dfdl:separatorPosition="infix">
                        <xs:element name="field" type="xs:string" maxOccurs="unbounded" />
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
        </xs:sequence>
    </xs:complexType>
</xs:element>



When I parse the input file using the DFDL schema, I get this XML:



<input>
  <prolog>PROLOG</prolog>
  <payload>
    <field>A</field>
    <field>B</field>
    <field>C
</field>
    <field>D</field>
    <field>E
</field>
    <field>F</field>
  </payload>
</input>



That’s perfect.



When I unparse the XML I get this (please note the bug (?) described in yellow):



[cid:image006.png@01D4EB91.79DBF2F0]