You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@daffodil.apache.org by Claude Mamo <cl...@gmail.com> on 2020/03/03 09:02:22 UTC

EDIFACT missing new line

Hello Daffodil team,

Thank you for this fantastic open source library. We've integrated Daffodil
with Smooks (https://github.com/smooks/smooks-dfdl-cartridge) and so far it
looks awesome. At the moment, we're implementing support for EDIFACT. We're
very close, but not quite there yet. In particular, the EDIFACT schemas,
available at
https://github.com/DFDLSchemas/EDIFACT/tree/daffodil-dev/src/main/resources,
don't seem to behave as expected when it comes to new lines.

When unparsing the infoset
https://raw.githubusercontent.com/DFDLSchemas/EDIFACT/daffodil-dev/src/test/resources/EDIFACT-SupplyChain-D03B/TestInfosets/INVOIC_D.03B_Interchange_with_UNA.xml,
the new line is missing between the "UNA" and "UNB" segments:
https://gist.github.com/claudemamo/6e381738bb1fa21fd7e14c6867380308. I've
played around with the expression setting the "ibmEdiFmt:SegmentTerm"
variable but didn't have much luck (
https://github.com/claudemamo/smooks-edifact-cartridge/blob/master/schemas/src/main/resources/EDIFACT-Common/EDIFACT-Service-Segments-4.1.dfdl.xsd#L128-L139).


Any advice?

Claude

Re: EDIFACT missing new line

Posted by Claude Mamo <cl...@gmail.com>.
Steve, thank you for the detailed explanation. Much appreciated.

Claude

On Tue, Mar 3, 2020 at 2:10 PM Steve Lawrence <sl...@apache.org> wrote:

> Smooks looks really interesting! Nice to hear that Daffodil is working
> out well, please keep us updated!
>
> The issue you are seeing is caused by the way that EDI is described in
> the schema (note that I'm not too familiar with the EDI format, but am
> familiar with the schema).
>
> The EDIFACT-SupplyChain-Messages-D.03B.xsd file describes the UNA
> section like this:
>
>   <xsd:element dfdl:terminator="%WSP*;" type="srv:UNA" ... />
>
> The %WSP*; is a special character class that, when parsing data, will
> match zero or more whitespace characters (spaces, tabs, newlines, etc.).
> So this says that the UNA can be terminated by any whitespace characters
> when parsing.
>
> The problem with this is that when Daffodil creates an infoset from a
> parse, it does not inlude what terminator was found at the end of the
> data. It could have been a newline but it also could have been a space,
> or nothing at all.
>
> So on unparsing, since the infoset doesn't contain information about
> that matched terminator, Daffodil must make a decision on what value to
> unparse. The rule for unparsing a WSP* is that Daffodil just unparses
> the empty string, which is valid for WSP*. Although the unparsed data
> might differ from the original, it is still valid EDI and semantically
> the same according to the schema.
>
> If you only wanted to support data with newlines, and thus force
> unparsing to create a newline, you could change that %WSP*; to %NL;
> which is the special character class to match newlines. Or
> alternatively, if you wanted to support both newlines OR any whitespace
> characters, but always wanted to unparse as a newline, you could provide
> two terminators, e.g.:
>
>   dfdl:terminator="%NL;%WSP; %WSP*;"
>
> On parsing, that will accept either a newline followed by zero or
> whitespace chars, or just zero or more whitespace characters. But will
> always unparse as a newline since it appears first in the list.
>
> Note that something similar happens in EDIFACT-Service-Segments-4.1.xsd
> for the definition of SegmentTerminator. Fortunately, it has a comment
> on what to do if you want newlines to appear when unparsing, which is
> similar to the above suggestion. Also, the default value for SegmentTerm
> (defined in IBM_EDI_Format.xsd) is "WSP*; %NL;WSP*;", so either zero or
> more whitespace characters or a newline followed by zero or more
> whitespace characters. To unparse a newline, swap the order of those
> (like the above terminator) so that the NL is first and will be used for
> unparsing.
>
> - Steve
>
>
> On 3/3/20 4:02 AM, Claude Mamo wrote:
> > Hello Daffodil team,
> >
> > Thank you for this fantastic open source library. We've integrated
> Daffodil with
> > Smooks (https://github.com/smooks/smooks-dfdl-cartridge) and so far it
> looks
> > awesome. At the moment, we're implementing support for EDIFACT. We're
> very
> > close, but not quite there yet. In particular, the EDIFACT schemas,
> available at
> >
> https://github.com/DFDLSchemas/EDIFACT/tree/daffodil-dev/src/main/resources,
>
> > don't seem to behave as expected when it comes to new lines.
> >
> > When unparsing the infoset
> >
> https://raw.githubusercontent.com/DFDLSchemas/EDIFACT/daffodil-dev/src/test/resources/EDIFACT-SupplyChain-D03B/TestInfosets/INVOIC_D.03B_Interchange_with_UNA.xml,
>
> > the new line is missing between the "UNA" and "UNB" segments:
> > https://gist.github.com/claudemamo/6e381738bb1fa21fd7e14c6867380308.
> I've played
> > around with the expression setting the "ibmEdiFmt:SegmentTerm" variable
> but
> > didn't have much luck
> > (
> https://github.com/claudemamo/smooks-edifact-cartridge/blob/master/schemas/src/main/resources/EDIFACT-Common/EDIFACT-Service-Segments-4.1.dfdl.xsd#L128-L139).
>
> >
> >
> > Any advice?
> >
> > Claude
> >
>
>

Re: EDIFACT missing new line

Posted by Steve Lawrence <sl...@apache.org>.
Smooks looks really interesting! Nice to hear that Daffodil is working
out well, please keep us updated!

The issue you are seeing is caused by the way that EDI is described in
the schema (note that I'm not too familiar with the EDI format, but am
familiar with the schema).

The EDIFACT-SupplyChain-Messages-D.03B.xsd file describes the UNA
section like this:

  <xsd:element dfdl:terminator="%WSP*;" type="srv:UNA" ... />

The %WSP*; is a special character class that, when parsing data, will
match zero or more whitespace characters (spaces, tabs, newlines, etc.).
So this says that the UNA can be terminated by any whitespace characters
when parsing.

The problem with this is that when Daffodil creates an infoset from a
parse, it does not inlude what terminator was found at the end of the
data. It could have been a newline but it also could have been a space,
or nothing at all.

So on unparsing, since the infoset doesn't contain information about
that matched terminator, Daffodil must make a decision on what value to
unparse. The rule for unparsing a WSP* is that Daffodil just unparses
the empty string, which is valid for WSP*. Although the unparsed data
might differ from the original, it is still valid EDI and semantically
the same according to the schema.

If you only wanted to support data with newlines, and thus force
unparsing to create a newline, you could change that %WSP*; to %NL;
which is the special character class to match newlines. Or
alternatively, if you wanted to support both newlines OR any whitespace
characters, but always wanted to unparse as a newline, you could provide
two terminators, e.g.:

  dfdl:terminator="%NL;%WSP; %WSP*;"

On parsing, that will accept either a newline followed by zero or
whitespace chars, or just zero or more whitespace characters. But will
always unparse as a newline since it appears first in the list.

Note that something similar happens in EDIFACT-Service-Segments-4.1.xsd
for the definition of SegmentTerminator. Fortunately, it has a comment
on what to do if you want newlines to appear when unparsing, which is
similar to the above suggestion. Also, the default value for SegmentTerm
(defined in IBM_EDI_Format.xsd) is "WSP*; %NL;WSP*;", so either zero or
more whitespace characters or a newline followed by zero or more
whitespace characters. To unparse a newline, swap the order of those
(like the above terminator) so that the NL is first and will be used for
unparsing.

- Steve


On 3/3/20 4:02 AM, Claude Mamo wrote:
> Hello Daffodil team,
> 
> Thank you for this fantastic open source library. We've integrated Daffodil with 
> Smooks (https://github.com/smooks/smooks-dfdl-cartridge) and so far it looks 
> awesome. At the moment, we're implementing support for EDIFACT. We're very 
> close, but not quite there yet. In particular, the EDIFACT schemas, available at 
> https://github.com/DFDLSchemas/EDIFACT/tree/daffodil-dev/src/main/resources, 
> don't seem to behave as expected when it comes to new lines.
> 
> When unparsing the infoset 
> https://raw.githubusercontent.com/DFDLSchemas/EDIFACT/daffodil-dev/src/test/resources/EDIFACT-SupplyChain-D03B/TestInfosets/INVOIC_D.03B_Interchange_with_UNA.xml, 
> the new line is missing between the "UNA" and "UNB" segments: 
> https://gist.github.com/claudemamo/6e381738bb1fa21fd7e14c6867380308. I've played 
> around with the expression setting the "ibmEdiFmt:SegmentTerm" variable but 
> didn't have much luck 
> (https://github.com/claudemamo/smooks-edifact-cartridge/blob/master/schemas/src/main/resources/EDIFACT-Common/EDIFACT-Service-Segments-4.1.dfdl.xsd#L128-L139). 
> 
> 
> Any advice?
> 
> Claude
>