You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@daffodil.apache.org by "Michael Beckerle (JIRA)" <ji...@apache.org> on 2019/04/10 11:53:00 UTC
[jira] [Commented] (DAFFODIL-1559) Add option to disable CRLF to LF XML canonicalization

    [ https://issues.apache.org/jira/browse/DAFFODIL-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814382#comment-16814382 ] 

Michael Beckerle commented on DAFFODIL-1559:
--------------------------------------------

Freestanding CR also needs to be preservable. Not just the CR of a CRLF pair. 

Our current policy is described here: [https://daffodil.apache.org/infoset/#xml-illegal-characters]

So we need an option to convert all CR to the code point #xE00D whether they are isolated CR or part of CRLF pairs.

The code point #xE00D is in the Unicode Private Use Area (PUA), and XML processing should then preserve it.

This needs to work for unparsing, with #xE00D turning into #xD (a CR character) in the DFDL Infoset which is then unparsed as a regular #xD codepoint.

This behavior should, ultimately, be the default behavior. Converting CR to LF and CRLF to LF not as delimiters, but in the data contents of an element, is probably just wrong.

The vast number of tests for Daffodil with CRLF-related behaviors will be using CR and LF and CRLF in delimiters. Those would be unaffected by this change. This change is only about when CR or CRLF are found in data values of string data.  So perhaps there will not be a large impact on tests that are broken by this change.

This change should be isolated to some utility functions (in org.apache.daffodil.util), and to the InfosetInputter and InfosetOutputters that consume and produce XML which convert between daffodil's DFDL infoset and the XML Infoset.

Despite the fact that many people use Daffodil to convert data to/from XML, XML technology is isolated in Daffodil to just the infoset inputter and infoset outputters.

When parsing, Daffodil converts the data stream in to a DFDL Infoset. Then an InfosetOutputter is called which traverse the DFDL Infoset creating an XML Infoset (as text, scala XML objects, or JDOM objects - take your pick). This suggested change only affects these Infoset outputters.

Similarly when unparsing, an InfosetInputter consumes XML and constructs the DFDL Infoset. This DFDL Infoset is then traversed by the Daffodil unparser to unparse data back to a data stream. This suggested change affects only these InfosetInputters.

 

> Add option to disable CRLF to LF XML canonicalization
> -----------------------------------------------------
>
>                 Key: DAFFODIL-1559
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-1559
>             Project: Daffodil
>          Issue Type: New Feature
>          Components: API
>            Reporter: Steve Lawrence
>            Priority: Minor
>              Labels: beginner
>
> See the review or more details. The short of it is that when converting parse results to XML, we convert CR to LF, and we convert CRLF to LF. This means that we lose the information that the data used to contain CRLF. This is similar to how we lose that information with delimiters if someone uses NL, but it's slightly different since it is actual data. However, it's most user friendly and consistent with other XML technologies to have this behavior.
> Perhaps we need an option to convert CRLF to somewhere in PUA so that this information can be maintained if someone needs it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)