You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@daffodil.apache.org by "Mike Beckerle (Jira)" <ji...@apache.org> on 2020/07/30 21:06:00 UTC
[jira] [Comment Edited] (DAFFODIL-1559) Add option to disable CRLF to LF XML canonicalization

    [ https://issues.apache.org/jira/browse/DAFFODIL-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814436#comment-16814436 ] 

Mike Beckerle edited comment on DAFFODIL-1559 at 7/30/20, 9:05 PM:
-------------------------------------------------------------------

PUA isn't necessary for CR, but only for the illegal code points. CR isn't "illegal".

But the PUA may be preferable, because otherwise there is this "quoting hell" issue.

Consider what happens if we just leave in the CR character code point when we create JDOM or scala XML trees, and use &#xD; when creating XML text. If serialized to text, and then loaded by an XML loader, the CR character may disappear. I don't know that it does, because the CR character isn't illegal, but XML  specs I've read suggest that CR and CRLF are converted to single LF by XML loaders. I don't know whether that is suppressed if the CR is represented not as a literal character but as a character entity, and I don't know whether that is suppressed if the CR character is found inside CDATA bracketing.

In general, use of XML Numeric Character Entities like &amp;#xD; can be problematic though, because you can't do that inside of a <![CDATA[... ]]> context. If you have a long string with lots of XML-escaped characters like "<" and "&" in it, then rather than converting those to &amp;LT; and &amp;amp;, you might want the whole string wrapped in <![CDATA[   ... ]]> so that the "<" and "&" characters can stay as they are.... but you can't get away with that if there are &amp;#xD; in there. You'd have to convert this:

{code:java}
    <<< a & b>>>CR<<< c & d >>>
{code}

where CR is a x0D character, into:

{code:java}
   <![CDATA[<<< a & b>>>]]>&#xD;<![CDATA[<<< c & d>>>]]>
{code}

which is getting pretty ugly. Or you can do:

{code:java}
   &LT;&LT;&LT; a &AMP; b >>>&#xD;&LT;&LT;&LT; c &AMP; d>>>
{code}

which is similarly ugly as hell. Note that this isn't the same a just calling escapify on a string that has CR replaced by &amp;#xD; in it, because that would get escaped as &amp;amp;#xD;, i.e., converted into literally an "&" character and "#xD;" characters.  I.e., you still have to recognize that there is a CR in the string, break the string at the CR, and do two different escapify calls, and concatenate them with &amp;#xD; in the middle.

Honestly the PUA may be preferable.

 

 

 

 


was (Author: mbeckerle):
PUA isn't necessary for CR, but only for the illegal code points. CR isn't "illegal".

But the PUA may be preferable, because otherwise there is this "quoting hell" issue.

Consider what happens if we just leave in the CR character code point when we create JDOM or scala XML trees, and use &#xD; when creating XML text. If serialized to text, and then loaded by an XML loader, the CR character may disappear. I don't know that it does, because the CR character isn't illegal, but XML  specs I've read suggest that CR and CRLF are converted to single LF by XML loaders. I don't know whether that is suppressed if the CR is represented not as a literal character but as a character entity, and I don't know whether that is suppressed if the CR character is found inside CDATA bracketing.

In general, use of XML Numeric Character Entities like &#xD; can be problematic though, because you can't do that inside of a <![CDATA[... ]]> context. If you have a long string with lots of XML-escaped characters like "<" and "&" in it, then rather than converting those to &LT; and &amp;, you might want the whole string wrapped in <![CDATA[   ... ]]> so that the "<" and "&" characters can stay as they are.... but you can't get away with that if there are &#xD; in there. You'd have to convert this:

    <<< a & b>>>CR<<< c & d >>>

where CR is a x0D character, into:

   <![CDATA[<<< a & b>>>]]>&#xD;<![CDATA[<<< c & d>>>]]>

which is getting pretty ugly. Or you can do:

   &LT;&LT;&LT; a &AMP; b >>>&#xD;&LT;&LT;&LT; c &AMP; d>>>

which is similarly ugly as hell. Note that this isn't the same a just calling escapify on a string that has CR replaced by &#xD; in it, because that would get escaped as &AMP;#xD;, i.e., converted into literally an "&" character and "#xD;" characters.  I.e., you still have to recognize that there is a CR in the string, break the string at the CR, and do two different escapify calls, and concatenate them with &#xD; in the middle.

Honestly the PUA may be preferable.

 

 

 

 

> Add option to disable CRLF to LF XML canonicalization
> -----------------------------------------------------
>
>                 Key: DAFFODIL-1559
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-1559
>             Project: Daffodil
>          Issue Type: Bug
>          Components: API
>            Reporter: Steve Lawrence
>            Priority: Major
>              Labels: beginner
>
> See the review or more details. The short of it is that when converting parse results to XML, we convert CR to LF, and we convert CRLF to LF. This means that we lose the information that the data used to contain CRLF. This is similar to how we lose that information with delimiters if someone uses NL, but it's slightly different since it is actual data. However, it's most user friendly and consistent with other XML technologies to have this behavior.
> Perhaps we need an option to convert CRLF to somewhere in PUA so that this information can be maintained if someone needs it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)