You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@daffodil.apache.org by sl...@apache.org on 2018/02/07 14:40:13 UTC

[incubator-daffodil-site] branch asf-site updated: Publishing from 20415cc46b561af68ac2d24b794cb8af804c2392

This is an automated email from the ASF dual-hosted git repository.

slawrence pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-daffodil-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new de7c02a  Publishing from 20415cc46b561af68ac2d24b794cb8af804c2392
de7c02a is described below

commit de7c02ac9d3e8a40cfde0afaaf142a840dbaa765
Author: Steve Lawrence <sl...@tresys.com>
AuthorDate: Wed Feb 7 09:38:05 2018 -0500

    Publishing from 20415cc46b561af68ac2d24b794cb8af804c2392
---
 content/tdml/index.html | 173 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 173 insertions(+)

diff --git a/content/tdml/index.html b/content/tdml/index.html
index d3a3f8d..271bd79 100644
--- a/content/tdml/index.html
+++ b/content/tdml/index.html
@@ -361,6 +361,179 @@ that warnings are considered non-fatal and so can appear alongside
 </code></pre>
 </div>
 
+<h3 id="using-cdata-regions">Using CDATA Regions</h3>
+
+<p>XML CDATA regions indicate XML data that should not be interpreted as XML.
+Although in general is it used to easily include XML special characters in XML
+data, its use has other benefits in TDML files as well. Below are examples of
+what scenarios when CDATA regions should and should not be used.</p>
+
+<h4 id="-as-a-clear-way-represent-xml-special-characters"><i class="glyphicon glyphicon-ok" style="color: #00d000;"></i>  As a clear way represent XML special characters</h4>
+
+<p>The characters <code class="highlighter-rouge">&lt;</code>, <code class="highlighter-rouge">&gt;</code>, <code class="highlighter-rouge">&amp;</code>, <code class="highlighter-rouge">'</code>, and <code class="highlighter-rouge">"</code> must be represented in XML
+with <code class="highlighter-rouge">&amp;lt;</code>, <code class="highlighter-rouge">&amp;gt;</code>, <code class="highlighter-rouge">&amp;amp;</code>, <code class="highlighter-rouge">&amp;apos;</code>, and <code class="highlighter-rouge">&amp;quot;</code>, respectively.
+These special characters are not escaped when used in CDATA tags, which can
+make the data more clear. For example, the following are equivalent:</p>
+
+<div class="language-xml highlighter-rouge"><pre class="highlight"><code><span class="nt">&lt;foo&gt;</span>abc<span class="ni">&amp;amp;&amp;amp;&amp;amp;</span>&gt;def<span class="nt">&lt;/foo&gt;</span>
+<span class="nt">&lt;foo&gt;</span>abc<span class="cp">&lt;![CDATA[&amp;&amp;&amp;]]&gt;</span>def<span class="nt">&lt;/foo&gt;</span>
+</code></pre>
+</div>
+
+<h4 id="-to-preserve-textual-formatting-within-tdml---for-clarity-reasons"><i class="glyphicon glyphicon-ok" style="color: #00d000;"></i>  To preserve textual formatting within TDML - for clarity reasons</h4>
+
+<p>Often times IDE’s and XML editors will indent, wrap, and remove redundant
+whitespace in XML data. However, sometimes it is desired that such formatting
+is maintained for readability purposes. Many tools  refuse to perform
+modifications on CDATA regions, so they can be used as a way to maintain
+formatting. For example:</p>
+
+<div class="language-xml highlighter-rouge"><pre class="highlight"><code><span class="nt">&lt;tdml:documentPart</span> <span class="na">type=</span><span class="s">"byte"</span><span class="nt">&gt;</span><span class="cp">&lt;![CDATA[
+00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f
+10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f
+20 21    23 24 25    27 28 29 2a 2b 2c 2d 2e 2f
+30 31 32 33 34 35 36 37 38 39 3a 3b    3d    3f
+40 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f
+50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f
+]]&gt;</span><span class="nt">&lt;/tdml:documentPart&gt;</span>
+</code></pre>
+</div>
+
+<p>The data holes in the above matrix of hex would be hard to understand without
+the formatting. But logically, the whitespace is irrelevant when the
+documentPart type is “byte”. In effect, we have CDATA here so that tooling like
+IDEs, XML editor, etc. will not mess with the formatting of the content.</p>
+
+<h4 id="-to-avoid-insertion-of-whitespace-that-would-make-things-incorrect"><i class="glyphicon glyphicon-ok" style="color: #00d000;"></i>  To avoid insertion of whitespace that would make things incorrect</h4>
+
+<p>Let us assume that the input document should contain exactly two letters:
+<code class="highlighter-rouge">a年</code>. This might be represented as the following in a TDML file:</p>
+
+<div class="language-xml highlighter-rouge"><pre class="highlight"><code><span class="nt">&lt;document&gt;</span>
+  <span class="nt">&lt;documentPart</span> <span class="na">type=</span><span class="s">"text"</span><span class="nt">&gt;</span>a年<span class="nt">&lt;/documentPart&gt;</span>
+<span class="nt">&lt;/document&gt;</span>
+</code></pre>
+</div>
+
+<p>The problem is that it is possible that an XML tool might reformat the XML as
+this:</p>
+
+<div class="language-xml highlighter-rouge"><pre class="highlight"><code><span class="nt">&lt;document&gt;</span>
+  <span class="nt">&lt;documentPart</span> <span class="na">type=</span><span class="s">"text"</span><span class="nt">&gt;</span>
+    a年
+  <span class="nt">&lt;/documentPart&gt;</span>
+<span class="nt">&lt;/document&gt;</span>
+</code></pre>
+</div>
+
+<p>But this is a text documentPart containing some letters with surrounding
+whitespace. Our test, in this case, expects data of length exactly 2
+characters, so could cause a failure. CDATA can be used to prevent many XML
+tools from reformatting and inserting whitespace that could affect the test
+input data:</p>
+
+<div class="language-xml highlighter-rouge"><pre class="highlight"><code><span class="nt">&lt;document&gt;</span>
+  <span class="nt">&lt;documentPart</span> <span class="na">type=</span><span class="s">"text"</span><span class="nt">&gt;</span><span class="cp">&lt;![CDATA[a年]]&gt;</span><span class="nt">&lt;/documentPart&gt;</span>
+<span class="nt">&lt;/document&gt;</span>
+</code></pre>
+</div>
+<h4 id="-to-preserve-specific-line-endings"><i class="glyphicon glyphicon-remove" style="color: #d00000;"></i>  To preserve specific line endings</h4>
+
+<p>Using CDATA does NOT necessarily preserve line endings. So if you had a test
+where you have this:</p>
+
+<div class="language-xml highlighter-rouge"><pre class="highlight"><code><span class="nt">&lt;documentPart</span> <span class="na">type=</span><span class="s">"text"</span><span class="nt">&gt;</span><span class="cp">&lt;![CDATA[Text followed by a CR LF
+]]&gt;</span><span class="nt">&lt;/documentPart&gt;</span>
+</code></pre>
+</div>
+
+<p>If you edit that on a windows machine, where CRLF is the usual text line
+ending, then the file will actually have a CRLF line ending in that text. If
+the test has say, <code class="highlighter-rouge">dfdl:terminator="%CR;%LF;"</code>, then this should fail
+because, no matter what, XML always standardizes line endings to just one
+character: LF. XML replaces CRLF with LF, and isolated CR with LF. The net
+result: by the time a program is reading the XML data, it should only see LF
+line endings.</p>
+
+<p>It is possible to get a literal CR character into XML content, but ONLY by
+using the numeric character entity notation, i.e., 
. So one might try to
+write the above test as:</p>
+
+<div class="language-xml highlighter-rouge"><pre class="highlight"><code><span class="nt">&lt;documentPart</span> <span class="na">type=</span><span class="s">"text"</span><span class="nt">&gt;</span><span class="cp">&lt;![CDATA[Text followed by a CR LF]]&gt;</span><span class="nt">&lt;/documentPart&gt;</span>
+<span class="nt">&lt;documentPart</span> <span class="na">type=</span><span class="s">"text"</span><span class="nt">&gt;</span><span class="ni">&amp;#xD;&amp;#xA;</span><span class="nt">&lt;/documentPart&gt;</span>
+</code></pre>
+</div>
+
+<p>Even this, however, is not a sure thing, because re-indenting the XML might
+cause you to get:</p>
+
+<div class="language-xml highlighter-rouge"><pre class="highlight"><code><span class="nt">&lt;documentPart</span> <span class="na">type=</span><span class="s">"text"</span><span class="nt">&gt;</span><span class="cp">&lt;![CDATA[Text followed by a CR LF]]&gt;</span><span class="nt">&lt;/documentPart&gt;</span>
+<span class="nt">&lt;documentPart</span> <span class="na">type=</span><span class="s">"text"</span><span class="nt">&gt;</span>
+   <span class="ni">&amp;#xD;&amp;#xA;</span>
+<span class="nt">&lt;/documentPart&gt;</span>
+</code></pre>
+</div>
+
+<p>Which would be broken because of the whitespace insertions around the
+<code class="highlighter-rouge">&amp;#xD;&amp;#xA;</code>.</p>
+
+<p>There are two good solutions to this problem. First one can use type=”byte”
+document parts:</p>
+
+<div class="language-xml highlighter-rouge"><pre class="highlight"><code><span class="nt">&lt;documentPart</span> <span class="na">type=</span><span class="s">"text"</span><span class="nt">&gt;</span><span class="cp">&lt;![CDATA[Text followed by a CR LF]]&gt;</span><span class="nt">&lt;/documentPart&gt;</span>
+<span class="nt">&lt;documentPart</span> <span class="na">type=</span><span class="s">"byte"</span><span class="nt">&gt;</span>0D 0A<span class="nt">&lt;/documentPart&gt;</span>
+</code></pre>
+</div>
+
+<p>This will always create exactly the bytes <code class="highlighter-rouge">0D</code> and <code class="highlighter-rouge">0A</code>, and documentParts
+are concatenated together with nothing between. However, this will break if the
+documentPart has an encoding where CR and LF are not exactly represented by the
+bytes 0D and 0A. For example currently we support
+<code class="highlighter-rouge">encoding="us-ascii-7-bit-packed"</code>. In that encoding, CR and LF each take up
+only 7 bits, resulting in 14 bits rather than 2 full bytes.</p>
+
+<p>The best way to handle this problem is to use the documentPart
+replaceDFDLEntities attribute:</p>
+
+<div class="language-xml highlighter-rouge"><pre class="highlight"><code><span class="nt">&lt;documentPart</span> <span class="na">type=</span><span class="s">"text"</span> <span class="na">replaceDFDLEntities=</span><span class="s">"true"</span><span class="nt">&gt;</span><span class="cp">&lt;![CDATA[Text followed by a CR LF%CR;%LF;]]&gt;</span><span class="nt">&lt;/documentPart&gt;</span>
+</code></pre>
+</div>
+
+<p>The line gets kind of long, but those <code class="highlighter-rouge">%CR;</code> and <code class="highlighter-rouge">%LF;</code> are DFDL entities
+syntax for those Unicode characters. These are translated into whatever
+encoding the documentPart specifies, so this will be robust even if the
+encoding is say, UTF-16 or the 7-bit encoding.</p>
+
+<p>If you have a multi-line piece of data and need CRLFs in it, then this does get
+a bit clumsy as you have to do it like this where each text line gets its own
+documentPart:</p>
+
+<div class="language-xml highlighter-rouge"><pre class="highlight"><code><span class="nt">&lt;documentPart</span> <span class="na">type=</span><span class="s">"text"</span> <span class="na">replaceDFDLEntities=</span><span class="s">"true"</span><span class="nt">&gt;</span><span class="cp">&lt;![CDATA[Of all the gin joints%CR;%LF;]]&gt;</span><span class="nt">&lt;/documentPart&gt;</span>
+<span class="nt">&lt;documentPart</span> <span class="na">type=</span><span class="s">"text"</span> <span class="na">replaceDFDLEntities=</span><span class="s">"true"</span><span class="nt">&gt;</span><span class="cp">&lt;![CDATA[In all the towns in the world%CR;%LF;]]&gt;</span><span class="nt">&lt;/documentPart&gt;</span>
+<span class="nt">&lt;documentPart</span> <span class="na">type=</span><span class="s">"text"</span> <span class="na">replaceDFDLEntities=</span><span class="s">"true"</span><span class="nt">&gt;</span><span class="cp">&lt;![CDATA[She walked into mine%CR;%LF;]]&gt;</span><span class="nt">&lt;/documentPart&gt;</span>
+</code></pre>
+</div>
+
+<p>So the general rule is that CDATA regions cannot be used to ensure that
+specific kinds of line endings will be preserved in a file.</p>
+
+<p>Some tests, however, are insensitive to the presence of whitespace. This is
+true of many tests for delimited text formats. In those cases you may want
+CDATA to preserve formatting of text (so it won’t be re-indented), and to
+preserve <em>some</em> line endings. If this same test example was instead using
+<code class="highlighter-rouge">dfdl:terminator="%NL;"</code>, the NL entity matches CRLF, CR, or LF, and even
+some other obscure Unicode line ending characters. In that case, the original
+documentPart XML:</p>
+
+<div class="language-xml highlighter-rouge"><pre class="highlight"><code><span class="nt">&lt;documentPart</span> <span class="na">type=</span><span class="s">"text"</span><span class="nt">&gt;</span><span class="cp">&lt;![CDATA[Of all the gin joints
+In all the towns of the world
+She walked into mine
+]]&gt;</span><span class="nt">&lt;/documentPart&gt;</span>
+</code></pre>
+</div>
+
+<p>is fine, and will work and be robust.</p>
+
   </div>
 </div>
 

-- 
To stop receiving notification emails like this one, please contact
slawrence@apache.org.