You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@daffodil.apache.org by Roger L Costello <co...@mitre.org> on 2021/05/19 17:22:25 UTC

Can you give me an example that motivates the need for extraEscapedCharacters please?

Hi Folks,

As I understand it, extraEscapedCharacters identifies characters that are to be escaped during unparsing. They are characters to be escaped above and beyond the characters identified by the escapeCharacter property.

Below I created an example to illustrate the use of extraEscapedCharacters. Is it correct? The example is hokey, do you have a more compelling example? 

Example: Suppose a data format contains a sequence of data items separated by forward slash. If a data item contains a separator, the separator is escaped by a backslash. An instance contains these three data items: "Yellow", "Lemon and/or Banana", and "6". The forward slash in the second data item needs escaping. Here is the instance:

Yellow/Lemon and\/or Banana/6

Parsing the instance produces this XML:

<FruitBasket>
    <Color>Yellow</Color>
    <Fruits>Lemon and/or Banana</Fruits>
    <Quantity>6</Quantity>
</FruitBasket>

If extraEscapedCharacters="" (no additional characters to be escaped during unparsing), then unparsing produces:

Yellow/Lemon and\/or Banana/6

If extraEscapedCharacters="a e", then the unparser will also escape all a's and e's, to produce:

Y\ellow/L\emon \and\/or B\an\an\a/6

Is that correct?

/Roger


Re: Can you give me an example that motivates the need for extraEscapedCharacters please?

Posted by "Beckerle, Mike" <mb...@owlcyberdefense.com>.
The case you highlight where extraEscapedCharacters="a e" looks like correct interpretation of the DFDL spec., but this example doesn't motivate the feature, so we need to get you a better example.

I did find IBM DFDL comes with an example that has extraEscapedCharacters="%#x0D; %#x0A", with escapeKind="escapeBlock".

So here's example data illustrating why you want those extraEscapedCharacters:

emp(name=Joe Smith|addr="862 West North Place
Dept 23 M/S 77
Unit 3",Madison,WI, 99999)
emp(name=Joe Jones|addr=8 Lost Lake Rd, Glendon,MT,88888)

Note that the street element contains line endings in the first instance. The format requires that such multi-line entries have quotation marks around them, even if the data doesn't contain any of the delimiters (like | or , )

The second emp instance does not have any quotations around the street part.

To model this, you have a lengthKind 'delimited' format with this escape scheme

<dfdl:escapeScheme escapeKind="escapeBlock" escapeBlockStart='"' escapeBlockEnd='"' escapeCharacter='"'
escapeEscapeCharacter='"' extraEscapedCharacters='%#x0D; %#x0A;' generateEscapeBlock="whenNeeded"/>

The extraEscapedCharacters ensure that the multi-line data is inside the quotes. This even though it does NOT contain a comma, nor a "|" nor a ")%CR;%LF;" that would conflict with the separators/terminators of the format: DFDL schema roughly like this:

<element name="employee" dfdl:initiator="emp(" dfdl:terminator=")%CR;%LF">
   <complexType>
     <sequence dfdl:separator="|">
       <element name="name" dfdl:initator="name=" type="xs:string"/>
       <element name="address" dfdl:initiator="addr=">
         <complexType>
         <sequence dfdl:separator=",">
           <element name="street" type="xs:string"/>
           <element name="city" type="xs:string"/>
           <element name="state" type="xs:string"/>
          <element name="postalCode" type="xs:string"/>
          </sequence>
       </complexType>
      </element>
     </sequence>
   </complexType>
</element>

A similar example using escapeKind 'escapeCharacter' with escapeCharacter="\" would be this data:

emp(name=Joe Smith|addr=862 West North Place\
Dept 23 M/S 77\
Unit 3",Madison,WI, 99999)

That's an example format where the line-endings must be escaped, even though line endings are not delimiters in this format.

If that data contained CRLF line endings, it would actually appear with two escape characters, one for the CR, one for the LF:

emp(name=Joe Smith|addr=862 West North Place\←\
Dept 23 M/S 77\←\
Unit 3",Madison,WI, 99999)

(where ← represents the CR character. If you have some box character between the slashes when reading this it's because unicode left arrow U+2190 isn't rendering in your font)

________________________________
From: Roger L Costello <co...@mitre.org>
Sent: Wednesday, May 19, 2021 1:22 PM
To: users@daffodil.apache.org <us...@daffodil.apache.org>
Subject: Can you give me an example that motivates the need for extraEscapedCharacters please?

Hi Folks,

As I understand it, extraEscapedCharacters identifies characters that are to be escaped during unparsing. They are characters to be escaped above and beyond the characters identified by the escapeCharacter property.

Below I created an example to illustrate the use of extraEscapedCharacters. Is it correct? The example is hokey, do you have a more compelling example?

Example: Suppose a data format contains a sequence of data items separated by forward slash. If a data item contains a separator, the separator is escaped by a backslash. An instance contains these three data items: "Yellow", "Lemon and/or Banana", and "6". The forward slash in the second data item needs escaping. Here is the instance:

Yellow/Lemon and\/or Banana/6

Parsing the instance produces this XML:

<FruitBasket>
    <Color>Yellow</Color>
    <Fruits>Lemon and/or Banana</Fruits>
    <Quantity>6</Quantity>
</FruitBasket>

If extraEscapedCharacters="" (no additional characters to be escaped during unparsing), then unparsing produces:

Yellow/Lemon and\/or Banana/6

If extraEscapedCharacters="a e", then the unparser will also escape all a's and e's, to produce:

Y\ellow/L\emon \and\/or B\an\an\a/6

Is that correct?

/Roger