You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@daffodil.apache.org by "Costello, Roger L." <co...@mitre.org> on 2018/11/20 15:58:20 UTC

I don't understand the difference between generateEscapeBlock="whenNeeded" and generateEscapeBlock="always"

Hello DFDL community,

I have an input file that uses a colon to separate (delimit) fields.

The backslash symbol is used to escape the colon.

Below is how I define the escape symbol. Everything makes sense to me except generateEscapeBlock. I get the same behavior regardless of whether I use generateEscapeBlock="whenNeeded" or generateEscapeBlock="always". I read the DFDL specification description of generateEscapeBlock. Honestly, that didn't help in my understanding of the difference between whenNeeded and always. Would someone please explain the differences in simple, layman terms, please? When do I use one versus the other? When would I see a difference in behavior?  /Roger

<dfdl:defineEscapeScheme name="Backslash">
           <dfdl:escapeScheme
                        escapeKind="escapeCharacter"
                        escapeCharacter="\"
                        escapeEscapeCharacter="\"
                        extraEscapedCharacters=""
                        generateEscapeBlock="whenNeeded"
            />
</dfdl:defineEscapeScheme>


RE: I don't understand the difference between generateEscapeBlock="whenNeeded" and generateEscapeBlock="always"

Posted by "Costello, Roger L." <co...@mitre.org>.
Wow!

That is a fantastically clear explanation.

Thank you Steve!

/Roger

-----Original Message-----
From: Steve Lawrence <sl...@apache.org> 
Sent: Tuesday, November 20, 2018 11:30 AM
To: users@daffodil.apache.org; Costello, Roger L. <co...@mitre.org>
Subject: Re: I don't understand the difference between generateEscapeBlock="whenNeeded" and generateEscapeBlock="always"

The generateEscapeBlock property only applies when escapeKind="escapeBlock" and only on unparsing. That explains why you don't see any different when escapeKind="escapeCharacter".

As an example, let's say we have the following:

  escapeKind="escapeBlock"
  escapeBlockStart="&quote;"
  escapeBlockEnd="&quote;"

And assume there's one in-scope delimiter that is:

  separator=","

This defines a format similar to CSV, where each field is separated by a comma, and fields the contain commas must start and end with quotes to signify that those commas are part of the data and not a delimiter. Note that data can still include quotes even if the data doesn't have a comma. So your data could look like this:

   foo,"bar",baz,"qaz,maz"

So 'foo' and 'baz' are unquoted, 'bar' is quoted unnecessarily since it does not contain a comma, and 'qaz,maz' is quoted and it's required since the data contains a comma.

In this case this might unparse to something like this:

   <field>foo</field>
   <field>bar</field>
   <field>baz</field>
   <field>qaz,maz</field>

Note that the escape block quotes have all been stripped off leaving only the data, and the last field contains a comma in the data.

Now, let's assume we want to unparse the XML. This is where generateEscapeBlock plays a role. Since the escape block characters are not in the infoset, we need a way to determine when we should create the escape block quote characters.

One option is generateEscapeBlock="always", which means every field will be unparsed with the escape block characters, regardless if they are needed or not. So the above would become this:

  "foo","bar","baz","qaz,maz"

Every field now has escape block quotes, even though they aren't all necessary.

The other option is generateEscapeBlock="whenNeeded". In this case, Daffodil inspects each field before unparsing and determines if it contains any in-scope delimiters. If it does, only then will it add the escape block quotes. With "whenNeeded", the data unparses to this:

  foo,bar,baz,"qaz,maz"

Note that only "qaz,maz" has quotes because only its field contains an inscope delimiter (the comma separator). Also note that "bar" does not have quotes even though the original data did have quotes. This is because the quotes are not necessary and the infoset does not store whether or not a field originally had quotes or not.

- Steve



On 11/20/18 10:58 AM, Costello, Roger L. wrote:
> Hello DFDL community,
> 
> I have an input file that uses a colon to separate (delimit) fields.
> 
> The backslash symbol is used to escape the colon.
> 
> Below is how I define the escape symbol. Everything makes sense to me 
> except generateEscapeBlock. I get the same behavior regardless of 
> whether I use generateEscapeBlock="whenNeeded" or 
> generateEscapeBlock="always". I read the DFDL specification 
> description of generateEscapeBlock. Honestly, that didn't help in my 
> understanding of the difference between whenNeeded and always. Would 
> someone please explain the differences in simple, layman terms, 
> please? When do I use one versus the other? When would I see a 
> difference in behavior?  /Roger
> 
> <dfdl:defineEscapeSchemename="Backslash">
>             <dfdl:escapeScheme
>                          escapeKind="escapeCharacter"
>                          escapeCharacter="\"
>                          escapeEscapeCharacter="\"
>                          extraEscapedCharacters=""
>                          generateEscapeBlock="whenNeeded"
> />
> </dfdl:defineEscapeScheme>
> 


Re: I don't understand the difference between generateEscapeBlock="whenNeeded" and generateEscapeBlock="always"

Posted by Steve Lawrence <sl...@apache.org>.
The generateEscapeBlock property only applies when
escapeKind="escapeBlock" and only on unparsing. That explains why you
don't see any different when escapeKind="escapeCharacter".

As an example, let's say we have the following:

  escapeKind="escapeBlock"
  escapeBlockStart="&quote;"
  escapeBlockEnd="&quote;"

And assume there's one in-scope delimiter that is:

  separator=","

This defines a format similar to CSV, where each field is separated by a
comma, and fields the contain commas must start and end with quotes to
signify that those commas are part of the data and not a delimiter. Note
that data can still include quotes even if the data doesn't have a
comma. So your data could look like this:

   foo,"bar",baz,"qaz,maz"

So 'foo' and 'baz' are unquoted, 'bar' is quoted unnecessarily since it
does not contain a comma, and 'qaz,maz' is quoted and it's required
since the data contains a comma.

In this case this might unparse to something like this:

   <field>foo</field>
   <field>bar</field>
   <field>baz</field>
   <field>qaz,maz</field>

Note that the escape block quotes have all been stripped off leaving
only the data, and the last field contains a comma in the data.

Now, let's assume we want to unparse the XML. This is where
generateEscapeBlock plays a role. Since the escape block characters are
not in the infoset, we need a way to determine when we should create the
escape block quote characters.

One option is generateEscapeBlock="always", which means every field will
be unparsed with the escape block characters, regardless if they are
needed or not. So the above would become this:

  "foo","bar","baz","qaz,maz"

Every field now has escape block quotes, even though they aren't all
necessary.

The other option is generateEscapeBlock="whenNeeded". In this case,
Daffodil inspects each field before unparsing and determines if it
contains any in-scope delimiters. If it does, only then will it add the
escape block quotes. With "whenNeeded", the data unparses to this:

  foo,bar,baz,"qaz,maz"

Note that only "qaz,maz" has quotes because only its field contains an
inscope delimiter (the comma separator). Also note that "bar" does not
have quotes even though the original data did have quotes. This is
because the quotes are not necessary and the infoset does not store
whether or not a field originally had quotes or not.

- Steve



On 11/20/18 10:58 AM, Costello, Roger L. wrote:
> Hello DFDL community,
> 
> I have an input file that uses a colon to separate (delimit) fields.
> 
> The backslash symbol is used to escape the colon.
> 
> Below is how I define the escape symbol. Everything makes sense to me except 
> generateEscapeBlock. I get the same behavior regardless of whether I use 
> generateEscapeBlock="whenNeeded" or generateEscapeBlock="always". I read the 
> DFDL specification description of generateEscapeBlock. Honestly, that didn’t 
> help in my understanding of the difference between whenNeeded and always. Would 
> someone please explain the differences in simple, layman terms, please? When do 
> I use one versus the other? When would I see a difference in behavior?  /Roger
> 
> <dfdl:defineEscapeSchemename="Backslash">
>             <dfdl:escapeScheme
>                          escapeKind="escapeCharacter"
>                          escapeCharacter="\"
>                          escapeEscapeCharacter="\"
>                          extraEscapedCharacters=""
>                          generateEscapeBlock="whenNeeded"
> />
> </dfdl:defineEscapeScheme>
>