You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@daffodil.apache.org by "Adams, Joshua" <ja...@owlcyberdefense.com> on 2021/05/03 18:38:48 UTC

Escape character parsing bug?

Consider the following schema:

    <dfdl:defineEscapeScheme name="scenario3">
      <dfdl:escapeScheme escapeCharacter='/'
        escapeKind="escapeCharacter" escapeEscapeCharacter="$" extraEscapedCharacters="" generateEscapeBlock="whenNeeded" />
    </dfdl:defineEscapeScheme>

    <xs:element name="e_infix">
      <xs:complexType>
        <xs:sequence dfdl:separator="/;" dfdl:separatorPosition="infix">
          <xs:element name="x" type="xs:string" dfdl:escapeSchemeRef="tns:scenario3" />
          <xs:element name="y" type="xs:string" minOccurs="0" dfdl:escapeSchemeRef="tns:scenario3" />
        </xs:sequence>
      </xs:complexType>
    </xs:element>

We then have the following test case:
  <parserTestCase name="scenario3_3" model="es3"
    description="Section 13 - escapeCharacter - DFDL-13-029R" root="e_infix" roundTrip="true">
    <!-- See DFDL-1556 for to make roundTrip="true" -->
    <document>foo$$/;bar</document>
    <infoset>
      <dfdlInfoset>
        <tns:e_infix>
          <x>foo$/;bar</x>
        </tns:e_infix>
      </dfdlInfoset>
    </infoset>
  </parserTestCase>

Shouldn't this parse as:
<tns:e_infix>
  <x>foo$$</x>
  <y>bar</y>
</tns:e_infix>

The spec says the following:
On parsing any in-scope terminating delimiter encountered in the data
is not interpreted as such when it is immediately preceded by the
dfdl:escapeCharacter (when not itself preceded by the
dfdl:escapeEscapeCharacter). Occurrences of the
dfdl:escapeCharacter and dfdl:escapeEscapeCharacter are removed
from the data, unless the dfdl:escapeCharacter is preceded by the
dfdl:escapeEscapeCharacter, or the dfdl:escapeEscapeCharacter
does not precede the dfdl:escapeCharacter.

It seems to me that the '/;' terminator shouldn't be getting escaped in this case, but want to double check.

Josh



Re: Escape character parsing bug?

Posted by "Beckerle, Mike" <mb...@owlcyberdefense.com>.
This sounds good to me. Less complexity therefore fewer tests is a good thing.
________________________________
From: Adams, Joshua <ja...@owlcyberdefense.com>
Sent: Wednesday, May 5, 2021 5:05 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Escape character parsing bug?

So, after making the change to throw a Schema Definition Error whenever a terminator or separator begins with the escapeCharacter or escapeEscapeCharacter, around half of our escape scenario tests fail as they were all trying to test these weird edge cases for dealing with delimiters that start with the escapeCharacter or escapeEscapeCharacter.  I'm guessing that most of these tests can just be purged after a review to make sure we aren't losing coverage (other than this scenario where we are now throwing an SDE).  Just wanted to get some opinions before moving forward with this change.

Josh
________________________________
From: Adams, Joshua <ja...@owlcyberdefense.com>
Sent: Tuesday, May 4, 2021 12:44 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Escape character parsing bug?

I'll begin making the change to add an SDE for these then.  It seems that most of the escape scheme tests that weren't round tripping were cases like this.

Josh

On May 4, 2021 12:15 PM, "Beckerle, Mike" <mb...@owlcyberdefense.com> wrote:
I asked Steve Hanson of IBM - other co-chair on DFDL workgroup, and one of the primaries on one of IBM's DFDL implementations, said that when he tries this situation with the escape character "/" matching the start of the separator, he gets an SDE.

It appears not to be part of the DFDL spec to call this out as an SDE, so that omission will likely become the first erratum to the DFDL v1.0 official final spec.


________________________________
From: Adams, Joshua <ja...@owlcyberdefense.com>
Sent: Monday, May 3, 2021 3:35 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Escape character parsing bug?

Thanks for running this up the chain so to speak.  I agree that an SDE would probably be best for situations like this as I wouldn't think any sort of sane data format would use a combination of separators/escape characters like this.

Josh
________________________________
From: Beckerle, Mike <mb...@owlcyberdefense.com>
Sent: Monday, May 3, 2021 3:32 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Escape character parsing bug?

So you have a separator the first char of which is the escape character.

Yikes. I think the DFDL spec should, ideally, make this an SDE. Feels entirely ambiguous to me.

The part of the spec you quote is quite problematic, but was updated by one word in the final DFDL Spec version.

Occurrences of the
dfdl:escapeCharacter and dfdl:escapeEscapeCharacter are removed
from the data, unless the dfdl:escapeCharacter is preceded by the
dfdl:escapeEscapeCharacter, or the dfdl:escapeEscapeCharacter
does not precede the dfdl:escapeCharacter, respectively.

So breaking that into two independent statements:

  1.  An escapeCharacter is removed unless it is preceded by the escape-escape.
  2.  An escape-escape is removed unless it does not precede the escape character.

So (1) means an escape char that is floating around not in front of any delimiter is removed.
(2) means an escape-escape floating around not in front of any escape char, is preserved.

That still doesn't help with your specific issue. If a delimiter begins with the escapeCharacter, will that delimiter appearing in the data be interpreted as an escape character followed by the 2nd and subsequent characters of the delimiter? Or will the delimiter be recognized?

Consider dfdl:separator="/ // ///" with escapeCharacter="/" and escapeEscapeCharacter="/"

What takes priority, interpretation of escapeCharacters and escapeEscapeCharacters or recognizing delimiters?

I have posed this issue for consideration of the other DFDL workgroup members and I'll report back.

________________________________
From: Adams, Joshua <ja...@owlcyberdefense.com>
Sent: Monday, May 3, 2021 2:38 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Escape character parsing bug?

Consider the following schema:

    <dfdl:defineEscapeScheme name="scenario3">
      <dfdl:escapeScheme escapeCharacter='/'
        escapeKind="escapeCharacter" escapeEscapeCharacter="$" extraEscapedCharacters="" generateEscapeBlock="whenNeeded" />
    </dfdl:defineEscapeScheme>

    <xs:element name="e_infix">
      <xs:complexType>
        <xs:sequence dfdl:separator="/;" dfdl:separatorPosition="infix">
          <xs:element name="x" type="xs:string" dfdl:escapeSchemeRef="tns:scenario3" />
          <xs:element name="y" type="xs:string" minOccurs="0" dfdl:escapeSchemeRef="tns:scenario3" />
        </xs:sequence>
      </xs:complexType>
    </xs:element>

We then have the following test case:
  <parserTestCase name="scenario3_3" model="es3"
    description="Section 13 - escapeCharacter - DFDL-13-029R" root="e_infix" roundTrip="true">
    <!-- See DFDL-1556 for to make roundTrip="true" -->
    <document>foo$$/;bar</document>
    <infoset>
      <dfdlInfoset>
        <tns:e_infix>
          <x>foo$/;bar</x>
        </tns:e_infix>
      </dfdlInfoset>
    </infoset>
  </parserTestCase>

Shouldn't this parse as:
<tns:e_infix>
  <x>foo$$</x>
  <y>bar</y>
</tns:e_infix>

The spec says the following:
On parsing any in-scope terminating delimiter encountered in the data
is not interpreted as such when it is immediately preceded by the
dfdl:escapeCharacter (when not itself preceded by the
dfdl:escapeEscapeCharacter). Occurrences of the
dfdl:escapeCharacter and dfdl:escapeEscapeCharacter are removed
from the data, unless the dfdl:escapeCharacter is preceded by the
dfdl:escapeEscapeCharacter, or the dfdl:escapeEscapeCharacter
does not precede the dfdl:escapeCharacter.

It seems to me that the '/;' terminator shouldn't be getting escaped in this case, but want to double check.

Josh



Re: Escape character parsing bug?

Posted by "Adams, Joshua" <ja...@owlcyberdefense.com>.
So, after making the change to throw a Schema Definition Error whenever a terminator or separator begins with the escapeCharacter or escapeEscapeCharacter, around half of our escape scenario tests fail as they were all trying to test these weird edge cases for dealing with delimiters that start with the escapeCharacter or escapeEscapeCharacter.  I'm guessing that most of these tests can just be purged after a review to make sure we aren't losing coverage (other than this scenario where we are now throwing an SDE).  Just wanted to get some opinions before moving forward with this change.

Josh
________________________________
From: Adams, Joshua <ja...@owlcyberdefense.com>
Sent: Tuesday, May 4, 2021 12:44 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Escape character parsing bug?

I'll begin making the change to add an SDE for these then.  It seems that most of the escape scheme tests that weren't round tripping were cases like this.

Josh

On May 4, 2021 12:15 PM, "Beckerle, Mike" <mb...@owlcyberdefense.com> wrote:
I asked Steve Hanson of IBM - other co-chair on DFDL workgroup, and one of the primaries on one of IBM's DFDL implementations, said that when he tries this situation with the escape character "/" matching the start of the separator, he gets an SDE.

It appears not to be part of the DFDL spec to call this out as an SDE, so that omission will likely become the first erratum to the DFDL v1.0 official final spec.


________________________________
From: Adams, Joshua <ja...@owlcyberdefense.com>
Sent: Monday, May 3, 2021 3:35 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Escape character parsing bug?

Thanks for running this up the chain so to speak.  I agree that an SDE would probably be best for situations like this as I wouldn't think any sort of sane data format would use a combination of separators/escape characters like this.

Josh
________________________________
From: Beckerle, Mike <mb...@owlcyberdefense.com>
Sent: Monday, May 3, 2021 3:32 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Escape character parsing bug?

So you have a separator the first char of which is the escape character.

Yikes. I think the DFDL spec should, ideally, make this an SDE. Feels entirely ambiguous to me.

The part of the spec you quote is quite problematic, but was updated by one word in the final DFDL Spec version.

Occurrences of the
dfdl:escapeCharacter and dfdl:escapeEscapeCharacter are removed
from the data, unless the dfdl:escapeCharacter is preceded by the
dfdl:escapeEscapeCharacter, or the dfdl:escapeEscapeCharacter
does not precede the dfdl:escapeCharacter, respectively.

So breaking that into two independent statements:

  1.  An escapeCharacter is removed unless it is preceded by the escape-escape.
  2.  An escape-escape is removed unless it does not precede the escape character.

So (1) means an escape char that is floating around not in front of any delimiter is removed.
(2) means an escape-escape floating around not in front of any escape char, is preserved.

That still doesn't help with your specific issue. If a delimiter begins with the escapeCharacter, will that delimiter appearing in the data be interpreted as an escape character followed by the 2nd and subsequent characters of the delimiter? Or will the delimiter be recognized?

Consider dfdl:separator="/ // ///" with escapeCharacter="/" and escapeEscapeCharacter="/"

What takes priority, interpretation of escapeCharacters and escapeEscapeCharacters or recognizing delimiters?

I have posed this issue for consideration of the other DFDL workgroup members and I'll report back.

________________________________
From: Adams, Joshua <ja...@owlcyberdefense.com>
Sent: Monday, May 3, 2021 2:38 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Escape character parsing bug?

Consider the following schema:

    <dfdl:defineEscapeScheme name="scenario3">
      <dfdl:escapeScheme escapeCharacter='/'
        escapeKind="escapeCharacter" escapeEscapeCharacter="$" extraEscapedCharacters="" generateEscapeBlock="whenNeeded" />
    </dfdl:defineEscapeScheme>

    <xs:element name="e_infix">
      <xs:complexType>
        <xs:sequence dfdl:separator="/;" dfdl:separatorPosition="infix">
          <xs:element name="x" type="xs:string" dfdl:escapeSchemeRef="tns:scenario3" />
          <xs:element name="y" type="xs:string" minOccurs="0" dfdl:escapeSchemeRef="tns:scenario3" />
        </xs:sequence>
      </xs:complexType>
    </xs:element>

We then have the following test case:
  <parserTestCase name="scenario3_3" model="es3"
    description="Section 13 - escapeCharacter - DFDL-13-029R" root="e_infix" roundTrip="true">
    <!-- See DFDL-1556 for to make roundTrip="true" -->
    <document>foo$$/;bar</document>
    <infoset>
      <dfdlInfoset>
        <tns:e_infix>
          <x>foo$/;bar</x>
        </tns:e_infix>
      </dfdlInfoset>
    </infoset>
  </parserTestCase>

Shouldn't this parse as:
<tns:e_infix>
  <x>foo$$</x>
  <y>bar</y>
</tns:e_infix>

The spec says the following:
On parsing any in-scope terminating delimiter encountered in the data
is not interpreted as such when it is immediately preceded by the
dfdl:escapeCharacter (when not itself preceded by the
dfdl:escapeEscapeCharacter). Occurrences of the
dfdl:escapeCharacter and dfdl:escapeEscapeCharacter are removed
from the data, unless the dfdl:escapeCharacter is preceded by the
dfdl:escapeEscapeCharacter, or the dfdl:escapeEscapeCharacter
does not precede the dfdl:escapeCharacter.

It seems to me that the '/;' terminator shouldn't be getting escaped in this case, but want to double check.

Josh



Re: Escape character parsing bug?

Posted by "Adams, Joshua" <ja...@owlcyberdefense.com>.
I'll begin making the change to add an SDE for these then.  It seems that most of the escape scheme tests that weren't round tripping were cases like this.

Josh

On May 4, 2021 12:15 PM, "Beckerle, Mike" <mb...@owlcyberdefense.com> wrote:
I asked Steve Hanson of IBM - other co-chair on DFDL workgroup, and one of the primaries on one of IBM's DFDL implementations, said that when he tries this situation with the escape character "/" matching the start of the separator, he gets an SDE.

It appears not to be part of the DFDL spec to call this out as an SDE, so that omission will likely become the first erratum to the DFDL v1.0 official final spec.


________________________________
From: Adams, Joshua <ja...@owlcyberdefense.com>
Sent: Monday, May 3, 2021 3:35 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Escape character parsing bug?

Thanks for running this up the chain so to speak.  I agree that an SDE would probably be best for situations like this as I wouldn't think any sort of sane data format would use a combination of separators/escape characters like this.

Josh
________________________________
From: Beckerle, Mike <mb...@owlcyberdefense.com>
Sent: Monday, May 3, 2021 3:32 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Escape character parsing bug?

So you have a separator the first char of which is the escape character.

Yikes. I think the DFDL spec should, ideally, make this an SDE. Feels entirely ambiguous to me.

The part of the spec you quote is quite problematic, but was updated by one word in the final DFDL Spec version.

Occurrences of the
dfdl:escapeCharacter and dfdl:escapeEscapeCharacter are removed
from the data, unless the dfdl:escapeCharacter is preceded by the
dfdl:escapeEscapeCharacter, or the dfdl:escapeEscapeCharacter
does not precede the dfdl:escapeCharacter, respectively.

So breaking that into two independent statements:

  1.  An escapeCharacter is removed unless it is preceded by the escape-escape.
  2.  An escape-escape is removed unless it does not precede the escape character.

So (1) means an escape char that is floating around not in front of any delimiter is removed.
(2) means an escape-escape floating around not in front of any escape char, is preserved.

That still doesn't help with your specific issue. If a delimiter begins with the escapeCharacter, will that delimiter appearing in the data be interpreted as an escape character followed by the 2nd and subsequent characters of the delimiter? Or will the delimiter be recognized?

Consider dfdl:separator="/ // ///" with escapeCharacter="/" and escapeEscapeCharacter="/"

What takes priority, interpretation of escapeCharacters and escapeEscapeCharacters or recognizing delimiters?

I have posed this issue for consideration of the other DFDL workgroup members and I'll report back.

________________________________
From: Adams, Joshua <ja...@owlcyberdefense.com>
Sent: Monday, May 3, 2021 2:38 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Escape character parsing bug?

Consider the following schema:

    <dfdl:defineEscapeScheme name="scenario3">
      <dfdl:escapeScheme escapeCharacter='/'
        escapeKind="escapeCharacter" escapeEscapeCharacter="$" extraEscapedCharacters="" generateEscapeBlock="whenNeeded" />
    </dfdl:defineEscapeScheme>

    <xs:element name="e_infix">
      <xs:complexType>
        <xs:sequence dfdl:separator="/;" dfdl:separatorPosition="infix">
          <xs:element name="x" type="xs:string" dfdl:escapeSchemeRef="tns:scenario3" />
          <xs:element name="y" type="xs:string" minOccurs="0" dfdl:escapeSchemeRef="tns:scenario3" />
        </xs:sequence>
      </xs:complexType>
    </xs:element>

We then have the following test case:
  <parserTestCase name="scenario3_3" model="es3"
    description="Section 13 - escapeCharacter - DFDL-13-029R" root="e_infix" roundTrip="true">
    <!-- See DFDL-1556 for to make roundTrip="true" -->
    <document>foo$$/;bar</document>
    <infoset>
      <dfdlInfoset>
        <tns:e_infix>
          <x>foo$/;bar</x>
        </tns:e_infix>
      </dfdlInfoset>
    </infoset>
  </parserTestCase>

Shouldn't this parse as:
<tns:e_infix>
  <x>foo$$</x>
  <y>bar</y>
</tns:e_infix>

The spec says the following:
On parsing any in-scope terminating delimiter encountered in the data
is not interpreted as such when it is immediately preceded by the
dfdl:escapeCharacter (when not itself preceded by the
dfdl:escapeEscapeCharacter). Occurrences of the
dfdl:escapeCharacter and dfdl:escapeEscapeCharacter are removed
from the data, unless the dfdl:escapeCharacter is preceded by the
dfdl:escapeEscapeCharacter, or the dfdl:escapeEscapeCharacter
does not precede the dfdl:escapeCharacter.

It seems to me that the '/;' terminator shouldn't be getting escaped in this case, but want to double check.

Josh



Re: Escape character parsing bug?

Posted by "Beckerle, Mike" <mb...@owlcyberdefense.com>.
I asked Steve Hanson of IBM - other co-chair on DFDL workgroup, and one of the primaries on one of IBM's DFDL implementations, said that when he tries this situation with the escape character "/" matching the start of the separator, he gets an SDE.

It appears not to be part of the DFDL spec to call this out as an SDE, so that omission will likely become the first erratum to the DFDL v1.0 official final spec.


________________________________
From: Adams, Joshua <ja...@owlcyberdefense.com>
Sent: Monday, May 3, 2021 3:35 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Escape character parsing bug?

Thanks for running this up the chain so to speak.  I agree that an SDE would probably be best for situations like this as I wouldn't think any sort of sane data format would use a combination of separators/escape characters like this.

Josh
________________________________
From: Beckerle, Mike <mb...@owlcyberdefense.com>
Sent: Monday, May 3, 2021 3:32 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Escape character parsing bug?

So you have a separator the first char of which is the escape character.

Yikes. I think the DFDL spec should, ideally, make this an SDE. Feels entirely ambiguous to me.

The part of the spec you quote is quite problematic, but was updated by one word in the final DFDL Spec version.

Occurrences of the
dfdl:escapeCharacter and dfdl:escapeEscapeCharacter are removed
from the data, unless the dfdl:escapeCharacter is preceded by the
dfdl:escapeEscapeCharacter, or the dfdl:escapeEscapeCharacter
does not precede the dfdl:escapeCharacter, respectively.

So breaking that into two independent statements:

  1.  An escapeCharacter is removed unless it is preceded by the escape-escape.
  2.  An escape-escape is removed unless it does not precede the escape character.

So (1) means an escape char that is floating around not in front of any delimiter is removed.
(2) means an escape-escape floating around not in front of any escape char, is preserved.

That still doesn't help with your specific issue. If a delimiter begins with the escapeCharacter, will that delimiter appearing in the data be interpreted as an escape character followed by the 2nd and subsequent characters of the delimiter? Or will the delimiter be recognized?

Consider dfdl:separator="/ // ///" with escapeCharacter="/" and escapeEscapeCharacter="/"

What takes priority, interpretation of escapeCharacters and escapeEscapeCharacters or recognizing delimiters?

I have posed this issue for consideration of the other DFDL workgroup members and I'll report back.

________________________________
From: Adams, Joshua <ja...@owlcyberdefense.com>
Sent: Monday, May 3, 2021 2:38 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Escape character parsing bug?

Consider the following schema:

    <dfdl:defineEscapeScheme name="scenario3">
      <dfdl:escapeScheme escapeCharacter='/'
        escapeKind="escapeCharacter" escapeEscapeCharacter="$" extraEscapedCharacters="" generateEscapeBlock="whenNeeded" />
    </dfdl:defineEscapeScheme>

    <xs:element name="e_infix">
      <xs:complexType>
        <xs:sequence dfdl:separator="/;" dfdl:separatorPosition="infix">
          <xs:element name="x" type="xs:string" dfdl:escapeSchemeRef="tns:scenario3" />
          <xs:element name="y" type="xs:string" minOccurs="0" dfdl:escapeSchemeRef="tns:scenario3" />
        </xs:sequence>
      </xs:complexType>
    </xs:element>

We then have the following test case:
  <parserTestCase name="scenario3_3" model="es3"
    description="Section 13 - escapeCharacter - DFDL-13-029R" root="e_infix" roundTrip="true">
    <!-- See DFDL-1556 for to make roundTrip="true" -->
    <document>foo$$/;bar</document>
    <infoset>
      <dfdlInfoset>
        <tns:e_infix>
          <x>foo$/;bar</x>
        </tns:e_infix>
      </dfdlInfoset>
    </infoset>
  </parserTestCase>

Shouldn't this parse as:
<tns:e_infix>
  <x>foo$$</x>
  <y>bar</y>
</tns:e_infix>

The spec says the following:
On parsing any in-scope terminating delimiter encountered in the data
is not interpreted as such when it is immediately preceded by the
dfdl:escapeCharacter (when not itself preceded by the
dfdl:escapeEscapeCharacter). Occurrences of the
dfdl:escapeCharacter and dfdl:escapeEscapeCharacter are removed
from the data, unless the dfdl:escapeCharacter is preceded by the
dfdl:escapeEscapeCharacter, or the dfdl:escapeEscapeCharacter
does not precede the dfdl:escapeCharacter.

It seems to me that the '/;' terminator shouldn't be getting escaped in this case, but want to double check.

Josh



Re: Escape character parsing bug?

Posted by "Adams, Joshua" <ja...@owlcyberdefense.com>.
Thanks for running this up the chain so to speak.  I agree that an SDE would probably be best for situations like this as I wouldn't think any sort of sane data format would use a combination of separators/escape characters like this.

Josh
________________________________
From: Beckerle, Mike <mb...@owlcyberdefense.com>
Sent: Monday, May 3, 2021 3:32 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Escape character parsing bug?

So you have a separator the first char of which is the escape character.

Yikes. I think the DFDL spec should, ideally, make this an SDE. Feels entirely ambiguous to me.

The part of the spec you quote is quite problematic, but was updated by one word in the final DFDL Spec version.

Occurrences of the
dfdl:escapeCharacter and dfdl:escapeEscapeCharacter are removed
from the data, unless the dfdl:escapeCharacter is preceded by the
dfdl:escapeEscapeCharacter, or the dfdl:escapeEscapeCharacter
does not precede the dfdl:escapeCharacter, respectively.

So breaking that into two independent statements:

  1.  An escapeCharacter is removed unless it is preceded by the escape-escape.
  2.  An escape-escape is removed unless it does not precede the escape character.

So (1) means an escape char that is floating around not in front of any delimiter is removed.
(2) means an escape-escape floating around not in front of any escape char, is preserved.

That still doesn't help with your specific issue. If a delimiter begins with the escapeCharacter, will that delimiter appearing in the data be interpreted as an escape character followed by the 2nd and subsequent characters of the delimiter? Or will the delimiter be recognized?

Consider dfdl:separator="/ // ///" with escapeCharacter="/" and escapeEscapeCharacter="/"

What takes priority, interpretation of escapeCharacters and escapeEscapeCharacters or recognizing delimiters?

I have posed this issue for consideration of the other DFDL workgroup members and I'll report back.

________________________________
From: Adams, Joshua <ja...@owlcyberdefense.com>
Sent: Monday, May 3, 2021 2:38 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Escape character parsing bug?

Consider the following schema:

    <dfdl:defineEscapeScheme name="scenario3">
      <dfdl:escapeScheme escapeCharacter='/'
        escapeKind="escapeCharacter" escapeEscapeCharacter="$" extraEscapedCharacters="" generateEscapeBlock="whenNeeded" />
    </dfdl:defineEscapeScheme>

    <xs:element name="e_infix">
      <xs:complexType>
        <xs:sequence dfdl:separator="/;" dfdl:separatorPosition="infix">
          <xs:element name="x" type="xs:string" dfdl:escapeSchemeRef="tns:scenario3" />
          <xs:element name="y" type="xs:string" minOccurs="0" dfdl:escapeSchemeRef="tns:scenario3" />
        </xs:sequence>
      </xs:complexType>
    </xs:element>

We then have the following test case:
  <parserTestCase name="scenario3_3" model="es3"
    description="Section 13 - escapeCharacter - DFDL-13-029R" root="e_infix" roundTrip="true">
    <!-- See DFDL-1556 for to make roundTrip="true" -->
    <document>foo$$/;bar</document>
    <infoset>
      <dfdlInfoset>
        <tns:e_infix>
          <x>foo$/;bar</x>
        </tns:e_infix>
      </dfdlInfoset>
    </infoset>
  </parserTestCase>

Shouldn't this parse as:
<tns:e_infix>
  <x>foo$$</x>
  <y>bar</y>
</tns:e_infix>

The spec says the following:
On parsing any in-scope terminating delimiter encountered in the data
is not interpreted as such when it is immediately preceded by the
dfdl:escapeCharacter (when not itself preceded by the
dfdl:escapeEscapeCharacter). Occurrences of the
dfdl:escapeCharacter and dfdl:escapeEscapeCharacter are removed
from the data, unless the dfdl:escapeCharacter is preceded by the
dfdl:escapeEscapeCharacter, or the dfdl:escapeEscapeCharacter
does not precede the dfdl:escapeCharacter.

It seems to me that the '/;' terminator shouldn't be getting escaped in this case, but want to double check.

Josh



Re: Escape character parsing bug?

Posted by "Beckerle, Mike" <mb...@owlcyberdefense.com>.
So you have a separator the first char of which is the escape character.

Yikes. I think the DFDL spec should, ideally, make this an SDE. Feels entirely ambiguous to me.

The part of the spec you quote is quite problematic, but was updated by one word in the final DFDL Spec version.

Occurrences of the
dfdl:escapeCharacter and dfdl:escapeEscapeCharacter are removed
from the data, unless the dfdl:escapeCharacter is preceded by the
dfdl:escapeEscapeCharacter, or the dfdl:escapeEscapeCharacter
does not precede the dfdl:escapeCharacter, respectively.

So breaking that into two independent statements:

  1.  An escapeCharacter is removed unless it is preceded by the escape-escape.
  2.  An escape-escape is removed unless it does not precede the escape character.

So (1) means an escape char that is floating around not in front of any delimiter is removed.
(2) means an escape-escape floating around not in front of any escape char, is preserved.

That still doesn't help with your specific issue. If a delimiter begins with the escapeCharacter, will that delimiter appearing in the data be interpreted as an escape character followed by the 2nd and subsequent characters of the delimiter? Or will the delimiter be recognized?

Consider dfdl:separator="/ // ///" with escapeCharacter="/" and escapeEscapeCharacter="/"

What takes priority, interpretation of escapeCharacters and escapeEscapeCharacters or recognizing delimiters?

I have posed this issue for consideration of the other DFDL workgroup members and I'll report back.

________________________________
From: Adams, Joshua <ja...@owlcyberdefense.com>
Sent: Monday, May 3, 2021 2:38 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Escape character parsing bug?

Consider the following schema:

    <dfdl:defineEscapeScheme name="scenario3">
      <dfdl:escapeScheme escapeCharacter='/'
        escapeKind="escapeCharacter" escapeEscapeCharacter="$" extraEscapedCharacters="" generateEscapeBlock="whenNeeded" />
    </dfdl:defineEscapeScheme>

    <xs:element name="e_infix">
      <xs:complexType>
        <xs:sequence dfdl:separator="/;" dfdl:separatorPosition="infix">
          <xs:element name="x" type="xs:string" dfdl:escapeSchemeRef="tns:scenario3" />
          <xs:element name="y" type="xs:string" minOccurs="0" dfdl:escapeSchemeRef="tns:scenario3" />
        </xs:sequence>
      </xs:complexType>
    </xs:element>

We then have the following test case:
  <parserTestCase name="scenario3_3" model="es3"
    description="Section 13 - escapeCharacter - DFDL-13-029R" root="e_infix" roundTrip="true">
    <!-- See DFDL-1556 for to make roundTrip="true" -->
    <document>foo$$/;bar</document>
    <infoset>
      <dfdlInfoset>
        <tns:e_infix>
          <x>foo$/;bar</x>
        </tns:e_infix>
      </dfdlInfoset>
    </infoset>
  </parserTestCase>

Shouldn't this parse as:
<tns:e_infix>
  <x>foo$$</x>
  <y>bar</y>
</tns:e_infix>

The spec says the following:
On parsing any in-scope terminating delimiter encountered in the data
is not interpreted as such when it is immediately preceded by the
dfdl:escapeCharacter (when not itself preceded by the
dfdl:escapeEscapeCharacter). Occurrences of the
dfdl:escapeCharacter and dfdl:escapeEscapeCharacter are removed
from the data, unless the dfdl:escapeCharacter is preceded by the
dfdl:escapeEscapeCharacter, or the dfdl:escapeEscapeCharacter
does not precede the dfdl:escapeCharacter.

It seems to me that the '/;' terminator shouldn't be getting escaped in this case, but want to double check.

Josh