You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@daffodil.apache.org by Steve Lawrence <sl...@apache.org> on 2020/04/30 15:15:23 UTC

Incorrect delimiter scanning when mixed encodings?

Say we have a schema like this:

  <xs:schema
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">

    <xs:include
schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />

    <xs:annotation>
      <xs:appinfo source="http://www.ogf.org/dfdl/">
        <dfdl:format ref="GeneralFormat" lengthKind="delimited"
          encoding="ISO-8859-1" />
      </xs:appinfo>
    </xs:annotation>

     <xs:element name="root" dfdl:terminator="§">
       <xs:complexType>
         <xs:sequence>
           <xs:element name="name" type="xs:string" />
         </xs:sequence>
       </xs:complexType>
    </xs:element>

  </xs:schema>

So we have a format that is all ISO-8859-1, and a delimited string
called "name", and the root is terminated by "§" in the ISO-8859-1
encoding. If we have data that looks like this:

  text§

It will parse to this:

  <root>
    <name>text</name>
  </root>

Now say we want just the "name" element to have a different encoding, so
we change it to this:

  <xs:element name="name" type="xs:string" dfdl:encoding="US-ASCII" />

Now the terminator defined on the root element is in a different
encoding than the delimited element. Note that the terminator § isn't
even valid in this encoding.

Currently, Daffodil does not successfully parse this. It scans the data,
decoding a single character at a time looking for a delimiter.
Eventually it gets to the § ata, the decoder says it's not valid in our
ASCII encoding and converts it to the unicode replacement character.
This of course doesn't match the delimiter we're looking for and
continue on. The delimeter scanner then hits the end of data, and errors
when it never finds the root termiantor.

Is this the correct behavior, or is our delimiter scanning fundamentally
broken?

I wonder if the correct behavior is when the terminator comes into scope
we should immediately encode it into its bytes. Delimiter scanning only
looks for these bytes and doesn't actually decode any data. And only
when bytes that match a delimiter are found do we decode all the bytes
up until that point?

Is this reasonable, or is this type of thing just not allowed?


Note that this is somewhat hypothetical. I don't know of any formats
that mix encodings like this, but this popped into my head looking at
DAFFODIL-2323 which complains about encoding property and a sequence,
which can have a siilar issue if the sequence encoding differs from
element encodings.

Re: Incorrect delimiter scanning when mixed encodings?

Posted by "Beckerle, Mike" <mb...@tresys.com>.

I have honestly never seen these mixed-encoding cases in real formats so I have no actual use cases.

I've seen delimited binary data, but never textual mixtures. There's a reason for this. Writing software to parse such data would be very hard also. The only way such data could come up would be by composition of two formats together which come from very different origins. I have seen data sets that looked like something straight out of COBOL glued adjacent to something that looked like the log from a perl-based web application. But that still didn't have this mixed encoding/delimiter kind of thing going on. It had more than one encoding, but not anywhere delimiters were involved.

One thing we have tried to avoid in the DFDL standard is making the behaviors of all sorts of things that never have, and likely never will actually appear in data be expressable. Features are supposed to compose in some fashion, but where those compositions introduce complexities, and aren't needed for any real use case, it's better to detect the situation and issue an SDE.

This is the conservative design approach. An SDE can always be rescinded and given meaningful semantics in the future. Once you don't issue an SDE, but give some odd composition of features an operational meaning, then you are stuck with it more or less.

We should consider if mixed encodings e.g., an ascii field has ebcdic delimiters due to the nest the field appears in. Whether those should simply be SDEs. I'd be in favor of such a thing.

You still have to do the lowering of delimiters to bytes due to the raw-bytes entities thing, which *does* come up in real formats unfortunately.


________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Thursday, April 30, 2020 3:47 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Incorrect delimiter scanning when mixed encodings?

As Brandon points out, this might give issues to non-byte size
encodings? I assume mandatory text alignment applies for these as well,
which I think deals with such issues. And I assume MTA still applies
even with raw byte entities.

Seems like the algorithm is something like:

  1) Set a mark
  2) Apply delimiter MTA
  3) Check if delimiter bits match current bits
  4) If not match, decode a single char in element encoding and record
  5) Repeat from 1

Re: escape characters, I think escape characters are always in the
encoding of the element of which they are applied since escapeSchemes do
not include a dfdl:encoding property?

Requires extra logic in the above, maybe even attempting the decode
before checking delimiters since escape char effectively disables
delimiter scanning, but I don't think anything too crazy.

I wonder if some of the DFA complexity goes away?



On 4/30/20 2:47 PM, Beckerle, Mike wrote:
>
> The encoding for the delimiter is the encoding in effect on the schema component carrying the property. Making them take on contextual encodings makes things much too complicated.
>
> So yeah, I think in your case, if we're scanning for that "§" but we're using a decoder for ASCII, that's incorrect.
>
> These mixed encoding cases are all corner cases anyway, so they don't have to be natural or easy. The rules simply have to be easy to interpret.
>
> So your root element defines a terminator.
>
> That terminator's encoding has *nothing* to do with the encoding specified for a contained element within root. It is not that contained element's terminator, it is the root's terminator.
>
> The semantics of delimiter scanning in DFDL is in fact something that requires lowering the delimiters to byte patterns. This is required based on mixed scenarios like this, but also based on features like Byte-Value entities e.g., %#rHH; which specifies a hex byte that can appear, even in the middle of characters, when that byte makes no sense.
>
> <element name="foo" type="xs:int" dfdl:terminator="11%#r88;99" dfdl:encoding="utf-16BE"/>
>
> So the terminator of the above is bytes 0031 0031 88 0039 0039.  See how that 88 is just thrown in there. Makes no sense in ANY encoding. We're even screwing up the character alignment here.
>
> That means scanning for delimiters in DFDL requires us to lower the scanning to bytes.
>
> Of course Daffodil doesn't implement %#rHH; byte-value (aka raw-bytes) entities except for one special case which is specifying the fill byte property.  And our scanning is currently character oriented.
>
> So, what does it mean to scan for say a UTF-16 character '1' as terminator of an element that is in say, ASCII ?
>
> It means you are searching through the bytes ignoring ASCII, decoding it as UTF-16, looking for '1' (which is bytes 00 31).  Then having found a 0031, the preceding bytes are then decoded as ASCII.
>
> Pretty sure Daffodil scanning isn't doing that.
>
>
> ________________________________
> From: Sloane, Brandon <bs...@tresys.com>
> Sent: Thursday, April 30, 2020 12:01 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Incorrect delimiter scanning when mixed encodings?
>
> Without looking at the spec, I would expect that delimiters be defined by the encoding the the element that defines the delimeter; so Daffodil is buggy in the case you describe. However, there are a couple of complications we have to consider:
>
> 1) What if instead of a terminator, we had a separator; and the separator is a valid character in both encoding; but has a different bytecode
>
>  <xs:element name="root" >
>        <xs:complexType>
>          <xs:sequence dfdl:separator=",">
>            <xs:element name="name" type="xs:string" maxOccurs="2" encoding="FOO"/>
>            <xs:element name="address" type="xs:string" maxOccurs="2" encoding="BAR"/>
>          </xs:sequence>
>        </xs:complexType>
>     </xs:element>
>
> In this case, I would expect the separator to be interperated based on the encoding of the individual elements, which is obviously not consistent with my expectation from your example. There is also the instance of the separator occurring between the two element types. So even in this case my naive expectation is not consistent.
> The correct answer here is probably to say that this example schema is wrong, and there should be 2 sequences, each defining their own separator.
>
> 2) What if the encodings have a different alignment? For instance, if the outer encoding that defines the delimiter is 8-bit and byte alligned, with a 7-bit inner encoding, should we look forward to the next byte boundary after every 7 bit character?
>
> 3) How does this interact with escape sequences?
>
> The solution here might be to think through some restrictions on where encoding changes are allowed to occur. I am not sure it is possible to give reasonable semantics for everything over a region that spans multiple encodings.
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Thursday, April 30, 2020 11:15 AM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Incorrect delimiter scanning when mixed encodings?
>
> Say we have a schema like this:
>
>   <xs:schema
>     xmlns:xs="http://www.w3.org/2001/XMLSchema"
>     xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">
>
>     <xs:include
> schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />
>
>     <xs:annotation>
>       <xs:appinfo source="http://www.ogf.org/dfdl/">
>         <dfdl:format ref="GeneralFormat" lengthKind="delimited"
>           encoding="ISO-8859-1" />
>       </xs:appinfo>
>     </xs:annotation>
>
>      <xs:element name="root" dfdl:terminator="§">
>        <xs:complexType>
>          <xs:sequence>
>            <xs:element name="name" type="xs:string" />
>          </xs:sequence>
>        </xs:complexType>
>     </xs:element>
>
>   </xs:schema>
>
> So we have a format that is all ISO-8859-1, and a delimited string
> called "name", and the root is terminated by "§" in the ISO-8859-1
> encoding. If we have data that looks like this:
>
>   text§
>
> It will parse to this:
>
>   <root>
>     <name>text</name>
>   </root>
>
> Now say we want just the "name" element to have a different encoding, so
> we change it to this:
>
>   <xs:element name="name" type="xs:string" dfdl:encoding="US-ASCII" />
>
> Now the terminator defined on the root element is in a different
> encoding than the delimited element. Note that the terminator § isn't
> even valid in this encoding.
>
> Currently, Daffodil does not successfully parse this. It scans the data,
> decoding a single character at a time looking for a delimiter.
> Eventually it gets to the § ata, the decoder says it's not valid in our
> ASCII encoding and converts it to the unicode replacement character.
> This of course doesn't match the delimiter we're looking for and
> continue on. The delimeter scanner then hits the end of data, and errors
> when it never finds the root termiantor.
>
> Is this the correct behavior, or is our delimiter scanning fundamentally
> broken?
>
> I wonder if the correct behavior is when the terminator comes into scope
> we should immediately encode it into its bytes. Delimiter scanning only
> looks for these bytes and doesn't actually decode any data. And only
> when bytes that match a delimiter are found do we decode all the bytes
> up until that point?
>
> Is this reasonable, or is this type of thing just not allowed?
>
>
> Note that this is somewhat hypothetical. I don't know of any formats
> that mix encodings like this, but this popped into my head looking at
> DAFFODIL-2323 which complains about encoding property and a sequence,
> which can have a siilar issue if the sequence encoding differs from
> element encodings.
>

Re: Incorrect delimiter scanning when mixed encodings?

Posted by Steve Lawrence <sl...@apache.org>.

As Brandon points out, this might give issues to non-byte size
encodings? I assume mandatory text alignment applies for these as well,
which I think deals with such issues. And I assume MTA still applies
even with raw byte entities.

Seems like the algorithm is something like:

  1) Set a mark
  2) Apply delimiter MTA
  3) Check if delimiter bits match current bits
  4) If not match, decode a single char in element encoding and record
  5) Repeat from 1

Re: escape characters, I think escape characters are always in the
encoding of the element of which they are applied since escapeSchemes do
not include a dfdl:encoding property?

Requires extra logic in the above, maybe even attempting the decode
before checking delimiters since escape char effectively disables
delimiter scanning, but I don't think anything too crazy.

I wonder if some of the DFA complexity goes away?



On 4/30/20 2:47 PM, Beckerle, Mike wrote:
> 
> The encoding for the delimiter is the encoding in effect on the schema component carrying the property. Making them take on contextual encodings makes things much too complicated.
> 
> So yeah, I think in your case, if we're scanning for that "§" but we're using a decoder for ASCII, that's incorrect.
> 
> These mixed encoding cases are all corner cases anyway, so they don't have to be natural or easy. The rules simply have to be easy to interpret.
> 
> So your root element defines a terminator.
> 
> That terminator's encoding has *nothing* to do with the encoding specified for a contained element within root. It is not that contained element's terminator, it is the root's terminator.
> 
> The semantics of delimiter scanning in DFDL is in fact something that requires lowering the delimiters to byte patterns. This is required based on mixed scenarios like this, but also based on features like Byte-Value entities e.g., %#rHH; which specifies a hex byte that can appear, even in the middle of characters, when that byte makes no sense.
> 
> <element name="foo" type="xs:int" dfdl:terminator="11%#r88;99" dfdl:encoding="utf-16BE"/>
> 
> So the terminator of the above is bytes 0031 0031 88 0039 0039.  See how that 88 is just thrown in there. Makes no sense in ANY encoding. We're even screwing up the character alignment here.
> 
> That means scanning for delimiters in DFDL requires us to lower the scanning to bytes.
> 
> Of course Daffodil doesn't implement %#rHH; byte-value (aka raw-bytes) entities except for one special case which is specifying the fill byte property.  And our scanning is currently character oriented.
> 
> So, what does it mean to scan for say a UTF-16 character '1' as terminator of an element that is in say, ASCII ?
> 
> It means you are searching through the bytes ignoring ASCII, decoding it as UTF-16, looking for '1' (which is bytes 00 31).  Then having found a 0031, the preceding bytes are then decoded as ASCII.
> 
> Pretty sure Daffodil scanning isn't doing that.
> 
> 
> ________________________________
> From: Sloane, Brandon <bs...@tresys.com>
> Sent: Thursday, April 30, 2020 12:01 PM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Re: Incorrect delimiter scanning when mixed encodings?
> 
> Without looking at the spec, I would expect that delimiters be defined by the encoding the the element that defines the delimeter; so Daffodil is buggy in the case you describe. However, there are a couple of complications we have to consider:
> 
> 1) What if instead of a terminator, we had a separator; and the separator is a valid character in both encoding; but has a different bytecode
> 
>  <xs:element name="root" >
>        <xs:complexType>
>          <xs:sequence dfdl:separator=",">
>            <xs:element name="name" type="xs:string" maxOccurs="2" encoding="FOO"/>
>            <xs:element name="address" type="xs:string" maxOccurs="2" encoding="BAR"/>
>          </xs:sequence>
>        </xs:complexType>
>     </xs:element>
> 
> In this case, I would expect the separator to be interperated based on the encoding of the individual elements, which is obviously not consistent with my expectation from your example. There is also the instance of the separator occurring between the two element types. So even in this case my naive expectation is not consistent.
> The correct answer here is probably to say that this example schema is wrong, and there should be 2 sequences, each defining their own separator.
> 
> 2) What if the encodings have a different alignment? For instance, if the outer encoding that defines the delimiter is 8-bit and byte alligned, with a 7-bit inner encoding, should we look forward to the next byte boundary after every 7 bit character?
> 
> 3) How does this interact with escape sequences?
> 
> The solution here might be to think through some restrictions on where encoding changes are allowed to occur. I am not sure it is possible to give reasonable semantics for everything over a region that spans multiple encodings.
> ________________________________
> From: Steve Lawrence <sl...@apache.org>
> Sent: Thursday, April 30, 2020 11:15 AM
> To: dev@daffodil.apache.org <de...@daffodil.apache.org>
> Subject: Incorrect delimiter scanning when mixed encodings?
> 
> Say we have a schema like this:
> 
>   <xs:schema
>     xmlns:xs="http://www.w3.org/2001/XMLSchema"
>     xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">
> 
>     <xs:include
> schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />
> 
>     <xs:annotation>
>       <xs:appinfo source="http://www.ogf.org/dfdl/">
>         <dfdl:format ref="GeneralFormat" lengthKind="delimited"
>           encoding="ISO-8859-1" />
>       </xs:appinfo>
>     </xs:annotation>
> 
>      <xs:element name="root" dfdl:terminator="§">
>        <xs:complexType>
>          <xs:sequence>
>            <xs:element name="name" type="xs:string" />
>          </xs:sequence>
>        </xs:complexType>
>     </xs:element>
> 
>   </xs:schema>
> 
> So we have a format that is all ISO-8859-1, and a delimited string
> called "name", and the root is terminated by "§" in the ISO-8859-1
> encoding. If we have data that looks like this:
> 
>   text§
> 
> It will parse to this:
> 
>   <root>
>     <name>text</name>
>   </root>
> 
> Now say we want just the "name" element to have a different encoding, so
> we change it to this:
> 
>   <xs:element name="name" type="xs:string" dfdl:encoding="US-ASCII" />
> 
> Now the terminator defined on the root element is in a different
> encoding than the delimited element. Note that the terminator § isn't
> even valid in this encoding.
> 
> Currently, Daffodil does not successfully parse this. It scans the data,
> decoding a single character at a time looking for a delimiter.
> Eventually it gets to the § ata, the decoder says it's not valid in our
> ASCII encoding and converts it to the unicode replacement character.
> This of course doesn't match the delimiter we're looking for and
> continue on. The delimeter scanner then hits the end of data, and errors
> when it never finds the root termiantor.
> 
> Is this the correct behavior, or is our delimiter scanning fundamentally
> broken?
> 
> I wonder if the correct behavior is when the terminator comes into scope
> we should immediately encode it into its bytes. Delimiter scanning only
> looks for these bytes and doesn't actually decode any data. And only
> when bytes that match a delimiter are found do we decode all the bytes
> up until that point?
> 
> Is this reasonable, or is this type of thing just not allowed?
> 
> 
> Note that this is somewhat hypothetical. I don't know of any formats
> that mix encodings like this, but this popped into my head looking at
> DAFFODIL-2323 which complains about encoding property and a sequence,
> which can have a siilar issue if the sequence encoding differs from
> element encodings.
>

Re: Incorrect delimiter scanning when mixed encodings?

Posted by "Beckerle, Mike" <mb...@tresys.com>.

The encoding for the delimiter is the encoding in effect on the schema component carrying the property. Making them take on contextual encodings makes things much too complicated.

So yeah, I think in your case, if we're scanning for that "§" but we're using a decoder for ASCII, that's incorrect.

These mixed encoding cases are all corner cases anyway, so they don't have to be natural or easy. The rules simply have to be easy to interpret.

So your root element defines a terminator.

That terminator's encoding has *nothing* to do with the encoding specified for a contained element within root. It is not that contained element's terminator, it is the root's terminator.

The semantics of delimiter scanning in DFDL is in fact something that requires lowering the delimiters to byte patterns. This is required based on mixed scenarios like this, but also based on features like Byte-Value entities e.g., %#rHH; which specifies a hex byte that can appear, even in the middle of characters, when that byte makes no sense.

<element name="foo" type="xs:int" dfdl:terminator="11%#r88;99" dfdl:encoding="utf-16BE"/>

So the terminator of the above is bytes 0031 0031 88 0039 0039.  See how that 88 is just thrown in there. Makes no sense in ANY encoding. We're even screwing up the character alignment here.

That means scanning for delimiters in DFDL requires us to lower the scanning to bytes.

Of course Daffodil doesn't implement %#rHH; byte-value (aka raw-bytes) entities except for one special case which is specifying the fill byte property.  And our scanning is currently character oriented.

So, what does it mean to scan for say a UTF-16 character '1' as terminator of an element that is in say, ASCII ?

It means you are searching through the bytes ignoring ASCII, decoding it as UTF-16, looking for '1' (which is bytes 00 31).  Then having found a 0031, the preceding bytes are then decoded as ASCII.

Pretty sure Daffodil scanning isn't doing that.


________________________________
From: Sloane, Brandon <bs...@tresys.com>
Sent: Thursday, April 30, 2020 12:01 PM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Re: Incorrect delimiter scanning when mixed encodings?

Without looking at the spec, I would expect that delimiters be defined by the encoding the the element that defines the delimeter; so Daffodil is buggy in the case you describe. However, there are a couple of complications we have to consider:

1) What if instead of a terminator, we had a separator; and the separator is a valid character in both encoding; but has a different bytecode

 <xs:element name="root" >
       <xs:complexType>
         <xs:sequence dfdl:separator=",">
           <xs:element name="name" type="xs:string" maxOccurs="2" encoding="FOO"/>
           <xs:element name="address" type="xs:string" maxOccurs="2" encoding="BAR"/>
         </xs:sequence>
       </xs:complexType>
    </xs:element>

In this case, I would expect the separator to be interperated based on the encoding of the individual elements, which is obviously not consistent with my expectation from your example. There is also the instance of the separator occurring between the two element types. So even in this case my naive expectation is not consistent.
The correct answer here is probably to say that this example schema is wrong, and there should be 2 sequences, each defining their own separator.

2) What if the encodings have a different alignment? For instance, if the outer encoding that defines the delimiter is 8-bit and byte alligned, with a 7-bit inner encoding, should we look forward to the next byte boundary after every 7 bit character?

3) How does this interact with escape sequences?

The solution here might be to think through some restrictions on where encoding changes are allowed to occur. I am not sure it is possible to give reasonable semantics for everything over a region that spans multiple encodings.
________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Thursday, April 30, 2020 11:15 AM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Incorrect delimiter scanning when mixed encodings?

Say we have a schema like this:

  <xs:schema
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">

    <xs:include
schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />

    <xs:annotation>
      <xs:appinfo source="http://www.ogf.org/dfdl/">
        <dfdl:format ref="GeneralFormat" lengthKind="delimited"
          encoding="ISO-8859-1" />
      </xs:appinfo>
    </xs:annotation>

     <xs:element name="root" dfdl:terminator="§">
       <xs:complexType>
         <xs:sequence>
           <xs:element name="name" type="xs:string" />
         </xs:sequence>
       </xs:complexType>
    </xs:element>

  </xs:schema>

So we have a format that is all ISO-8859-1, and a delimited string
called "name", and the root is terminated by "§" in the ISO-8859-1
encoding. If we have data that looks like this:

  text§

It will parse to this:

  <root>
    <name>text</name>
  </root>

Now say we want just the "name" element to have a different encoding, so
we change it to this:

  <xs:element name="name" type="xs:string" dfdl:encoding="US-ASCII" />

Now the terminator defined on the root element is in a different
encoding than the delimited element. Note that the terminator § isn't
even valid in this encoding.

Currently, Daffodil does not successfully parse this. It scans the data,
decoding a single character at a time looking for a delimiter.
Eventually it gets to the § ata, the decoder says it's not valid in our
ASCII encoding and converts it to the unicode replacement character.
This of course doesn't match the delimiter we're looking for and
continue on. The delimeter scanner then hits the end of data, and errors
when it never finds the root termiantor.

Is this the correct behavior, or is our delimiter scanning fundamentally
broken?

I wonder if the correct behavior is when the terminator comes into scope
we should immediately encode it into its bytes. Delimiter scanning only
looks for these bytes and doesn't actually decode any data. And only
when bytes that match a delimiter are found do we decode all the bytes
up until that point?

Is this reasonable, or is this type of thing just not allowed?


Note that this is somewhat hypothetical. I don't know of any formats
that mix encodings like this, but this popped into my head looking at
DAFFODIL-2323 which complains about encoding property and a sequence,
which can have a siilar issue if the sequence encoding differs from
element encodings.

Re: Incorrect delimiter scanning when mixed encodings?

Posted by "Sloane, Brandon" <bs...@tresys.com>.

Without looking at the spec, I would expect that delimiters be defined by the encoding the the element that defines the delimeter; so Daffodil is buggy in the case you describe. However, there are a couple of complications we have to consider:

1) What if instead of a terminator, we had a separator; and the separator is a valid character in both encoding; but has a different bytecode

 <xs:element name="root" >
       <xs:complexType>
         <xs:sequence dfdl:separator=",">
           <xs:element name="name" type="xs:string" maxOccurs="2" encoding="FOO"/>
           <xs:element name="address" type="xs:string" maxOccurs="2" encoding="BAR"/>
         </xs:sequence>
       </xs:complexType>
    </xs:element>

In this case, I would expect the separator to be interperated based on the encoding of the individual elements, which is obviously not consistent with my expectation from your example. There is also the instance of the separator occurring between the two element types. So even in this case my naive expectation is not consistent.
The correct answer here is probably to say that this example schema is wrong, and there should be 2 sequences, each defining their own separator.

2) What if the encodings have a different alignment? For instance, if the outer encoding that defines the delimiter is 8-bit and byte alligned, with a 7-bit inner encoding, should we look forward to the next byte boundary after every 7 bit character?

3) How does this interact with escape sequences?

The solution here might be to think through some restrictions on where encoding changes are allowed to occur. I am not sure it is possible to give reasonable semantics for everything over a region that spans multiple encodings.
________________________________
From: Steve Lawrence <sl...@apache.org>
Sent: Thursday, April 30, 2020 11:15 AM
To: dev@daffodil.apache.org <de...@daffodil.apache.org>
Subject: Incorrect delimiter scanning when mixed encodings?

Say we have a schema like this:

  <xs:schema
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">

    <xs:include
schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />

    <xs:annotation>
      <xs:appinfo source="http://www.ogf.org/dfdl/">
        <dfdl:format ref="GeneralFormat" lengthKind="delimited"
          encoding="ISO-8859-1" />
      </xs:appinfo>
    </xs:annotation>

     <xs:element name="root" dfdl:terminator="§">
       <xs:complexType>
         <xs:sequence>
           <xs:element name="name" type="xs:string" />
         </xs:sequence>
       </xs:complexType>
    </xs:element>

  </xs:schema>

So we have a format that is all ISO-8859-1, and a delimited string
called "name", and the root is terminated by "§" in the ISO-8859-1
encoding. If we have data that looks like this:

  text§

It will parse to this:

  <root>
    <name>text</name>
  </root>

Now say we want just the "name" element to have a different encoding, so
we change it to this:

  <xs:element name="name" type="xs:string" dfdl:encoding="US-ASCII" />

Now the terminator defined on the root element is in a different
encoding than the delimited element. Note that the terminator § isn't
even valid in this encoding.

Currently, Daffodil does not successfully parse this. It scans the data,
decoding a single character at a time looking for a delimiter.
Eventually it gets to the § ata, the decoder says it's not valid in our
ASCII encoding and converts it to the unicode replacement character.
This of course doesn't match the delimiter we're looking for and
continue on. The delimeter scanner then hits the end of data, and errors
when it never finds the root termiantor.

Is this the correct behavior, or is our delimiter scanning fundamentally
broken?

I wonder if the correct behavior is when the terminator comes into scope
we should immediately encode it into its bytes. Delimiter scanning only
looks for these bytes and doesn't actually decode any data. And only
when bytes that match a delimiter are found do we decode all the bytes
up until that point?

Is this reasonable, or is this type of thing just not allowed?


Note that this is somewhat hypothetical. I don't know of any formats
that mix encodings like this, but this popped into my head looking at
DAFFODIL-2323 which complains about encoding property and a sequence,
which can have a siilar issue if the sequence encoding differs from
element encodings.