You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@daffodil.apache.org by "mike-mcgann (via GitHub)" <gi...@apache.org> on 2023/03/02 17:40:17 UTC

[GitHub] [daffodil] mike-mcgann opened a new pull request, #976: Update test to use valid UTF-8

mike-mcgann opened a new pull request, #976:
URL: https://github.com/apache/daffodil/pull/976

   The test for 'test_byte_entities_6_10' is looking for a terminator with a code point of 0xab. Update the test to use the UTF-8 representation of that code point as 0xc2ab.
   
   [DAFFODIL-2102](https://issues.apache.org/jira/browse/DAFFODIL-2102)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #976: Update test to use valid UTF-8

Posted by "stevedlawrence (via GitHub)" <gi...@apache.org>.
stevedlawrence commented on code in PR #976:
URL: https://github.com/apache/daffodil/pull/976#discussion_r1123566879


##########
daffodil-test/src/test/resources/org/apache/daffodil/section06/entities/Entities.tdml:
##########
@@ -495,7 +495,7 @@ is multiple bytes in UTF-8 encoding that is used - DFDL-6-042R"
 is multiple bytes in UTF-8 encoding that is used"
     model="Entities_01-Embedded.dfdl.xsd" root="seq_10" roundTrip="false">
     <tdml:document>
-      <tdml:documentPart type="byte">30 ab 31 32 32 7f</tdml:documentPart>
+      <tdml:documentPart type="byte">30 c2 ab 31 32 32 7f</tdml:documentPart>

Review Comment:
   I'm not sure if this is the right change. The schema has this for the first element:
   ```xml
             <xs:element name="e1" type="xs:int" dfdl:terminator="%#rab;" />
   ```
   So the terminator should be the single raw byte `0xab`. According to DAFFODIL-258, raw byte entities aren't implemented.
   
   So I don't think this issue will be resolved until we fixed DAFFODIL-258.
   
   We might even want to considering creating an SDE or subsetError if raw byte entities aren't used, since they are clearly broken and they break in subtle ways.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mike-mcgann closed pull request #976: Update test to use valid UTF-8

Posted by "mike-mcgann (via GitHub)" <gi...@apache.org>.
mike-mcgann closed pull request #976: Update test to use valid UTF-8
URL: https://github.com/apache/daffodil/pull/976


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mbeckerle commented on a diff in pull request #976: Update test to use valid UTF-8

Posted by "mbeckerle (via GitHub)" <gi...@apache.org>.
mbeckerle commented on code in PR #976:
URL: https://github.com/apache/daffodil/pull/976#discussion_r1123805199


##########
daffodil-test/src/test/resources/org/apache/daffodil/section06/entities/Entities.tdml:
##########
@@ -495,7 +495,7 @@ is multiple bytes in UTF-8 encoding that is used - DFDL-6-042R"
 is multiple bytes in UTF-8 encoding that is used"
     model="Entities_01-Embedded.dfdl.xsd" root="seq_10" roundTrip="false">
     <tdml:document>
-      <tdml:documentPart type="byte">30 ab 31 32 32 7f</tdml:documentPart>
+      <tdml:documentPart type="byte">30 c2 ab 31 32 32 7f</tdml:documentPart>

Review Comment:
   Wow, I missed this entirely.
   
   You are definitely on it. Implementing raw-bytes in delimiters requires an entirely different non-text-based scanner be used. One must lower the characters of the delimiter to bytes so that the scanner is just looking for sequences of bytes. (It all becomes raw bytes)
   
   We need a way to say this issue is V.Never i.e., we have no plan/roadmap to fix this issue and implement rawbytes. 
   
   Maybe we just make sure this is in the unimpemented features page on the Daffodil site, and close the ticket with Won't fix?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mike-mcgann commented on a diff in pull request #976: Update test to use valid UTF-8

Posted by "mike-mcgann (via GitHub)" <gi...@apache.org>.
mike-mcgann commented on code in PR #976:
URL: https://github.com/apache/daffodil/pull/976#discussion_r1123622655


##########
daffodil-test/src/test/resources/org/apache/daffodil/section06/entities/Entities.tdml:
##########
@@ -495,7 +495,7 @@ is multiple bytes in UTF-8 encoding that is used - DFDL-6-042R"
 is multiple bytes in UTF-8 encoding that is used"
     model="Entities_01-Embedded.dfdl.xsd" root="seq_10" roundTrip="false">
     <tdml:document>
-      <tdml:documentPart type="byte">30 ab 31 32 32 7f</tdml:documentPart>
+      <tdml:documentPart type="byte">30 c2 ab 31 32 32 7f</tdml:documentPart>

Review Comment:
   So if that was `dfdl:terminator="«%#rab;"`, the element would have to be scanned, byte by byte, until finding `c2 ab ab` and, once found, all bytes scanned would then be decoded to UTF-8? Or would it be better to change the representation to binary at that point?
   
   I've updated DAFFODIL-2102 to put a link to a blocking ticket to DAFFODIL-258. Since DAFFODIL-258 does not have a fix version of 3.5.0, should we remove the fix version from DAFFODIL-2102 as well? I'll close this ticket as it is part of a larger amount of work not scheduled for the upcoming release. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] mike-mcgann commented on pull request #976: Update test to use valid UTF-8

Posted by "mike-mcgann (via GitHub)" <gi...@apache.org>.
mike-mcgann commented on PR #976:
URL: https://github.com/apache/daffodil/pull/976#issuecomment-1452492332

   To be done at a later time. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #976: Update test to use valid UTF-8

Posted by "stevedlawrence (via GitHub)" <gi...@apache.org>.
stevedlawrence commented on code in PR #976:
URL: https://github.com/apache/daffodil/pull/976#discussion_r1123566879


##########
daffodil-test/src/test/resources/org/apache/daffodil/section06/entities/Entities.tdml:
##########
@@ -495,7 +495,7 @@ is multiple bytes in UTF-8 encoding that is used - DFDL-6-042R"
 is multiple bytes in UTF-8 encoding that is used"
     model="Entities_01-Embedded.dfdl.xsd" root="seq_10" roundTrip="false">
     <tdml:document>
-      <tdml:documentPart type="byte">30 ab 31 32 32 7f</tdml:documentPart>
+      <tdml:documentPart type="byte">30 c2 ab 31 32 32 7f</tdml:documentPart>

Review Comment:
   I'm not sure if this is the right change. The schema has this for the first element:
   ```xml
             <xs:element name="e1" type="xs:int" dfdl:terminator="%#rab;" />
   ```
   So the terminator should be the single raw byte `0xab` and not the UTF-8 bytes `0xc2ab`. According to DAFFODIL-258, raw byte entities aren't implemented.
   
   So I don't think this issue will be resolved until we fixed DAFFODIL-258.
   
   We might even want to considering creating an SDE or subsetError if raw byte entities are used, since they are clearly broken and they break in subtle ways.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [daffodil] stevedlawrence commented on a diff in pull request #976: Update test to use valid UTF-8

Posted by "stevedlawrence (via GitHub)" <gi...@apache.org>.
stevedlawrence commented on code in PR #976:
URL: https://github.com/apache/daffodil/pull/976#discussion_r1123639796


##########
daffodil-test/src/test/resources/org/apache/daffodil/section06/entities/Entities.tdml:
##########
@@ -495,7 +495,7 @@ is multiple bytes in UTF-8 encoding that is used - DFDL-6-042R"
 is multiple bytes in UTF-8 encoding that is used"
     model="Entities_01-Embedded.dfdl.xsd" root="seq_10" roundTrip="false">
     <tdml:document>
-      <tdml:documentPart type="byte">30 ab 31 32 32 7f</tdml:documentPart>
+      <tdml:documentPart type="byte">30 c2 ab 31 32 32 7f</tdml:documentPart>

Review Comment:
   Yeah, I think former is what would need to happen. Which is very different that what happens now. Right now Daffodil decode bytes as it seems them in the input stream and then looks for an encoded terminator. Delimiter scanning has no concept of raw bytes, so if a raw byte entities doesn't map to the same byte has a non raw byte entitity, things will get funky.
   
   And my guess is we would need a significant redeisgn to how delimiter scanning works so it doesn't decode on the fly, or scanning does some sort of lookahead prior to decoding or something. Probably not trivial.
   
   Yes, I think DAFFODIL-2102 should be removed from 3.5.0. I don't think we realized it was related to raw byte entities when we assigned it to this release. We should either keep DAFFODIL-2102 open, or we could close as a duplicate of DAFFODI-258 and add a comment mentioning the test to make sure it is fixed when DAFFODIL-258 is fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@daffodil.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org