You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@daffodil.apache.org by "Michael Beckerle (JIRA)" <ji...@apache.org> on 2018/10/10 19:40:00 UTC

[jira] [Commented] (DAFFODIL-1386) single utf-8 4-byte character becomes surrogate character pairs in scala/java string

    [ https://issues.apache.org/jira/browse/DAFFODIL-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645457#comment-16645457 ] 

Michael Beckerle commented on DAFFODIL-1386:
--------------------------------------------

Until Java/JVM languages have a way to treat unicode strings with codepoints greater than 0xFFFF as a single character, there's really no way for us to fix this.
That is to say, it's beyond the call of duty for Daffodil to invent a new UString class that does this, reinvent all string libraries used, deal with the performance implications of that, etc.

> single utf-8 4-byte character becomes surrogate character pairs in scala/java string
> ------------------------------------------------------------------------------------
>
>                 Key: DAFFODIL-1386
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-1386
>             Project: Daffodil
>          Issue Type: Wish
>          Components: Back End
>            Reporter: Michael Beckerle
>            Priority: Major
>
> Recent changes in 1.2.0 to the data input layers removed a feature which is the ability to treat surrogate pair characters as single characters.
> See test_encodingNoError. 
> This test has a TDML representation where a single character in utf-8 that has a 4-byte encoding has to become a surrogate-pair (two codepoints) in a java/scala string, but the data input stream's char iterator on a call to next() returns only 1 codepoint. There is no accomodation in the data input stream layers for the possibility of a single character needing 2 codepoints.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)