You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@daffodil.apache.org by "Dave Thompson (JIRA)" <ji...@apache.org> on 2018/10/01 12:22:00 UTC

[jira] [Closed] (DAFFODIL-1979) UTF8 decoder doesn't handle 3-byte and 4-byte correctly

     [ https://issues.apache.org/jira/browse/DAFFODIL-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Thompson closed DAFFODIL-1979.
-----------------------------------

Pulled latest updates from incubator-daffodil repository which included specified commit, d50e1fa098feec407a8aae07921d1f1e885e4ff5.

Verified daffodil builds and executes all sbt tests successfully, including the associated test case added in the specified commit.

Verified changes specified in the commit comment.

All nightly tests also executed successfully.

> UTF8 decoder doesn't handle 3-byte and 4-byte correctly
> -------------------------------------------------------
>
>                 Key: DAFFODIL-1979
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-1979
>             Project: Daffodil
>          Issue Type: Bug
>          Components: Back End
>    Affects Versions: 2.2.0
>            Reporter: Michael Beckerle
>            Assignee: Dave Thompson
>            Priority: Major
>             Fix For: 2.2.0
>
>
> It is classifying some valid characters as "overlong" and erroring out.
> The PNG schema on DFDLSchemas github has 1 test that runs into this bug on 3 byte Devangari script characters.
> This is 6 devangari characters: e0 a4 b6 e0 a5 80 e0 a4 b0 e0 a5 8d e0 a4 b7 e0 a4 95
> Should be: शीर्षक
> But is coming out all substitution chars.
> In 3 byte utf-8, the bits that at least one of must be non-zero are shown here in M, notice one of them is in the second byte. This second byte wasn't being tested.
> 1110MMMM 10Mxxxxx 10xxxxxx
> In 4 byte utf-8, the bits that must at least one of be non-zero are:
> 11110 MMM 10MMxxxx 10xxxxxx 10xxxxxx



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)