You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Thiruvalluvan M. G. (JIRA)" <ji...@apache.org> on 2018/12/30 07:21:00 UTC
[jira] [Resolved] (AVRO-1190) C++ json parser fails to decode multibyte unicode code points

     [ https://issues.apache.org/jira/browse/AVRO-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thiruvalluvan M. G. resolved AVRO-1190.
---------------------------------------
       Resolution: Fixed
    Fix Version/s: 1.9.0

Merged the Pull Request

> C++ json parser fails to decode multibyte unicode code points
> -------------------------------------------------------------
>
>                 Key: AVRO-1190
>                 URL: https://issues.apache.org/jira/browse/AVRO-1190
>             Project: Apache Avro
>          Issue Type: Bug
>          Components: c++
>    Affects Versions: 1.7.0
>            Reporter: Keh-Li Sheng
>            Priority: Major
>             Fix For: 1.9.0
>
>
> The parser in JsonIO.cc does not handle decoding a multibyte unicode character into any kind of valid character encoding for a std::string in c++. The following snippet from JsonParser::tryString() has several flaws:
> 1. sv is a std::string used as a vector, where each unit is a char
> 2. a single unicode hex quad encoded in JSON can represent a 16-bit value
> 3. a unicode hex quad can represent a "high surrogate" character meaning that it must be combined with the following quad to derive the full unicode code point
> 4. \U is not a valid unicode escape for JSON (see http://www.ietf.org/rfc/rfc4627.txt)
> {code:title=JsonIO.cc}
>             case 'u':
>             case 'U':
>                 {
>                     unsigned int n = 0;
>                     char e[4];
>                     in_.readBytes(reinterpret_cast<uint8_t*>(e), 4);
>                     for (int i = 0; i < 4; i++) {
>                         n *= 16;
>                         char c = e[i];
>                         if (isdigit(c)) {
>                             n += c - '0';
>                         } else if (c >= 'a' && c <= 'f') {
>                             n += c - 'a' + 10;
>                         } else if (c >= 'A' && c <= 'F') {
>                             n += c - 'A' + 10;
>                         } else {
>                             throw unexpected(c);
>                         }
>                     }
>                     sv.push_back(n);
>                 }
> {code}
> This code loop creates a temporary int then decodes the quad into it and then simply pushes the int (which may be a 16-bit value) onto the std::string. This essentially means that the JSON parser does not decode any unicode characters. For example, this JSON string:
> {noformat}
> "Dress up if you dare! Free cover all night! \uD83C\uDF83\uD83D\uDC7B"
> {noformat}
> results in a decoded byte sequence for the last 4 characters:
> {noformat}
> 3C 83 3D 7B 00
> {noformat}
> where you can see that it simply drops the high order bytes. In this particular example, \uD83C is a high-surrogate character which requires some additional handling. I am not sure what users of the c++ library expect the encoding to be, but given that we are working with json and given that avro c++ uses char instead of wchar, I would assume users would expect a UTF-8 encoded string. However, I could be wrong. There are many examples of decoders that handle this string properly - I found this one helpful while implementing a fix: http://rishida.net/tools/conversion/
> For basics on UTF-8 http://www.utf-8.com/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)