You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by Hatem Helal <ha...@gmail.com> on 2014/10/01 12:27:03 UTC

C++ JSON encoding generates invalid UTF-8 on windows

I¹ve stumbled across a problem using the C++ JSON encoding within a

service running on windows.  For example, encoding a multibyte UTF-8 code

point such as:


"\xEF\xBD\x81"


Incorrectly becomes:


"\xEF\xBD\U0081"


When encoded in the service running in the windows-1252 locale.  This

isn¹t a valid UTF-8 sequence so we end up with Mojibake when we try to

read back the JSON encoded string.


The heart of the problem appears to be that JsonGenerator::doEncodeString

relies on calling "iscntrl" to determine whether a given byte is a control

character.  In the windows-1252 code page the byte "\x81" is a control

character but not in the C locale which leads to locale dependent JSON

objects, but more importantly, the encoded string is no longer a valid

UTF-8 sequence.


I've experimented with running the service in the C locale and found that

non-ascii code points are encoded correctly.  A fix for this would be to

use the iscntrl function provided by the <locale> header, like so:


http://git.io/XetN-w


This makes the determination of whether a given code point is a control

character independent of the runtime environment.


Let me know whether this looks like a legitimate issue and whether this

fix looks appropriate.


Many Thanks,


Hatem