You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by Hatem Helal <ha...@gmail.com> on 2014/10/01 12:27:03 UTC
C++ JSON encoding generates invalid UTF-8 on windows
I¹ve stumbled across a problem using the C++ JSON encoding within a
service running on windows. For example, encoding a multibyte UTF-8 code
point such as:
"\xEF\xBD\x81"
Incorrectly becomes:
"\xEF\xBD\U0081"
When encoded in the service running in the windows-1252 locale. This
isn¹t a valid UTF-8 sequence so we end up with Mojibake when we try to
read back the JSON encoded string.
The heart of the problem appears to be that JsonGenerator::doEncodeString
relies on calling "iscntrl" to determine whether a given byte is a control
character. In the windows-1252 code page the byte "\x81" is a control
character but not in the C locale which leads to locale dependent JSON
objects, but more importantly, the encoded string is no longer a valid
UTF-8 sequence.
I've experimented with running the service in the C locale and found that
non-ascii code points are encoded correctly. A fix for this would be to
use the iscntrl function provided by the <locale> header, like so:
http://git.io/XetN-w
This makes the determination of whether a given code point is a control
character independent of the runtime environment.
Let me know whether this looks like a legitimate issue and whether this
fix looks appropriate.
Many Thanks,
Hatem