You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@thrift.apache.org by David Bennett <da...@yorkage.com> on 2016/01/01 02:09:22 UTC

RE: UTF-16

>>>while UTF-8 is great, especially on Windows platforms UTF-16 is more common, because the OS uses it heavily internally. Since Win2k it also supports surrogates and supplementary characters. So there’s OS support for it. What I don’t know is, how universally is UTF-16 (or a subset of it) supported across other platforms? Can we assume a certain degree of support on all the various platforms that Thrift can run on?

>>>TL;DR: Would it make sense to add UTF-16 as another string format type?

In my opinion, no. This is based on a mistaken understanding or expectation.

Thrift currently supports a string of bytes as a type, and users who wish to exchange character string data are expected to impose some kind of meaning on top of that. 

What Thrift needs is a genuine string data type, independent of any particular transport format, and which fully supports Unicode code points. The transport mechanism could be UTF-8, UTF-16, UTF-32 or variable length (zigzag) integers (currently Unicode requires about 21 bits).

User libraries would of course be free to reformat those Unicode strings into any format comfortably supported by the platform. On Windows UTF-16 is preferred, but should never be viewed as something different from the underlying Unicode string.

Regards
David M Bennett FACS

Andl - A New Database Language - andl.org

Re: UTF-16

Posted by Randy Abernethy <ra...@gmail.com>.

Hey David,

Apache Thrift has a "string" type in its IDL and that type is a language
native string in the generated code but is UTF-8 on the wire when using
binary, compact or JSON protocols by default.

I think Jens is posing the question (correct me if I'm wrong Jens): Should
we also support UTF-16 string encoding on the wire with binary, compact and
JSON protocols.

-Randy

On Thu, Dec 31, 2015 at 5:09 PM, David Bennett <da...@yorkage.com> wrote:

> >>>while UTF-8 is great, especially on Windows platforms UTF-16 is more
> common, because the OS uses it heavily internally. Since Win2k it also
> supports surrogates and supplementary characters. So there’s OS support for
> it. What I don’t know is, how universally is UTF-16 (or a subset of it)
> supported across other platforms? Can we assume a certain degree of support
> on all the various platforms that Thrift can run on?
>
> >>>TL;DR: Would it make sense to add UTF-16 as another string format type?
>
> In my opinion, no. This is based on a mistaken understanding or
> expectation.
>
> Thrift currently supports a string of bytes as a type, and users who wish
> to exchange character string data are expected to impose some kind of
> meaning on top of that.
>
> What Thrift needs is a genuine string data type, independent of any
> particular transport format, and which fully supports Unicode code points.
> The transport mechanism could be UTF-8, UTF-16, UTF-32 or variable length
> (zigzag) integers (currently Unicode requires about 21 bits).
>
> User libraries would of course be free to reformat those Unicode strings
> into any format comfortably supported by the platform. On Windows UTF-16 is
> preferred, but should never be viewed as something different from the
> underlying Unicode string.
>
> Regards
> David M Bennett FACS
>
> Andl - A New Database Language - andl.org
>
>
>
>
>
>