You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@thrift.apache.org by "Chad Walters (JIRA)" <ji...@apache.org> on 2009/04/01 06:25:52 UTC

[jira] Commented: (THRIFT-395) Python library + compiler does not support unicode strings

    [ https://issues.apache.org/jira/browse/THRIFT-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694376#action_12694376 ] 

Chad Walters commented on THRIFT-395:
-------------------------------------

Jonathan, you have stumbled on an old ugly problem in Thrift. The 'string' type was originally the only way to pass arbitrary binary data around but this didn't actually work properly in Java because of its requirement that String's carry an encoding. The 'binary' subtype was introduced to fix this. There was no agreement that string should enforce UTF-8 encoding, even though this meant an inability to enforce interoperability with Java, probably driven in large part by pre-existing data at Facebook (and other places?) where strings were already for binary data in C++ (at the time, Java was somewhat of a second-class citizen for Thrift -- IIRC Facebook's emphasis was on C++. Python, PHP). Somehow I imagine that the backwards compatibility issue is not going to be taken off the table.

I may not fully understand the issues with Python so forgive me if this suggestion is naive: Can we split the difference and have some kind of configuration option to "enforce UTF-8" for Python (but make it off by default)?

The policy would then be: use non-UTF8 encoding in strings if you wish, but realize that you will not interoperate correctly with Java and C# all the time or with Python when "enforce UTF-8" mode is on.


> Python library + compiler does not support unicode strings
> ----------------------------------------------------------
>
>                 Key: THRIFT-395
>                 URL: https://issues.apache.org/jira/browse/THRIFT-395
>             Project: Thrift
>          Issue Type: Bug
>          Components: Compiler (Python), Library (Python)
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>            Priority: Blocker
>             Fix For: 0.1
>
>         Attachments: 0001-python-Minor-cleanup-of-protocols-don-t-use-str.patch, 0002-THRIFT-395.-python-Phase-One-of-support-for-unicode.patch, 0003-THRIFT-395.-python-Phase-Two-of-support-for-unicode.patch, 0004-python-Remove-ridiculous-semicolons-from-gen-code.patch, python-utf8-v2.patch, python-utf8.patch
>
>
> Effectively, all strings in the python bindings are treated as binary strings -- no encoding/decoding to UTF-8 is done.  So if a unicode object is passed to a (regular, non-binary) string, an exception is raised.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.