You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2016/12/20 18:57:58 UTC

[jira] [Commented] (ARROW-374) Python: clarify unicode vs. binary in API

    [ https://issues.apache.org/jira/browse/ARROW-374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15764951#comment-15764951 ] 

Wes McKinney commented on ARROW-374:
------------------------------------

Binary type is already supported in the Arrow C++ API. I suggest that we convert arrays of PyBytes to {{arrow::BinaryArray}} instead of {{arrow::StringArray}}. For proper Unicode (Python 3 str or Python 2 unicode), we encode as UTF-8 on array construction

> Python: clarify unicode vs. binary in API
> -----------------------------------------
>
>                 Key: ARROW-374
>                 URL: https://issues.apache.org/jira/browse/ARROW-374
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 0.1.0
>            Reporter: Jochen Ott
>            Priority: Minor
>
> pyarrow supports arrow's String type, arrow-internally represented as BINARY+UTF8 annotation.
> In python 2, the pyarrow API accept both {{unicode}} and binary strings ({{str}}), where the latter are assumed to be utf-8 encoded. I find this approach problematic, because:
>  * there is an implicit assumption that a binary {{str}} contains valid utf-8 data. This assumption can be wrong, however, and it's not clear what the consequences are of passing such "invalid data" to the API are.
>  * the utf-8 assumption is not clearly documented  or otherwise visible from the API
>  * if pyarrow wants to support pure binary data in the future, a natural choice would be to use {{str}} as python2 type. However, this would conflict with the current interpretation of binary {{str}} as BINARY+UTF8
> *Proposed solution*
> I propose to change the API that it only accepts or returns unicode strings, i.e. python2's {{unicode}} and python3's {{str}}. Passing a python2 {{str}} should raise an exception, same for python3's {{bytes}}.
> If in some point in the future also raw BINARY is supported, use python3's {{bytes}} and python2's {{str}}.
> As convenience feature for API users, the API may allow to also pass utf-8 encoded binary data as arrow's String, but that should be an explicit, opt-in choice, s.t. API users are aware of the (encoding-)assumptions made.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)