You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Kalle Niemitalo (Jira)" <ji...@apache.org> on 2022/07/08 06:53:00 UTC

[jira] [Commented] (AVRO-3572) Python encodes default value of bytes field as UTF-8

    [ https://issues.apache.org/jira/browse/AVRO-3572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564119#comment-17564119 ] 

Kalle Niemitalo commented on AVRO-3572:
---------------------------------------

It would be good to have a test with a record schema in which the default value of a "bytes" field contains all 256 code units from "\u0000" to "\u00ff". The same for a "fixed" field as well. Then read a record that was written using a schema that lacks those fields, and assert that the resulting values of the fields are correct. That would show that the transformation from string to bytes handles all valid bytes correctly.

> Python encodes default value of bytes field as UTF-8
> ----------------------------------------------------
>
>                 Key: AVRO-3572
>                 URL: https://issues.apache.org/jira/browse/AVRO-3572
>             Project: Apache Avro
>          Issue Type: Bug
>          Components: python
>    Affects Versions: 1.11.0
>         Environment: Python 3.9.2
>            Reporter: Kalle Niemitalo
>            Priority: Minor
>
> The Avro spec says
> bq. Default values for bytes and fixed fields are JSON strings, where Unicode code points 0-255 are mapped to unsigned 8-bit byte values 0-255.
> but in the Avro library for Python, [_read_default_value calls str.encode|https://github.com/apache/avro/blob/release-1.11.0/lang/py/avro/io.py#L958-L959] to convert the JSON string to bytes, and [str.encode in Python 3|https://docs.python.org/3/library/stdtypes.html#str.encode] uses UTF-8 by default. So, this miscodes bytes 0x80 and higher. For example, the JSON string "\u0080" becomes two bytes b'\xc2\x80' even though it should become one byte b'\x80'.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)