You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Kalle Niemitalo (Jira)" <ji...@apache.org> on 2022/07/08 06:33:00 UTC

[jira] [Created] (AVRO-3572) Python encodes default value of bytes field as UTF-8

Kalle Niemitalo created AVRO-3572:
-------------------------------------

             Summary: Python encodes default value of bytes field as UTF-8
                 Key: AVRO-3572
                 URL: https://issues.apache.org/jira/browse/AVRO-3572
             Project: Apache Avro
          Issue Type: Bug
          Components: python
    Affects Versions: 1.11.0
         Environment: Python 3.9.2
            Reporter: Kalle Niemitalo


The Avro spec says

bq. Default values for bytes and fixed fields are JSON strings, where Unicode code points 0-255 are mapped to unsigned 8-bit byte values 0-255.

but in the Avro library for Python, [_read_default_value calls str.encode|https://github.com/apache/avro/blob/release-1.11.0/lang/py/avro/io.py#L958-L959] to convert the JSON string to bytes, and [str.encode in Python 3|https://docs.python.org/3/library/stdtypes.html#str.encode] uses UTF-8 by default. So, this miscodes bytes 0x80 and higher. For example, the JSON string "\u0080" becomes two bytes b'\xc2\x80' even though it should become one byte b'\x80'.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)