You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Jorge Leitão (Jira)" <ji...@apache.org> on 2022/02/08 14:46:00 UTC

[jira] [Created] (ARROW-15613) [C++][Python] Metadata from C data interface is not valid utf8

Jorge Leitão created ARROW-15613:
------------------------------------

             Summary: [C++][Python] Metadata from C data interface is not valid utf8
                 Key: ARROW-15613
                 URL: https://issues.apache.org/jira/browse/ARROW-15613
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python
            Reporter: Jorge Leitão


While trying to roundtrip an extension from schema.metadata (see ARROW-13855 for details), I got invalid utf8, which imo goes against

> A binary string describing the type’s metadata [1]

Specifically, a field

field = pyarrow.field("aa", UuidType())

contains the following:

```
key len: 20
key: "ARROW:extension:name"
value len: 23
value: "arrow.py_extension_type"
key len: 24
key: "ARROW:extension:metadata"
value len: 28
```

with the values' data:

```
[128, 3, 99, 116, 101, 115, 116, 95, 115, 113, 108, 10, 85, 117, 105, 100, 84, 121, 112, 101, 10, 113, 0, 41, 82, 113, 1, 46]
```

This is not a valid utf8 (see e.g. https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=02b67658b3cddf8dc095bc9750fa7032).

Maybe I am reading the values incorrectly, but I would expect valid utf8 (like in the IPC format).

[1] https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata



--
This message was sent by Atlassian Jira
(v8.20.1#820001)