You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Steve M. Kim (Jira)" <ji...@apache.org> on 2022/09/30 16:14:00 UTC
[jira] [Commented] (ARROW-16430) [Python] Read/Write record batch custom metadata API in pyarrow

    [ https://issues.apache.org/jira/browse/ARROW-16430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611673#comment-17611673 ] 

Steve M. Kim commented on ARROW-16430:
--------------------------------------

ARROW-2022 introduced a related problem with {{Schema}} messages. I am not sure whether the related problem ought to be tracked in this issue or in a separate issue, or discussed further on the mailing list.

The current [documentation|https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata] for the IPC format says
{quote}We provide a {{custom_metadata}} field at three levels to provide a mechanism for developers to pass application-specific metadata in Arrow protocol messages. This includes {{{}Field{}}}, {{{}Schema{}}}, and {{{}Message{}}}.
{quote}
Consistent with the documentation, the FlatBuffer definitions have two different {{custom_metadata}} fields that appear in an encapsulated message of type Schema:
* The {{custom_metadata}} field within the {{Schema}} table
* The {{custom_metadata}} field within the parent {{Message}} table

 

Currently, the pyarrow implementation recognizes only the custom_metadata field in the Schema table and is unaware of the custom_metadata field in the parent Message table. The proposed change [https://github.com/apache/arrow/pull/13041] will use the custom_metadata field in the parent Message table for RecordBatch messages, but it won't address this ambiguity with Schema messages.

I think that it is useful for {{pyarrow.Schema}} object and {{pyarrow.RecordBatch}} object to carry custom metadata, independent of their IPC message serialization. I also think that perhaps {{pyarrow.Table}} ought to carry its own custom metadata that is separate from the metadata of its {{{}Schema{}}}, because a {{Table}} is like a {{{}RecordBatch{}}}. In the current implementation, attempting to instantiate a {{Table}} with both a metadata-enriched {{Schema}} and a separate metadata dict raises {{{}ValueError: Cannot pass both schema and metadata{}}}. This behavior is inconsistent with the documentation, which distinguishes the metadata of a Schema from the metadata of a Message.

> [Python] Read/Write record batch custom metadata API in pyarrow
> ---------------------------------------------------------------
>
>                 Key: ARROW-16430
>                 URL: https://issues.apache.org/jira/browse/ARROW-16430
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 7.0.0
>            Reporter: Yue Ni
>            Assignee: Yue Ni
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> In https://issues.apache.org/jira/browse/ARROW-16131, Arrow C++ APIs were added so that users can read/write record batch custom metadata for IPC file. But pyarrow still lacks corresponding APIs for doing this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)