You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Steve M. Kim (Jira)" <ji...@apache.org> on 2022/09/30 16:14:00 UTC
[jira] [Commented] (ARROW-16430) [Python] Read/Write record batch custom metadata API in pyarrow
[ https://issues.apache.org/jira/browse/ARROW-16430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611673#comment-17611673 ]
Steve M. Kim commented on ARROW-16430:
--------------------------------------
ARROW-2022 introduced a related problem with {{Schema}} messages. I am not sure whether the related problem ought to be tracked in this issue or in a separate issue, or discussed further on the mailing list.
The current [documentation|https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata] for the IPC format says
{quote}We provide a {{custom_metadata}} field at three levels to provide a mechanism for developers to pass application-specific metadata in Arrow protocol messages. This includes {{{}Field{}}}, {{{}Schema{}}}, and {{{}Message{}}}.
{quote}
Consistent with the documentation, the FlatBuffer definitions have two different {{custom_metadata}} fields that appear in an encapsulated message of type Schema:
* The {{custom_metadata}} field within the {{Schema}} table
* The {{custom_metadata}} field within the parent {{Message}} table
Currently, the pyarrow implementation recognizes only the custom_metadata field in the Schema table and is unaware of the custom_metadata field in the parent Message table. The proposed change [https://github.com/apache/arrow/pull/13041] will use the custom_metadata field in the parent Message table for RecordBatch messages, but it won't address this ambiguity with Schema messages.
I think that it is useful for {{pyarrow.Schema}} object and {{pyarrow.RecordBatch}} object to carry custom metadata, independent of their IPC message serialization. I also think that perhaps {{pyarrow.Table}} ought to carry its own custom metadata that is separate from the metadata of its {{{}Schema{}}}, because a {{Table}} is like a {{{}RecordBatch{}}}. In the current implementation, attempting to instantiate a {{Table}} with both a metadata-enriched {{Schema}} and a separate metadata dict raises {{{}ValueError: Cannot pass both schema and metadata{}}}. This behavior is inconsistent with the documentation, which distinguishes the metadata of a Schema from the metadata of a Message.
> [Python] Read/Write record batch custom metadata API in pyarrow
> ---------------------------------------------------------------
>
> Key: ARROW-16430
> URL: https://issues.apache.org/jira/browse/ARROW-16430
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 7.0.0
> Reporter: Yue Ni
> Assignee: Yue Ni
> Priority: Major
> Labels: pull-request-available
> Time Spent: 7h 20m
> Remaining Estimate: 0h
>
> In https://issues.apache.org/jira/browse/ARROW-16131, Arrow C++ APIs were added so that users can read/write record batch custom metadata for IPC file. But pyarrow still lacks corresponding APIs for doing this.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)