You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/06 12:59:47 UTC

[GitHub] [arrow] niyue opened a new pull request, #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

niyue opened a new pull request, #12812:
URL: https://github.com/apache/arrow/pull/12812

   This PR aims to address https://issues.apache.org/jira/projects/ARROW/issues/ARROW-16131
   
   Currently, when writing an IPC file with multiple record batches using the `arrow::ipc::RecordBatchWriter`, the `custom_metadata` for each record batch is not saved and will be discarded silently. 
   
   The current writer does provide the `AppendCustomMetadata` API internally, but it is never called. I use this API to add the custom metadata during writing and retrieve the custom metadata when reading the record batch. I am not completely sure if this is intentionally ignored or accidentally missing, but from a user's perspective, the metadata can be set via public API (see the example case in ARROW-16131) and I think it is not very friendly that this piece of data is silently discarded.
   
   Server test cases are added in the `read_write_test.cc` for verifying this change. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou closed pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file
URL: https://github.com/apache/arrow/pull/12812


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
lidavidm commented on PR #12812:
URL: https://github.com/apache/arrow/pull/12812#issuecomment-1101346115

   I filed ARROW-16215


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #12812:
URL: https://github.com/apache/arrow/pull/12812#issuecomment-1108754397

   Thanks a lot for your contribution @niyue !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #12812:
URL: https://github.com/apache/arrow/pull/12812#issuecomment-1114085287

   Benchmark runs are scheduled for baseline = 52698f35799522925d3ad84a9324db7dc6e6ba60 and contender = 942f77e5c52412694cb78cd4eca96d559475906e. 942f77e5c52412694cb78cd4eca96d559475906e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/793c728162df42de9c54ad88cf288383...fa59f19dfcd84eff86fdbaeb1989289c/)
   [Failed] [test-mac-arm](https://conbench.ursa.dev/compare/runs/c0ffd651348e4860a51d7fdb1983b534...21366dee2d104e1190150be3ad546b16/)
   [Failed :arrow_down:0.0% :arrow_up:0.36%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/70391f6635d84e578eebf88b6c97fe3f...61be24632bcf485f88faeee8ff27059a/)
   [Finished :arrow_down:0.88% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/c271ff28d6f0406eb1618839bfa2a5ec...8e3426df7e6b4c5a9557db73b5ed98ea/)
   Buildkite builds:
   [Finished] [`942f77e5` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/629)
   [Failed] [`942f77e5` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/617)
   [Failed] [`942f77e5` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/618)
   [Finished] [`942f77e5` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/630)
   [Finished] [`52698f35` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/628)
   [Failed] [`52698f35` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/616)
   [Finished] [`52698f35` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/617)
   [Finished] [`52698f35` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/629)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] niyue commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
niyue commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r846549698


##########
cpp/src/arrow/ipc/writer.cc:
##########
@@ -263,6 +263,14 @@ class RecordBatchSerializer {
     out_->body_length = offset - buffer_start_offset_;
     DCHECK(bit_util::IsMultipleOf8(out_->body_length));
 
+    // copy given record batch's schema metadata to the serializer for serialization
+    auto const &metadata = batch.schema()->metadata();

Review Comment:
   @westonpace 
   >  Let's call it serialize_metadata to keep the tenses consistent
   Sure. I am not a native English speaker, and feel free to let me know if there is better naming for this parameter.
   
   @westonpace @lidavidm 
   I will give it a try but I am not completely sure what kind of API compatibility is required. Could you give me some suggestions on what the overloaded function signatures are desirable?
   
   Currently in the C++ API, we have:
   ```C++
   class ARROW_EXPORT RecordBatchWriter {
    public:
     virtual ~RecordBatchWriter();
   
     /// \brief Write a record batch to the stream
     ///
     /// \param[in] batch the record batch to write to the stream
     /// \return Status
     virtual Status WriteRecordBatch(const RecordBatch& batch) = 0;
   
     /// \brief Write possibly-chunked table by creating sequence of record batches
     /// \param[in] table table to write
     /// \return Status
     Status WriteTable(const Table& table);
   
     /// \brief Write Table with a particular chunksize
     /// \param[in] table table to write
     /// \param[in] max_chunksize maximum length of table chunks. To indicate
     /// that no maximum should be enforced, pass -1.
     /// \return Status
     virtual Status WriteTable(const Table& table, int64_t max_chunksize);
   ```
   
   Do we need to keep all existing APIs unchanged or is it allowed to introduce some compatible API changes?
   Some options I can think of:
   1) option 1, keep all existing APIs unchanged and add overloaded APIs only
   ```C++
   virtual Status WriteRecordBatch(const RecordBatch& batch) = 0;
   virtual Status WriteRecordBatch(const RecordBatch& batch, bool serialize_metadata); // new overload API
   Status WriteTable(const Table& table);
   virtual Status WriteTable(const Table& table, int64_t max_chunksize);
   virtual Status WriteTable(const Table& table, int64_t max_chunksize, bool serialize_metadata); // new overload API
   ```
   
   2) option 2, add a default parameter to existing APIs
   ```C++
   virtual Status WriteRecordBatch(const RecordBatch& batch, bool serialize_metadatda=false) = 0; // this will require modification to all sub classes
   Status WriteTable(const Table& table);
   virtual Status WriteTable(const Table& table, int64_t max_chunksize, bool serialize_metadata=false);
   ```
   
   I assume we expect option 1, is it correct?
   
   Additionally, there may be some higher level binding libraries (Python/Ruby/etc) using these APIs, and they may have to be changed to take advantage of these new APIs, what is our strategy for managing them for such a change? Do I need to change them all in this PR (I think I can try changing Python at least but I am not sure if I am familiar with all the binding languages to make such a change) or should I separate the changes into several PRs? Thanks.
   



##########
cpp/src/arrow/ipc/writer.cc:
##########
@@ -263,6 +263,14 @@ class RecordBatchSerializer {
     out_->body_length = offset - buffer_start_offset_;
     DCHECK(bit_util::IsMultipleOf8(out_->body_length));
 
+    // copy given record batch's schema metadata to the serializer for serialization
+    auto const &metadata = batch.schema()->metadata();

Review Comment:
   @westonpace 
   >  Let's call it serialize_metadata to keep the tenses consistent
   
   Sure. I am not a native English speaker, and feel free to let me know if there is better naming for this parameter.
   
   @westonpace @lidavidm 
   I will give it a try but I am not completely sure what kind of API compatibility is required. Could you give me some suggestions on what the overloaded function signatures are desirable?
   
   Currently in the C++ API, we have:
   ```C++
   class ARROW_EXPORT RecordBatchWriter {
    public:
     virtual ~RecordBatchWriter();
   
     /// \brief Write a record batch to the stream
     ///
     /// \param[in] batch the record batch to write to the stream
     /// \return Status
     virtual Status WriteRecordBatch(const RecordBatch& batch) = 0;
   
     /// \brief Write possibly-chunked table by creating sequence of record batches
     /// \param[in] table table to write
     /// \return Status
     Status WriteTable(const Table& table);
   
     /// \brief Write Table with a particular chunksize
     /// \param[in] table table to write
     /// \param[in] max_chunksize maximum length of table chunks. To indicate
     /// that no maximum should be enforced, pass -1.
     /// \return Status
     virtual Status WriteTable(const Table& table, int64_t max_chunksize);
   ```
   
   Do we need to keep all existing APIs unchanged or is it allowed to introduce some compatible API changes?
   Some options I can think of:
   1) option 1, keep all existing APIs unchanged and add overloaded APIs only
   ```C++
   virtual Status WriteRecordBatch(const RecordBatch& batch) = 0;
   virtual Status WriteRecordBatch(const RecordBatch& batch, bool serialize_metadata); // new overload API
   Status WriteTable(const Table& table);
   virtual Status WriteTable(const Table& table, int64_t max_chunksize);
   virtual Status WriteTable(const Table& table, int64_t max_chunksize, bool serialize_metadata); // new overload API
   ```
   
   2) option 2, add a default parameter to existing APIs
   ```C++
   virtual Status WriteRecordBatch(const RecordBatch& batch, bool serialize_metadatda=false) = 0; // this will require modification to all sub classes
   Status WriteTable(const Table& table);
   virtual Status WriteTable(const Table& table, int64_t max_chunksize, bool serialize_metadata=false);
   ```
   
   I assume we expect option 1, is it correct?
   
   Additionally, there may be some higher level binding libraries (Python/Ruby/etc) using these APIs, and they may have to be changed to take advantage of these new APIs, what is our strategy for managing them for such a change? Do I need to change them all in this PR (I think I can try changing Python at least but I am not sure if I am familiar with all the binding languages to make such a change) or should I separate the changes into several PRs? Thanks.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] niyue commented on pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
niyue commented on PR #12812:
URL: https://github.com/apache/arrow/pull/12812#issuecomment-1107696693

   I rebased the PR onto the latest master branch and resolved the conflict in `read_write_test.cc`. 
   
   @pitrou  @lidavidm could you please help to review if there is anything else that needs to be revised? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] niyue commented on pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
niyue commented on PR #12812:
URL: https://github.com/apache/arrow/pull/12812#issuecomment-1099117416

   > One possible API on the write side
   
   Thanks for the advice. Let me give it a try.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] niyue commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
niyue commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r853675452


##########
cpp/src/arrow/record_batch.h:
##########
@@ -227,6 +227,11 @@ class ARROW_EXPORT RecordBatchReader {
   /// \return Status
   virtual Status ReadNext(std::shared_ptr<RecordBatch>* batch) = 0;
 
+  virtual Status ReadNext(std::shared_ptr<RecordBatch>* batch,
+                          std::shared_ptr<KeyValueMetadata>* custom_metadata) {
+    return Status::NotImplemented("ReadNext with custom metadata");
+  }

Review Comment:
   I change the code to return `Result<RecordBatchWithMetadata>` now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r845076985


##########
cpp/src/arrow/ipc/writer.cc:
##########
@@ -263,6 +263,14 @@ class RecordBatchSerializer {
     out_->body_length = offset - buffer_start_offset_;
     DCHECK(bit_util::IsMultipleOf8(out_->body_length));
 
+    // copy given record batch's schema metadata to the serializer for serialization
+    auto const &metadata = batch.schema()->metadata();

Review Comment:
   Maybe we should instead add overloads that also allow explicitly passing in metadata to include with the message. Thoughts @pitrou @westonpace?



##########
cpp/src/arrow/ipc/writer.cc:
##########
@@ -263,6 +263,14 @@ class RecordBatchSerializer {
     out_->body_length = offset - buffer_start_offset_;
     DCHECK(bit_util::IsMultipleOf8(out_->body_length));
 
+    // copy given record batch's schema metadata to the serializer for serialization
+    auto const &metadata = batch.schema()->metadata();

Review Comment:
   ```suggestion
       const auto& metadata = batch.schema()->metadata();
   ```
   for consistency



##########
cpp/src/arrow/ipc/writer.cc:
##########
@@ -263,6 +263,14 @@ class RecordBatchSerializer {
     out_->body_length = offset - buffer_start_offset_;
     DCHECK(bit_util::IsMultipleOf8(out_->body_length));
 
+    // copy given record batch's schema metadata to the serializer for serialization
+    auto const &metadata = batch.schema()->metadata();

Review Comment:
   I'm not sure about this change. If the overall schema has metadata, then won't this copy the schema metadata into every batch, duplicating it over and over?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] niyue commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
niyue commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r853673629


##########
cpp/src/arrow/ipc/reader.h:
##########
@@ -190,6 +190,15 @@ class ARROW_EXPORT RecordBatchFileReader
   /// \return the read batch
   virtual Result<std::shared_ptr<RecordBatch>> ReadRecordBatch(int i) = 0;
 
+  /// \brief Read a particular record batch along with its custom metadatda from the file.

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r852244788


##########
cpp/src/arrow/record_batch.h:
##########
@@ -227,6 +227,11 @@ class ARROW_EXPORT RecordBatchReader {
   /// \return Status
   virtual Status ReadNext(std::shared_ptr<RecordBatch>* batch) = 0;
 
+  virtual Status ReadNext(std::shared_ptr<RecordBatch>* batch,
+                          std::shared_ptr<KeyValueMetadata>* custom_metadata) {
+    return Status::NotImplemented("ReadNext with custom metadata");
+  }

Review Comment:
   Perhaps something like:
   ```c++
   struct RecordBatchWithMetadata {
     std::shared_ptr<RecordBatch> batch;
     std::shared_ptr<KeyValueMetadata> custom_metadata;
   };
   virtual Result<RecordBatchWithMetadata> ReadNext();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #12812:
URL: https://github.com/apache/arrow/pull/12812#issuecomment-1096461837

   Hmm, I don't think this is the right way to do this. The IPC message's `custom_metadata` is independent from record batch (i.e. schema) metadata, one should not influence the other.
   
   (also, you could have a `custom_metadata` for other message types such as Tensor)
   
   cc @wesm for opinions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r845759831


##########
cpp/src/arrow/ipc/writer.cc:
##########
@@ -263,6 +263,14 @@ class RecordBatchSerializer {
     out_->body_length = offset - buffer_start_offset_;
     DCHECK(bit_util::IsMultipleOf8(out_->body_length));
 
+    // copy given record batch's schema metadata to the serializer for serialization
+    auto const &metadata = batch.schema()->metadata();

Review Comment:
   A boolean parameter seems like a fine idea to me.  Let's call it `serialize_metadata` to keep the tenses consistent.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] niyue commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
niyue commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r844529211


##########
cpp/src/arrow/ipc/writer.cc:
##########
@@ -263,6 +263,14 @@ class RecordBatchSerializer {
     out_->body_length = offset - buffer_start_offset_;
     DCHECK(bit_util::IsMultipleOf8(out_->body_length));
 
+    // copy given record batch's schema metadata to the serializer for serialization
+    auto const &metadata = batch.schema()->metadata();
+    if (metadata) {
+      for (int64_t i = 0; i < metadata->size(); ++i) {
+        AppendCustomMetadata(metadata->key(i), metadata->value(i));
+      }
+    }

Review Comment:
   When writing a batch, its batch specific metadata will be copied into the serializer's internal metadata variable. Previously, we have the `AppendCustomMetadata` API, but it is never called, so the serializer's internal metadata variable is always empty.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r845559633


##########
cpp/src/arrow/ipc/writer.cc:
##########
@@ -263,6 +263,14 @@ class RecordBatchSerializer {
     out_->body_length = offset - buffer_start_offset_;
     DCHECK(bit_util::IsMultipleOf8(out_->body_length));
 
+    // copy given record batch's schema metadata to the serializer for serialization
+    auto const &metadata = batch.schema()->metadata();

Review Comment:
   I think I agree.  This would make it too easy to write an inefficient file that contains a lot of duplicate metadata.  Adding an optional metadata field to the `WriteTable` method makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] niyue commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
niyue commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r851746729


##########
cpp/src/arrow/record_batch.h:
##########
@@ -227,6 +227,11 @@ class ARROW_EXPORT RecordBatchReader {
   /// \return Status
   virtual Status ReadNext(std::shared_ptr<RecordBatch>* batch) = 0;
 
+  virtual Status ReadNext(std::shared_ptr<RecordBatch>* batch,
+                          std::shared_ptr<KeyValueMetadata>* custom_metadata) {
+    return Status::NotImplemented("ReadNext with custom metadata");
+  }

Review Comment:
   I add this new API with a default implementation so that existing sub class of `RecordBatchReader` can still be compiled successfully.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #12812:
URL: https://github.com/apache/arrow/pull/12812#issuecomment-1101259839

   @niyue Indeed, these CI failures are unrelated. I'll let @lidavidm take a look at the Flight one; the hash-join-node-test crash is at https://issues.apache.org/jira/browse/ARROW-15361


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] niyue commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
niyue commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r854127693


##########
cpp/src/arrow/ipc/reader.cc:
##########
@@ -852,15 +863,21 @@ class RecordBatchStreamReaderImpl : public RecordBatchStreamReader {
   }
 
   Status ReadNext(std::shared_ptr<RecordBatch>* batch) override {
+    ARROW_ASSIGN_OR_RAISE(auto batch_with_metadata, ReadNext());
+    *batch = batch_with_metadata.batch;

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r854095687


##########
cpp/src/arrow/ipc/reader.cc:
##########
@@ -852,15 +863,21 @@ class RecordBatchStreamReaderImpl : public RecordBatchStreamReader {
   }
 
   Status ReadNext(std::shared_ptr<RecordBatch>* batch) override {
+    ARROW_ASSIGN_OR_RAISE(auto batch_with_metadata, ReadNext());
+    *batch = batch_with_metadata.batch;

Review Comment:
   nit, but can this be `std::move`d?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] niyue commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
niyue commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r851746529


##########
cpp/src/arrow/ipc/reader.h:
##########
@@ -190,6 +190,15 @@ class ARROW_EXPORT RecordBatchFileReader
   /// \return the read batch
   virtual Result<std::shared_ptr<RecordBatch>> ReadRecordBatch(int i) = 0;
 
+  /// \brief Read a particular record batch along with its custom metadatda from the file.
+  /// Does not copy memory if the input source supports zero-copy.
+  ///
+  /// \param[in] i the index of the record batch to return
+  /// \return a pair of the read batch and its custom metadata
+  virtual Result<
+      std::pair<std::shared_ptr<RecordBatch>, std::shared_ptr<KeyValueMetadata>>>
+  ReadRecordBatchWithCustomMetadata(int i) = 0;

Review Comment:
   This is a new API added for the `RecordBatchFileReader` class so that accessing a record batch with its metadata  by the index is possible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] niyue commented on pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
niyue commented on PR #12812:
URL: https://github.com/apache/arrow/pull/12812#issuecomment-1100864075

   @pitrou @lidavidm 
   I submitted a new commit following the suggestion above. I tried to keep the changes compatible with existing public APIs so several new APIs are added. Could you please help to review? Let me know if there is any issue. Thanks.
   
   For the CI:
   1) I found one test case `80 - arrow-compute-hash-join-node-test` failed in `C++ / AMD64 Conda C++ (pull_request)`, but I guess this is probably not caused by my PR
   2) another test case `51 - arrow-flight-test (Failed)` failed in `continuous-integration/appveyor/pr`, I am not able to find the detailed test report in the CI job, is this an issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r852245531


##########
cpp/src/arrow/ipc/reader.h:
##########
@@ -190,6 +190,15 @@ class ARROW_EXPORT RecordBatchFileReader
   /// \return the read batch
   virtual Result<std::shared_ptr<RecordBatch>> ReadRecordBatch(int i) = 0;
 
+  /// \brief Read a particular record batch along with its custom metadatda from the file.
+  /// Does not copy memory if the input source supports zero-copy.
+  ///
+  /// \param[in] i the index of the record batch to return
+  /// \return a pair of the read batch and its custom metadata
+  virtual Result<
+      std::pair<std::shared_ptr<RecordBatch>, std::shared_ptr<KeyValueMetadata>>>
+  ReadRecordBatchWithCustomMetadata(int i) = 0;

Review Comment:
   I think user code would be more readable with:
   ```c++
   struct RecordBatchWithMetadata {
     std::shared_ptr<RecordBatch> batch;
     std::shared_ptr<KeyValueMetadata> custom_metadata;
   };
   virtual Result<RecordBatchWithMetadata> ReadRecordBatchWithMetadata(int i) = 0;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] niyue commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
niyue commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r845757121


##########
cpp/src/arrow/ipc/writer.cc:
##########
@@ -263,6 +263,14 @@ class RecordBatchSerializer {
     out_->body_length = offset - buffer_start_offset_;
     DCHECK(bit_util::IsMultipleOf8(out_->body_length));
 
+    // copy given record batch's schema metadata to the serializer for serialization
+    auto const &metadata = batch.schema()->metadata();

Review Comment:
   I think there are two cases.
   
   1) in one case, a user simply re-uses the overall schema that happens to have some metadata, and it is less likely this user wants to duplicate the schema metadata in all batches and discard them silently is okay (and duplicating it all record batches may be undesirable and problematic):
   ```python
       schema = pa.schema(
           [
               ("values", pa.int64()),
           ],
           metadata={"foo": "bar"},
       )
       writer = pa.RecordBatchFileWriter(
           ipc_file, schema
       )
       for i in range(num_batches):
           batch = pa.record_batch(
               [int_array],
               schema=schema # <=== re-use the overall schema having metadata
           )
           writer.write_batch(batch)
   ```
   
   2) in the other case, a user provides metadata explicitly when creating a record batch, and discarding the metadata silently may not be desirable in this case:
   ```python
      schema = pa.schema(
           [
               ("values", pa.int64()),
           ],
           metadata={"foo": "bar"},
       )
       writer = pa.RecordBatchFileWriter(
           ipc_file, schema
       )
       for i in range(num_batches):
           batch = pa.record_batch(
               [int_array],
               names=["values"],
               metadata={"batch_id": str(i)}, # <=== pass a metadata explicitly
           )
           writer.write_batch(batch)
   ```
   The underlying implementation is difficult to tell these two cases.
   
   Some current API, like `pyarrow.record_batch`, already allows users to specify metadata explicitly. If we provide an overloaded function asking users to provide metadata, it seems we make the `metadata` in these APIs useless (or only useful when the record batch is in memory). The code may look like:
   ```python
   batch = pa.record_batch(
       [int_array],
       names=["values"] # <== `metadata` passed here will never be used by `write_batch` so users won't pass this parameter any more
   )
   writer.write_batch(batch, metadata={"batch_id": str(i)})
   ```
   
   Maybe a boolean flag indicating if the metadata in record batch should be serialized?
   ```python
   batch = pa.record_batch(
       [int_array],
       names=["values"],
       metadata={"batch_id": str(i)}
   )
   writer.write_batch(batch, serializing_metadata=True)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12812:
URL: https://github.com/apache/arrow/pull/12812#issuecomment-1090242688

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12812:
URL: https://github.com/apache/arrow/pull/12812#issuecomment-1090242647

   https://issues.apache.org/jira/browse/ARROW-16131


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] niyue commented on pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
niyue commented on PR #12812:
URL: https://github.com/apache/arrow/pull/12812#issuecomment-1097344685

   > The IPC message's custom_metadata is independent from record batch (i.e. schema) metadata
   
   This is actually my initial understanding, according to this documentation (https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata), it says `This includes Field, Schema, and Message.` and it seems to indicate schema metadata is different from message metadata. 
   
   But for the documentation here (https://arrow.apache.org/docs/python/data.html#custom-schema-and-field-metadata), it says `Arrow supports both schema-level and field-level custom key-value metadata allowing for systems to insert their own application defined metadata to customize behavior.` and record batch (message level) metadata is not mentioned.
   
   And I only see API/tests allowing users to modify schema's metadata (probably I am missing something), so I assumed my initial understanding is incorrect and came up with such a PR change.
   
   If message metadata is different from schema metadata, the question here may be slightly more complex since in this case, each record batch may provide 1) this record batch's schema specific metadata, this seems not serialized/deserialized  2) this record batch's message specific metadata, we don't seem to provide API for reading/writing this (internally we have `AppendCustomMetadata` but it is not called and there seems no higher level API allowing users to pass this information)
   
   Please advice what the correct behavior should be so that I could give it a try. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r846623192


##########
cpp/src/arrow/ipc/writer.cc:
##########
@@ -263,6 +263,14 @@ class RecordBatchSerializer {
     out_->body_length = offset - buffer_start_offset_;
     DCHECK(bit_util::IsMultipleOf8(out_->body_length));
 
+    // copy given record batch's schema metadata to the serializer for serialization
+    auto const &metadata = batch.schema()->metadata();

Review Comment:
   For the bindings: we can tackle them separately. CI will catch us if we make any breaking changes that need to be handled here. It's only Python/R/GLib (Ruby) we have to worry about (Java/Go have a few bindings I think but not at this level). 
   
   For the APIs: I would prefer adding overloads. CC @pitrou for opinions too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] niyue commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
niyue commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r853674205


##########
cpp/src/arrow/ipc/reader.cc:
##########
@@ -875,15 +894,31 @@ class RecordBatchStreamReaderImpl : public RecordBatchStreamReader {
     if (message == nullptr) {
       // End of stream
       *batch = nullptr;
+      if (custom_metadata != nullptr) {
+        *custom_metadata = nullptr;
+      }
       return Status::OK();
     }
 
     CHECK_HAS_BODY(*message);
     ARROW_ASSIGN_OR_RAISE(auto reader, Buffer::GetReader(message->body()));
     IpcReadContext context(&dictionary_memo_, options_, swap_endian_);
-    return ReadRecordBatchInternal(*message->metadata(), schema_, field_inclusion_mask_,
-                                   context, reader.get())
-        .Value(batch);
+    auto result = ReadRecordBatchInternal(*message->metadata(), schema_,
+                                          field_inclusion_mask_, context, reader.get());
+    if (result.ok()) {
+      auto batch_and_metadata = result.ValueOrDie();

Review Comment:
   After changing the API, I switch to use the `ARROW_ASSIGN_OR_RAISE` since it seems more common in the code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r852110810


##########
cpp/src/arrow/ipc/reader.h:
##########
@@ -190,6 +190,15 @@ class ARROW_EXPORT RecordBatchFileReader
   /// \return the read batch
   virtual Result<std::shared_ptr<RecordBatch>> ReadRecordBatch(int i) = 0;
 
+  /// \brief Read a particular record batch along with its custom metadatda from the file.

Review Comment:
   ```suggestion
     /// \brief Read a particular record batch along with its custom metadada from the file.
   ```



##########
cpp/src/arrow/ipc/reader.cc:
##########
@@ -875,15 +894,31 @@ class RecordBatchStreamReaderImpl : public RecordBatchStreamReader {
     if (message == nullptr) {
       // End of stream
       *batch = nullptr;
+      if (custom_metadata != nullptr) {
+        *custom_metadata = nullptr;
+      }
       return Status::OK();
     }
 
     CHECK_HAS_BODY(*message);
     ARROW_ASSIGN_OR_RAISE(auto reader, Buffer::GetReader(message->body()));
     IpcReadContext context(&dictionary_memo_, options_, swap_endian_);
-    return ReadRecordBatchInternal(*message->metadata(), schema_, field_inclusion_mask_,
-                                   context, reader.get())
-        .Value(batch);
+    auto result = ReadRecordBatchInternal(*message->metadata(), schema_,
+                                          field_inclusion_mask_, context, reader.get());
+    if (result.ok()) {
+      auto batch_and_metadata = result.ValueOrDie();

Review Comment:
   you can `std::move(result).ValueUnsafe()`



##########
cpp/src/arrow/record_batch.h:
##########
@@ -227,6 +227,11 @@ class ARROW_EXPORT RecordBatchReader {
   /// \return Status
   virtual Status ReadNext(std::shared_ptr<RecordBatch>* batch) = 0;
 
+  virtual Status ReadNext(std::shared_ptr<RecordBatch>* batch,
+                          std::shared_ptr<KeyValueMetadata>* custom_metadata) {
+    return Status::NotImplemented("ReadNext with custom metadata");
+  }

Review Comment:
   Hmm, should we prefer `arrow::Result` for new APIs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] niyue commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
niyue commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r844529994


##########
cpp/src/arrow/ipc/reader.cc:
##########
@@ -676,7 +676,14 @@ Result<std::shared_ptr<RecordBatch>> ReadRecordBatchInternal(
   }
   context.compression = compression;
   context.metadata_version = internal::GetMetadataVersion(message->version());
-  return LoadRecordBatch(batch, schema, inclusion_mask, context, file);
+
+  auto schema_with_custom_meta = schema;
+  if (message->custom_metadata() != nullptr) {
+    std::shared_ptr<KeyValueMetadata> md;
+    RETURN_NOT_OK(internal::GetKeyValueMetadata(message->custom_metadata(), &md));
+    schema_with_custom_meta = schema->WithMetadata(std::move(md));
+  }
+  return LoadRecordBatch(batch, schema_with_custom_meta, inclusion_mask, context, file);

Review Comment:
   When reading a batch, if the batch message contains some custom metadata, it will be used to create a new schema with the custom metadata when creating the record batch. If the batch message doesn't contain custom metadata, we still use the original schema for creating the record batch as usual.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] niyue commented on pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
niyue commented on PR #12812:
URL: https://github.com/apache/arrow/pull/12812#issuecomment-1090938240

   I checked out all the build failures reported , including `arrow-gcsfs-test (Timeout)` and `arrow-s3-test (Failed)`, and they don't seem to be related with my change. 
   
   I can build and run the S3 test locally, and it ran into segfault locally as the CI server did for the test case `GetFileInfoRoot`:
   ```
   TEST_F(TestS3FS, GetFileInfoRoot) { AssertFileInfo(fs_.get(), "", FileType::Directory); }
   ```
   
   For the gcs test, I failed to build the google cloud cpp lib and I am not able to run it locally so far, but it seems happened in other PR as well (for example, https://github.com/apache/arrow/pull/12816)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] niyue commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
niyue commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r844528644


##########
cpp/src/arrow/ipc/test_common.cc:
##########
@@ -275,6 +275,20 @@ Status MakeIntRecordBatch(std::shared_ptr<RecordBatch>* out) {
   return MakeIntBatchSized(10, out);
 }
 
+Status MakeIntBatchWithMetadata(std::shared_ptr<KeyValueMetadata> key_value_metadata,
+                                std::shared_ptr<RecordBatch>* out) {

Review Comment:
   I add a new test utility function to generate a random int batch with the provided metadata key value pairs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #12812:
URL: https://github.com/apache/arrow/pull/12812#issuecomment-1097832538

   So, basically, per-message metadata is an Arrow *IPC* feature that's independent from record batches or schemas. That's why you don't see it mentioned in the [schema metadata documentation](https://arrow.apache.org/docs/python/data.html#custom-schema-and-field-metadata).
   
   > If message metadata is different from schema metadata, the question here may be slightly more complex since in this case, each record batch may provide 1) this record batch's schema specific metadata, this seems not serialized/deserialized 2) this record batch's message specific metadata, we don't seem to provide API for reading/writing this
   
   Indeed the message metadata wouldn't be attached to the record batch or schema, but passed separately.
   
   One possible API on the write side:
   ```c++
   class ARROW_EXPORT RecordBatchWriter {
    public:
     // (default implementation could call `WriteRecordBatch(batch)` while ignoring metadata?)
     virtual Status WriteRecordBatch(const RecordBatch& batch, const KeyValueMetadata& custom_metadata);
   ```
   
   and on the read side:
   ```c++
   class ARROW_EXPORT RecordBatchReader {
    public:
     // (default implementation could call `ReadNext(batch)` and reset `*custom_metadata` to null)
     virtual Status ReadNext(std::shared_ptr<RecordBatch>* batch, std::shared_ptr<KeyValueMetadata>* custom_metadata);
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] niyue commented on a diff in pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
niyue commented on code in PR #12812:
URL: https://github.com/apache/arrow/pull/12812#discussion_r853674723


##########
cpp/src/arrow/ipc/reader.h:
##########
@@ -190,6 +190,15 @@ class ARROW_EXPORT RecordBatchFileReader
   /// \return the read batch
   virtual Result<std::shared_ptr<RecordBatch>> ReadRecordBatch(int i) = 0;
 
+  /// \brief Read a particular record batch along with its custom metadatda from the file.
+  /// Does not copy memory if the input source supports zero-copy.
+  ///
+  /// \param[in] i the index of the record batch to return
+  /// \return a pair of the read batch and its custom metadata
+  virtual Result<
+      std::pair<std::shared_ptr<RecordBatch>, std::shared_ptr<KeyValueMetadata>>>
+  ReadRecordBatchWithCustomMetadata(int i) = 0;

Review Comment:
   Sure. I change the code to return the struct now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] niyue commented on pull request #12812: ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file

Posted by GitBox <gi...@apache.org>.
niyue commented on PR #12812:
URL: https://github.com/apache/arrow/pull/12812#issuecomment-1109093847

   @pitrou @lidavidm thanks so much for the detailed review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org