You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/06 17:24:46 UTC
[GitHub] [arrow] wjones127 commented on a diff in pull request #14219: ARROW-17825: [C++] Allow the possibility to write several tables in ORCFileWriter

wjones127 commented on code in PR #14219:
URL: https://github.com/apache/arrow/pull/14219#discussion_r989228396


##########
cpp/src/arrow/adapters/orc/adapter.cc:
##########
@@ -781,7 +794,29 @@ Result<std::unique_ptr<ORCFileWriter>> ORCFileWriter::Open(
   return std::move(result);
 }
 
-Status ORCFileWriter::Write(const Table& table) { return impl_->Write(table); }
+Status ORCFileWriter::Write(const Table& table)
+{
+  if (table.num_rows() == 0) {
+    ARROW_ASSIGN_OR_RAISE(auto empty_batch, RecordBatch::MakeEmpty(table.schema()));
+    RETURN_NOT_OK(Write(*empty_batch));
+  } else {
+    TableBatchReader reader(table);

Review Comment:
   We should make sure the reader is set with a max batch size at least the same size as the `batch_size` provided by the user.
   
   ```suggestion
       TableBatchReader reader(table);
       reader.set_chunksize(write_options_.batch_size);
   ```



##########
cpp/src/arrow/adapters/orc/adapter.cc:
##########
@@ -743,9 +753,10 @@ class ORCFileWriter::Impl {
         internal::checked_cast<liborc::StructVectorBatch*>(batch.get());
     while (num_rows > 0) {
       for (int i = 0; i < num_cols; i++) {
+        ARROW_ASSIGN_OR_RAISE(auto chunked_array, ChunkedArray::Make({ (record_batch.column(i)) }));

Review Comment:
   I think you can just use the constructor and skip the validation:
   
   ```suggestion
           auto chunked_array = std::make_shared<ChunkedArray>(record_batch.column(i));
   ```



##########
cpp/src/arrow/adapters/orc/adapter.cc:
##########
@@ -781,7 +794,29 @@ Result<std::unique_ptr<ORCFileWriter>> ORCFileWriter::Open(
   return std::move(result);
 }
 
-Status ORCFileWriter::Write(const Table& table) { return impl_->Write(table); }
+Status ORCFileWriter::Write(const Table& table)
+{
+  if (table.num_rows() == 0) {
+    ARROW_ASSIGN_OR_RAISE(auto empty_batch, RecordBatch::MakeEmpty(table.schema()));
+    RETURN_NOT_OK(Write(*empty_batch));
+  } else {
+    TableBatchReader reader(table);

Review Comment:
   (Though see my other comment on whether this approach is the right one)



##########
cpp/src/arrow/adapters/orc/adapter.h:
##########
@@ -272,9 +272,12 @@ class ARROW_EXPORT ORCFileWriter {
       io::OutputStream* output_stream,
       const WriteOptions& write_options = WriteOptions());
 
-  /// \brief Write a table
+  /// \brief Write a table. This can be called multiple times.
   ///
-  /// \param[in] table the Arrow table from which data is extracted
+  /// Tables passed in subsequent calls must match the schema of the table that was
+  /// written first.
+  ///
+  /// \param[in] table the Arrow table from which data is extracted.
   /// \return Status
   Status Write(const Table& table);

Review Comment:
   I actually think implementing `Status Write(const RecordBatch& batch)` would be more sensible here. The function `ORCFileWriter::Impl::Write()` handles the enforcement of the `batch_size` and (assuming I'm reading correctly) will handle writing batches that straddle chunked array boundaries. If you make `Status Write(const Table& table)` instead implemented by using a `TableBatchReader` and passing each individual batch to the `Write` implementation, then the writer has to break up batches based on the input chunk boundaries. IMO, ideally we should aim to decouple the chunk boundaries in-memory and the serialized chunk boundaries whenever possible.
   
   So unless you want to do a very deep refactoring, I think we should implement `Status Write(const RecordBatch& record_batch)` by just wrapping the batch in a Table.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org