You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/08 08:41:57 UTC

[GitHub] [arrow] pitrou commented on a diff in pull request #13338: ARROW-16761: [C++][Python] Track bytes written in dataset

pitrou commented on code in PR #13338:
URL: https://github.com/apache/arrow/pull/13338#discussion_r892079858


##########
cpp/src/arrow/dataset/file_base.cc:
##########
@@ -265,7 +265,20 @@ Status FileWriter::Write(RecordBatchReader* batches) {
 }
 
 Future<> FileWriter::Finish() {
-  return FinishInternal().Then([this]() { return destination_->CloseAsync(); });
+  return FinishInternal()
+      .Then([this]() {
+        ARROW_ASSIGN_OR_RAISE(bytes_written_, destination_->Tell());
+        return Status::OK();
+      })
+      .Then([this]() { return destination_->CloseAsync(); });

Review Comment:
   Nit, but we can probably merge the two `Then` callbacks together?
   



##########
docs/source/python/dataset.rst:
##########
@@ -731,6 +731,7 @@ to supply a visitor that will be called as each file is created:
 
     def file_visitor(written_file):
         print(f"path={written_file.path}")
+        print(f"size={written_file.size:,} bytes")

Review Comment:
   This can be misleading because the comma is a decimal separator in some countries such as France, but perhaps that's just me :-). I would just use the default formatting. Or you can use `_` (which is recognized in Python int literals):
   ```suggestion
           print(f"size={written_file.size:_} bytes")
   ```



##########
python/pyarrow/_dataset.pyx:
##########
@@ -2677,11 +2678,21 @@ cdef class WrittenFile(_Weakrefable):
     """
     Metadata information about files written as
     part of a dataset write operation
+
+    Parameters
+    ----------
+    path : str
+        Path to the file.
+    metadata : pyarrow.parquet.FileMetaData, optional
+        For Parquet files, the Parquet file metadata.
+    size : int
+        The size of the file in bytes.
     """
 
-    def __init__(self, path, metadata):
+    def __init__(self, path, metadata, size):

Review Comment:
   This is not any code you wrote, but FTR `WrittenFile` could probably have been a `namedtuple` subclass.



##########
cpp/src/arrow/dataset/file_base.h:
##########
@@ -320,6 +320,9 @@ class ARROW_DS_EXPORT FileWriter {
   const std::shared_ptr<FileWriteOptions>& options() const { return options_; }
   const fs::FileLocator& destination() const { return destination_locator_; }
 
+  /// \brief After Finish() is called, provides number of bytes written to file.
+  Result<int64_t> GetBytesWritten();

Review Comment:
   Nit: make this method `const`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org