You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/21 17:41:48 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #1719: Rustify parquet writer (#1717) (#1163)

tustvold opened a new pull request, #1719:
URL: https://github.com/apache/arrow-rs/pull/1719

   # Which issue does this PR close?
   
   Closes #1717
   
   Part of #1163 
   
   # Rationale for this change
    
   See tickets, but in short the current write path makes use of a lot of custom IO abstractions which can be hard to use correctly. In particular the use of TryClone can easily lead to races if used to share a file descriptor across threads.
   
   # What changes are included in this PR?
   
   This reworks the write path to use `std::io::Write` and nothing else. Unfortunately to achieve this requires a few changes:
   
   * A TrackedWrite that keeps track of how many bytes have been written, allowing removal of Seek
   * A callback based approach to get metadata from a child writer to its parent
   * Lifetimes, lots of lifetimes...
   
   This last point becomes a bit obnoxious when it interacts with the `RowGroupWriter` trait. In order to be object-safe, i.e. possible to construct `Box<dyn RowGroupWriter>`, `RowGroupWriter::close` cannot take `self` by value. This results in explicit scopes, or calls to `std::mem::drop` in order to truncate its lifetime.
   
   I'm not entirely sure what the purpose of these traits is, but perhaps we could simplify things, and fix this slight lifetime annoyance by removing them? Maybe something for a follow up PR?
   
   # Are there any user-facing changes?
   
   Yes, this makes non-trivial changes to the parquet write API.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1719: Change parquet `ArrowFileWriter` to use standard `std:io::Write` rather custom `ParquetWriter` trait (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#discussion_r879354264


##########
parquet/src/file/writer.rs:
##########
@@ -92,102 +111,90 @@ pub trait FileWriter {
 /// All columns should be written sequentially; the main workflow is:
 /// - Request the next column using `next_column` method - this will return `None` if no
 /// more columns are available to write.
-/// - Once done writing a column, close column writer with `close_column` method - this
-/// will finalise column chunk metadata and update row group metrics.
+/// - Once done writing a column, close column writer with `close`

Review Comment:
   This method no longer exists, and is instead on the ColumnWriter itself



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1719: Rustify parquet writer (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#discussion_r878743074


##########
parquet/src/file/writer.rs:
##########
@@ -92,102 +111,90 @@ pub trait FileWriter {
 /// All columns should be written sequentially; the main workflow is:
 /// - Request the next column using `next_column` method - this will return `None` if no
 /// more columns are available to write.
-/// - Once done writing a column, close column writer with `close_column` method - this
-/// will finalise column chunk metadata and update row group metrics.
+/// - Once done writing a column, close column writer with `close`
 /// - Once all columns have been written, close row group writer with `close` method -
 /// it will return row group metadata and is no-op on already closed row group.
 pub trait RowGroupWriter {
     /// Returns the next column writer, if available; otherwise returns `None`.
     /// In case of any IO error or Thrift error, or if row group writer has already been
     /// closed returns `Err`.
-    ///
-    /// To request the next column writer, the previous one must be finalised and closed
-    /// using `close_column`.
-    fn next_column(&mut self) -> Result<Option<ColumnWriter>>;
-
-    /// Closes column writer that was created using `next_column` method.
-    /// This should be called before requesting the next column writer.
-    fn close_column(&mut self, column_writer: ColumnWriter) -> Result<()>;
+    fn next_column(&mut self) -> Result<Option<SerializedColumnWriter<'_>>>;
 
     /// Closes this row group writer and returns row group metadata.
     /// After calling this method row group writer must not be used.
     ///
-    /// It is recommended to call this method before requesting another row group, but it
-    /// will be closed automatically before returning a new row group.
-    ///
     /// Can be called multiple times. In subsequent calls will result in no-op and return
     /// already created row group metadata.
     fn close(&mut self) -> Result<RowGroupMetaDataPtr>;
 }
 
+/// Callback invoked on closing a column chunk, arguments are:

Review Comment:
   We have to move to a callback model as the child writers now mutably borrow from their parents, and so it would be impossible to keep the current interface. I also happen to think this is cleaner, you simply close the ColumnWriter when you are finished, vs passing the ColumnWriter back to the RowGroupWriter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #1719: Change parquet `ArrowFileWriter` to use standard `std:io::Write` rather custom `ParquetWriter` trait (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#discussion_r879320089


##########
parquet/src/file/mod.rs:
##########
@@ -48,12 +48,14 @@
 //! let props = Arc::new(WriterProperties::builder().build());
 //! let file = fs::File::create(&path).unwrap();
 //! let mut writer = SerializedFileWriter::new(file, schema, props).unwrap();
-//! let mut row_group_writer = writer.next_row_group().unwrap();
-//! while let Some(mut col_writer) = row_group_writer.next_column().unwrap() {
-//!     // ... write values to a column writer
-//!     row_group_writer.close_column(col_writer).unwrap();
+//! {

Review Comment:
   Ah, yes, I also find this very confusing -- I think that split was part of the original implementation from @sunchao 
   
   Would be possible to create a `ParallelizedFileWriter` that used multiple threads and buffers internally to write files out with greater concurrency?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] ahmedriza commented on a diff in pull request #1719: Change parquet `ArrowFileWriter` to use standard `std:io::Write` rather custom `ParquetWriter` trait (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
ahmedriza commented on code in PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#discussion_r879561306


##########
parquet/src/arrow/arrow_writer.rs:
##########
@@ -188,7 +190,7 @@ impl<W: 'static + ParquetWriter> ArrowWriter<W> {
             write_leaves(row_group_writer.as_mut(), &arrays, &mut levels)?;
         }
 
-        self.writer.close_row_group(row_group_writer)?;
+        row_group_writer.close().unwrap();

Review Comment:
   Since `close()` returns a `Result` anyway, is there a particular reason to call `unwrap` instead of using `?`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] ahmedriza commented on a diff in pull request #1719: Change parquet `ArrowFileWriter` to use standard `std:io::Write` rather custom `ParquetWriter` trait (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
ahmedriza commented on code in PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#discussion_r879561306


##########
parquet/src/arrow/arrow_writer.rs:
##########
@@ -188,7 +190,7 @@ impl<W: 'static + ParquetWriter> ArrowWriter<W> {
             write_leaves(row_group_writer.as_mut(), &arrays, &mut levels)?;
         }
 
-        self.writer.close_row_group(row_group_writer)?;
+        row_group_writer.close().unwrap();

Review Comment:
   Since this returns a `Result` anyway, is there a particular reason to call `unwrap` instead of using `?`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1719: Change parquet `ArrowFileWriter` to use standard `std:io::Write` rather custom `ParquetWriter` trait (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#discussion_r879326505


##########
parquet/src/file/mod.rs:
##########
@@ -48,12 +48,14 @@
 //! let props = Arc::new(WriterProperties::builder().build());
 //! let file = fs::File::create(&path).unwrap();
 //! let mut writer = SerializedFileWriter::new(file, schema, props).unwrap();
-//! let mut row_group_writer = writer.next_row_group().unwrap();
-//! while let Some(mut col_writer) = row_group_writer.next_column().unwrap() {
-//!     // ... write values to a column writer
-//!     row_group_writer.close_column(col_writer).unwrap();
+//! {

Review Comment:
   It would have been highly non-trivial to do this correctly before, and with the lifetime change in this PR it would actually be impossible, as `FileWriter::next_row_group` returns a mutable borrow of `FileWriter`
   
   That being said, nothing would prevent creating a `ParallelizedFileWriter`, you just wouldn't be able to be generic over parallel and non-parallel implementations. I'm not sure this is actually a problem.
   
   I personally think removing these traits as part of the PR would go a long way to making the change easy for users to understand, the lifetime errors are not very obvious unless you know what you are looking for.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1719: Change parquet `ArrowFileWriter` to use standard `std:io::Write` rather custom `ParquetWriter` trait (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#discussion_r879326505


##########
parquet/src/file/mod.rs:
##########
@@ -48,12 +48,14 @@
 //! let props = Arc::new(WriterProperties::builder().build());
 //! let file = fs::File::create(&path).unwrap();
 //! let mut writer = SerializedFileWriter::new(file, schema, props).unwrap();
-//! let mut row_group_writer = writer.next_row_group().unwrap();
-//! while let Some(mut col_writer) = row_group_writer.next_column().unwrap() {
-//!     // ... write values to a column writer
-//!     row_group_writer.close_column(col_writer).unwrap();
+//! {

Review Comment:
   It would have been highly non-trivial to do this correctly before, and with the lifetime change in this PR it would actually be impossible, as `FileWriter::next_row_group` returns a mutable borrow of `FileWriter`
   
   That being said, nothing would prevent creating a ParallelizedFileWriter, you just wouldn't be able to be generic over parallel and non-parallel implementations. I'm not sure this is actually a problem.
   
   I personally think removing these traits as part of the PR would go a long way to making the change easy for users to understand, the lifetime errors are not very obvious unless you know what you are looking for.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] ahmedriza commented on pull request #1719: Change parquet `ArrowFileWriter` to use standard `std:io::Write` rather custom `ParquetWriter` trait (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
ahmedriza commented on PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#issuecomment-1134832399

   This is a great step forwards, and makes it cleaner as well as let the compiler enforce invariants to avoid things like https://github.com/apache/arrow-rs/issues/1717 happening. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1719: Change parquet writers to use standard `std:io::Write` rather custom `ParquetWriter` trait (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#discussion_r880519748


##########
parquet/src/arrow/arrow_writer.rs:
##########
@@ -188,7 +190,7 @@ impl<W: 'static + ParquetWriter> ArrowWriter<W> {
             write_leaves(row_group_writer.as_mut(), &arrays, &mut levels)?;
         }
 
-        self.writer.close_row_group(row_group_writer)?;
+        row_group_writer.close().unwrap();

Review Comment:
   Good spot - pure oversight



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #1719: Change parquet `ArrowFileWriter` to use standard `std:io::Write` rather custom `ParquetWriter` trait (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#discussion_r879302492


##########
parquet/src/file/writer.rs:
##########
@@ -92,102 +111,90 @@ pub trait FileWriter {
 /// All columns should be written sequentially; the main workflow is:
 /// - Request the next column using `next_column` method - this will return `None` if no
 /// more columns are available to write.
-/// - Once done writing a column, close column writer with `close_column` method - this
-/// will finalise column chunk metadata and update row group metrics.
+/// - Once done writing a column, close column writer with `close`

Review Comment:
   What is the reason to remove the rationale ('this will finalize column chunk metadata ...')?



##########
parquet/src/util/cursor.rs:
##########
@@ -133,68 +131,6 @@ impl Seek for SliceableCursor {
     }
 }
 
-/// Use this type to write Parquet to memory rather than a file.
-#[derive(Debug, Default, Clone)]

Review Comment:
   In terms of easing the transition of this change, what would you think about marking this struct "deprecated" and keeping the code backwards compatible for a few releases (e.g. `impl Wite for InMemoryWriteableCursor`)? 
   
   That might make the API change easier to manage and give users some time to remove it.



##########
parquet/src/file/mod.rs:
##########
@@ -48,12 +48,14 @@
 //! let props = Arc::new(WriterProperties::builder().build());
 //! let file = fs::File::create(&path).unwrap();
 //! let mut writer = SerializedFileWriter::new(file, schema, props).unwrap();
-//! let mut row_group_writer = writer.next_row_group().unwrap();
-//! while let Some(mut col_writer) = row_group_writer.next_column().unwrap() {
-//!     // ... write values to a column writer
-//!     row_group_writer.close_column(col_writer).unwrap();
+//! {

Review Comment:
   Also, it seems to me like we could change this example to actually run (not sure why it is marked `norun` -- it just needs to pick a tempfile as target rather than `/path/to/sample.parquet`)



##########
parquet/src/file/writer.rs:
##########
@@ -270,69 +273,45 @@ impl<W: 'static + ParquetWriter> FileWriter for SerializedFileWriter<W> {
 /// A serialized implementation for Parquet [`RowGroupWriter`].
 /// Coordinates writing of a row group with column writers.
 /// See documentation on row group writer for more information.
-pub struct SerializedRowGroupWriter<W: ParquetWriter> {
+pub struct SerializedRowGroupWriter<'a, W: Write> {
     descr: SchemaDescPtr,
     props: WriterPropertiesPtr,
-    buf: W,
+    buf: &'a mut TrackedWrite<W>,
     total_rows_written: Option<u64>,
     total_bytes_written: u64,
     column_index: usize,
-    previous_writer_closed: bool,
     row_group_metadata: Option<RowGroupMetaDataPtr>,
     column_chunks: Vec<ColumnChunkMetaData>,
+    on_close: Option<OnCloseRowGroup<'a>>,
 }
 
-impl<W: 'static + ParquetWriter> SerializedRowGroupWriter<W> {
+impl<'a, W: Write> SerializedRowGroupWriter<'a, W> {
+    /// Creates a new `SerializedRowGroupWriter` with:
+    ///
+    /// - `schema_descr` - the schema to write
+    /// - `properties` - writer properties
+    /// - `buf` - the buffer to write data to
+    /// - `on_close` - an optional callback that will invoked on [`Self::close`]
     pub fn new(
         schema_descr: SchemaDescPtr,
         properties: WriterPropertiesPtr,
-        buf: &W,
+        buf: &'a mut TrackedWrite<W>,
+        on_close: Option<OnCloseRowGroup<'a>>,
     ) -> Self {
         let num_columns = schema_descr.num_columns();
         Self {
+            buf,
+            on_close,
+            total_rows_written: None,
             descr: schema_descr,
             props: properties,
-            buf: buf.try_clone().unwrap(),
-            total_rows_written: None,
-            total_bytes_written: 0,
             column_index: 0,
-            previous_writer_closed: true,
             row_group_metadata: None,
             column_chunks: Vec::with_capacity(num_columns),
+            total_bytes_written: 0,
         }
     }
 
-    /// Checks and finalises current column writer.
-    fn finalise_column_writer(&mut self, writer: ColumnWriter) -> Result<()> {

Review Comment:
   This code appears to have been inlined into `next_column` below



##########
parquet/src/util/io.rs:
##########
@@ -153,47 +153,6 @@ impl<R: ParquetReader> Length for FileSource<R> {
         self.end - self.start
     }
 }
-
-/// Struct that represents `File` output stream with position tracking.
-/// Used as a sink in file writer.

Review Comment:
   Likewise, I recommend considering leaving this structure in (and marking it as "deprecated") for a release or two to give people a chance to update their code rather over time.



##########
parquet/benches/arrow_writer.rs:
##########
@@ -278,8 +276,8 @@ fn _create_nested_bench_batch(
 #[inline]
 fn write_batch(batch: &RecordBatch) -> Result<()> {
     // Write batch to an in-memory writer
-    let cursor = InMemoryWriteableCursor::default();
-    let mut writer = ArrowWriter::try_new(cursor, batch.schema(), None)?;
+    let buffer = vec![];

Review Comment:
   here is a nice example of the new API in action: use something that does `std::io::Write`



##########
parquet/src/file/writer.rs:
##########
@@ -214,7 +221,7 @@ impl<W: ParquetWriter> SerializedFileWriter<W> {
     }
 
     #[inline]
-    fn assert_closed(&self) -> Result<()> {
+    fn assert_not_closed(&self) -> Result<()> {

Review Comment:
   👍 



##########
parquet/src/file/writer.rs:
##########
@@ -196,13 +203,13 @@ impl<W: ParquetWriter> SerializedFileWriter<W> {
         };
 
         // Write file metadata
-        let start_pos = self.buf.seek(SeekFrom::Current(0))?;
+        let start_pos = self.buf.bytes_written();

Review Comment:
   ❤️ 



##########
parquet/src/file/writer.rs:
##########
@@ -270,69 +273,45 @@ impl<W: 'static + ParquetWriter> FileWriter for SerializedFileWriter<W> {
 /// A serialized implementation for Parquet [`RowGroupWriter`].
 /// Coordinates writing of a row group with column writers.
 /// See documentation on row group writer for more information.
-pub struct SerializedRowGroupWriter<W: ParquetWriter> {
+pub struct SerializedRowGroupWriter<'a, W: Write> {
     descr: SchemaDescPtr,
     props: WriterPropertiesPtr,
-    buf: W,
+    buf: &'a mut TrackedWrite<W>,
     total_rows_written: Option<u64>,
     total_bytes_written: u64,
     column_index: usize,
-    previous_writer_closed: bool,
     row_group_metadata: Option<RowGroupMetaDataPtr>,
     column_chunks: Vec<ColumnChunkMetaData>,
+    on_close: Option<OnCloseRowGroup<'a>>,
 }
 
-impl<W: 'static + ParquetWriter> SerializedRowGroupWriter<W> {
+impl<'a, W: Write> SerializedRowGroupWriter<'a, W> {
+    /// Creates a new `SerializedRowGroupWriter` with:
+    ///
+    /// - `schema_descr` - the schema to write
+    /// - `properties` - writer properties
+    /// - `buf` - the buffer to write data to
+    /// - `on_close` - an optional callback that will invoked on [`Self::close`]
     pub fn new(
         schema_descr: SchemaDescPtr,
         properties: WriterPropertiesPtr,
-        buf: &W,
+        buf: &'a mut TrackedWrite<W>,
+        on_close: Option<OnCloseRowGroup<'a>>,

Review Comment:
   why is this `Option` -- shouldn't it always be required?



##########
parquet/src/file/mod.rs:
##########
@@ -48,12 +48,14 @@
 //! let props = Arc::new(WriterProperties::builder().build());
 //! let file = fs::File::create(&path).unwrap();
 //! let mut writer = SerializedFileWriter::new(file, schema, props).unwrap();
-//! let mut row_group_writer = writer.next_row_group().unwrap();
-//! while let Some(mut col_writer) = row_group_writer.next_column().unwrap() {
-//!     // ... write values to a column writer
-//!     row_group_writer.close_column(col_writer).unwrap();
+//! {

Review Comment:
   Which trait indirection are you referring to?



##########
parquet/src/file/writer.rs:
##########
@@ -92,102 +111,90 @@ pub trait FileWriter {
 /// All columns should be written sequentially; the main workflow is:
 /// - Request the next column using `next_column` method - this will return `None` if no
 /// more columns are available to write.
-/// - Once done writing a column, close column writer with `close_column` method - this
-/// will finalise column chunk metadata and update row group metrics.
+/// - Once done writing a column, close column writer with `close`
 /// - Once all columns have been written, close row group writer with `close` method -
 /// it will return row group metadata and is no-op on already closed row group.
 pub trait RowGroupWriter {
     /// Returns the next column writer, if available; otherwise returns `None`.
     /// In case of any IO error or Thrift error, or if row group writer has already been
     /// closed returns `Err`.
-    ///
-    /// To request the next column writer, the previous one must be finalised and closed
-    /// using `close_column`.
-    fn next_column(&mut self) -> Result<Option<ColumnWriter>>;
-
-    /// Closes column writer that was created using `next_column` method.
-    /// This should be called before requesting the next column writer.
-    fn close_column(&mut self, column_writer: ColumnWriter) -> Result<()>;
+    fn next_column(&mut self) -> Result<Option<SerializedColumnWriter<'_>>>;
 
     /// Closes this row group writer and returns row group metadata.
     /// After calling this method row group writer must not be used.
     ///
-    /// It is recommended to call this method before requesting another row group, but it
-    /// will be closed automatically before returning a new row group.
-    ///
     /// Can be called multiple times. In subsequent calls will result in no-op and return
     /// already created row group metadata.
     fn close(&mut self) -> Result<RowGroupMetaDataPtr>;
 }
 
+/// Callback invoked on closing a column chunk, arguments are:

Review Comment:
   Also importantly, by keeping a mutable reference it lets the compiler prevent concurrent writes to the same writer, as reported in https://github.com/apache/arrow-rs/issues/1717



##########
parquet/src/file/writer.rs:
##########
@@ -65,15 +91,8 @@ pub trait FileWriter {
     ///
     /// There is no limit on a number of row groups in a file; however, row groups have
     /// to be written sequentially. Every time the next row group is requested, the
-    /// previous row group must be finalised and closed using `close_row_group` method.
-    fn next_row_group(&mut self) -> Result<Box<dyn RowGroupWriter>>;
-
-    /// Finalises and closes row group that was created using `next_row_group` method.
-    /// After calling this method, the next row group is available for writes.
-    fn close_row_group(
-        &mut self,
-        row_group_writer: Box<dyn RowGroupWriter>,
-    ) -> Result<()>;
+    /// previous row group must be finalised and closed using `RowGroupWriter::close` method.
+    fn next_row_group(&mut self) -> Result<Box<dyn RowGroupWriter + '_>>;

Review Comment:
   It might help here to leave a link back to the docs in `parquet/src/file/mod.rs` to bring readers to an example



##########
parquet_derive_test/src/lib.rs:
##########
@@ -131,9 +131,13 @@ mod tests {
         let mut writer =
             SerializedFileWriter::new(file, generated_schema, props).unwrap();
 
-        let mut row_group = writer.next_row_group().unwrap();
-        drs.as_slice().write_to_row_group(&mut row_group).unwrap();
-        writer.close_row_group(row_group).unwrap();
+        {
+            let mut row_group = writer.next_row_group().unwrap();
+            drs.as_slice()
+                .write_to_row_group(row_group.as_mut())
+                .unwrap();
+            row_group.close().unwrap();

Review Comment:
   I do like this pattern better (close the row group writer rather than passing it back to the parquet file writer to close)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1719: Change parquet `ArrowFileWriter` to use standard `std:io::Write` rather custom `ParquetWriter` trait (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#discussion_r879319496


##########
parquet/src/file/writer.rs:
##########
@@ -270,69 +273,45 @@ impl<W: 'static + ParquetWriter> FileWriter for SerializedFileWriter<W> {
 /// A serialized implementation for Parquet [`RowGroupWriter`].
 /// Coordinates writing of a row group with column writers.
 /// See documentation on row group writer for more information.
-pub struct SerializedRowGroupWriter<W: ParquetWriter> {
+pub struct SerializedRowGroupWriter<'a, W: Write> {
     descr: SchemaDescPtr,
     props: WriterPropertiesPtr,
-    buf: W,
+    buf: &'a mut TrackedWrite<W>,
     total_rows_written: Option<u64>,
     total_bytes_written: u64,
     column_index: usize,
-    previous_writer_closed: bool,
     row_group_metadata: Option<RowGroupMetaDataPtr>,
     column_chunks: Vec<ColumnChunkMetaData>,
+    on_close: Option<OnCloseRowGroup<'a>>,
 }
 
-impl<W: 'static + ParquetWriter> SerializedRowGroupWriter<W> {
+impl<'a, W: Write> SerializedRowGroupWriter<'a, W> {
+    /// Creates a new `SerializedRowGroupWriter` with:
+    ///
+    /// - `schema_descr` - the schema to write
+    /// - `properties` - writer properties
+    /// - `buf` - the buffer to write data to
+    /// - `on_close` - an optional callback that will invoked on [`Self::close`]
     pub fn new(
         schema_descr: SchemaDescPtr,
         properties: WriterPropertiesPtr,
-        buf: &W,
+        buf: &'a mut TrackedWrite<W>,
+        on_close: Option<OnCloseRowGroup<'a>>,

Review Comment:
   So these writers are technically public and I don't really know what people might be using them for, I thought this was a way to make the change less annoying - you can just pass `None` and move on...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1719: Rustify parquet writer (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#discussion_r878742614


##########
parquet/src/file/mod.rs:
##########
@@ -48,12 +48,14 @@
 //! let props = Arc::new(WriterProperties::builder().build());
 //! let file = fs::File::create(&path).unwrap();
 //! let mut writer = SerializedFileWriter::new(file, schema, props).unwrap();
-//! let mut row_group_writer = writer.next_row_group().unwrap();
-//! while let Some(mut col_writer) = row_group_writer.next_column().unwrap() {
-//!     // ... write values to a column writer
-//!     row_group_writer.close_column(col_writer).unwrap();
+//! {

Review Comment:
   This is an example of the somewhat unfortunate manual scoping... I personally think removing the trait indirection would make the code a lot easier to understand...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1719: Rustify parquet writer (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#discussion_r878742794


##########
parquet/src/arrow/arrow_writer.rs:
##########
@@ -65,7 +67,7 @@ pub struct ArrowWriter<W: ParquetWriter> {
     max_row_group_size: usize,
 }
 
-impl<W: 'static + ParquetWriter> ArrowWriter<W> {
+impl<W: Write> ArrowWriter<W> {

Review Comment:
   This seemingly small change means you can pass in `&mut std::io::Cursor<Vec<_>>` or any other construction which makes this much easier to use



##########
parquet/src/arrow/arrow_writer.rs:
##########
@@ -744,15 +748,15 @@ mod tests {
         let expected_batch =
             RecordBatch::try_new(schema.clone(), vec![Arc::new(a), Arc::new(b)]).unwrap();
 
-        let cursor = InMemoryWriteableCursor::default();
+        let mut cursor = Cursor::new(vec![]);

Review Comment:
   :tada: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] codecov-commenter commented on pull request #1719: Rustify parquet writer (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#issuecomment-1133756007

   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1719?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1719](https://codecov.io/gh/apache/arrow-rs/pull/1719?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (2620f45) into [master](https://codecov.io/gh/apache/arrow-rs/commit/a30e787b325fd5579699197db1411df52570cc20?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (a30e787) will **increase** coverage by `0.00%`.
   > The diff coverage is `91.66%`.
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #1719   +/-   ##
   =======================================
     Coverage   83.32%   83.33%           
   =======================================
     Files         195      194    -1     
     Lines       56023    55951   -72     
   =======================================
   - Hits        46681    46626   -55     
   + Misses       9342     9325   -17     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1719?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/benches/arrow\_writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/1719/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9iZW5jaGVzL2Fycm93X3dyaXRlci5ycw==) | `0.00% <0.00%> (ø)` | |
   | [parquet/src/util/cursor.rs](https://codecov.io/gh/apache/arrow-rs/pull/1719/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvdXRpbC9jdXJzb3IucnM=) | `80.85% <ø> (+3.54%)` | :arrow_up: |
   | [parquet/src/util/io.rs](https://codecov.io/gh/apache/arrow-rs/pull/1719/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvdXRpbC9pby5ycw==) | `85.26% <ø> (-4.74%)` | :arrow_down: |
   | [parquet\_derive/src/lib.rs](https://codecov.io/gh/apache/arrow-rs/pull/1719/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL2xpYi5ycw==) | `0.00% <ø> (ø)` | |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1719/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.98% <0.00%> (+0.45%)` | :arrow_up: |
   | [parquet/src/data\_type.rs](https://codecov.io/gh/apache/arrow-rs/pull/1719/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZGF0YV90eXBlLnJz) | `75.84% <50.00%> (ø)` | |
   | [parquet/src/file/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/1719/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS93cml0ZXIucnM=) | `93.22% <92.07%> (-0.09%)` | :arrow_down: |
   | [parquet/src/arrow/arrow\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1719/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfcmVhZGVyLnJz) | `95.91% <100.00%> (ø)` | |
   | [parquet/src/arrow/arrow\_writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/1719/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfd3JpdGVyLnJz) | `97.66% <100.00%> (+<0.01%)` | :arrow_up: |
   | [parquet/src/column/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/1719/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvY29sdW1uL3dyaXRlci5ycw==) | `91.39% <100.00%> (ø)` | |
   | ... and [7 more](https://codecov.io/gh/apache/arrow-rs/pull/1719/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1719?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1719?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [a30e787...2620f45](https://codecov.io/gh/apache/arrow-rs/pull/1719?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #1719: Change parquet writers to use standard `std:io::Write` rather custom `ParquetWriter` trait (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
tustvold merged PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1719: Change parquet `ArrowFileWriter` to use standard `std:io::Write` rather custom `ParquetWriter` trait (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#discussion_r879326505


##########
parquet/src/file/mod.rs:
##########
@@ -48,12 +48,14 @@
 //! let props = Arc::new(WriterProperties::builder().build());
 //! let file = fs::File::create(&path).unwrap();
 //! let mut writer = SerializedFileWriter::new(file, schema, props).unwrap();
-//! let mut row_group_writer = writer.next_row_group().unwrap();
-//! while let Some(mut col_writer) = row_group_writer.next_column().unwrap() {
-//!     // ... write values to a column writer
-//!     row_group_writer.close_column(col_writer).unwrap();
+//! {

Review Comment:
   It would have been highly non-trivial to do this correctly before, and with the lifetime change in this PR it would actually be impossible, as `FileWriter::next_row_group` returns a mutable borrow of `FileWriter`
   
   That being said, nothing would prevent creating a `ParallelizedFileWriter`, you just wouldn't be able to be generic over parallel and non-parallel implementations. I'm not sure this is actually a problem, especially as any `ParallelizedFileWriter` is likely to have additional constraints, e.g. Send, async, etc...
   
   I personally think removing these traits as part of the PR would go a long way to making the change easy for users to understand, the lifetime errors are not very obvious unless you know what you are looking for.



##########
parquet/src/file/mod.rs:
##########
@@ -48,12 +48,14 @@
 //! let props = Arc::new(WriterProperties::builder().build());
 //! let file = fs::File::create(&path).unwrap();
 //! let mut writer = SerializedFileWriter::new(file, schema, props).unwrap();
-//! let mut row_group_writer = writer.next_row_group().unwrap();
-//! while let Some(mut col_writer) = row_group_writer.next_column().unwrap() {
-//!     // ... write values to a column writer
-//!     row_group_writer.close_column(col_writer).unwrap();
+//! {

Review Comment:
   It would have been highly non-trivial to do this correctly before, and with the lifetime change in this PR it would actually be impossible, as `FileWriter::next_row_group` returns a mutable borrow of `FileWriter`
   
   That being said, nothing would prevent creating a `ParallelizedFileWriter`, you just wouldn't be able to be generic over parallel and non-parallel implementations. I'm not sure this is actually a problem, especially as any `ParallelizedFileWriter` is likely to have additional constraints, e.g. Send, async, etc...
   
   I personally think removing these traits as part of the PR would go a long way to making the change easy for users to understand, the lifetime errors are not very obvious unless you know what you are looking for. What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #1719: Change parquet `ArrowFileWriter` to use standard `std:io::Write` rather custom `ParquetWriter` trait (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#discussion_r879320446


##########
parquet/src/file/writer.rs:
##########
@@ -270,69 +273,45 @@ impl<W: 'static + ParquetWriter> FileWriter for SerializedFileWriter<W> {
 /// A serialized implementation for Parquet [`RowGroupWriter`].
 /// Coordinates writing of a row group with column writers.
 /// See documentation on row group writer for more information.
-pub struct SerializedRowGroupWriter<W: ParquetWriter> {
+pub struct SerializedRowGroupWriter<'a, W: Write> {
     descr: SchemaDescPtr,
     props: WriterPropertiesPtr,
-    buf: W,
+    buf: &'a mut TrackedWrite<W>,
     total_rows_written: Option<u64>,
     total_bytes_written: u64,
     column_index: usize,
-    previous_writer_closed: bool,
     row_group_metadata: Option<RowGroupMetaDataPtr>,
     column_chunks: Vec<ColumnChunkMetaData>,
+    on_close: Option<OnCloseRowGroup<'a>>,
 }
 
-impl<W: 'static + ParquetWriter> SerializedRowGroupWriter<W> {
+impl<'a, W: Write> SerializedRowGroupWriter<'a, W> {
+    /// Creates a new `SerializedRowGroupWriter` with:
+    ///
+    /// - `schema_descr` - the schema to write
+    /// - `properties` - writer properties
+    /// - `buf` - the buffer to write data to
+    /// - `on_close` - an optional callback that will invoked on [`Self::close`]
     pub fn new(
         schema_descr: SchemaDescPtr,
         properties: WriterPropertiesPtr,
-        buf: &W,
+        buf: &'a mut TrackedWrite<W>,
+        on_close: Option<OnCloseRowGroup<'a>>,

Review Comment:
   makes sense



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1719: Change parquet `ArrowFileWriter` to use standard `std:io::Write` rather custom `ParquetWriter` trait (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#discussion_r879317772


##########
parquet/src/file/mod.rs:
##########
@@ -48,12 +48,14 @@
 //! let props = Arc::new(WriterProperties::builder().build());
 //! let file = fs::File::create(&path).unwrap();
 //! let mut writer = SerializedFileWriter::new(file, schema, props).unwrap();
-//! let mut row_group_writer = writer.next_row_group().unwrap();
-//! while let Some(mut col_writer) = row_group_writer.next_column().unwrap() {
-//!     // ... write values to a column writer
-//!     row_group_writer.close_column(col_writer).unwrap();
+//! {

Review Comment:
   > Which trait indirection are you referring to?
   
   We have FileWriter and RowGroupWriter traits which are then implemented by `SerializedFileWriter` and `SerializedRowGroupWriter`.
   
   I'm not sure why you would ever want to override the default implementation, and the need to expose these as object-safe traits prevents `FileWriter::close(self)` or `RowGroupWriter::close(self)` which causes lifetime annoyances.
   
   I also find the indirection very confusing in general :sweat_smile:  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #1719: Change parquet writers to use standard `std:io::Write` rather custom `ParquetWriter` trait (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#issuecomment-1137696422

   👨‍🍳 👌  -- this will be a very nice improvement in arrow 15 ❤️  -- thank you @tustvold 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #1719: Change parquet writers to use standard `std:io::Write` rather custom `ParquetWriter` trait (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#issuecomment-1135912402

   It will be a nice feature in version 15 🎉 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #1719: Change parquet writers to use standard `std:io::Write` rather custom `ParquetWriter` trait (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#issuecomment-1135542068

   I've removed the FileWriter and RowGroupWriter traits as discussed, and deprecated FileSink and InMemoryWriteableCursor.
   
   Assuming CI doesn't find something I've missed I think this is now ready, unless anyone objects I intend to leave it open for another 24 hours and then get it merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #1719: Change parquet `ArrowFileWriter` to use standard `std:io::Write` rather custom `ParquetWriter` trait (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#discussion_r879344191


##########
parquet/src/file/mod.rs:
##########
@@ -48,12 +48,14 @@
 //! let props = Arc::new(WriterProperties::builder().build());
 //! let file = fs::File::create(&path).unwrap();
 //! let mut writer = SerializedFileWriter::new(file, schema, props).unwrap();
-//! let mut row_group_writer = writer.next_row_group().unwrap();
-//! while let Some(mut col_writer) = row_group_writer.next_column().unwrap() {
-//!     // ... write values to a column writer
-//!     row_group_writer.close_column(col_writer).unwrap();
+//! {

Review Comment:
   I am not opposed to removing the `FileWriter` and  `RowGroupWriter`. I think it would help readability a lot. However I think other reviewers should weigh in.
   
   If we are going to make such an API change, perhaps we should do it in the same release as this change to get it all over with at once.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1719: Rustify parquet writer (#1717) (#1163)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1719:
URL: https://github.com/apache/arrow-rs/pull/1719#discussion_r878742710


##########
parquet/src/file/writer.rs:
##########
@@ -393,39 +387,91 @@ impl<W: 'static + ParquetWriter> RowGroupWriter for SerializedRowGroupWriter<W>
                 .set_num_rows(self.total_rows_written.unwrap_or(0) as i64)
                 .build()?;
 
-            self.row_group_metadata = Some(Arc::new(row_group_metadata));
+            let metadata = Arc::new(row_group_metadata);
+            self.row_group_metadata = Some(metadata.clone());
+
+            if let Some(on_close) = self.on_close.take() {
+                on_close(metadata)?
+            }
         }
 
         let metadata = self.row_group_metadata.as_ref().unwrap().clone();
         Ok(metadata)
     }
 }
 
+/// A wrapper around a [`ColumnWriter`] that invokes a callback on [`Self::close`]

Review Comment:
   I'm not a massive fan of this, but I was somewhat wary of making changes to the `ColumnWriter` plumbing as it is used in lots of places



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org