You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "comphead (via GitHub)" <gi...@apache.org> on 2023/03/14 01:31:27 UTC

[GitHub] [arrow-datafusion] comphead opened a new pull request, #5584: Introduce file writer strategies for Parquet writer

comphead opened a new pull request, #5584:
URL: https://github.com/apache/arrow-datafusion/pull/5584

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #5119 .
   
   # Rationale for this change
   
   To give the user flexibility selecting persist strategies when saving the Parquet data, in similar way Spark does
   https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/SaveMode.html
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are these changes tested?
   Yes
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   The only user-facing change is the user must provide the save mode, and the parquet  file name changed.
   `part-0-0.parquet` to `part-uuid-0.parquet`
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on pull request #5584: Introduce file writer strategies for Parquet writer

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on PR #5584:
URL: https://github.com/apache/arrow-datafusion/pull/5584#issuecomment-1470864241

   Thanks @metesynnada for the review. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] metesynnada commented on a diff in pull request #5584: Introduce file writer strategies for Parquet writer

Posted by "metesynnada (via GitHub)" <gi...@apache.org>.
metesynnada commented on code in PR #5584:
URL: https://github.com/apache/arrow-datafusion/pull/5584#discussion_r1140096025


##########
datafusion/core/src/dataframe.rs:
##########
@@ -930,10 +932,11 @@ impl DataFrame {
         self,
         path: &str,
         writer_properties: Option<WriterProperties>,
+        save_mode: FileWriterSaveMode,

Review Comment:
   I agree with @alamb , a wrapper for options might be more future proof.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] metesynnada commented on a diff in pull request #5584: Introduce file writer strategies for Parquet writer

Posted by "metesynnada (via GitHub)" <gi...@apache.org>.
metesynnada commented on code in PR #5584:
URL: https://github.com/apache/arrow-datafusion/pull/5584#discussion_r1138182981


##########
datafusion/core/src/dataframe.rs:
##########
@@ -930,10 +932,11 @@ impl DataFrame {
         self,
         path: &str,
         writer_properties: Option<WriterProperties>,
+        save_mode: FileWriterSaveMode,

Review Comment:
   I saw that option, however, it would better to be have a general `***WriterOptions`, like we have `ParquetReadOptions`, `CsvReadOptions`, `AvroReadOptions` etc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #5584: Introduce file writer strategies for Parquet writer

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on code in PR #5584:
URL: https://github.com/apache/arrow-datafusion/pull/5584#discussion_r1137795180


##########
datafusion/core/src/physical_plan/file_format/mod.rs:
##########
@@ -541,6 +541,23 @@ pub(crate) fn get_output_ordering(
         }).unwrap_or_else(|| None)
 }
 
+/// Defines strategies for the saving behavior in the case when the saved folder already exists
+#[derive(Debug, Clone, Copy)]
+pub enum FileWriterSaveMode {
+    /// Overwrite mode means that when saving a parquet file, if folder already exists, existing data is expected to be overwritten.
+    /// If folder does not exist, the folder will be created. This is default value.
+    Overwrite,
+    /// Append mode means that when saving a parquet file, if folder already exists, new data is expected to be appended on top of the existing.
+    /// If folder does not exist, the folder will be created
+    Append,

Review Comment:
   I see your point, will be trying to do writer properties more sbstract, so each writer can implement its own save mode behaviuors



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] metesynnada commented on a diff in pull request #5584: Introduce file writer strategies for Parquet writer

Posted by "metesynnada (via GitHub)" <gi...@apache.org>.
metesynnada commented on code in PR #5584:
URL: https://github.com/apache/arrow-datafusion/pull/5584#discussion_r1137496223


##########
datafusion/core/src/dataframe.rs:
##########
@@ -930,10 +932,11 @@ impl DataFrame {
         self,
         path: &str,
         writer_properties: Option<WriterProperties>,
+        save_mode: FileWriterSaveMode,

Review Comment:
   Spark made the save mode optional, we can also default to `ErrorIfExists` mode. Also, we can merge the `WriterProperties` and `FileWriterSaveMode` structs like in a `ParquetWriterOptions` struct since they are both configurations, this would abstract the save mode from immediate usage and contribute the simplicity.
   
   ```rust
   ParquetWriterOptions {
   	writer_properties: WriterProperties,
   	save_mode: FileWriterSaveMode
   }
   
   CsvWriterOptions {
   	dummy_properties: FutureCSVProperties, //dummy
   	save_mode: FileWriterSaveMode
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Introduce file writer strategies for Parquet writer [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed pull request #5584: Introduce file writer strategies for Parquet writer
URL: https://github.com/apache/arrow-datafusion/pull/5584


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5584: Introduce file writer strategies for Parquet writer

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5584:
URL: https://github.com/apache/arrow-datafusion/pull/5584#discussion_r1140076312


##########
datafusion/core/src/dataframe.rs:
##########
@@ -930,10 +932,11 @@ impl DataFrame {
         self,
         path: &str,
         writer_properties: Option<WriterProperties>,
+        save_mode: FileWriterSaveMode,

Review Comment:
   I agree that SaveMode in particular might belong in arrow-rs. I do think there is value longer term in making a DataFusion have a wrapper around those properties for datafusion specific things (like sort order).  
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #5584: Introduce file writer strategies for Parquet writer

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on code in PR #5584:
URL: https://github.com/apache/arrow-datafusion/pull/5584#discussion_r1137794244


##########
datafusion/core/src/dataframe.rs:
##########
@@ -930,10 +932,11 @@ impl DataFrame {
         self,
         path: &str,
         writer_properties: Option<WriterProperties>,
+        save_mode: FileWriterSaveMode,

Review Comment:
   That is good point, alternatively for parquet we can reuse `WriterProperties::key_value_metadata` to specify optional saveMode and default to some value in case of None



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] metesynnada commented on pull request #5584: Introduce file writer strategies for Parquet writer

Posted by "metesynnada (via GitHub)" <gi...@apache.org>.
metesynnada commented on PR #5584:
URL: https://github.com/apache/arrow-datafusion/pull/5584#issuecomment-1468228396

   I will carefully review it tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Introduce file writer strategies for Parquet writer [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5584:
URL: https://github.com/apache/arrow-datafusion/pull/5584#issuecomment-2043641214

   Since this has been open for more than a year, closing it down. Feel free to reopen if/when you keep working on it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5584: Introduce file writer strategies for Parquet writer

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5584:
URL: https://github.com/apache/arrow-datafusion/pull/5584#discussion_r1138849539


##########
datafusion/core/src/dataframe.rs:
##########
@@ -930,10 +932,11 @@ impl DataFrame {
         self,
         path: &str,
         writer_properties: Option<WriterProperties>,
+        save_mode: FileWriterSaveMode,

Review Comment:
   > I saw that option, however, it would better to be have a general ***WriterOptions, like we have ParquetReadOptions, CsvReadOptions, AvroReadOptions etc.
   
   I agree that making a more future proof API here would be very helpful  -- for example, I can imagine adding a other (DataFusion) specific properties for writing parquet. 
   
   For example, I would love to have a place to put "sort expressions" to sort the output parquet file by some set of expressions. 
   
   I love @metesynnada 's suggestion for `ParquetWriterOptions`
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #5584: Introduce file writer strategies for Parquet writer

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on code in PR #5584:
URL: https://github.com/apache/arrow-datafusion/pull/5584#discussion_r1138908503


##########
datafusion/core/src/dataframe.rs:
##########
@@ -930,10 +932,11 @@ impl DataFrame {
         self,
         path: &str,
         writer_properties: Option<WriterProperties>,
+        save_mode: FileWriterSaveMode,

Review Comment:
   Right. @alamb @metesynnada 
   I was thinking initially to reuse `WriterProperties` optional key value metadata, so it can has arbitrary keys and values, including savemode and whatever. But `WriterProperties` is created for Parquet only in arrow-rs. CSV writer has its write properties as top struct fields https://github.com/comphead/arrow-rs/blob/master/arrow-csv/src/writer.rs#L82
   
   Ideally is to place such struct to arrow-rs and introduce it to all writers, to make them consistent. This will allow us to avoid conflicts(if arrow-rs decides to add their own saveMode), it will mess with ours defined in separate structure.
   But defining such struct in arrow-rs might be too breaking, so yes, lets try to do it in DF and backport to arrow-rs. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] metesynnada commented on a diff in pull request #5584: Introduce file writer strategies for Parquet writer

Posted by "metesynnada (via GitHub)" <gi...@apache.org>.
metesynnada commented on code in PR #5584:
URL: https://github.com/apache/arrow-datafusion/pull/5584#discussion_r1140096025


##########
datafusion/core/src/dataframe.rs:
##########
@@ -930,10 +932,11 @@ impl DataFrame {
         self,
         path: &str,
         writer_properties: Option<WriterProperties>,
+        save_mode: FileWriterSaveMode,

Review Comment:
   I agree with @alamb, a wrapper for options might be more future-proof.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on pull request #5584: Introduce file writer strategies for Parquet writer

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on PR #5584:
URL: https://github.com/apache/arrow-datafusion/pull/5584#issuecomment-1470170971

   Thanks @metesynnada. looking forward


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] metesynnada commented on a diff in pull request #5584: Introduce file writer strategies for Parquet writer

Posted by "metesynnada (via GitHub)" <gi...@apache.org>.
metesynnada commented on code in PR #5584:
URL: https://github.com/apache/arrow-datafusion/pull/5584#discussion_r1137496223


##########
datafusion/core/src/dataframe.rs:
##########
@@ -930,10 +932,11 @@ impl DataFrame {
         self,
         path: &str,
         writer_properties: Option<WriterProperties>,
+        save_mode: FileWriterSaveMode,

Review Comment:
   Spark made the save mode optional, we can also default to `ErrorIfExists` mode. Also, we can merge the `WriterProperties` and `FileWriterSaveMode` structs like in a `ParquetWriterOptions` struct since they are both configurations, this would abstract the save mode from immediate usage and contribute the simplicity.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] metesynnada commented on a diff in pull request #5584: Introduce file writer strategies for Parquet writer

Posted by "metesynnada (via GitHub)" <gi...@apache.org>.
metesynnada commented on code in PR #5584:
URL: https://github.com/apache/arrow-datafusion/pull/5584#discussion_r1137496223


##########
datafusion/core/src/dataframe.rs:
##########
@@ -930,10 +932,11 @@ impl DataFrame {
         self,
         path: &str,
         writer_properties: Option<WriterProperties>,
+        save_mode: FileWriterSaveMode,

Review Comment:
   Spark made the save mode optional, we can also default to `ErrorIfExists` mode. Also, we can merge the `WriterProperties` and `FileWriterSaveMode` structs like in a `ParquetWriterOptions` struct since they are both configurations, this would abstract the save mode from immediate usage and contribute the simplicity.
   What I mean is we can maybe have something like:
   ```rust
   ParquetWriterOptions {
   	writer_properties: WriterProperties,
   	save_mode: FileWriterSaveMode
   }
   
   CsvWriterOptions {
   	dummy_properties: FutureCSVProperties, //dummy
   	save_mode: FileWriterSaveMode
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] metesynnada commented on a diff in pull request #5584: Introduce file writer strategies for Parquet writer

Posted by "metesynnada (via GitHub)" <gi...@apache.org>.
metesynnada commented on code in PR #5584:
URL: https://github.com/apache/arrow-datafusion/pull/5584#discussion_r1137479350


##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -707,26 +709,58 @@ impl ParquetFileReaderFactory for DefaultParquetFileReaderFactory {
     }
 }
 
-/// Executes a query and writes the results to a partitioned Parquet file.
-pub async fn plan_to_parquet(
-    task_ctx: Arc<TaskContext>,
-    plan: Arc<dyn ExecutionPlan>,
+/// create target folder using different save mode strategy, refer to [`FileWriterSaveMode`] for more details
+fn create_target_folder(
     path: impl AsRef<str>,
-    writer_properties: Option<WriterProperties>,
-) -> Result<()> {
+    save_mode: FileWriterSaveMode,
+) -> Result<Option<std::path::PathBuf>> {
     let path = path.as_ref();
-    // create directory to contain the Parquet files (one per partition)
     let fs_path = std::path::Path::new(path);
+
+    if fs_path.exists() {
+        match save_mode {
+            FileWriterSaveMode::Overwrite => {
+                std::fs::remove_dir_all(fs_path)?;
+            }
+            FileWriterSaveMode::Ignore => {
+                return Ok(None);
+            }
+            FileWriterSaveMode::Append => {
+                return Ok(Some(fs_path.to_path_buf()));
+            }
+            _ => {}
+        };
+    }
+
     if let Err(e) = fs::create_dir(fs_path) {
         return Err(DataFusionError::Execution(format!(
             "Could not create directory {path}: {e:?}"
         )));
     }
 
+    Ok(Some(fs_path.to_path_buf()))
+}
+
+/// Executes a query and writes the results to a partitioned Parquet file.
+pub async fn plan_to_parquet(
+    task_ctx: Arc<TaskContext>,
+    plan: Arc<dyn ExecutionPlan>,
+    path: impl AsRef<str>,
+    writer_properties: Option<WriterProperties>,
+    save_mode: FileWriterSaveMode,
+) -> Result<()> {
+    // create directory to contain the Parquet files (one per partition)
+    let fs_path = &create_target_folder(path, save_mode)?;

Review Comment:
   There is an unnecessary copy/clone if the folder is not overwritten. We can make this part without a clone with a different implementation.



##########
datafusion/core/src/physical_plan/file_format/mod.rs:
##########
@@ -541,6 +541,23 @@ pub(crate) fn get_output_ordering(
         }).unwrap_or_else(|| None)
 }
 
+/// Defines strategies for the saving behavior in the case when the saved folder already exists
+#[derive(Debug, Clone, Copy)]
+pub enum FileWriterSaveMode {
+    /// Overwrite mode means that when saving a parquet file, if folder already exists, existing data is expected to be overwritten.
+    /// If folder does not exist, the folder will be created. This is default value.
+    Overwrite,
+    /// Append mode means that when saving a parquet file, if folder already exists, new data is expected to be appended on top of the existing.
+    /// If folder does not exist, the folder will be created
+    Append,

Review Comment:
   As I understand, parquet files are not appended, we hold the directory and write new files with a new uuid. 
   
   This approach is specific to the parquet file format and may not apply to other file formats like CSV, where appending to an existing file is the expected behavior. It's worth noting that this structure may be extended to support other file formats in the future, so it's important to update the comments accordingly.
   



##########
datafusion/core/src/dataframe.rs:
##########
@@ -930,10 +932,11 @@ impl DataFrame {
         self,
         path: &str,
         writer_properties: Option<WriterProperties>,
+        save_mode: FileWriterSaveMode,

Review Comment:
   Spark made the save mode optional, we can also default to `append` mode. Also, we can merge the `WriterProperties` and `FileWriterSaveMode` structs like in a `ParquetWriterOptions` struct since they are both configurations, this would abstract the save mode from immediate usage and contribute the simplicity.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org