You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "wiedld (via GitHub)" <gi...@apache.org> on 2024/03/11 03:39:13 UTC

[PR] WIP(do-not-merge): changes to enable ParquetSink poc [arrow-datafusion]

wiedld opened a new pull request, #9548:
URL: https://github.com/apache/arrow-datafusion/pull/9548

   **POC for Discussion only: DO NOT MERGE.**
   
   
   ## Which issue does this PR close?
   
   For discussion of https://github.com/apache/arrow-datafusion/issues/9493.
   
   ## Rationale for this change
   
   We are proposed a generalized public API that provides access to parallelized parquet writes outside of the COPYTO execution context. The code shared here is **NOT** the changes we are requesting. Instead, it shows what current limitations exist when trying to use the ParquetSink for parquet writing, instead of the ArrowWriter.
   
   ## What changes are included in this PR?
   
   What ArrowWriter already provided, and we had to change in order to use ParquetSink:
   * expose the FileMetaData associated with the created parquet:
      * **ArrowWriter already provides:**
          * in the [ArrowWriter::close() return signature](https://github.com/apache/arrow-rs/blob/c6ba0f764a9142b74c9070db269de04d2701d112/parquet/src/arrow/arrow_writer/mod.rs#L254), the FileMetaData is provided.
      * **ParquetSink had to be changed:**
          * as ParquetSink is intended for use inside a query execution context, and writes to 1+ file sinks, it does not currently return any FileMetaData associated with any sinks.
          * We had to change this, in order for the POC to work.
   * provide the appropriate schema in the kv store:
      * **ArrowWriter already provides:**
          * the [ArrowWriter::try_new() both serializes the schema and maps to the appropriate key](https://github.com/apache/arrow-rs/blob/c6ba0f764a9142b74c9070db269de04d2701d112/parquet/src/arrow/arrow_writer/mod.rs#L141) within the kv_store of the WriterPropertries
      * **ParquetSink had to be changed:**
          * whereas the ParquetSink does not include this functionality.
          * As such, we had to provide this mutation of WriterProperties in our own code (by extracting the `add_encoded_arrow_schema_to_metadata()` and associated upstream functionality).
   
   
   
   ## Are these changes tested?
   
   This code will not be merged.
   
   ## Are there any user-facing changes?
   
   This code will not be merged.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] feat(9493): provide access to FileMetaData for files written with ParquetSink [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb merged PR #9548:
URL: https://github.com/apache/arrow-datafusion/pull/9548


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] feat(9493): provide access to FileMetaData for files written with ParquetSink [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #9548:
URL: https://github.com/apache/arrow-datafusion/pull/9548#discussion_r1520436449


##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -717,7 +734,18 @@ impl DataSink for ParquetSink {
         while let Some(result) = file_write_tasks.join_next().await {
             match result {
                 Ok(r) => {
-                    row_count += r?;
+                    let (path, file_metadata) = r?;
+                    row_count += file_metadata.num_rows;
+                    let mut written_files = self.written.lock();
+                    written_files
+                        .try_insert(path.clone(), file_metadata)
+                        .map_err(|e| {
+                            DataFusionError::Internal(format!(
+                                "duplicate entry detected for partitioned file {}: {}",
+                                &path, e
+                            ))

Review Comment:
   Can you please use the `internal_err!` macro here instead -- something like
   
   ```suggestion
                           .map_err(|e| internal_err!("duplicate entry detected for partitioned file {path}: {e}"))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] WIP(do-not-merge): Proposed public parallel parquet writer API / changes to `ParquetSink` [arrow-datafusion]

Posted by "devinjdangelo (via GitHub)" <gi...@apache.org>.

devinjdangelo commented on PR #9548:
URL: https://github.com/apache/arrow-datafusion/pull/9548#issuecomment-1988486807

   I plan to take a closer look at this later this evening. Looks good at a high level though, thanks @wiedld !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] WIP(do-not-merge): Proposed public parallel parquet writer API / changes to `ParquetSink` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #9548:
URL: https://github.com/apache/arrow-datafusion/pull/9548#issuecomment-1989258358

   @wiedld  -- what do you think about changing the title and description of this PR to match the intent of merging this API?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] feat(9493): provide access to FileMetaData for files written with ParquetSink [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #9548:
URL: https://github.com/apache/arrow-datafusion/pull/9548#issuecomment-1992069331

   Thanks @devinjdangelo  and @wiedld


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org