You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/18 17:57:59 UTC

[GitHub] [arrow-rs] wjones127 opened a new issue, #1711: Concatenate parquet files without deserializing?

wjones127 opened a new issue, #1711:
URL: https://github.com/apache/arrow-rs/issues/1711

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   This is a random idea, but it seems like it would be valuable to be able to concatenate parquet files without deserializing to Arrow and re-serializing back to Parquet. I'm not 100% sure that it would be possible, but does seem like you should in theory be able to just copy the row group buffers and then update the offsets within the row group metadata in the footer.
   
   You can only do this if the schemas match, of course.
   
   **Describe the solution you'd like**
   
   If this is indeed possible, then some interface like:
   
   ```rust
   fn merge_files(readers: Vec<SerializedFileReader>, writer: impl FileWriter) -> Result<()>;
   ```
   
   **Describe alternatives you've considered**
   
   The obvious alternative is to simple read as Arrow, concatenate, and then serialize back, but reading and writing parquet is famously compute intensive, so would be nice if we could avoid that.
   
   **Additional context**
   
   Concatenating parquet files is a common operation in Delta Lake tables, which may initially write out many small files that later need to be merged for better read performance. See delta-io/delta-rs#98.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #1711: Concatenate parquet files without deserializing?

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #1711:
URL: https://github.com/apache/arrow-rs/issues/1711#issuecomment-1572227720

   Closed by #4269 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold closed issue #1711: Concatenate parquet files without deserializing?

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold closed issue #1711: Concatenate parquet files without deserializing?
URL: https://github.com/apache/arrow-rs/issues/1711


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #1711: Concatenate parquet files without deserializing?

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #1711:
URL: https://github.com/apache/arrow-rs/issues/1711#issuecomment-1133108762

   This sounds like a good idea to me, and could possibly feed into some sort of story for parallel writing :+1: 
   
   It is probably worth highlighting though that whilst merging parquet files without rewriting the row groups will theoretically reduce the IO required to fetch them from object storage, along with any catalog overheads, it likely won't help with the CPU-bound portion of actually decoding the bytes, nor with compression.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org