You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/03/15 09:50:30 UTC

[GitHub] [arrow-rs] tustvold opened a new issue, #3871: Avoid Buffering Arrow Data for Entire Row Group in parquet::ArrowWriter

tustvold opened a new issue, #3871:
URL: https://github.com/apache/arrow-rs/issues/3871

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

Currently ArrowWriter buffers up `RecordBatch` until it has enough rows to populate an entire row group, and then proceeds to write each column in turn to the output buffer.

**Describe the solution you'd like**

The encoded parquet data is often orders of magnitude smaller than the corresponding arrow data. The read path goes to great lengths to allow incremental reading of data within a row group. It may therefore be desirable to instead encode arrow data eagerly, writing each ColumnChunk to its own temporary buffer, and then stitching these back together.

This would allow writing larger row groups, whilst potentially consuming less memory in the arrow writer.

This would likely involve extending or possibly replacing `SerializedRowGroupWriter` to allow writing to the same column multiple times

**Describe alternatives you've considered**

We could not do this, parquet is inherently a read-optimised format and write performance may therefore be less of a priority for many workloads.

**Additional context**

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #3871: Avoid Buffering Arrow Data for Entire Row Group in parquet::ArrowWriter

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #3871:
URL: https://github.com/apache/arrow-rs/issues/3871#issuecomment-1557840581

   This ticket will improve https://github.com/influxdata/influxdb_iox/issues/7783 -- thank you for filing it. 
    
   As part of this feature, I would like to request some user definable best effort limit of how much memory the parquet writer will buffer (so flush is a function of both "max_row_group_size" as well as "buffer_limit"). 
   
   If for some reason that is not possible or advisable, exposing the currently buffered size would be ok too (so external users can implement the buffer limiting themselves)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #3871: Avoid Buffering Arrow Data for Entire Row Group in parquet::ArrowWriter

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #3871:
URL: https://github.com/apache/arrow-rs/issues/3871#issuecomment-1559304105

   I think https://github.com/apache/arrow-rs/issues/4155 is a precursor to this, as it provides the necessary APIs to be able to encode the columns separately, and then stitch them together again. I therefore intend to work on it first


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #3871: Avoid Buffering Arrow Data for Entire Row Group in parquet::ArrowWriter

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #3871:
URL: https://github.com/apache/arrow-rs/issues/3871#issuecomment-1559378680

   I wonder if  you also might think about  https://github.com/apache/arrow-rs/issues/1718 "encode the columns in parallel while writing parquet" while working on this.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold closed issue #3871: Avoid Buffering Arrow Data for Entire Row Group in parquet::ArrowWriter

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold closed issue #3871: Avoid Buffering Arrow Data for Entire Row Group in parquet::ArrowWriter
URL: https://github.com/apache/arrow-rs/issues/3871


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org