You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/30 10:27:00 UTC

[GitHub] [arrow-rs] tustvold opened a new issue, #1764: Optimized Writing of Arrow Byte Array to Parquet

tustvold opened a new issue, #1764:
URL: https://github.com/apache/arrow-rs/issues/1764

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   A significant amount of effort has been put into making the reading of byte arrays from parquet fast:
   
   * https://github.com/apache/arrow-rs/pull/1041
   * https://github.com/apache/arrow-rs/pull/1082
   * https://github.com/apache/arrow-rs/pull/1180
   
   We should invest some effort in making the writer performance comparable.
   
   **Describe the solution you'd like**
   
   Currently in order to write byte array types from arrow:
   
   * Any dictionaries are hydrated
   * Each value from a string array is separately allocated into a `Vec<ByteArray>`
   * These values are then written using the ColumnWriter
   
   It would be a significant performance win to be able to elide these first two steps. This would likely involve much the same process as was followed for the reader:
   
   * Generify ColumnWriter to allow writing from different buffers
   * Add the ability to write from an arrow ByteArray directly
   * Add the ability to write from an arrow dictionary array directly
   
   **Describe alternatives you've considered**
   
   We could not do this
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #1764: Optimized Writing of Arrow Byte Array to Parquet

Posted by GitBox <gi...@apache.org>.
tustvold closed issue #1764: Optimized Writing of Arrow Byte Array to Parquet
URL: https://github.com/apache/arrow-rs/issues/1764


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org