You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/29 22:03:04 UTC

[GitHub] [arrow-rs] tustvold commented on issue #1627: Written Parquet file way bigger than input files

tustvold commented on issue #1627:
URL: https://github.com/apache/arrow-rs/issues/1627#issuecomment-1113791088

   Some ideas to try:
   
   - Disable dictionary compression for columns that don't have repeated values
   - Use writer version 2, which has better string encoding
   - Represent the id / sequence as an integral type instead of a variable length string
   - Try without snappy, as compression may not always yield benefits
   - Maybe try writing the data using something like pyarrow to determine if this is something specific to the Rust implementation
   
   Without the data it is hard to say for sure what is going on, but ignoring compression parquet will have at least a 4 byte overhead per string, and so in the case of lots of small strings...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org