You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/28 14:57:33 UTC

[GitHub] [arrow-rs] stevenliebregt opened a new issue, #1627: Written Parquet file way bigger than input files

stevenliebregt opened a new issue, #1627:
URL: https://github.com/apache/arrow-rs/issues/1627

   **Which part is this question about**
   The parquet file writer usage
   
   **Describe your question**
   Hi, I'm looking if parqet and arrow could fit a usecase of mine, but I've ran into a strange issue, for which I can find now answer in the documentation. I have two input files in txt format, where each record spans 4 lines. I have a parser that reads that just fine, and want to convert that format to a parquet file. The two input files are combined around 600MB, but when I write these to a parquet file, the resulting file is nearly 5GB, it also consumes around 6/7GB memory while writing the files. I have turned on compression.
   
   ```rust
   let message_type = "
       message Schema {
           REQUIRED BINARY id (UTF8);
           REQUIRED BINARY header (UTF8);
           REQUIRED BINARY sequence (UTF8);
           REQUIRED BINARY quality (UTF8);
       }
   ";
   
   let schema = Arc::new(parse_message_type(message_type).unwrap());
   let props = Arc::new(
       WriterProperties::builder()
           .set_compression(Compression::SNAPPY)
           .build(),
   );
   ```
   
   My rust configuration for the writer.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] stevenliebregt commented on issue #1627: Written Parquet file way bigger than input files

Posted by GitBox <gi...@apache.org>.

stevenliebregt commented on issue #1627:
URL: https://github.com/apache/arrow-rs/issues/1627#issuecomment-1113935499

   Thanks for the answer, I'll give those ideas a try, if I find it's a problem specific to Rust I'll create an issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #1627: Written Parquet file way bigger than input files

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #1627:
URL: https://github.com/apache/arrow-rs/issues/1627#issuecomment-1113791088

   Some ideas to try:
   
   - Disable dictionary compression for columns that don't have repeated values
   - Use writer version 2, which has better string encoding
   - Represent the id / sequence as an integral type instead of a variable length string
   - Try without snappy, as compression may not always yield benefits
   - Maybe try writing the data using something like pyarrow to determine if this is something specific to the Rust implementation
   
   Without the data it is hard to say for sure what is going on, but ignoring compression parquet will have at least a 4 byte overhead per string, and so in the case of lots of small strings...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] Dandandan commented on issue #1627: Written Parquet file way bigger than input files

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #1627:
URL: https://github.com/apache/arrow-rs/issues/1627#issuecomment-1113979085

   Also, try zstd, which often gives quite a bit better compression than snappy.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] stevenliebregt closed issue #1627: Written Parquet file way bigger than input files

Posted by GitBox <gi...@apache.org>.

stevenliebregt closed issue #1627: Written Parquet file way bigger than input files 
URL: https://github.com/apache/arrow-rs/issues/1627


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org