You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/28 12:30:55 UTC

[GitHub] [arrow] stevenliebregt opened a new issue, #13023: Written Parquet file way bigger than input files

stevenliebregt opened a new issue, #13023:
URL: https://github.com/apache/arrow/issues/13023

   Hi, I'm looking if parqet and arrow could fit a usecase of mine, but I've ran into a strange issue, for which I can find now answer in the documentation. I have two input files in txt format, where each record spans 4 lines. I have a parser that reads that just fine, and want to convert that format to a parquet file. The two input files are combined around 600MB, but when I write these to a parquet file, the resulting file is nearly 5GB, it also consumes around 6/7GB memory while writing the files. I have turned on compression.
   
   ```rust
   let message_type = "
           message Schema {
               REQUIRED BINARY id (UTF8);
               REQUIRED BINARY header (UTF8);
               REQUIRED BINARY sequence (UTF8);
               REQUIRED BINARY quality (UTF8);
           }
       ";
   
       let schema = Arc::new(parse_message_type(message_type).unwrap());
       let props = Arc::new(
           WriterProperties::builder()
               .set_compression(Compression::SNAPPY)
               .build(),
       );
   ```
   
   My rust configuration for the writer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] stevenliebregt closed issue #13023: Written Parquet file way bigger than input files

Posted by GitBox <gi...@apache.org>.
stevenliebregt closed issue #13023: Written Parquet file way bigger than input files
URL: https://github.com/apache/arrow/issues/13023


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] stevenliebregt commented on issue #13023: Written Parquet file way bigger than input files

Posted by GitBox <gi...@apache.org>.
stevenliebregt commented on issue #13023:
URL: https://github.com/apache/arrow/issues/13023#issuecomment-1112312151

   Oh my bad, didn't realize I was on the global arrow repo, moved question there


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on issue #13023: Written Parquet file way bigger than input files

Posted by GitBox <gi...@apache.org>.
wjones127 commented on issue #13023:
URL: https://github.com/apache/arrow/issues/13023#issuecomment-1112264824

   Hi @stevenliebregt. For Rust questions, you'll get better responses in the [apache/arrow-rs](https://github.com/apache/arrow-rs/issues) repo, since that's where the Rust codebase and planning has moved to.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org