You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2024/03/07 22:32:28 UTC

[I] Better memory limiting in parquet `ArrowWriter` [arrow-rs]

alamb opened a new issue, #5484:
URL: https://github.com/apache/arrow-rs/issues/5484

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   @DDtKey suggested in https://github.com/apache/arrow-rs/pull/5457 https://github.com/apache/arrow-rs/pull/5457#pullrequestreview-1913224197
   
   **Describe the solution you'd like**
   
   > I still think would be nice to have an additional config(or method) to "enforce flush on buffer size". To be able to encapsulate this logic for user's code 🤔
   
   The idea is to add an additional option to force the writer to flush when its buffered data hits a certain limit. 
   
   **Describe alternatives you've considered**
   
   The challenge is how to enforce buffer limiting without slowing down encoding.  One idea would be to checking memory usage after completing encoding each RecordBatch. This would be imprecise (the writer could go over), as noted by @tustvold , but the overage would be bounded to the size of one RecordBatch (which the user could control)
   
   Since all writers 
   This might look like adding something like this to the [ArrowWriter](https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriter.html):
   
   ```rust
   let mut writer = ArrowWriter::try_new(&mut buffer, to_write.schema(), None)
     .unwrap()
     // flush when buffered parquet data exceeds 10MB
     .with_target_buffer_size(10*1024*1024)
   ```
   
   Since not all the parquet writers buffer their data like this, I think it doesn't make sense to put the buffer size on the `WriterProperties` struct. 
   
   
   
   **Additional context**
   @tustvold  documented the current behavior better in https://github.com/apache/arrow-rs/pull/5457 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Better memory limiting in parquet `ArrowWriter` [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #5484:
URL: https://github.com/apache/arrow-rs/issues/5484#issuecomment-1984912881

   I think we should probably just remove the `buffer_size` argument, I'll file a PR to do this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Better memory limiting in parquet `ArrowWriter` [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #5484:
URL: https://github.com/apache/arrow-rs/issues/5484#issuecomment-1998864209

   `label_issue.py` automatically added labels {'parquet'} from #5457


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Better memory limiting in parquet `ArrowWriter` [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold closed issue #5484: Better memory limiting in parquet `ArrowWriter` 
URL: https://github.com/apache/arrow-rs/issues/5484


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org