You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/20 01:04:47 UTC

[GitHub] [arrow] VHellendoorn commented on issue #12653: Conversion from one dataset to another that will not fit in memory?

VHellendoorn commented on issue #12653:
URL: https://github.com/apache/arrow/issues/12653#issuecomment-1159856484

   I am noticing the same issue with pyarrow 8.0.0. Memory usage steadily increases to over 10GB while reading batches from a 15GB Parquet file, even with batch size 1. The rows vary a fair bit in size in this dataset, but not enough to require that much RAM.
   
   For what it's worth, I've found that passing `use_threads=False` as an argument to `scanner` prevents the memory footprint from growing as large (not growing past ~3GB in this case, but still fluctuating by a fair bit), after noticing that this implicitly disables both batch and fragment readahead [here](https://github.com/apache/arrow/blob/78fb2edd30b602bd54702896fa78d36ec6fefc8c/cpp/src/arrow/dataset/scanner.h#L90). The performance penalty isn't particularly large, especially with bigger batch sizes, so this may be a temporary solution for those wishing to keep memory usage low.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org