You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/24 04:55:45 UTC

[GitHub] [arrow-datafusion] houqp opened a new issue #1657: ARROW2: Optimize parquet read memory usage

houqp opened a new issue #1657:
URL: https://github.com/apache/arrow-datafusion/issues/1657


   **Describe the bug**
   
   First reported by @ic4y at https://github.com/apache/arrow-datafusion/pull/1556#issuecomment-1012809108.
   
   This is also causing TPCH q7 benchmark to fail due to OOM in  https://github.com/apache/arrow-datafusion/issues/1652#issuecomment-1019622028.
   
   **To Reproduce**
   
   Compare peak memory usage between https://github.com/apache/arrow-datafusion/commit/2008b1dc06d5030f572634c7f8f2ba48562fa636 and https://github.com/apache/arrow-datafusion/commit/c0c9c7231f9c5685fda5fc9294fdc1711384b6fb when processing a parquet table.
   
   **Expected behavior**
   
   Memory usage should be on par with arrow-rs or alternatively we should have an option in arrow2 to let user make memory usage and array segmentation tradeoffs.
   
   **Additional context**
   
   Related upstream issue: https://github.com/jorgecarleitao/arrow2/issues/768
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org