You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/01 06:43:01 UTC

[GitHub] [arrow-datafusion] voltcode opened a new issue #464: Question: Can DataFusion handle larger than RAM datasets?

voltcode opened a new issue #464:
URL: https://github.com/apache/arrow-datafusion/issues/464


   I browsed the readme and slides but failed to grok - can DataFusion handle larger than RAM datasets? In other words, if I register multiple parquet files, which size exceeds RAM, will they get all loaded into memory or will DataFusion carefully manage memory buffers to avoid out of memory exception?
   
   As an extension of this question, I'd like to ask for pointers on how can one tune DataFusion resource usage if necessary ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #464: Question: Can DataFusion handle larger than RAM datasets?

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #464:
URL: https://github.com/apache/arrow-datafusion/issues/464#issuecomment-864033922


   Possibly related: https://github.com/apache/arrow-datafusion/issues/587https://github.com/apache/arrow-datafusion/issues/587


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #464: Question: Can DataFusion handle larger than RAM datasets?

Posted by GitBox <gi...@apache.org>.

alamb closed issue #464:
URL: https://github.com/apache/arrow-datafusion/issues/464


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on issue #464: Question: Can DataFusion handle larger than RAM datasets?

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on issue #464:
URL: https://github.com/apache/arrow-datafusion/issues/464#issuecomment-864033922


   Possibly related: https://github.com/apache/arrow-datafusion/issues/587https://github.com/apache/arrow-datafusion/issues/587 (feature to keep memory limit)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #464: Question: Can DataFusion handle larger than RAM datasets?

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #464:
URL: https://github.com/apache/arrow-datafusion/issues/464#issuecomment-899594233


   I think this question is answered so closing this ticket


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #464: Question: Can DataFusion handle larger than RAM datasets?

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #464:
URL: https://github.com/apache/arrow-datafusion/issues/464#issuecomment-854998756


   @voltcode  -- DataFusion is at its core an in memory processing system.
   
   That being said, depending on what the plan is doing, simply reading from a large number of parquet files does not necessarily mean they will be decompressed all at once into memory.
   
   DataFusion has several features that keep the memory usage down:
   1. It will only read columns required for the query "projection pushdown"
   2. It will attempt to prune row groups  (based on metadata) and skip them entirely if possible
   3. It has a "streaming" model of computation and so will read the parquet files into memory in small batches.
   
   Certain operations in DataFusion are likely to consume large amounts of memory, notable "Sort" and "Join" (as well as grouping where there are large numbers of distinct groups)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #464: Question: Can DataFusion handle larger than RAM datasets?

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #464:
URL: https://github.com/apache/arrow-datafusion/issues/464#issuecomment-854999085


   I am not sure there is any documentation written about tuning resource usage of DataFusion -- perhaps @andygrove would know if such documentation existed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org