You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/27 16:19:39 UTC

[GitHub] [arrow] andygrove commented on pull request #8283: ARROW-9707: [Rust] [DataFusion] DataFusion Scheduler Prototype [WIP]

andygrove commented on pull request #8283:
URL: https://github.com/apache/arrow/pull/8283#issuecomment-699655553


   @jorgecarleitao Async/await helps a lot but we also need our own scheduler to orchestrate how a query is executed. I am going to write up something more detailed with my reasoning on this soon but here is one example. When I run the TPC-H query I am testing against a data set that has 240 Parquet files. If we just try and run everything at once with async/await and have tokio do the scheduling, we will end up with 240 files open at once with reads happening against all of them, which is inefficient. It is better to process a smaller number of files concurrently (better use of page caches, fewer file handles open, etc) and process them in batches. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org