You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/28 18:12:13 UTC

[GitHub] [arrow] alamb commented on pull request #8283: ARROW-9707: [Rust] [DataFusion] DataFusion Scheduler Prototype [WIP]

alamb commented on pull request #8283:
URL: https://github.com/apache/arrow/pull/8283#issuecomment-700197576


   > When I run the TPC-H query I am testing against a data set that has 240 Parquet files. If we just try and run everything at once with async/await and have tokio do the scheduling, we will end up with 240 files open at once with reads happening against all of them, which is inefficient.
   
   One way to avoid this type of resource usage explosion is if the Parquet reader itself limits the number of outstanding `Task`s that it submits. For example, with a tokio channel or something.
   
   It seems to me the challenge is not really "scheduling" per se, but more "resource allocation"


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org