You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/28 00:33:43 UTC

[GitHub] [arrow] alippai edited a comment on pull request #8283: ARROW-9707: [Rust] [DataFusion] DataFusion Scheduler Prototype [WIP]

alippai edited a comment on pull request #8283:
URL: https://github.com/apache/arrow/pull/8283#issuecomment-699695827


   @andygrove I think now you understand all my issues I had previously. The scheduler proposal and the recent comments regarding the concurrency are all superb, I think you are on track. Thanks for listening for my newbie concerns.
   
   My only note: https://github.com/apache/arrow/pull/8283#issuecomment-699655553 likely you want to read a largish partition in "one go". AFAIR HDFS creates ~128MB large parquet chunks. Reading ~100MB large parquet files, or large columns with tens of MBs of data in one go will likely increase the throughput. While using local disks values over a few MBs won't make any difference, but using S3, HDFS, GPFS, NFS it can be beneficial. 
   
   I couldn't find how the TPC-H parquet files you test with are structured, can you give me some pointers?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org