You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/15 17:12:26 UTC

[GitHub] [arrow] wesm commented on pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

wesm commented on pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#issuecomment-692852532


   In terms of benchmarking, it also strikes me that one issue is that it may be faster (especially on machines with a lot of cores -- e.g. 16/20 core servers) to read a 2-file (or even n-file where n is some number less than the number of cores on the machine) dataset by reading the files one at a time rather than using the datasets API. How many files do you have to have before the performance issue goes away? This is something that would be good to quantify in a collection of benchmarks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org