You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/04/29 14:57:45 UTC

[GitHub] [arrow] wesm commented on pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

wesm commented on pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#issuecomment-621266083


   Yes, we should discuss on the mailing list. 
   
   For the record, IO-related tasks should almost certainly not be using the default global thread pool, which is intended for CPU-intensive tasks. Eventually absent a path forward on sane nested parallelism, we're going to continue to see either highly suboptimal performance or scenarios where we can't use parallelism because of the risk of deadlocks. 
   
   In the meantime, I think we need to create an explicit scheduler API (probably higher level / more abstracted than the current ThreadPool API) so that an application can make sense of the IO tasks that are being issued when reading multiple files in parallel. This would extend to the Datasets API presumably also. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org