You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/12/03 02:46:00 UTC

[jira] [Created] (ARROW-14974) [C++] Dataset scanning, in async mode, is running parquet reads on the CPU thread pool

Weston Pace created ARROW-14974:
-----------------------------------

             Summary: [C++] Dataset scanning, in async mode, is running parquet reads on the CPU thread pool
                 Key: ARROW-14974
                 URL: https://issues.apache.org/jira/browse/ARROW-14974
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace


This is something I picked up while doing some profiling a while back.  When running a scan of a large parquet dataset many of the read tasks (e.g. I/O reads) were running on the CPU thread pool.  This could lead to the CPU thread pool being underutilized.

It might not have a large effect on the parquet read itself (if the reads are slow we are probably I/O bound so one might not notice) but it can cause issues on a more complex query where reading is being interleaved with CPU work (like filtering and joining).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)