You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/03/25 18:32:00 UTC

[jira] [Created] (ARROW-12090) [C++] Expose CSV block level readahead as a read option

Weston Pace created ARROW-12090:
-----------------------------------

             Summary: [C++] Expose CSV block level readahead as a read option
                 Key: ARROW-12090
                 URL: https://issues.apache.org/jira/browse/ARROW-12090
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace
            Assignee: Weston Pace


All of the CSV readers today base their I/O readahead on the parallelism of the executor (or 2 for the serial reader).  This is a reasonable default if the I/O is homogeneous but better values could presumably be used for some situations.

For example, if most files are buffered in RAM (and the reader is CPU bound for these files) but some files are not, then you would want the readahead to be large enough to read the unbuffered files while the CPU bound work is being done (assuming you are even lucky enough for things to be scheduled in that way)

This isn't likely to be much benefit in most situations though and it does add yet another option so I'm not really motivated to do this work until such a situation arises.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)