You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/03/25 18:24:00 UTC

[jira] [Commented] (ARROW-11889) [C++] Add parallelism to streaming CSV reader

    [ https://issues.apache.org/jira/browse/ARROW-11889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308885#comment-17308885 ] 

Antoine Pitrou commented on ARROW-11889:
----------------------------------------

I'll add that this probably means making {{ColumnDecoder}} async (perhaps turning it into a generator).

It would be nice if the solution could also tackle ARROW-11853 at the same time, since both issues will require significant reworking of the {{ColumnDecoder}} internals anyway.

> [C++] Add parallelism to streaming CSV reader
> ---------------------------------------------
>
>                 Key: ARROW-11889
>                 URL: https://issues.apache.org/jira/browse/ARROW-11889
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>             Fix For: 5.0.0
>
>
> Currently the streaming CSV reader does not allow for much parallelism.  It doesn't allow for reading more than one segment at once (useful in S3) and it doesn't allow for column fan-out for parsing & converting.
> It seems both of these options would speed up CSV reading in some scenarios although it's possible this is mostly mitigated in cases where there are many more files than cores (as per-file parallelism will occupy all the cores anyways).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)