You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/06/16 02:00:00 UTC
[jira] [Commented] (ARROW-11889) [C++] Add parallelism to streaming CSV reader

    [ https://issues.apache.org/jira/browse/ARROW-11889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17364009#comment-17364009 ] 

Weston Pace commented on ARROW-11889:
-------------------------------------

It'll probably be at least another day (probably more, I'll target the end of the week) before this is PR-ready but some notes:
 * The approach I'm taking is to create a functor for parsing (CSVBlock -> ParsedBlock) and another for decoding (ParsedBlock -> Array) and then hook  the whole thing up as an iterator/generator.
 * Since it was already in place I'll be keeping the per-column parallelism but I'm also adding parallel readahead (for conversion/decoding) on the batches themselves so I don't expect per-column parallelism is strictly neccesary for performance.
 * There doesn't seem to be much use case for eagerly blocking (i.e. ThreadedBlockReader).  It seems pretty unlikely we will need a multi-threaded parser.  So for the moment I expect I can reuse SerialBlockReader and just ensure it is not pulled async-reentrantly
 * The column builders & decoders are getting even more similar, I suspect I could probably combine the two into a single set of types with a boolean "try_reconvert" flag or something.  For example, the decoders already had an array of chunks although I can't see any reason they needed more than a single chunk.
 * In order to address this and ARROW-11853 each ParsedBlock will create its own ThreadedTaskGroup.  The future for that parsed block will be completed when all columns have been decoded and any "recode" tasks that were launched by that parsed block have finished.  Finish will be called on each future so a failure (or a cancellation) should get caught pretty quickly.  The stored futures might still hang around for coordination but they won't be waited on so we shouldn't deadlock there.
 * The table reader and the streaming readers are starting to become more and more similar as well.  It may end up that they can be combined as well where the table readers set "try_reconvert" to true and have some kind of emplace_into_table step at the end (although this might be a bit tricky with ordering & reconversion).
 *

> [C++] Add parallelism to streaming CSV reader
> ---------------------------------------------
>
>                 Key: ARROW-11889
>                 URL: https://issues.apache.org/jira/browse/ARROW-11889
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>             Fix For: 5.0.0
>
>
> Currently the streaming CSV reader does not allow for much parallelism.  It doesn't allow for reading more than one segment at once (useful in S3) and it doesn't allow for column fan-out for parsing & converting.
> It seems both of these options would speed up CSV reading in some scenarios although it's possible this is mostly mitigated in cases where there are many more files than cores (as per-file parallelism will occupy all the cores anyways).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)