You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Sergej Fries (Jira)" <ji...@apache.org> on 2020/10/14 18:12:00 UTC
[jira] [Closed] (ARROW-10303) [Rust] Parallel type transformation
in CSV reader
[ https://issues.apache.org/jira/browse/ARROW-10303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sergej Fries closed ARROW-10303.
--------------------------------
Resolution: Feedback Received
> [Rust] Parallel type transformation in CSV reader
> -------------------------------------------------
>
> Key: ARROW-10303
> URL: https://issues.apache.org/jira/browse/ARROW-10303
> Project: Apache Arrow
> Issue Type: Wish
> Components: Rust
> Reporter: Sergej Fries
> Priority: Minor
> Labels: CSVReader
> Attachments: tracing.png
>
>
> Currently, when the CSV file is read, a single thread is responsible for reading the file and for transformation of returned string values into correct data types.
> In my case, reading a 2 GB CSV file with a dozen of float columns, takes ~40 seconds. Out of this time, only ~10% of this is reading the file, and ~68% is transformation of the string values into correct data types.
> My proposal is to parallelize the part responsible for the data type transformation.
> It seems to be quite simple to achieve since after the CSV reader reads a batch, all projected columns are transformed one by one using an iterator over vector and a map function afterwards. I believe that if one uses the rayon crate, the only change will be the adjustment of "iter()" into "par_iter()" and
> changing
> {color:#0033b3}impl{color}<{color:#20999d}R{color}: {color:#000000}Read{color}> {color:#000000}Reader{color}<{color:#20999d}R{color}>
> into:
> {color:#0033b3}impl{color}<{color:#20999d}R{color}: {color:#000000}Read {color}+ {color:#000000}std{color}::{color:#000000}marker{color}::{color:#000000}Sync{color}> {color:#000000}Reader{color}<{color:#20999d}R{color}>
>
> But maybe I oversee something crucial (as being quite new in Rust and Arrow). Any advise from someone experienced is therefore very welcome!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)