You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Sergej Fries (Jira)" <ji...@apache.org> on 2020/10/14 18:12:00 UTC

[jira] [Closed] (ARROW-10303) [Rust] Parallel type transformation in CSV reader

     [ https://issues.apache.org/jira/browse/ARROW-10303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sergej Fries closed ARROW-10303.
--------------------------------
    Resolution: Feedback Received

> [Rust] Parallel type transformation in CSV reader
> -------------------------------------------------
>
>                 Key: ARROW-10303
>                 URL: https://issues.apache.org/jira/browse/ARROW-10303
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: Rust
>            Reporter: Sergej Fries
>            Priority: Minor
>              Labels: CSVReader
>         Attachments: tracing.png
>
>
> Currently, when the CSV file is read, a single thread is responsible for reading the file and for transformation of returned string values into correct data types.
> In my case, reading a 2 GB CSV file with a dozen of float columns, takes ~40 seconds. Out of this time, only ~10% of this is reading the file,  and ~68% is transformation of the string values into correct data types.
> My proposal is to parallelize the part responsible for the data type transformation.
> It seems to be quite simple to achieve since after the CSV reader reads a batch, all projected columns are transformed one by one using an iterator over vector and a map function afterwards. I believe that if one uses the rayon crate, the only change will be the adjustment of "iter()" into "par_iter()" and
> changing
> {color:#0033b3}impl{color}<{color:#20999d}R{color}: {color:#000000}Read{color}> {color:#000000}Reader{color}<{color:#20999d}R{color}>
> into:
> {color:#0033b3}impl{color}<{color:#20999d}R{color}: {color:#000000}Read {color}+ {color:#000000}std{color}::{color:#000000}marker{color}::{color:#000000}Sync{color}> {color:#000000}Reader{color}<{color:#20999d}R{color}>
>  
> But maybe I oversee something crucial (as being quite new in Rust and Arrow). Any advise from someone experienced is therefore very welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)