You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Bogdan Klichuk (Jira)" <ji...@apache.org> on 2019/09/07 23:54:00 UTC
[jira] [Updated] (ARROW-6481) Bad performance of read_csv() with column_types

     [ https://issues.apache.org/jira/browse/ARROW-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bogdan Klichuk updated ARROW-6481:
----------------------------------
    Description: 
Case: Dataset wit 20k columns. Amount of rows can be 0.

{{pyarrow.csv.read_csv()}} works rather fine if no convert_options provided.

Took 700ms.

Now I call {{read_csv()}} with column types mapping that marks 2000 out of these columns as string.

{{pyarrow.csv.read_csv('20k_cols.csv', convert_options=pyarrow.csv.ConvertOptions(column_types=\{'K%d' % i: pyarrow.string() for i in range(2000)}))}}

(K1..K19999 are column names in attached dataset).

My task globally is to read everything as string, avoid any inferring.

This takes several minutes, consumes around 4GB memory.

This doesn't look sane at all.

  was:
Case: Dataset wit 20k columns. Amount of rows can be 0.

`pyarrow.csv.read_csv()` works rather fine if no convert_options provided.

Took 700ms.

Now I call `read_csv()` with column types mapping that marks 2000 out of these columns as string.

`pyarrow.csv.read_csv('20k_cols.csv', convert_options=pyarrow.csv.ConvertOptions(column_types=\{'K%d' % i: pyarrow.string() for i in range(2000)}))`

(K1..K19999 are column names in attached dataset).

My task globally is to read everything as string, avoid any inferring.

This takes several minutes, consumes around 4GB memory.

This doesn't look sane at all.


> Bad performance of read_csv() with column_types
> -----------------------------------------------
>
>                 Key: ARROW-6481
>                 URL: https://issues.apache.org/jira/browse/ARROW-6481
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.14.1
>         Environment: ubuntu xenial
>            Reporter: Bogdan Klichuk
>            Priority: Major
>         Attachments: 20k_cols.csv
>
>
> Case: Dataset wit 20k columns. Amount of rows can be 0.
> {{pyarrow.csv.read_csv()}} works rather fine if no convert_options provided.
> Took 700ms.
> Now I call {{read_csv()}} with column types mapping that marks 2000 out of these columns as string.
> {{pyarrow.csv.read_csv('20k_cols.csv', convert_options=pyarrow.csv.ConvertOptions(column_types=\{'K%d' % i: pyarrow.string() for i in range(2000)}))}}
> (K1..K19999 are column names in attached dataset).
> My task globally is to read everything as string, avoid any inferring.
> This takes several minutes, consumes around 4GB memory.
> This doesn't look sane at all.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)