You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Bogdan Klichuk (Jira)" <ji...@apache.org> on 2019/09/07 23:53:00 UTC

[jira] [Created] (ARROW-6481) Bad performance of read_csv() with column_types

Bogdan Klichuk created ARROW-6481:
-------------------------------------

             Summary: Bad performance of read_csv() with column_types
                 Key: ARROW-6481
                 URL: https://issues.apache.org/jira/browse/ARROW-6481
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.14.1
         Environment: ubuntu xenial
            Reporter: Bogdan Klichuk
         Attachments: 20k_cols.csv

Case: Dataset wit 20k columns. Amount of rows can be 0.

`pyarrow.csv.read_csv()` works rather fine if no convert_options provided.

Took 700ms.

Now I call `read_csv()` with column types mapping that marks 2000 out of these columns as string.

`pyarrow.csv.read_csv('20k_cols.csv', convert_options=pyarrow.csv.ConvertOptions(column_types=\{'K%d' % i: pyarrow.string() for i in range(2000)}))`

(K1..K19999 are column names in attached dataset).

My task globally is to read everything as string, avoid any inferring.

This takes several minutes, consumes around 4GB memory.

This doesn't look sane at all.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)