You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Bogdan Klichuk (Jira)" <ji...@apache.org> on 2019/09/07 23:53:00 UTC
[jira] [Created] (ARROW-6481) Bad performance of read_csv() with
column_types
Bogdan Klichuk created ARROW-6481:
-------------------------------------
Summary: Bad performance of read_csv() with column_types
Key: ARROW-6481
URL: https://issues.apache.org/jira/browse/ARROW-6481
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.14.1
Environment: ubuntu xenial
Reporter: Bogdan Klichuk
Attachments: 20k_cols.csv
Case: Dataset wit 20k columns. Amount of rows can be 0.
`pyarrow.csv.read_csv()` works rather fine if no convert_options provided.
Took 700ms.
Now I call `read_csv()` with column types mapping that marks 2000 out of these columns as string.
`pyarrow.csv.read_csv('20k_cols.csv', convert_options=pyarrow.csv.ConvertOptions(column_types=\{'K%d' % i: pyarrow.string() for i in range(2000)}))`
(K1..K19999 are column names in attached dataset).
My task globally is to read everything as string, avoid any inferring.
This takes several minutes, consumes around 4GB memory.
This doesn't look sane at all.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)