You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Eric Kisslinger (Jira)" <ji...@apache.org> on 2019/11/04 22:08:00 UTC
[jira] [Created] (ARROW-7059) Reading parquet file with many
columns is still slow for 0.15.1
Eric Kisslinger created ARROW-7059:
--------------------------------------
Summary: Reading parquet file with many columns is still slow for 0.15.1
Key: ARROW-7059
URL: https://issues.apache.org/jira/browse/ARROW-7059
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.15.1
Environment: Linux OS with RHEL 7.7 distribution
blkcqas037:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Reporter: Eric Kisslinger
Reading Parquet files with large number of columns still seems to be very slow in 0.15.1 compared to 0.14.1. I using the same test used in https://issues.apache.org/jira/browse/ARROW-6876 except I set {{use_threads=False}} to make for an apples-to-apples comparison with respect to # of CPUs.
{{import numpy as np}}
{{import pyarrow as pa}}
{{import pyarrow.parquet as pq}}
{{table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(10000)})}}
{{pq.write_table(table, "test_wide.parquet")}}
{{res = pq.read_table("test_wide.parquet")}}
{{print(pa.__version__)}}
use_threads=False
{{%time res = pq.read_table("test_wide.parquet", use_threads=False)}}
*In 0.14.1 with use_threads=False:*
{{0.14.1}}
{{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}}
{{Wall time: 525 ms}}
**
*In 0.15.1 with* *use_threads=False**:*
{{0.15.1}}
{{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}}
{{Wall time: 9.93 s}}
{{}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)