You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Eric Kisslinger (Jira)" <ji...@apache.org> on 2019/11/05 17:18:00 UTC
[jira] [Commented] (ARROW-7059) [Python] Reading parquet file with
many columns is much slower in 0.15.x versus 0.14.x
[ https://issues.apache.org/jira/browse/ARROW-7059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967688#comment-16967688 ]
Eric Kisslinger commented on ARROW-7059:
----------------------------------------
I found this work around which might help in troubleshooting the issue. Using ParquetFile.reader.read_column() to read all the columns individually is much faster than ParquetFile.reader.read_all().
{{import numpy as np}}
{{import pyarrow as pa}}
{{import pyarrow.parquet as pq}}
{{width = 10000}}
{{table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(width)})}}
{{pq.write_table(table, "test_wide.parquet")}}
{{pf = pq.ParquetFile("test_wide.parquet")}}
{{print('--- Using ParquetFile.reader.read_all() ---')}}
{{%time res = pf.reader.read_all(use_threads=False)}}
{{def work_around():}}
{{ arrays = [pf.reader.read_column(i) for i in range(width)]}}
{{ return pa.Table.from_arrays(arrays, names=pf.schema.names)}}
{{print('--- Using ParquetFile.reader.read_column() ---')}}
{{%time res = work_around()}}
{{assert table.equals(res)}}
*Output:*
{{--- Using ParquetFile.reader.read_all() ---}}
{{CPU times: user 10.2 s, sys: 27.2 ms, total: 10.3 s}}
{{Wall time: 10.3 s}}
{{--- Using ParquetFile.reader.read_column() ---}}
{{CPU times: user 149 ms, sys: 9.02 ms, total: 158 ms}}
{{Wall time: 158 ms}}
> [Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x
> --------------------------------------------------------------------------------------
>
> Key: ARROW-7059
> URL: https://issues.apache.org/jira/browse/ARROW-7059
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.15.1
> Environment: Linux OS with RHEL 7.7 distribution
> blkcqas037:~$ lscpu
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Byte Order: Little Endian
> CPU(s): 32
> On-line CPU(s) list: 0-31
> Thread(s) per core: 2
> Core(s) per socket: 8
> Socket(s): 2
> NUMA node(s): 2
> Vendor ID: GenuineIntel
> CPU family: 6
> Model: 79
> Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
> Reporter: Eric Kisslinger
> Priority: Major
> Labels: performance
>
> Reading Parquet files with large number of columns still seems to be very slow in 0.15.1 compared to 0.14.1. I using the same test used in https://issues.apache.org/jira/browse/ARROW-6876 except I set {{use_threads=False}} to make for an apples-to-apples comparison with respect to # of CPUs.
> {{import numpy as np}}
> {{import pyarrow as pa}}
> {{import pyarrow.parquet as pq}}
> {{table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(10000)})}}
> {{pq.write_table(table, "test_wide.parquet")}}
> {{res = pq.read_table("test_wide.parquet")}}
> {{print(pa.__version__)}}
> use_threads=False
> {{%time res = pq.read_table("test_wide.parquet", use_threads=False)}}
> *In 0.14.1 with use_threads=False:*
> {{0.14.1}}
> {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}}
> {{Wall time: 525 ms}}
> **
> *In 0.15.1 with* *use_threads=False**:*
> {{0.15.1}}
> {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}}
> {{Wall time: 9.93 s}}
> {{}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)