You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Håkon Magne Holmen (Jira)" <ji...@apache.org> on 2022/10/02 22:05:00 UTC
[jira] [Created] (ARROW-17913) feather.read_table 150x slower when reading columns in newer versions
Håkon Magne Holmen created ARROW-17913:
------------------------------------------
Summary: feather.read_table 150x slower when reading columns in newer versions
Key: ARROW-17913
URL: https://issues.apache.org/jira/browse/ARROW-17913
Project: Apache Arrow
Issue Type: Bug
Affects Versions: 9.0.0, 8.0.0, 7.0.0
Environment: python 3.9, ubuntu 20.04
Reporter: Håkon Magne Holmen
h3. Description
Performance when reading columns using {{feather.read_table}} on Arrow 7.0.0-9.0.0 is drastically slower than it was in 6.0.0.
Profiling the code below shows that the bottleneck is somewhere in the {{read_names}} function of {{pyarrow._feather.FeatherReader}}.
h5. Example
Setup code:
{code}
import pandas as pd
from pyarrow import feather
rows, cols = (1_000_000, 10)
data = {f'c{c}': range(rows) for c in range(cols)}
df = pd.DataFrame(data=data)
feather.write_feather(df, 'test.feather', compression="uncompressed"){code}
Benchmarks Arrow 9.0.0:
{code}
%timeit feather.read_table('test.feather', memory_map=True)
%timeit feather.read_table('test.feather', columns=list(df.columns), memory_map=True)
> 178 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
33.8 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
{code}
Benchmarks Arrow 6.0.0:
{code}
%timeit feather.read_table('test.feather', memory_map=True)
%timeit feather.read_table('test.feather', columns=list(df.columns), memory_map=True)
> 173 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
224 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)