You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2020/11/12 15:59:00 UTC
[jira] [Created] (ARROW-10569) [C++][Python] Poor Table filtering
performance
Wes McKinney created ARROW-10569:
------------------------------------
Summary: [C++][Python] Poor Table filtering performance
Key: ARROW-10569
URL: https://issues.apache.org/jira/browse/ARROW-10569
Project: Apache Arrow
Issue Type: Bug
Components: C++, Python
Reporter: Wes McKinney
Fix For: 3.0.0
From the mailing list
{code:java}
import pandas as pd
import pyarrow as pa
import pyarrow.compute as pc
import numpy as np
num_rows = 10_000_000
data = np.random.randn(num_rows)
df = pd.DataFrame({'data{}'.format(i): data
for i in range(100)})
df['key'] = np.random.randint(0, 100, size=num_rows)
rb = pa.record_batch(df)
t = pa.table(df)
I found that the performance of filtering a record batch is very similar:
In [22]: timeit df[df.key == 5]
71.3 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [24]: %timeit rb.filter(pc.equal(rb[-1], 5))
75.8 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Whereas the performance of filtering a table is absolutely abysmal (no
idea what's going on here)
In [23]: %timeit t.filter(pc.equal(t[-1], 5))
961 ms ± 3.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{code}
[https://lists.apache.org/thread.html/r4d4ffa7935efb2902600b9024859211e53aa6552d43ba0ad83517af5%40%3Cuser.arrow.apache.org%3Ehttps://lists.apache.org/thread.html/r4d4ffa7935efb2902600b9024859211e53aa6552d43ba0ad83517af5%40%3Cuser.arrow.apache.org%3E]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)