You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Tom Scheffers (Jira)" <ji...@apache.org> on 2021/03/02 21:36:00 UTC
[jira] [Created] (ARROW-11844) [Python] Initial table.take(...)
call takes much longer
Tom Scheffers created ARROW-11844:
-------------------------------------
Summary: [Python] Initial table.take(...) call takes much longer
Key: ARROW-11844
URL: https://issues.apache.org/jira/browse/ARROW-11844
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 3.0.0
Environment: MacOS, python 3.8, pyarrow=3.0.0
Reporter: Tom Scheffers
When you call the table.take(...) function, the first time it will take significantly longer. See the code example below.
{code:python}
import time
import numpy as np
import pyarrow as pa
# Create table
size = int(1e6)
ids = np.random.choice(np.arange(size), size=size, replace=False)
t = pa.Table.from_arrays(
[ids, np.random.randint(0, 100000, size=(size))],
names=['id', 'salary']
)
for i in range(5):
start = time.time()
tf = t.take(list(range(int(1e5))))
print("Iteration {} took {:4f} seconds".format(i, time.time() - start))
{code}
This prints:
*Iteration 0 took 0.031361 seconds*
Iteration 1 took 0.011474 seconds
Iteration 2 took 0.012330 seconds
Iteration 3 took 0.012391 seconds
Iteration 4 took 0.017687 seconds
Although this example is not as severe as I experienced in other works, it still seems significant. Any clue what is causing this behavior?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)