You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Tom Scheffers (Jira)" <ji...@apache.org> on 2021/03/02 21:36:00 UTC

[jira] [Created] (ARROW-11844) [Python] Initial table.take(...) call takes much longer

Tom Scheffers created ARROW-11844:
-------------------------------------

             Summary: [Python] Initial table.take(...) call takes much longer
                 Key: ARROW-11844
                 URL: https://issues.apache.org/jira/browse/ARROW-11844
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 3.0.0
         Environment: MacOS, python 3.8, pyarrow=3.0.0
            Reporter: Tom Scheffers


When you call the table.take(...) function, the first time it will take significantly longer. See the code example below.
{code:python}
import time
import numpy as np
import pyarrow as pa

# Create table
size = int(1e6)
ids = np.random.choice(np.arange(size), size=size, replace=False)
t = pa.Table.from_arrays(
    [ids, np.random.randint(0, 100000, size=(size))],
    names=['id', 'salary']
)

for i in range(5):
    start = time.time()
    tf = t.take(list(range(int(1e5))))
    print("Iteration {} took {:4f} seconds".format(i, time.time() - start))
{code}
This prints:
*Iteration 0 took 0.031361 seconds*
Iteration 1 took 0.011474 seconds
Iteration 2 took 0.012330 seconds
Iteration 3 took 0.012391 seconds
Iteration 4 took 0.017687 seconds

Although this example is not as severe as I experienced in other works, it still seems significant. Any clue what is causing this behavior?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)