You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Paul Balanca (Jira)" <ji...@apache.org> on 2020/12/22 15:59:00 UTC

[jira] [Created] (ARROW-11006) [Python] Array to_numpy slow compared to Numpy.view

Paul Balanca created ARROW-11006:
------------------------------------

             Summary: [Python] Array to_numpy slow compared to Numpy.view
                 Key: ARROW-11006
                 URL: https://issues.apache.org/jira/browse/ARROW-11006
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
            Reporter: Paul Balanca
            Assignee: Paul Balanca


The method `to_numpy` is quite slow compare Numpy slice and viewing performance. For instance:
{code:java}
N = 1000000
np_arr = np.arange(N)
pa_arr = pa.array(np_arr)

%timeit l = [np_arr.view() for _ in range(N)]
251 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit l = [pa_arr.to_numpy(zero_copy_only=True) for _ in range(N)]
1.2 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{code}
The previous benchmark is clearly an extreme case, but the idea is that for any operation not available in PyArrow, failing back on Numpy is a good option and the cost of extracting should be as minimal as possible (there are scenarios where you can't cache easily this view, so you end up calling `to_numpy` a fair amount of times).

I would believe that part of this overhead is probably due to PyArrow implementing a very generic Pandas conversion, and using this one even for very simple Numpy-like dense arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)