You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Paul Balanca (Jira)" <ji...@apache.org> on 2020/12/22 16:02:00 UTC
[jira] [Updated] (ARROW-11006) [Python] Array to_numpy slow compared to Numpy.view

     [ https://issues.apache.org/jira/browse/ARROW-11006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Balanca updated ARROW-11006:
---------------------------------
    Description: 
The method `to_numpy` is quite slow compare Numpy slice and viewing performance. For instance:
{code:java}
N = 1000000
np_arr = np.arange(N)
pa_arr = pa.array(np_arr)

%timeit l = [np_arr.view() for _ in range(N)]
251 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit l = [pa_arr.to_numpy(zero_copy_only=True) for _ in range(N)]
1.2 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{code}
The previous benchmark is clearly an extreme case, but the idea is that for any operation not available in PyArrow, failing back on Numpy is a good option and the cost of extracting should be as minimal as possible (there are scenarios where you can't cache easily this view, so you end up calling `to_numpy` a fair amount of times).

I would believe that a bit part of this overhead is due to PyArrow implementing a very generic Pandas conversion, and using this one even for very simple Numpy-like dense arrays.

There are a lot of use cases of PyArrow <=> Numpy interaction projects where I think most would be interested in not paying any Pandas compatibility additional cost. And in this particular case, it could be valuable to implement a direct Numpy conversion method for some Array subclasses (starting with the simple `NumericArray`).

`

  was:
The method `to_numpy` is quite slow compare Numpy slice and viewing performance. For instance:
{code:java}
N = 1000000
np_arr = np.arange(N)
pa_arr = pa.array(np_arr)

%timeit l = [np_arr.view() for _ in range(N)]
251 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit l = [pa_arr.to_numpy(zero_copy_only=True) for _ in range(N)]
1.2 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{code}
The previous benchmark is clearly an extreme case, but the idea is that for any operation not available in PyArrow, failing back on Numpy is a good option and the cost of extracting should be as minimal as possible (there are scenarios where you can't cache easily this view, so you end up calling `to_numpy` a fair amount of times).

I would believe that part of this overhead is probably due to PyArrow implementing a very generic Pandas conversion, and using this one even for very simple Numpy-like dense arrays.


> [Python] Array to_numpy slow compared to Numpy.view
> ---------------------------------------------------
>
>                 Key: ARROW-11006
>                 URL: https://issues.apache.org/jira/browse/ARROW-11006
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Paul Balanca
>            Assignee: Paul Balanca
>            Priority: Minor
>
> The method `to_numpy` is quite slow compare Numpy slice and viewing performance. For instance:
> {code:java}
> N = 1000000
> np_arr = np.arange(N)
> pa_arr = pa.array(np_arr)
> %timeit l = [np_arr.view() for _ in range(N)]
> 251 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> %timeit l = [pa_arr.to_numpy(zero_copy_only=True) for _ in range(N)]
> 1.2 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> {code}
> The previous benchmark is clearly an extreme case, but the idea is that for any operation not available in PyArrow, failing back on Numpy is a good option and the cost of extracting should be as minimal as possible (there are scenarios where you can't cache easily this view, so you end up calling `to_numpy` a fair amount of times).
> I would believe that a bit part of this overhead is due to PyArrow implementing a very generic Pandas conversion, and using this one even for very simple Numpy-like dense arrays.
> There are a lot of use cases of PyArrow <=> Numpy interaction projects where I think most would be interested in not paying any Pandas compatibility additional cost. And in this particular case, it could be valuable to implement a direct Numpy conversion method for some Array subclasses (starting with the simple `NumericArray`).
> `



--
This message was sent by Atlassian Jira
(v8.3.4#803005)