You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2020/12/22 19:44:00 UTC

[jira] [Commented] (ARROW-11006) [Python] Array to_numpy slow compared to Numpy.view

    [ https://issues.apache.org/jira/browse/ARROW-11006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253716#comment-17253716 ] 

Wes McKinney commented on ARROW-11006:
--------------------------------------

Here we are looking at microperformance of 250 nanoseconds versus 1200 nanoseconds. I do not believe this code path has been optimized for nanosecond-level microperformance. If this is important to you, you are free to investigate and propose improvements (and add some benchmarks to our ASV suite so we can keep an eye on this). 

> [Python] Array to_numpy slow compared to Numpy.view
> ---------------------------------------------------
>
>                 Key: ARROW-11006
>                 URL: https://issues.apache.org/jira/browse/ARROW-11006
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Paul Balanca
>            Assignee: Paul Balanca
>            Priority: Minor
>
> The method `to_numpy` is quite slow compare Numpy slice and viewing performance. For instance:
> {code:java}
> N = 1000000
> np_arr = np.arange(N)
> pa_arr = pa.array(np_arr)
> %timeit l = [np_arr.view() for _ in range(N)]
> 251 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> %timeit l = [pa_arr.to_numpy(zero_copy_only=True) for _ in range(N)]
> 1.2 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> {code}
> The previous benchmark is clearly an extreme case, but the idea is that for any operation not available in PyArrow, failing back on Numpy is a good option and the cost of extracting should be as minimal as possible (there are scenarios where you can't cache easily this view, so you end up calling `to_numpy` a fair amount of times).
> I would believe that a bit part of this overhead is due to PyArrow implementing a very generic Pandas conversion, and using this one even for very simple Numpy-like dense arrays.
> There are a lot of use cases of PyArrow <=> Numpy interaction projects where I think most would be interested in not paying any Pandas compatibility additional cost. And in this particular case, it could be valuable to implement a direct Numpy conversion method for some Array subclasses (starting with the simple `NumericArray`).
> `



--
This message was sent by Atlassian Jira
(v8.3.4#803005)