You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/06/07 15:50:00 UTC
[jira] [Comment Edited] (ARROW-12976) [Python] Arrow-to-Python conversion is slow

    [ https://issues.apache.org/jira/browse/ARROW-12976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358676#comment-17358676 ] 

Joris Van den Bossche edited comment on ARROW-12976 at 6/7/21, 3:49 PM:
------------------------------------------------------------------------

Actually, the APIs are already available to avoid the Scalar creation, for example for Int64Array:

{code}
    def to_pylist(self):
        cdef:
            CInt64Array* int_arr = (<CInt64Array*> self.ap)
            int64_t val

        res = []
        for i in range(len(self)):
            if int_arr.IsValid(i):
                val = int_arr.Value(i)
                res.append(val)
            else:
                res.append(None)
        return res
{code}

This gives me 13µs for the example case, which is now almost the same as for the numpy tolist.

This might certainly be worth including. Are there ways in cython to avoid having to duplicate this in each of Int8/Int16/Int... Array? The problem is that {{NumericArray::Value(i)}} return type is type-dependent.


was (Author: jorisvandenbossche):
Actually, the APIs are already available to avoid the Scalar creation, for example for Int64Array:

{code}
    def to_pylist(self):
        cdef:
            CInt64Array* int_arr = (<CInt64Array*> self.ap)
            int64_t val

        res = []
        for i in range(len(self)):
            if int_arr.IsValid(i):
                val = int_arr.Value(i)
                res.append(val)
            else:
                res.append(None)
        return res
{code}

This gives me 13µs for the example case, which is now almost the same as for the numpy tolist.

This might certainly be worth including. Are there ways in cython to avoid having to duplicate this in each of Int8/Int16/Int... Array?

> [Python] Arrow-to-Python conversion is slow
> -------------------------------------------
>
>                 Key: ARROW-12976
>                 URL: https://issues.apache.org/jira/browse/ARROW-12976
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Antoine Pitrou
>            Priority: Major
>
> It seems that we are 20x slower than Numpy for converting the exact same data to a Python list.
> With integers:
> {code:python}
> >>> arr = np.arange(0,1000, dtype=np.int64)
> >>> %timeit arr.tolist()
> 8.24 µs ± 3.46 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
> >>> parr = pa.array(arr)
> >>> %timeit parr.to_pylist()
> 218 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> {code}
> With floats:
> {code:python}
> >>> arr = np.arange(0,1000, dtype=np.float64)
> >>> %timeit arr.tolist()
> 10.2 µs ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
> >>> parr = pa.array(arr)
> >>> %timeit parr.to_pylist()
> 199 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)