You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Robert Nishihara (JIRA)" <ji...@apache.org> on 2017/11/24 20:41:00 UTC

[jira] [Commented] (ARROW-1854) [Python] Improve performance of serializing object dtype ndarrays

    [ https://issues.apache.org/jira/browse/ARROW-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265525#comment-16265525 ] 

Robert Nishihara commented on ARROW-1854:
-----------------------------------------

Your numbers are much better than what I'm seeing. It looks like the poor performance comes from our handling of lists. Since pyarrow handles the numpy array or objects by first converting it to a list and then serializing it, we can't do better than the list case.

{code}
import pickle
import pyarrow as pa
import numpy as np

print(pa.__version__)  # '0.7.2.dev165+ga446fbd.d20171116'

arr = np.array(['foo', 'bar', None] * 100000, dtype=object)
arr_list = arr.tolist()

# Serializing the array.
%timeit pa.serialize(arr).to_buffer()
130 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pickle.dumps(arr)
7.43 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Serializing the list.
%timeit pa.serialize(arr_list).to_buffer()
127 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pickle.dumps(arr_list)
5.87 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
{code}

> [Python] Improve performance of serializing object dtype ndarrays
> -----------------------------------------------------------------
>
>                 Key: ARROW-1854
>                 URL: https://issues.apache.org/jira/browse/ARROW-1854
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>             Fix For: 0.8.0
>
>
> I haven't looked carefully at the hot path for this, but I would expect these statements to have roughly the same performance (offloading the ndarray serialization to pickle)
> {code}
> In [1]: import pickle
> In [2]: import numpy as np
> In [3]: import pyarrow as pa
> a
> In [4]: arr = np.array(['foo', 'bar', None] * 100000, dtype=object)
> In [5]: timeit serialized = pa.serialize(arr).to_buffer()
> 10 loops, best of 3: 27.1 ms per loop
> In [6]: timeit pickled = pickle.dumps(arr)
> 100 loops, best of 3: 6.03 ms per loop
> {code}
> [~robertnishihara] [~pcmoritz] I encountered this while working on ARROW-1783, but it can likely be resolved independently



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)