You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Clark Zinzow (Jira)" <ji...@apache.org> on 2022/09/28 17:03:00 UTC

[jira] [Comment Edited] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers

    [ https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610662#comment-17610662 ] 

Clark Zinzow edited comment on ARROW-10739 at 9/28/22 5:02 PM:
---------------------------------------------------------------

[~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC format is used under-the-hood for pickle serialization, and confirmed that the buffer truncation works as expected. Although this is a far simpler solution than (1), the overhead of the {{RecordBatch}} wrapper adds ~230 extra bytes to the pickled payload (per {{Array}} chunk) compared to current Arrow master, which can be pretty bad for the many-chunk and/or many-column case (order of magnitude larger serialized payloads). We could sidestep this issue by having {{{}Table{}}}, {{{}RecordBatch{}}}, and {{ChunkedArray}} port their {{__reduce__}} to the Arrow IPC serialization as well, which should avoid this many-column and many-chunk blow-up, but there will still be the baseline ~230 byte bloat for {{ChunkedArray}} and {{Array}} that we might find untenable.

 

I can try to get a PR up for (2) either today or tomorrow while I start working on (1) in the background. (1) is going to have a much larger Arrow code impact + we'll continue having two serialization paths to maintain, but it shouldn't result in any serialized payload bloat.


was (Author: clarkzinzow):
[~jorisvandenbossche] I did a quick implementation of (2), where the Arrow IPC format is used under-the-hood for pickle serialization, and confirmed that the buffer truncation works as expected. Although this is a far simpler solution than (1), the overhead of the `RecordBatch` wrapper adds ~230 extra bytes to the pickled payload (per `Array` chunk) compared to current Arrow master, which can be pretty bad for the many-chunk and/or many-column case (order of magnitude larger serialized payloads). We could sidestep this issue by having `Table`, `RecordBatch`, and `ChunkedArray` port their `__reduce__` to the Arrow IPC serialization as well, which should avoid this many-column and many-chunk blow-up, but there will still be the baseline ~230 byte bloat for `ChunkedArray` and `Array` that we might find untenable.

 

I can try to get a PR up for (2) either today or tomorrow while I start working on (1) in the background. (1) is going to have a much larger Arrow code impact + we'll continue having two serialization paths to maintain, but it shouldn't result in any serialized payload bloat.

> [Python] Pickling a sliced array serializes all the buffers
> -----------------------------------------------------------
>
>                 Key: ARROW-10739
>                 URL: https://issues.apache.org/jira/browse/ARROW-10739
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Maarten Breddels
>            Assignee: Alessandro Molina
>            Priority: Critical
>             Fix For: 10.0.0
>
>
> If a large array is sliced, and pickled, it seems the full buffer is serialized, this leads to excessive memory usage and data transfer when using multiprocessing or dask.
> {code:java}
> >>> import pyarrow as pa
> >>> ar = pa.array(['foo'] * 100_000)
> >>> ar.nbytes
> 700004
> >>> import pickle
> >>> len(pickle.dumps(ar.slice(10, 1)))
> 700165
> NumPy for instance
> >>> import numpy as np
> >>> ar_np = np.array(ar)
> >>> ar_np
> array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object)
> >>> import pickle
> >>> len(pickle.dumps(ar_np[10:11]))
> 165{code}
> I think this makes sense if you know arrow, but kind of unexpected as a user.
> Is there a workaround for this? For instance copy an arrow array to get rid of the offset, and trim the buffers?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)