You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Brian Hulette via dev <de...@beam.apache.org> on 2022/09/19 23:15:07 UTC

Re: Out of band pickling in Python (pickle5)

I got to thinking about this again and ran some benchmarks. The result is
documented in the GitHub issue [1].

tl;dr: we can't realize a huge benefit since we don't actually have an
out-of-band path for exchanging the buffers. However, pickle 5 can yield
improved in-band performance as well, and I think we can take advantage of
this with some relatively simple adjustments to PickleCoder and
OutputStream.

[1] https://github.com/apache/beam/issues/20900#issuecomment-1251658001
[2] https://peps.python.org/pep-0574/#improved-in-band-performance

On Thu, May 27, 2021 at 5:15 PM Stephan Hoyer <sh...@google.com> wrote:

> I'm unlikely to have bandwidth to take this one on, but I do think it
> would be quite valuable!
>
> On Thu, May 27, 2021 at 4:42 PM Brian Hulette <bh...@google.com> wrote:
>
>> I filed https://issues.apache.org/jira/browse/BEAM-12418 for this. Would
>> you have any interest in taking it on?
>>
>> On Tue, May 25, 2021 at 3:09 PM Brian Hulette <bh...@google.com>
>> wrote:
>>
>>> Hm this would definitely be of interest for the DataFrame API, which is
>>> shuffling pandas objects. This issue [1] confirms what you suggested above,
>>> that pandas supports out-of-band pickling since DataFrames are mostly just
>>> collections of numpy arrays.
>>>
>>> Brian
>>>
>>> [1] https://github.com/pandas-dev/pandas/issues/34244
>>>
>>> On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer <sh...@google.com> wrote:
>>>
>>>> Beam's PickleCoder would need to be updated to pass the
>>>> "buffer_callback" argument into pickle.dumps() and the "buffers" argument
>>>> into pickle.loads(). I expect this would be relatively straightforward.
>>>>
>>>> Then it should "just work", assuming that data is stored in objects
>>>> (like NumPy arrays or wrappers of NumPy arrays) that implement the
>>>> out-of-band Pickle protocol.
>>>>
>>>>
>>>> On Tue, May 25, 2021 at 2:50 PM Brian Hulette <bh...@google.com>
>>>> wrote:
>>>>
>>>>> I'm not aware of anyone looking at it.
>>>>>
>>>>> Will out-of-band pickling "just work" in Beam for types that implement
>>>>> the correct interface in Python 3.8?
>>>>>
>>>>> On Tue, May 25, 2021 at 2:43 PM Evan Galpin <ev...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> FWIW I recently ran into the exact case you described (high
>>>>>> serialization cost). The solution was to implement some not-so-intuitive
>>>>>> alternative transforms in my case, but I would have very much appreciated
>>>>>> faster serialization performance.
>>>>>>
>>>>>> Thanks,
>>>>>> Evan
>>>>>>
>>>>>> On Tue, May 25, 2021 at 15:26 Stephan Hoyer <sh...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Has anyone looked into out of band pickling for Beam's Python SDK,
>>>>>>> i.e., Pickle protocol version 5?
>>>>>>> https://www.python.org/dev/peps/pep-0574/
>>>>>>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers
>>>>>>>
>>>>>>> For Beam pipelines passing around NumPy arrays (or collections of
>>>>>>> NumPy arrays, like pandas or Xarray) I've noticed that serialization costs
>>>>>>> can be significant. Beam seems to currently incur at least one one (maybe
>>>>>>> two) unnecessary memory copies.
>>>>>>>
>>>>>>> Pickle protocol version 5 exists for solving exactly this problem.
>>>>>>> You can serialize collections of arbitrary Python objects in a fully
>>>>>>> streaming fashion using memory buffers. This is a Python 3.8 feature, but
>>>>>>> the "pickle5" library provides a backport to Python 3.6 and 3.7. It has
>>>>>>> been supported by NumPy since version 1.16, released in January 2019.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Stephan
>>>>>>>
>>>>>>