You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Stephan Hoyer <sh...@google.com> on 2021/06/08 21:22:52 UTC
Re: Consider Cloudpickle instead of dill for Python pickling

To give a quick update here: I did look into what it could look like to
swap out dill -> cloudpickle internally in Beam.

In principle, making the switch would be easy: literally just swap out
"dill" in favor of "cloudpickle". Both use the same loads/dumps API.

The hard part is figuring out how to make this a configuration option.
pickler.loads() and pickler.dumps() are used all over the Beam Python
codebase, including very far away from the Pipeline object on which
configurable options live, e.g., on pvalues, transforms, coders, runners
etc. It's not obvious to me what the right way to pass on this
configuration state is, and if it needs to be done explicitly, it seems
like it's going to be a lot of work.

On Mon, May 3, 2021 at 4:11 PM Ahmet Altay <al...@google.com> wrote:

> I agree with Robert on this one. With the exception of DillCoder, it might
> be reasonable to conditionally support both. (On a related note, I only see
> one use of DillCoder, do we really need that coder?)
>
> On Mon, May 3, 2021 at 5:01 AM Kenneth Knowles <ke...@apache.org> wrote:
>
>> My 2 contradictory cents from even further back in the peanut gallery:
>>
>>  - Pickle/dill/cloudpickle/etc are most suitable for transmission, not
>> storage, so changing is a lesser breaking change. But there still might be
>> streaming pipelines that are using it can cannot be updated.
>>
>
> I believe the problem arises from transmission between a client and a
> worker when they are not using compatible libraries. Much history here (
> https://github.com/uqfoundation/dill/issues/341) but I do not think dill
> is necessarily at fault here. I think we will need to ensure using
> compatible libraries in both environments.
>
>
>>  - A serialization library with an unstable/breaking serialization format
>> is not production-ready. If open version ranges are unsafe, that is an
>> indication that it is not ready for use.
>>
>
> It was a bug and it happened twice according to the maintainer. (See the
> issue above). I am not sure it will be much better with a different library.
>
>
>>  - We should use whatever the rest of the world uses. That is more
>> important than either of the above two points.
>>
>
> Both are similarly popular. Looks like dill is a bit (50%) more popular (
> https://libraries.io/pypi/dill vs https://libraries.io/pypi/cloudpickle)
> and has a more recent release.
>
>
>>
>> Kenn
>>
>> On Sat, May 1, 2021 at 5:15 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> Just my 2 cents  comment from the users perspective.
>>>
>>> In Airflow, the narrow limits of `dill` caused some problems with
>>> dependencies. We had to add some exceptions in our process for that:
>>> https://github.com/apache/airflow/blob/master/Dockerfile#L246
>>> https://github.com/apache/airflow/blob/master/Dockerfile.ci#L271  - so
>>> the problem is largely solved for now, but if dill would be used by any
>>> different library it could be a problem. I imagine cloudpickle is more
>>> frequently used than dill, so it might become a problem if those
>>> dependencies are narrowly defined.
>>>
>>> Currently cloudpickle for Airflow is already pulled in by
>>> Dask's  "distributed"  library (but they have just > limits there):
>>>
>>> distributed==2.19.0
>>>   - click [required: >=6.6, installed: 7.1.2]
>>>   - cloudpickle [required: >=1.3.0, installed: 1.4.1]
>>>   - dask [required: >=2.9.0, installed: 2021.4.1]
>>>     - cloudpickle [required: >=1.1.1, installed: 1.4.1]
>>>     - fsspec [required: >=0.6.0, installed: 2021.4.0]
>>>
>>> However, I have a better idea - why don't you simply vendor-in either
>>> `dill` or `cloudpickle` (I am not sure which one is best) ?
>>>
>>> Since you are not planning to upgrade it often (that's the whole point
>>> of narrow versioning), you can have the best of both worlds - stable
>>> version used in both client/server AND you would not be limiting others.
>>>
>>> J.
>>>
>>>
>>> On Fri, Apr 30, 2021 at 9:42 PM Stephan Hoyer <sh...@google.com> wrote:
>>>
>>>> Glad to hear this is something you've open to and in fact have already
>>>> considered :)
>>>>
>>>> I may give implementing this a try, though I'm not familiar with how
>>>> configuration options are managed in Beam, so that may be easier for a core
>>>> developer to deal with.
>>>>
>>>> On Fri, Apr 30, 2021 at 10:58 AM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>>
>>>>> As I've mentioned before, I would be in favor of moving to cloudpicke,
>>>>> first as an option, and if that works out well as the default. In
>>>>> particular, pickling functions from the main session in a hermetic (if
>>>>> sometimes slightly bulkier way) way as opposed to the main session
>>>>> pickling gymnastics is far preferable (especially for interactive).
>>>>>
>>>>> Versioning is an issue in general, and a tradeoff between the
>>>>> overheads of re-building the worker every time (either custom
>>>>> containers or at runtime) vs. risking different versions, and we could
>>>>> possibly do better more generally on both fronts (as well as making
>>>>> this tradeoff clear). Fair point that Cloudpickle is less likely to
>>>>> just work with pinning. On the other hand, Cloudpickle looks fairly
>>>>> mature/stable at this point, so hopefully it wouldn't be too hard to
>>>>> keep our containers closet to head. If there is an error, we could
>>>>> consider catching it and raising a more explicit message about the
>>>>> version things were pickled vs. unpickled with.
>>>>>
>>>>> I would welcome as a first step a PR that conditionally allows the use
>>>>> of CloudPickle in the place of Dill (with the exception of DillCoder,
>>>>> there should of course probably be a separate CloudPickleCoder).
>>>>>
>>>>> On Fri, Apr 30, 2021 at 10:17 AM Valentyn Tymofieiev
>>>>> <va...@google.com> wrote:
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Fri, Apr 30, 2021 at 9:53 AM Brian Hulette <bh...@google.com>
>>>>> wrote:
>>>>> >>
>>>>> >> > I think with cloudpickle we will not be able have a tight range.
>>>>> >>
>>>>> >> If cloudpickle is backwards compatible, we should be able to just
>>>>> keep an upper bound in setup.py [1] synced up with a pinned version in
>>>>> base_image_requirements.txt [2], right?
>>>>> >
>>>>> >
>>>>> > With an upper bound only, dependency resolver could still downgrade
>>>>> pickler on the runner' side, ideally we should be detecting that.
>>>>> >
>>>>> > Also if we ever depend on a newer functionality, we would add a
>>>>> lower bound as well, which (for that particular Beam release), makes it a
>>>>> tight bound, so potentially a friction point.
>>>>> >
>>>>> >>
>>>>> >>
>>>>> >> > We could solve this problem by passing the version of pickler
>>>>> used at job submission
>>>>> >>
>>>>> >> A bit of a digression, but it may be worth considering something
>>>>> more general here, for a couple of reasons:
>>>>> >> - I've had a similar concern for the Beam DataFrame API. Our goal
>>>>> is for it to match the behavior of the pandas version used at construction
>>>>> time, but we could get into some surprising edge cases if the version of
>>>>> pandas used to compute partial results in the SDK harness is different.
>>>>> >> - Occasionally we have Dataflow customers report
>>>>> NameErrors/AttributeErrors that can be attributed to a dependency mismatch.
>>>>> It would be nice to proactively warn about this.
>>>>> >>
>>>>> >>
>>>>> >> That being said I imagine it would be hard to do something truly
>>>>> general since every dependency will have different compatibility guarantees.
>>>>> >>
>>>>> > I think it should be considered a best practice to have matching
>>>>> dependencies on job submission and execution side. We can:
>>>>> > 1)  consider sending a manifest of all locally installed
>>>>> dependencies to the runner and verify on the runner's side that critical
>>>>> dependencies are compatible.
>>>>> > 2) help make it easier to ensure the dependencies match:
>>>>> >   - leverage container prebuilding workflow to construct Runner's
>>>>> container on the SDK side, with the knowledge of locally-installed
>>>>> dependency versions.
>>>>> >   - document how to launch pipeline from the SDK container,
>>>>> especially for pipelines using a custom container. This would guarantee
>>>>> exact match of dependencies. This can also prevent Python minor version
>>>>> mismatch. Some runners can make it easier with features like Dataflow Flex
>>>>> Templates.
>>>>> >
>>>>> >
>>>>> >>
>>>>> >> [1] https://github.com/apache/beam/blob/master/sdks/python/setup.py
>>>>> >> [2]
>>>>> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
>>>>> >>
>>>>> >> On Fri, Apr 30, 2021 at 9:34 AM Valentyn Tymofieiev <
>>>>> valentyn@google.com> wrote:
>>>>> >>>
>>>>> >>> Hi Stephan,
>>>>> >>>
>>>>> >>> Thanks for reaching out. We first considered switching to
>>>>> cloudpickle when adding Python 3 support[1], and there is a tracking
>>>>> issue[2]. We were able to fix or work around missing Py3 in dill, features
>>>>> although some are still not working for us [3].
>>>>> >>> I agree that Beam can and should support cloudpickle as a pickler.
>>>>> Practically, we can make cloudpickle the default pickler starting from a
>>>>> particular python version, for example we are planning to add Python 3.9
>>>>> support and we can try to make cloudpickle the default pickler for this
>>>>> version to avoid breaking users while ironing out rough edges.
>>>>> >>>
>>>>> >>> My main concern is client-server version range compatibility of
>>>>> the pickler. When SDK creates the job representation, it serializes the
>>>>> objects using the pickler used on the user's machine. When SDK deserializes
>>>>> the objects on the Runner side, it uses the pickler installed on the
>>>>> runner, for example it can be a dill version installed the docker container
>>>>> provided by Beam or Dataflow. We have been burned in the past by having an
>>>>> open version bound for the pickler in Beam's requirements: client side
>>>>> would pick the newest version, but runner container would have a somewhat
>>>>> older version, either because the container did not have the new version,
>>>>> or because some pipeline dependency wanted to downgrade dill. Older version
>>>>> of pickler did not correctly deserialize new pickles. I suspect cloudpickle
>>>>> may have the same problem. A solution was to have a very tight version
>>>>> range for the pickler in SDK's requirements [4]. Given that dill is not a
>>>>> popular dependency, the tight range did not create much friction for Beam
>>>>> users. I think with cloudpickle we will not be able have a tight range.  We
>>>>> could solve this problem by passing the version of pickler used at job
>>>>> submission, and have a check on the runner to make sure that the client
>>>>> version is not newer than the runner's version. Additionally, we should
>>>>> make sure cloudpickle is backwards compatible (newer version can
>>>>> deserialize objects created by older version).
>>>>> >>>
>>>>> >>> [1]
>>>>> https://lists.apache.org/thread.html/d431664a3fc1039faa01c10e2075659288aec5961c7b4b59d9f7b889%40%3Cdev.beam.apache.org%3E
>>>>> >>> [2] https://issues.apache.org/jira/browse/BEAM-8123
>>>>> >>> [3]
>>>>> https://github.com/uqfoundation/dill/issues/300#issuecomment-525409202
>>>>> >>> [4]
>>>>> https://github.com/apache/beam/blob/master/sdks/python/setup.py#L138-L143
>>>>> >>>
>>>>> >>> On Thu, Apr 29, 2021 at 8:04 PM Stephan Hoyer <sh...@google.com>
>>>>> wrote:
>>>>> >>>>
>>>>> >>>> cloudpickle [1] and dill [2] are two Python packages that
>>>>> implement extensions of Python's pickle protocol for arbitrary objects.
>>>>> Beam currently uses dill, but I'm wondering if we could consider
>>>>> additionally or alternatively use cloudpickle instead.
>>>>> >>>>
>>>>> >>>> Overall, cloudpickle seems to be a more popular choice for
>>>>> extended pickle support in distributing computing in Python, e.g., it's
>>>>> used by Spark, Dask and joblib.
>>>>> >>>>
>>>>> >>>> One of the major differences between cloudpickle and dill is how
>>>>> they handle pickling global variables (such as Python modules) that are
>>>>> referred to by a function:
>>>>> >>>> - Dill doesn't serialize globals. If you want to save globals,
>>>>> you need to call dill.dump_session(). This is what the "save_main_session"
>>>>> flag does in Beam.
>>>>> >>>> - Cloudpickle takes a different approach. It introspects which
>>>>> global variables are used by a function, and creates a closure around the
>>>>> serialized function that only contains these variables.
>>>>> >>>>
>>>>> >>>> The cloudpickle approach results in larger serialized functions,
>>>>> but it's also much more robust, because the required globals are included
>>>>> by default. In contrast, with dill, one either needs to save all globals or
>>>>> none. This is repeated pain-point for Beam Python users [3]:
>>>>> >>>> - Saving all globals can be overly aggressive, particularly in
>>>>> notebooks where users may have incidentally created large objects.
>>>>> >>>> - Alternatively, users can avoid using global variables entirely,
>>>>> but this makes defining ad-hoc pipelines very awkward. Mapped over
>>>>> functions need to be imported from other modules, or need to have their
>>>>> imports defined inside the function itself.
>>>>> >>>>
>>>>> >>>> I'd love to see an option to use cloudpickle in Beam instead of
>>>>> dill, and to consider switching over entirely. Cloudpickle would allow Beam
>>>>> users to write readable code in the way they expect, without needing to
>>>>> worry about the confusing and potentially problematic "save_main_session"
>>>>> flag.
>>>>> >>>>
>>>>> >>>> Any thoughts?
>>>>> >>>>
>>>>> >>>> Cheers,
>>>>> >>>> Stephan
>>>>> >>>>
>>>>> >>>> [1] https://github.com/cloudpipe/cloudpickle
>>>>> >>>> [2] https://github.com/uqfoundation/dill
>>>>> >>>> [3]
>>>>> https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors
>>>>> >>>>
>>>>>
>>>>
>>>
>>> --
>>> +48 660 796 129 <+48%20660%20796%20129>
>>>
>>