You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Holden Karau <ho...@pigscanfly.ca> on 2017/02/13 23:01:39 UTC

[PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

Hi PySpark Developers,

Cloudpickle is a core part of PySpark, and is originally copied from (and
improved from) picloud. Since then other projects have found cloudpickle
useful and a fork of cloudpickle <https://github.com/cloudpipe/cloudpickle> was
created and is now maintained as its own library
<https://pypi.python.org/pypi/cloudpickle> (with better test coverage and
resulting bug fixes I understand). We've had a few PRs backporting fixes
from the cloudpickle project into Spark's local copy of cloudpickle - how
would people feel about moving to taking an explicit (pinned) dependency on
cloudpickle?

We could add cloudpickle to the setup.py and a requirements.txt file for
users who prefer not to do a system installation of PySpark.

Py4J is maybe even a simpler case, we currently have a zip of py4j in our
repo but could instead have a pinned version required. While we do depend
on a lot of py4j internal APIs, version pinning should be sufficient to
ensure functionality (and simplify the update process).

Cheers,

Holden :)

-- 
Twitter: https://twitter.com/holdenkarau

Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

Posted by Maciej Szymkiewicz <ms...@gmail.com>.
I don't have any strong views, so just to highlight possible issues:

  * Based on different issues I've seen there is a substantial amount of
    users which depend on system wide Python installations. As far as I
    am aware neither Py4j nor cloudpickle are present in the standard
    system repositories in Debian or Red Hat derivatives.
  * Assuming that Spark is committed to supporting Python 2 beyond its
    end of life we have to be sure that any external dependency has the
    same policy.
  * Py4j is missing from default Anaconda channel. Not a big issue, just
    a small annoyance.
  * External dependencies with pinned versions add some overhead to the
    development across versions (effectively we may need a separate env
    for each major Spark release). I've seen small inconsistencies in
    PySpark behavior with different Py4j versions so it is not
    completely hypothetical.
  * Adding possible version conflicts. It is probably not a big risk but
    something to consider (for example in combination Blaze + Dask +
    PySpark).
  * Adding another party user has to trust.


On 02/14/2017 12:22 AM, Holden Karau wrote:
> It's a good question. Py4J seems to have been updated 5 times in 2016
> and is a bit involved (from a review point of view verifying the zip
> file contents is somewhat tedious).
>
> cloudpickle is a bit difficult to tell since we can have changes to
> cloudpickle which aren't correctly tagged as backporting changes from
> the fork (and this can take awhile to review since we don't always
> catch them right away as being backports).
>
> Another difficulty with looking at backports is that since our review
> process for PySpark has historically been on the slow side, changes
> benefiting systems like dask or IPython parallel were not backported
> to Spark unless they caused serious errors.
>
> I think the key benefits are better test coverage of the forked
> version of cloudpickle, using a more standardized packaging of
> dependencies, simpler updates of dependencies reduces friction to
> gaining benefits from other related projects work - Python
> serialization really isn't our secret sauce.
>
> If I'm missing any substantial benefits or costs I'd love to know :)
>
> On Mon, Feb 13, 2017 at 3:03 PM, Reynold Xin <rxin@databricks.com
> <ma...@databricks.com>> wrote:
>
>     With any dependency update (or refactoring of existing code), I
>     always ask this question: what's the benefit? In this case it
>     looks like the benefit is to reduce efforts in backports. Do you
>     know how often we needed to do those?
>
>
>     On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau
>     <holden@pigscanfly.ca <ma...@pigscanfly.ca>> wrote:
>
>         Hi PySpark Developers,
>
>         Cloudpickle is a core part of PySpark, and is originally
>         copied from (and improved from) picloud. Since then other
>         projects have found cloudpickle useful and a fork of
>         cloudpickle <https://github.com/cloudpipe/cloudpickle> was
>         created and is now maintained as its own library
>         <https://pypi.python.org/pypi/cloudpickle> (with better test
>         coverage and resulting bug fixes I understand). We've had a
>         few PRs backporting fixes from the cloudpickle project into
>         Spark's local copy of cloudpickle - how would people feel
>         about moving to taking an explicit (pinned) dependency on
>         cloudpickle?
>
>         We could add cloudpickle to the setup.py and a
>         requirements.txt file for users who prefer not to do a system
>         installation of PySpark.
>
>         Py4J is maybe even a simpler case, we currently have a zip of
>         py4j in our repo but could instead have a pinned version
>         required. While we do depend on a lot of py4j internal APIs,
>         version pinning should be sufficient to ensure functionality
>         (and simplify the update process).
>
>         Cheers,
>
>         Holden :)
>
>         -- 
>         Twitter: https://twitter.com/holdenkarau
>         <https://twitter.com/holdenkarau>
>
>
>
>
>
> -- 
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau

-- 
Maciej Szymkiewicz


Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

Posted by Holden Karau <ho...@pigscanfly.ca>.
It's a good question. Py4J seems to have been updated 5 times in 2016 and
is a bit involved (from a review point of view verifying the zip file
contents is somewhat tedious).

cloudpickle is a bit difficult to tell since we can have changes to
cloudpickle which aren't correctly tagged as backporting changes from the
fork (and this can take awhile to review since we don't always catch them
right away as being backports).

Another difficulty with looking at backports is that since our review
process for PySpark has historically been on the slow side, changes
benefiting systems like dask or IPython parallel were not backported to
Spark unless they caused serious errors.

I think the key benefits are better test coverage of the forked version of
cloudpickle, using a more standardized packaging of dependencies, simpler
updates of dependencies reduces friction to gaining benefits from other
related projects work - Python serialization really isn't our secret sauce.

If I'm missing any substantial benefits or costs I'd love to know :)

On Mon, Feb 13, 2017 at 3:03 PM, Reynold Xin <rx...@databricks.com> wrote:

> With any dependency update (or refactoring of existing code), I always ask
> this question: what's the benefit? In this case it looks like the benefit
> is to reduce efforts in backports. Do you know how often we needed to do
> those?
>
>
> On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau <ho...@pigscanfly.ca>
> wrote:
>
>> Hi PySpark Developers,
>>
>> Cloudpickle is a core part of PySpark, and is originally copied from (and
>> improved from) picloud. Since then other projects have found cloudpickle
>> useful and a fork of cloudpickle
>> <https://github.com/cloudpipe/cloudpickle> was created and is now
>> maintained as its own library <https://pypi.python.org/pypi/cloudpickle> (with
>> better test coverage and resulting bug fixes I understand). We've had a few
>> PRs backporting fixes from the cloudpickle project into Spark's local copy
>> of cloudpickle - how would people feel about moving to taking an explicit
>> (pinned) dependency on cloudpickle?
>>
>> We could add cloudpickle to the setup.py and a requirements.txt file for
>> users who prefer not to do a system installation of PySpark.
>>
>> Py4J is maybe even a simpler case, we currently have a zip of py4j in our
>> repo but could instead have a pinned version required. While we do depend
>> on a lot of py4j internal APIs, version pinning should be sufficient to
>> ensure functionality (and simplify the update process).
>>
>> Cheers,
>>
>> Holden :)
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

Posted by Reynold Xin <rx...@databricks.com>.
With any dependency update (or refactoring of existing code), I always ask
this question: what's the benefit? In this case it looks like the benefit
is to reduce efforts in backports. Do you know how often we needed to do
those?


On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau <ho...@pigscanfly.ca> wrote:

> Hi PySpark Developers,
>
> Cloudpickle is a core part of PySpark, and is originally copied from (and
> improved from) picloud. Since then other projects have found cloudpickle
> useful and a fork of cloudpickle
> <https://github.com/cloudpipe/cloudpickle> was created and is now
> maintained as its own library <https://pypi.python.org/pypi/cloudpickle> (with
> better test coverage and resulting bug fixes I understand). We've had a few
> PRs backporting fixes from the cloudpickle project into Spark's local copy
> of cloudpickle - how would people feel about moving to taking an explicit
> (pinned) dependency on cloudpickle?
>
> We could add cloudpickle to the setup.py and a requirements.txt file for
> users who prefer not to do a system installation of PySpark.
>
> Py4J is maybe even a simpler case, we currently have a zip of py4j in our
> repo but could instead have a pinned version required. While we do depend
> on a lot of py4j internal APIs, version pinning should be sufficient to
> ensure functionality (and simplify the update process).
>
> Cheers,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
>