You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by superbobry <gi...@git.apache.org> on 2018/09/27 16:21:38 UTC

[GitHub] spark pull request #21157: [SPARK-22674][PYTHON] Removed the namedtuple pick...

GitHub user superbobry reopened a pull request:

    https://github.com/apache/spark/pull/21157

    [SPARK-22674][PYTHON] Removed the namedtuple pickling patch

    ## What changes were proposed in this pull request?
    
    This is a breaking change.
    
    Prior to this commit PySpark patched ``collections.namedtuple`` to make
    namedtuple instances serializable even if the namedtuple class has been
    defined outside of ``globals()``, e.g.
    
        def do_something():
            Foo = namedtuple("Foo", ["foo"])
            sc.parallelize(range(1)).map(lambda _: Foo(42))
    
    The patch changed the pickled representation of the namedtuple instance
    to include the structure of namedtuple class, and recreate the class on
    each unpickling. This behaviour causes hard to diagnose failures both
    in the user code with namedtuples, as well as third-party libraries
    relying on them. See [1] and [2] for details.
    
    [1]: https://superbobry.github.io/pyspark-silently-breaks-your-namedtuples.html
    [2]: https://superbobry.github.io/tensorflowonspark-or-the-namedtuple-patch-strikes-again.html
    
    ## How was this patch tested?
    
    PySpark test suite.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/criteo-forks/spark no-hijack-namedtuple

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21157.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21157
    
----
commit eadc0c8af853a57ee80f5e80fb708451931eedc0
Author: Sergei Lebedev <s....@...>
Date:   2018-04-25T14:45:25Z

    [SPARK-22674][PYTHON] Removed the namedtuple pickling patch
    
    This is a breaking change.
    
    Prior to this commit PySpark patched ``collections.namedtuple`` to make
    namedtuple instances serializable even if the namedtuple class has been
    defined outside of ``globals()``, e.g.
    
        def do_something():
            Foo = namedtuple("Foo", ["foo"])
            sc.parallelize(range(1)).map(lambda _: Foo(42))
    
    The patch changed the pickled representation of the namedtuple instance
    to include the structure of namedtuple class, and recreate the class on
    each unpickling. This behaviour causes hard to diagnose failures both
    in the user code with namedtuples, as well as third-party libraries
    relying on them. See [1] and [2] for details.
    
    [1]: https://superbobry.github.io/pyspark-silently-breaks-your-namedtuples.html
    [2]: https://superbobry.github.io/tensorflowonspark-or-the-namedtuple-patch-strikes-again.html

commit 67c4f6707aab670b9cde5a3afa34fda3abbbf46d
Author: Sergei Lebedev <s....@...>
Date:   2018-04-26T09:45:55Z

    Fixed test_namedtuple_in_rdd

commit c67ce29a3279073812070e6ff4bb2e2624961b36
Author: Sergei Lebedev <s....@...>
Date:   2018-04-26T10:43:47Z

    Fixed test_infer_nested_schema

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org