You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by superbobry <gi...@git.apache.org> on 2018/11/11 22:11:36 UTC
[GitHub] spark pull request #23008: [SPARK-22674][PYTHON] Removed the namedtuple pick...
GitHub user superbobry opened a pull request:
https://github.com/apache/spark/pull/23008
[SPARK-22674][PYTHON] Removed the namedtuple pickling patch
## What changes were proposed in this pull request?
Prior to this PR PySpark patched ``collections.namedtuple`` to make
namedtuple instances serializable even if the namedtuple class has been
defined outside of ``globals()``, e.g.
def do_something():
Foo = namedtuple("Foo", ["foo"])
sc.parallelize(range(1)).map(lambda _: Foo(42))
The patch changed the pickled representation of the namedtuple instance
to include the structure of namedtuple class, and recreate the class on
each unpickling. This behaviour causes hard to diagnose failures both
in the user code with namedtuples, as well as third-party libraries
relying on them. See [1] and [2] for details.
[1]: https://superbobry.github.io/pyspark-silently-breaks-your-namedtuples.html
[2]: https://superbobry.github.io/tensorflowonspark-or-the-namedtuple-patch-strikes-again.html
The PR changes the default serializer to `CloudPickleSerializer` which natively supports pickling namedtuples and does not require the aforementioned patch. To the best of my knowledge, this is **not** a breaking change.
## How was this patch tested?
PySpark test suite.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/superbobry/spark no-hijack-namedtuple
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/23008.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #23008
----
commit 36ff69717c411e72f60c3006eb6a491c3d88862d
Author: Sergei Lebedev <su...@...>
Date: 2018-11-11T21:05:07Z
Removed namedtuple hack and made cloudpickle the default serializer
This is a followup of the discussion in #21157. See the PR and the
linked JIRA ticket for context and motivation.
commit 9a818797603f5804b32202d28474493c80966f58
Author: Sergei Lebedev <su...@...>
Date: 2018-11-11T22:11:02Z
Changed SerializationTestCase to use cloudpickle
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23008
BTW, let.s test them in end-to-end. For instance, `spark.range(10000).rdd.map(lambda blabla).count()`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23008
Nope, it should be manually done.. should be great to have it FWIW.
I am not yet sure how we're going to measure the performance. I think you can show the performance diff for namedtuple for now - that's going to at the very least show some numbers to compare.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23008
ok to test
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/23008
If the perf diff is big, let's don't change but document that we can use `CloudPickleSerializer()` to avoid breaking change.
If the perf diff is rather trivial, let's check if we can keep this change. I will help to check the perf in this case as well.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23008
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Posted by superbobry <gi...@git.apache.org>.
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/23008
Interestingly, `cloudpickle` adds overhead even if the namedtuple is importable:
```bash
$ cat a.py
from collections import namedtuple
A = namedtuple("A", ["foo", "bar"])
$ python -c "from a import A; import cloudpickle; print(len(cloudpickle.dumps(A(42, 24))))"
30
$ python -c "from a import A; import pickle; print(len(pickle.dumps(A(42, 24))))"
20
```
If the namedtuple is not importable, the size of the result explodes because `cloudpickle` includes a full class definition along with all the docstrings with *every* pickled object:
```python
>>> from collections import namedtuple
>>> A = namedtuple("A", ["foo", "bar"])
>>> import cloudpickle
>>> len(cloudpickle.dumps(A(42, 24)))
3836
>>> import pickle
>>> len(pickle.dumps(A(42, 24)))
27
```
Note that the order of magnitude is incomparable to what PySpark does currently:
```python
>>> import pyspark
>>> A = namedtuple("A", ["foo", "bar"])
>>> len(pickle.dumps(A(42, 24)))
79
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23008
**[Test build #98710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98710/testReport)** for PR 23008 at commit [`9a81879`](https://github.com/apache/spark/commit/9a818797603f5804b32202d28474493c80966f58).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23008
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23008
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23008
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98710/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23008
**[Test build #98710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98710/testReport)** for PR 23008 at commit [`9a81879`](https://github.com/apache/spark/commit/9a818797603f5804b32202d28474493c80966f58).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23008
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Posted by superbobry <gi...@git.apache.org>.
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/23008
Is there a benchmark suite for PySpark?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org