You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Sergei Lebedev <se...@gmail.com> on 2018/08/14 14:25:13 UTC

[DISCUSS][SPARK-22674][PYTHON] Disabled _hack_namedtuple for picklable namedtuples

Hi all,

Some time ago we've discovered that PySpark patches
collections.namedtuple to allow unpickling of namedtuples defined in the
REPL on the executors. Side-effects of the patch include

* hard to debug failures -- we originally came across this while
investigating a TensorFlowOnSpark failure, see [1];
* serialization overhead -- each namedtuple instance carries a full
namedtuple definition.

I think it is best to completely remove the patch since the benefits it
brings are insignificant compared to the issues. However, there is a middle
ground which to me looks non-intrusive enough to be releasable in the 2.X
branch. The proposed PR [2] does not break any of the currently working
usages of namedtuple while reducing the damage done by the patch when the
namedtuple or its subclass is importable.

Do you think it might be possible to merge the PR in either 2.4.X or the
following 2.X release?

Cheers,
Sergei

[1]:
https://superbobry.github.io/tensorflowonspark-or-the-namedtuple-patch-strikes-again.html
[2]: https://github.com/apache/spark/pull/21180