You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2020/09/23 11:23:00 UTC

[jira] [Commented] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses

    [ https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200747#comment-17200747 ] 

Apache Spark commented on SPARK-22674:
--------------------------------------

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29851

> PySpark breaks serialization of namedtuple subclasses
> -----------------------------------------------------
>
>                 Key: SPARK-22674
>                 URL: https://issues.apache.org/jira/browse/SPARK-22674
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.2.0, 2.3.0
>            Reporter: Jonas Amrich
>            Priority: Major
>
> Pyspark monkey patches the namedtuple class to make it serializable, however this breaks serialization of its subclasses. With current implementation, any subclass will be serialized (and deserialized) as it's parent namedtuple. Consider this code, which will fail with {{AttributeError: 'Point' object has no attribute 'sum'}}:
> {code}
> from collections import namedtuple
> Point = namedtuple("Point", "x y")
> class PointSubclass(Point):
>     def sum(self):
>         return self.x + self.y
> rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]])
> rdd.collect()[0][0].sum()
> {code}
> Moreover, as PySpark hijacks all namedtuples in the main module, importing pyspark breaks serialization of namedtuple subclasses even in code which is not related to spark / distributed execution. I don't see any clean solution to this; a possible workaround may be to limit serialization hack only to direct namedtuple subclasses like in https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org