You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2021/11/23 03:23:00 UTC

[jira] [Updated] (SPARK-32079) PySpark <> Beam pickling issues for collections.namedtuple

     [ https://issues.apache.org/jira/browse/SPARK-32079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-32079:
---------------------------------
    Affects Version/s: 3.3.0
                           (was: 3.0.0)

> PySpark <> Beam pickling issues for collections.namedtuple
> ----------------------------------------------------------
>
>                 Key: SPARK-32079
>                 URL: https://issues.apache.org/jira/browse/SPARK-32079
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.3.0
>            Reporter: Gerard Casas Saez
>            Priority: Major
>
> PySpark monkeypatching namedtuple makes it difficult/impossible to depickle collections.namedtuple instances from outside of a pyspark environment.
>  
> When PySpark has been loaded into the environment, any time that you try to pickle a namedtuple, you are only able to unpickle it from an environment where the [hijack|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L385] has been applied. 
> This conflicts directly when trying to use Beam from a non-Spark environment (namingly Flink or Dataflow) making it impossible to use the pipeline if it has a namedtuple loaded somewhere. 
>  
> {code:python}
> import collections
> import dill
> ColumnInfo = collections.namedtuple(
>     "ColumnInfo",
>     [
>         "name",  # type: ColumnName  # pytype: disable=ignored-type-comment
>         "type",  # type: Optional[ColumnType]  # pytype: disable=ignored-type-comment
>     ])
> dill.dumps(ColumnInfo('test', int))
> {code}
> {{b'\x80\x03cdill._dill\n_create_namedtuple\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x08\x00\x00\x00__main__q\x05\x87q\x06Rq\x07X\x04\x00\x00\x00testq\x08cdill._dill\n_load_type\nq\tX\x03\x00\x00\x00intq\n\x85q\x0bRq\x0c\x86q\r\x81q\x0e.'}}
> {code:python}
> import pyspark
> import collections
> import dill
> ColumnInfo = collections.namedtuple(
>     "ColumnInfo",
>     [
>         "name",  # type: ColumnName  # pytype: disable=ignored-type-comment
>         "type",  # type: Optional[ColumnType]  # pytype: disable=ignored-type-comment
>     ])
> dill.dumps(ColumnInfo('test', int))
> {code}
> {{b'\x80\x03cpyspark.serializers\n_restore\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x04\x00\x00\x00testq\x05cdill._dill\n_load_type\nq\x06X\x03\x00\x00\x00intq\x07\x85q\x08Rq\t\x86q\n\x87q\x0bRq\x0c.'}}
> Second pickled object can only be used from an environment with PySpark. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org