You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "holdenk (JIRA)" <ji...@apache.org> on 2016/10/08 00:26:21 UTC

[jira] [Closed] (SPARK-7613) Serialization fails in pyspark for lambdas referencing class data members

     [ https://issues.apache.org/jira/browse/SPARK-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

holdenk closed SPARK-7613.
--------------------------
    Resolution: Won't Fix

I believe this is expected behaviour and the current best practice is simply to make a local copy of any required variables.

> Serialization fails in pyspark for lambdas referencing class data members
> -------------------------------------------------------------------------
>
>                 Key: SPARK-7613
>                 URL: https://issues.apache.org/jira/browse/SPARK-7613
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.0, 1.3.1
>         Environment: Python 2.7.6, Java 8
>            Reporter: Nate Crosswhite
>
> The following code snippet works in pyspark 1.1.0, but fails post 1.2 with the indicated error.  It appears the failure is caused by cloudpickler attempting to serialize the second lambda function twice.
> {code}
> ## Begin PySpark code
> class LambdaFine():
>     def __init__(self, exp):
>         self.exp = exp
>         self.f_function = (lambda x: x**exp)
> class LambdaFail():
>     def __init__(self, exp):
>         self.exp = exp
>         self.f_function = (lambda x: x**self.exp)
> rdd = sc.parallelize(range(0,10))
> print 'LambdaFine:', rdd.map(LambdaFine(2).f_function).collect()  # works
> print 'LambdaFail:', rdd.map(LambdaFail(2).f_function).collect() # fails in spark 1.2+
> ### End PySpark code
> {code}
> ### Output:
> {code}
> LambdaFine: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
> LambdaFail:
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "<home>/spark-1.4.0-SNAPSHOT-bin-4abf285f/python/pyspark/rdd.py", line 745, in collect
>     port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
>   File "<home>/spark-1.4.0-SNAPSHOT-bin-4abf285f/python/pyspark/rdd.py", line 2345, in _jrdd
>     pickled_cmd, bvars, env, includes = _prepare_for_python_RDD(self.ctx, command, self)
>   File "<home>/spark-1.4.0-SNAPSHOT-bin-4abf285f/python/pyspark/rdd.py", line 2265, in _prepare_for_python_RDD
>     pickled_command = ser.dumps((command, sys.version_info[:2]))
>   File "<home>/spark-1.4.0-SNAPSHOT-bin-4abf285f/python/pyspark/serializers.py", line 427, in dumps
>     return cloudpickle.dumps(obj, 2)
>   File "<home>/spark-1.4.0-SNAPSHOT-bin-4abf285f/python/pyspark/cloudpickle.py", line 622, in dumps
>     cp.dump(obj)
>   File "<home>/spark-1.4.0-SNAPSHOT-bin-4abf285f/python/pyspark/cloudpickle.py", line 107, in dump
>     return Pickler.dump(self, obj)
>   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 224, in dump
>     self.save(obj)
>   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
>     f(self, obj) # Call unbound method with explicit self
>   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 548, in save_tuple
>     save(element)
>   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
>     f(self, obj) # Call unbound method with explicit self
>   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 562, in save_tuple
>     save(element)
>   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
>     f(self, obj) # Call unbound method with explicit self
>   File "<home>/spark-1.4.0-SNAPSHOT-bin-4abf285f/python/pyspark/cloudpickle.py", line 199, in save_function
>     self.save_function_tuple(obj)
>   File "<home>/spark-1.4.0-SNAPSHOT-bin-4abf285f/python/pyspark/cloudpickle.py", line 236, in save_function_tuple
>     save((code, closure, base_globals))
>   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
>     f(self, obj) # Call unbound method with explicit self
>   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 548, in save_tuple
>     save(element)
>   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
>     f(self, obj) # Call unbound method with explicit self
>   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 600, in save_list
>     self._batch_appends(iter(obj))
>   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 636, in _batch_appends
>     save(tmp[0])
>   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
>     f(self, obj) # Call unbound method with explicit self
>   File "<home>/spark-1.4.0-SNAPSHOT-bin-4abf285f/python/pyspark/cloudpickle.py", line 193, in save_function
>     self.save_function_tuple(obj)
>   File "<home>/spark-1.4.0-SNAPSHOT-bin-4abf285f/python/pyspark/cloudpickle.py", line 238, in save_function_tuple
>     self.memoize(func)
>   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 244, in memoize
>     assert id(obj) not in self.memo
> AssertionError
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org