You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2014/12/10 21:17:13 UTC
[jira] [Commented] (SPARK-2951) SerDeUtils.pythonToPairRDD fails on
RDDs of pickled array.arrays in Python 2.6
[ https://issues.apache.org/jira/browse/SPARK-2951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241654#comment-14241654 ]
Apache Spark commented on SPARK-2951:
-------------------------------------
User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3668
> SerDeUtils.pythonToPairRDD fails on RDDs of pickled array.arrays in Python 2.6
> ------------------------------------------------------------------------------
>
> Key: SPARK-2951
> URL: https://issues.apache.org/jira/browse/SPARK-2951
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 1.1.0
> Reporter: Josh Rosen
> Assignee: Davies Liu
> Fix For: 1.2.0
>
>
> With Python 2.6, calling SerDeUtils.pythonToPairRDD() on an RDD of pickled Python array.arrays will fail with this exception:
> {code}
> ava.lang.ClassCastException: java.lang.String cannot be cast to java.util.ArrayList
> net.razorvine.pickle.objects.ArrayConstructor.construct(ArrayConstructor.java:33)
> net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
> net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
> net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
> net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToPairRDD$1$$anonfun$5.apply(SerDeUtil.scala:106)
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToPairRDD$1$$anonfun$5.apply(SerDeUtil.scala:106)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:898)
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:880)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}
> I think this is due to a difference in how array.array is pickled in Python 2.6 vs. Python 2.7. To see this, run the following script:
> {code}
> from pickletools import dis, optimize
> from pickle import dumps, loads, HIGHEST_PROTOCOL
> from array import array
> arr = array('d', [1, 2, 3])
> #protocol = HIGHEST_PROTOCOL
> protocol = 0
> pickled = dumps(arr, protocol=protocol)
> pickled = optimize(pickled)
> unpickled = loads(pickled)
> print arr
> print unpickled
> print dis(pickled)
> {code}
> In Python 2.7, this outputs
> {code}
> array('d', [1.0, 2.0, 3.0])
> array('d', [1.0, 2.0, 3.0])
> 0: c GLOBAL 'array array'
> 13: ( MARK
> 14: S STRING 'd'
> 19: ( MARK
> 20: l LIST (MARK at 19)
> 21: F FLOAT 1.0
> 26: a APPEND
> 27: F FLOAT 2.0
> 32: a APPEND
> 33: F FLOAT 3.0
> 38: a APPEND
> 39: t TUPLE (MARK at 13)
> 40: R REDUCE
> 41: . STOP
> highest protocol among opcodes = 0
> None
> {code}
> whereas 2.6 outputs
> {code}
> array('d', [1.0, 2.0, 3.0])
> array('d', [1.0, 2.0, 3.0])
> 0: c GLOBAL 'array array'
> 13: ( MARK
> 14: S STRING 'd'
> 19: S STRING '\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@'
> 110: t TUPLE (MARK at 13)
> 111: R REDUCE
> 112: . STOP
> highest protocol among opcodes = 0
> None
> {code}
> I think the Java-side depickling library doesn't expect this pickled format, causing this failure.
> I noticed this when running PySpark's unit tests on 2.6 because the TestOuputFormat.test_newhadoop test failed.
> I think that this issue affects all of the methods that might need to depickle arrays in Java, including all of the Hadoop output format methods.
> How should we try to fix this? Require that users upgrade to 2.7 if they want to use code that requires this? Open a bug with the depickling library maintainers? Try to hack in our own pickling routines for arrays if we detect that we're using 2.6?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org