You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2014/08/10 03:31:12 UTC
[jira] [Created] (SPARK-2951) SerDeUtils.pythonToPairRDD fails on RDDs of pickled array.arrays in Python 2.6

Josh Rosen created SPARK-2951:
---------------------------------

             Summary: SerDeUtils.pythonToPairRDD fails on RDDs of pickled array.arrays in Python 2.6
                 Key: SPARK-2951
                 URL: https://issues.apache.org/jira/browse/SPARK-2951
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.1.0
            Reporter: Josh Rosen


With Python 2.6, calling SerDeUtils.pythonToPairRDD() on an RDD of pickled Python array.arrays will fail with this exception:

{code}
ava.lang.ClassCastException: java.lang.String cannot be cast to java.util.ArrayList
        net.razorvine.pickle.objects.ArrayConstructor.construct(ArrayConstructor.java:33)
        net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
        net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
        net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
        net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
        org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToPairRDD$1$$anonfun$5.apply(SerDeUtil.scala:106)
        org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToPairRDD$1$$anonfun$5.apply(SerDeUtil.scala:106)
        scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:898)
        org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:880)
        org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
        org.apache.spark.scheduler.Task.run(Task.scala:54)
        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        java.lang.Thread.run(Thread.java:745)
{code}

I think this is due to a difference in how array.array is pickled in Python 2.6 vs. Python 2.7.  To see this, run the following script:

{code}
from pickletools import dis, optimize
from pickle import dumps, loads, HIGHEST_PROTOCOL
from array import array

arr = array('d', [1, 2, 3])

#protocol = HIGHEST_PROTOCOL
protocol = 0

pickled = dumps(arr, protocol=protocol)
pickled = optimize(pickled)
unpickled = loads(pickled)

print arr
print unpickled

print dis(pickled)
{code}

In Python 2.7, this outputs

{code}
array('d', [1.0, 2.0, 3.0])
array('d', [1.0, 2.0, 3.0])
    0: c    GLOBAL     'array array'
   13: (    MARK
   14: S        STRING     'd'
   19: (        MARK
   20: l            LIST       (MARK at 19)
   21: F        FLOAT      1.0
   26: a        APPEND
   27: F        FLOAT      2.0
   32: a        APPEND
   33: F        FLOAT      3.0
   38: a        APPEND
   39: t        TUPLE      (MARK at 13)
   40: R    REDUCE
   41: .    STOP
highest protocol among opcodes = 0
None
{code}

whereas 2.6 outputs

{code}
array('d', [1.0, 2.0, 3.0])
array('d', [1.0, 2.0, 3.0])
    0: c    GLOBAL     'array array'
   13: (    MARK
   14: S        STRING     'd'
   19: S        STRING     '\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@'
  110: t        TUPLE      (MARK at 13)
  111: R    REDUCE
  112: .    STOP
highest protocol among opcodes = 0
None
{code}

I think the Java-side depickling library doesn't expect this pickled format, causing this failure.

I noticed this when running PySpark's unit tests on 2.6 because the TestOuputFormat.test_newhadoop test failed.

I think that this issue affects all of the methods that might need to depickle arrays in Java, including all of the Hadoop output format methods.

How should we try to fix this?  Require that users upgrade to 2.7 if they want to use code that requires this?  Open a bug with the depickling library maintainers?  Try to hack in our own pickling routines for arrays if we detect that we're using 2.6?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org