You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Anoop Shiralige <an...@gmail.com> on 2016/02/11 15:38:46 UTC
PySpark : couldn't pickle object of type class T
Hi All,
I am working with Spark 1.6.0 and pySpark shell specifically. I have an
JavaRDD[org.apache.avro.GenericRecord] which I have converted to pythonRDD
in the following way.
javaRDD = sc._jvm.java.package.loadJson("path to data", sc._jsc)
javaPython = sc._jvm.SerDe.javaToPython(javaRDD)
from pyspark.rdd import RDD
pythonRDD=RDD(javaPython,sc)
pythonRDD.first()
However everytime I am trying to call collect() or first() method on
pythonRDD I am getting the following error :
16/02/11 06:19:19 ERROR python.PythonRunner: Python worker exited
unexpectedly (crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
File
"/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/worker.py",
line 98, in main
command = pickleSer._read_with_length(infile)
File
"/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py",
line 156, in _read_with_length
length = read_int(stream)
File
"/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py",
line 545, in read_int
raise EOFError
EOFError
at
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at
org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at
org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: net.razorvine.pickle.PickleException: couldn't pickle object of
type class org.apache.avro.generic.GenericData$Record
at net.razorvine.pickle.Pickler.save(Pickler.java:142)
at net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:493)
at net.razorvine.pickle.Pickler.dispatch(Pickler.java:205)
at net.razorvine.pickle.Pickler.save(Pickler.java:137)
at net.razorvine.pickle.Pickler.dump(Pickler.java:107)
at net.razorvine.pickle.Pickler.dumps(Pickler.java:92)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:121)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:110)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:110)
at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
at
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
Thanks for your time,
AnoopShiralige
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-couldn-t-pickle-object-of-type-class-T-tp26204.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: PySpark : couldn't pickle object of type class T
Posted by Jeff Zhang <zj...@gmail.com>.
Hi Anoop,
I don't see the exception you mentioned in the link. I can use spark-avro
to read the sample file users.avro in spark successfully. Do you have the
details of the union issue ?
On Sat, Feb 27, 2016 at 10:05 AM, Anoop Shiralige <anoop.shiralige@gmail.com
> wrote:
> Hi Jeff,
>
> Thank you for looking into the post.
>
> I had explored spark-avro option earlier. Since, we have union of multiple
> complex data types in our avro schema we couldn't use it.
> Couple of things I tried.
>
> -
> https://stackoverflow.com/questions/31261376/how-to-read-pyspark-avro-file-and-extract-the-values :
> "Spark Exception : Unions may only consist of concrete type and null"
> - Use of dataFrame/DataSet : serialization problem.
>
> For now, I got it working by modifing AvroConversionUtils, to address the
> union of multiple data-types.
>
> Thanks,
> AnoopShiralige
>
>
> On Thu, Feb 25, 2016 at 7:25 AM, Jeff Zhang <zj...@gmail.com> wrote:
>
>> Avro Record is not supported by pickler, you need to create a custom
>> pickler for it. But I don't think it worth to do that. Actually you can
>> use package spark-avro to load avro data and then convert it to RDD if
>> necessary.
>>
>> https://github.com/databricks/spark-avro
>>
>>
>> On Thu, Feb 11, 2016 at 10:38 PM, Anoop Shiralige <
>> anoop.shiralige@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I am working with Spark 1.6.0 and pySpark shell specifically. I have an
>>> JavaRDD[org.apache.avro.GenericRecord] which I have converted to
>>> pythonRDD
>>> in the following way.
>>>
>>> javaRDD = sc._jvm.java.package.loadJson("path to data", sc._jsc)
>>> javaPython = sc._jvm.SerDe.javaToPython(javaRDD)
>>> from pyspark.rdd import RDD
>>> pythonRDD=RDD(javaPython,sc)
>>>
>>> pythonRDD.first()
>>>
>>> However everytime I am trying to call collect() or first() method on
>>> pythonRDD I am getting the following error :
>>>
>>> 16/02/11 06:19:19 ERROR python.PythonRunner: Python worker exited
>>> unexpectedly (crashed)
>>> org.apache.spark.api.python.PythonException: Traceback (most recent call
>>> last):
>>> File
>>>
>>> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/worker.py",
>>> line 98, in main
>>> command = pickleSer._read_with_length(infile)
>>> File
>>>
>>> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py",
>>> line 156, in _read_with_length
>>> length = read_int(stream)
>>> File
>>>
>>> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py",
>>> line 545, in read_int
>>> raise EOFError
>>> EOFError
>>>
>>> at
>>>
>>> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>>> at
>>>
>>> org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
>>> at
>>> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>>> at
>>> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>>> at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>> at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>> at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>> at
>>>
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:744)
>>> Caused by: net.razorvine.pickle.PickleException: couldn't pickle object
>>> of
>>> type class org.apache.avro.generic.GenericData$Record
>>> at net.razorvine.pickle.Pickler.save(Pickler.java:142)
>>> at
>>> net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:493)
>>> at net.razorvine.pickle.Pickler.dispatch(Pickler.java:205)
>>> at net.razorvine.pickle.Pickler.save(Pickler.java:137)
>>> at net.razorvine.pickle.Pickler.dump(Pickler.java:107)
>>> at net.razorvine.pickle.Pickler.dumps(Pickler.java:92)
>>> at
>>>
>>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:121)
>>> at
>>>
>>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:110)
>>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>> at
>>>
>>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:110)
>>> at
>>>
>>> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
>>> at
>>>
>>> org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
>>> at
>>> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
>>> at
>>>
>>> org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
>>>
>>> Thanks for your time,
>>> AnoopShiralige
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-couldn-t-pickle-object-of-type-class-T-tp26204.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
--
Best Regards
Jeff Zhang
Re: PySpark : couldn't pickle object of type class T
Posted by Anoop Shiralige <an...@gmail.com>.
Hi Jeff,
Thank you for looking into the post.
I had explored spark-avro option earlier. Since, we have union of multiple
complex data types in our avro schema we couldn't use it.
Couple of things I tried.
-
https://stackoverflow.com/questions/31261376/how-to-read-pyspark-avro-file-and-extract-the-values
:
"Spark Exception : Unions may only consist of concrete type and null"
- Use of dataFrame/DataSet : serialization problem.
For now, I got it working by modifing AvroConversionUtils, to address the
union of multiple data-types.
Thanks,
AnoopShiralige
On Thu, Feb 25, 2016 at 7:25 AM, Jeff Zhang <zj...@gmail.com> wrote:
> Avro Record is not supported by pickler, you need to create a custom
> pickler for it. But I don't think it worth to do that. Actually you can
> use package spark-avro to load avro data and then convert it to RDD if
> necessary.
>
> https://github.com/databricks/spark-avro
>
>
> On Thu, Feb 11, 2016 at 10:38 PM, Anoop Shiralige <
> anoop.shiralige@gmail.com> wrote:
>
>> Hi All,
>>
>> I am working with Spark 1.6.0 and pySpark shell specifically. I have an
>> JavaRDD[org.apache.avro.GenericRecord] which I have converted to pythonRDD
>> in the following way.
>>
>> javaRDD = sc._jvm.java.package.loadJson("path to data", sc._jsc)
>> javaPython = sc._jvm.SerDe.javaToPython(javaRDD)
>> from pyspark.rdd import RDD
>> pythonRDD=RDD(javaPython,sc)
>>
>> pythonRDD.first()
>>
>> However everytime I am trying to call collect() or first() method on
>> pythonRDD I am getting the following error :
>>
>> 16/02/11 06:19:19 ERROR python.PythonRunner: Python worker exited
>> unexpectedly (crashed)
>> org.apache.spark.api.python.PythonException: Traceback (most recent call
>> last):
>> File
>>
>> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/worker.py",
>> line 98, in main
>> command = pickleSer._read_with_length(infile)
>> File
>>
>> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py",
>> line 156, in _read_with_length
>> length = read_int(stream)
>> File
>>
>> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py",
>> line 545, in read_int
>> raise EOFError
>> EOFError
>>
>> at
>> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>> at
>>
>> org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
>> at
>> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>> at
>> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:744)
>> Caused by: net.razorvine.pickle.PickleException: couldn't pickle object of
>> type class org.apache.avro.generic.GenericData$Record
>> at net.razorvine.pickle.Pickler.save(Pickler.java:142)
>> at
>> net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:493)
>> at net.razorvine.pickle.Pickler.dispatch(Pickler.java:205)
>> at net.razorvine.pickle.Pickler.save(Pickler.java:137)
>> at net.razorvine.pickle.Pickler.dump(Pickler.java:107)
>> at net.razorvine.pickle.Pickler.dumps(Pickler.java:92)
>> at
>>
>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:121)
>> at
>>
>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:110)
>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> at
>>
>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:110)
>> at
>>
>> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
>> at
>>
>> org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
>> at
>> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
>> at
>>
>> org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
>>
>> Thanks for your time,
>> AnoopShiralige
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-couldn-t-pickle-object-of-type-class-T-tp26204.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
Re: PySpark : couldn't pickle object of type class T
Posted by Jeff Zhang <zj...@gmail.com>.
Avro Record is not supported by pickler, you need to create a custom
pickler for it. But I don't think it worth to do that. Actually you can
use package spark-avro to load avro data and then convert it to RDD if
necessary.
https://github.com/databricks/spark-avro
On Thu, Feb 11, 2016 at 10:38 PM, Anoop Shiralige <anoop.shiralige@gmail.com
> wrote:
> Hi All,
>
> I am working with Spark 1.6.0 and pySpark shell specifically. I have an
> JavaRDD[org.apache.avro.GenericRecord] which I have converted to pythonRDD
> in the following way.
>
> javaRDD = sc._jvm.java.package.loadJson("path to data", sc._jsc)
> javaPython = sc._jvm.SerDe.javaToPython(javaRDD)
> from pyspark.rdd import RDD
> pythonRDD=RDD(javaPython,sc)
>
> pythonRDD.first()
>
> However everytime I am trying to call collect() or first() method on
> pythonRDD I am getting the following error :
>
> 16/02/11 06:19:19 ERROR python.PythonRunner: Python worker exited
> unexpectedly (crashed)
> org.apache.spark.api.python.PythonException: Traceback (most recent call
> last):
> File
>
> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/worker.py",
> line 98, in main
> command = pickleSer._read_with_length(infile)
> File
>
> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py",
> line 156, in _read_with_length
> length = read_int(stream)
> File
>
> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py",
> line 545, in read_int
> raise EOFError
> EOFError
>
> at
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
> at
>
> org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
> at
> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
> at
> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: net.razorvine.pickle.PickleException: couldn't pickle object of
> type class org.apache.avro.generic.GenericData$Record
> at net.razorvine.pickle.Pickler.save(Pickler.java:142)
> at
> net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:493)
> at net.razorvine.pickle.Pickler.dispatch(Pickler.java:205)
> at net.razorvine.pickle.Pickler.save(Pickler.java:137)
> at net.razorvine.pickle.Pickler.dump(Pickler.java:107)
> at net.razorvine.pickle.Pickler.dumps(Pickler.java:92)
> at
>
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:121)
> at
>
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:110)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at
>
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:110)
> at
>
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
> at
>
> org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
> at
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
> at
>
> org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
>
> Thanks for your time,
> AnoopShiralige
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-couldn-t-pickle-object-of-type-class-T-tp26204.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
--
Best Regards
Jeff Zhang