You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Diana Carroll <dc...@cloudera.com> on 2014/04/14 21:24:41 UTC
using Kryo with pyspark?
I'm looking at the Tuning Guide suggestion to use Kryo instead of default
serialization. My questions:
Does pyspark use Java serialization by default, as Scala spark does? If
so, then...
can I use Kryo with pyspark instead? The instructions say I should
register my classes with the Kryo Serialization, but that's in Java/Scala.
If I simply set the spark.serializer variable for my SparkContext, will it
at least use Kryo for Spark's own classes, even if I can't register any of
my own classes?
Thanks,
Diana
Re: using Kryo with pyspark?
Posted by Matei Zaharia <ma...@gmail.com>.
Kryo won’t make a major impact on PySpark because it just stores data as byte[] objects, which are fast to serialize even with Java. But it may be worth a try — you would just set spark.serializer and not try to register any classes. What might make more impact is storing data as MEMORY_ONLY_SER and turning on spark.rdd.compress, which will compress them. In Java this can add some CPU overhead but Python runs quite a bit slower so it might not matter, and it might speed stuff up by reducing GC or letting you cache more data.
Matei
On Apr 14, 2014, at 12:24 PM, Diana Carroll <dc...@cloudera.com> wrote:
> I'm looking at the Tuning Guide suggestion to use Kryo instead of default serialization. My questions:
>
> Does pyspark use Java serialization by default, as Scala spark does? If so, then...
> can I use Kryo with pyspark instead? The instructions say I should register my classes with the Kryo Serialization, but that's in Java/Scala. If I simply set the spark.serializer variable for my SparkContext, will it at least use Kryo for Spark's own classes, even if I can't register any of my own classes?
>
> Thanks,
> Diana