You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Diana Carroll <dc...@cloudera.com> on 2014/04/14 21:24:41 UTC

using Kryo with pyspark?

I'm looking at the Tuning Guide suggestion to use Kryo instead of default
serialization.  My questions:

Does pyspark use Java serialization by default, as Scala spark does?  If
so, then...
can I use Kryo with pyspark instead?  The instructions say I should
register my classes with the Kryo Serialization, but that's in Java/Scala.
 If I simply set the spark.serializer variable for my SparkContext, will it
at least use Kryo for Spark's own classes, even if I can't register any of
my own classes?

Thanks,
Diana

Re: using Kryo with pyspark?

Posted by Matei Zaharia <ma...@gmail.com>.
Kryo won’t make a major impact on PySpark because it just stores data as byte[] objects, which are fast to serialize even with Java. But it may be worth a try — you would just set spark.serializer and not try to register any classes. What might make more impact is storing data as MEMORY_ONLY_SER and turning on spark.rdd.compress, which will compress them. In Java this can add some CPU overhead but Python runs quite a bit slower so it might not matter, and it might speed stuff up by reducing GC or letting you cache more data.

Matei

On Apr 14, 2014, at 12:24 PM, Diana Carroll <dc...@cloudera.com> wrote:

> I'm looking at the Tuning Guide suggestion to use Kryo instead of default serialization.  My questions:
> 
> Does pyspark use Java serialization by default, as Scala spark does?  If so, then...
> can I use Kryo with pyspark instead?  The instructions say I should register my classes with the Kryo Serialization, but that's in Java/Scala.  If I simply set the spark.serializer variable for my SparkContext, will it at least use Kryo for Spark's own classes, even if I can't register any of my own classes?
> 
> Thanks,
> Diana