You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Vipul Pandey <vi...@gmail.com> on 2014/01/31 02:00:50 UTC

Kyro serialization slow and runs OOM

Hola! 

I have about half a TB of (Lzo compressed protobuf) data that I try loading on to my cluster.  I have 20 nodes and I assign 100G for executor memory.
	-Dspark.serializer=org.apache.spark.serializer.KryoSerializer -Dspark.executor.memory=100g

Now, when I load my dataset, transform it with some one to one transformations, and try to cache the eventual RDD - it runs really slow and then runs out of memory. When I remove Kyro serializer and default back to java serialization it works just fine and is able to load and cache the 700Gs of resultant data. 
(Btw, I am not registering my classes with Kyro yet but I do'nt think it should be worst than Java Serialization - should it?)

Here's a summary of all the experiments I ran : 



Any explanation for this behavior? 
Also, I saw that even in the cases when caching was successful, the Size In Memory would go up to a certain level and then fall down, and then climb back up. Why does that happen? 

Regards,
Vipul