You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Vipul Pandey <vi...@gmail.com> on 2014/01/31 02:00:50 UTC
Kyro serialization slow and runs OOM
Hola!
I have about half a TB of (Lzo compressed protobuf) data that I try loading on to my cluster. I have 20 nodes and I assign 100G for executor memory.
-Dspark.serializer=org.apache.spark.serializer.KryoSerializer -Dspark.executor.memory=100g
Now, when I load my dataset, transform it with some one to one transformations, and try to cache the eventual RDD - it runs really slow and then runs out of memory. When I remove Kyro serializer and default back to java serialization it works just fine and is able to load and cache the 700Gs of resultant data.
(Btw, I am not registering my classes with Kyro yet but I do'nt think it should be worst than Java Serialization - should it?)
Here's a summary of all the experiments I ran :
Any explanation for this behavior?
Also, I saw that even in the cases when caching was successful, the Size In Memory would go up to a certain level and then fall down, and then climb back up. Why does that happen?
Regards,
Vipul