You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by agg <ag...@gmail.com> on 2014/02/18 06:18:50 UTC
Nodes failing when using MEMORY_AND_DISK_SER
Hi,
I am trying to run kmeans (not mllib verison) on 8 machines (8 cores, 60gb
ram each) and having some issues, hopefully someone will have some advice.
Basically, the input data (250gb) won't fit in memory (even using Kyro
serialization). When I run the job using MEMORY_ONLY, the program works,
but is slow (understandably), but when I try to run it using
MEMORY_AND_DISK_SER to spill RDDs to disk, I get OutOfMemory exceptions (for
heap space), and worker nodes begin to die. I am running the job using the
following settings:
System.setProperty("spark.executor.memory", "55g")
System.setProperty("spark.storage.memoryFraction", ".2")
System.setProperty("spark.default.parallelism", "5000")
What is the best configuration for Spark for a scenario like this? Does
anyone have any thoughts?
Thanks!
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Nodes-failing-when-using-MEMORY-AND-DISK-SER-tp1664.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.