You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Bo Lu <bl...@etinternational.com> on 2013/10/31 18:21:00 UTC
Worker lost during processing large input
Hi spark users,
I just started to learn running spark standalone application on a standalone cluster and I am very impressed by how easy to program using spark.
But when I run for a large data input (about 30G) on my cluster, I met errors like "Removing BlockManager" "worker lost".
The application is a Kmeans algorithm for just one iteration and the initial K = 16.
The cluster I have is: 15 nodes with 4G RAM and 4 cores each (one of the nodes behaves both master and slave)
I am running spark 0.8.0 and spark is built against hadoop 1.1.1 for accessing HDFS
In the spark-env.sh (on all the nodes and in the same directory):
export SPARK_WORKER_MEMORY=8g
export HADOOP_CONF_DIR="/share/hadoop-1.1.1/conf"
export SPARK_JAVA_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedOops"
In the driver program:
System.setProperty("spark.default.parallelism", "160");
System.setProperty("spark.storage.memoryFraction", "0.1");
System.setProperty("spark.executor.memory", "8g");
System.setProperty("spark.worker.timeout", "6000");
System.setProperty("spark.akka.frameSize", "10000");
System.setProperty("spark.akka.timeout", "6000");
In the program, I use groupByKey() which group all the input with respect to a cluster id (Key), it turns out that there is one key has 7.8G of data, that is why I use System.setProperty("spark.executor.memory", "8g"), if I lower spark.executor.memory, I will get OOM. But I need to write all the data back to HDFS after clustering
I looked at the Environment tab of the application UI and confirmed that the system property are all set but one weird thing is that I get
"13/10/31 11:48:57 WARN master.Master: Removing worker-20131031105954-pen13.xmen.eti-34747 because we got no heartbeat in 60 seconds"
But is this value supposed to be 6000, since I have set System.setProperty("spark.worker.timeout", "6000"); and System.setProperty("spark.akka.timeout", "6000");
I also looked at the worker node and found out that there are a lot of swap going on and also GC, maybe that is why the worker get lost?
If anyone can give me a hint on how to configure the system for such a application and cluster to solve the problem that will be great.
Thanks.
Bo