You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Bo Lu <bl...@etinternational.com> on 2013/10/31 18:21:00 UTC
Worker lost during processing large input

Hi spark users, 

I   just started to learn running spark standalone application on a   standalone cluster and I am very impressed by how easy to program using   spark.
But when I run for a large data input (about 30G) on my cluster, I met errors like "Removing BlockManager" "worker lost".

The application is a Kmeans algorithm for just one iteration and the initial K = 16.

The cluster I have is: 15 nodes with 4G RAM and 4 cores each (one of the nodes behaves both master and slave)

I am running spark 0.8.0 and spark is built against hadoop 1.1.1 for accessing HDFS

In the spark-env.sh (on all the nodes and in the same directory):

export SPARK_WORKER_MEMORY=8g
export HADOOP_CONF_DIR="/share/hadoop-1.1.1/conf"
export SPARK_JAVA_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedOops"

In the driver program:

System.setProperty("spark.default.parallelism", "160");
System.setProperty("spark.storage.memoryFraction", "0.1");
System.setProperty("spark.executor.memory", "8g");
System.setProperty("spark.worker.timeout", "6000");
System.setProperty("spark.akka.frameSize", "10000");
System.setProperty("spark.akka.timeout", "6000");

In   the program, I use groupByKey() which group all the input with respect   to a cluster id (Key), it turns out that there is one key has 7.8G of   data, that is why I use System.setProperty("spark.executor.memory",   "8g"), if I lower spark.executor.memory, I will get OOM. But I need to   write all the data back to HDFS after clustering

I looked at the   Environment tab of the application UI and confirmed that the system   property are all set but one weird thing is that I get
"13/10/31 11:48:57 WARN master.Master: Removing worker-20131031105954-pen13.xmen.eti-34747 because we got no heartbeat in 60 seconds"
But is this value supposed to be 6000, since I have set System.setProperty("spark.worker.timeout", "6000"); and System.setProperty("spark.akka.timeout", "6000");

I   also looked at the worker node and found out that there are a lot of   swap going on and also GC, maybe that is why the worker get lost?

If   anyone can give me a hint on how to configure the system for such a   application and cluster to solve the problem that will be great.

Thanks.

Bo