You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by AlexG <sw...@gmail.com> on 2015/08/13 01:28:26 UTC

what is cause of, and how to recover from, unresponsive nodes w/ spark-ec2 script

I'm using the spark-ec2 script to launch a 30 node r3.8xlarge cluster.
Occasionally several nodes will become unresponsive: I will notice that hdfs
complains it can't find some blocks, then when I go to restart hadoop, the
messages indicate that the connection to some nodes timed out, then when I
check, I can't ssh into those nodes at all.

Is this a problem others have experienced? What is causing this random
failure--- or where can I look to find relevant logs---, and how can I
recover from this other than to destroy the cluster and start anew
(time-consuming, tedious, and requiring that I pull down my large dataset
from S3 to HDFS once again, but this is what I've been doing currently)?






--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-is-cause-of-and-how-to-recover-from-unresponsive-nodes-w-spark-ec2-script-tp24235.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org