You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sameer Tilak <ss...@live.com> on 2014/07/08 08:51:19 UTC

Spark: All masters are unresponsive!

Hi All,
I am having a few issues with stability and scheduling. When I use spark shell to submit my application. I get the following error message and spark shell crashes. I have a small 4-node cluster for PoC. I tried both manual and scripts-based cluster set up. I tried using FQDN as well for specifying the master node, but no luck.  
14/07/07 23:44:35 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MappedRDD[6] at map at JaccardScore.scala:83)14/07/07 23:44:35 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks14/07/07 23:44:35 INFO TaskSetManager: Starting task 1.0:0 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)14/07/07 23:44:35 INFO TaskSetManager: Serialized task 1.0:0 as 2322 bytes in 0 ms14/07/07 23:44:35 INFO TaskSetManager: Starting task 1.0:1 as TID 2 on executor localhost: localhost (PROCESS_LOCAL)14/07/07 23:44:35 INFO TaskSetManager: Serialized task 1.0:1 as 2322 bytes in 0 ms14/07/07 23:44:35 INFO Executor: Running task ID 114/07/07 23:44:35 INFO Executor: Running task ID 214/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 locally14/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 locally14/07/07 23:44:35 INFO HadoopRDD: Input split: hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:0+9723938914/07/07 23:44:35 INFO HadoopRDD: Input split: hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:97239389+9723939014/07/07 23:44:54 INFO AppClient$ClientActor: Connecting to master spark://pzxnvm2018:7077...14/07/07 23:45:14 INFO AppClient$ClientActor: Connecting to master spark://pzxnvm2018:7077...14/07/07 23:45:35 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.14/07/07 23:45:35 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.14/07/07 23:45:35 WARN HadoopRDD: Exception in RecordReader.close()java.io.IOException: Filesystem closed	at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264)	at org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74)	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close(DFSClient.java:2135)	at java.io.FilterInputStream.close(FilterInputStream.java:181)	at org.apache.hadoop.util.LineReader.close(LineReader.java:83)	at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:168)	at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:208)	at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)	at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:193)	at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)	at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)	at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:113)	at org.apache.spark.scheduler.Task.run(Task.scala:51)	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)	at java.lang.Thread.run(Thread.java:722)14/07/07 23:45:35 ERROR Executor: Exception in task ID 2java.io.IOException: Filesystem closed	at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264)	at org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74)	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2213)	at java.io.DataInputStream.read(DataInputStream.java:100)	at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133)	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38)	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198)	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181)	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)	at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)	at scala.collection.Iterator$class.foreach(Iterator.scala:727)	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)	at scala.collection.AbstractIterator.to(Iterator.scala:1157)	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)	at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)	at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)	at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)	at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)	at org.apache.spark.scheduler.Task.run(Task.scala:51)	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)	at java.lang.Thread.run(Thread.java:722)

Re: Spark: All masters are unresponsive!

Posted by Andrew Or <an...@databricks.com>.

Hi Sameer,

>>> 14/07/09 10:00:00 INFO CoarseGrainedExecutorBackend: Connecting to
driver: akka.tcp://spark@localhost:38145/user/CoarseGrainedScheduler

Looks like your executors cannot reach your driver because they are
contacting "localhost" when the driver exists on a different node. (I am
assuming you ran bin/spark-submit on the Master node). It is very strange
to me that the driver URL ended up pointing to "localhost", after you
indicated that conf.get("spark.driver.host") == "pzxnvm2018.x.y.name.org".
In standalone mode, executors get the driver URL from the SparkContext
itself, which sets them based on "spark.driver.host". However, since your
"spark.driver.host" is not "localhost", it's almost as if the executors
that you're launching are not launched from the same Master / driver that
you configured.

I would try explicitly setting "spark.driver.host" to 172.16.48.41 in your
SparkConf. If you do this, then you don't need to set SPARK_LOCAL_IP. As a
sanity check, you can also try using only IP addresses explicitly and
commenting out all relevant entries in your /etc/host.

Let me know if you find anything.
Andrew


2014-07-09 16:01 GMT-07:00 Sameer Tilak <ss...@live.com>:

> Hi All,
> I would really appreciate help with this issue. We are evaluating Spark
> and would love to get this set up to do some benchmarking.
>
> ------------------------------
> From: sstilak@live.com
> To: user@spark.apache.org
> Subject: RE: Spark: All masters are unresponsive!
> Date: Wed, 9 Jul 2014 10:11:24 -0700
>
>
> Dear Andrew and Aaron,
> Please find all the details here. I really appreciate your help.
>
> Master node configuration:
>
> /etc/host on the Master node
> 127.0.0.1 localhost.localdomain localhost
>
> Another entry
>
> 172.16.48.41 pzxnvm2018.x.y.name.org pzxnvm2018
>
>
> spark-env.sh on the Master node
>
> export SPARK_MASTER_IP=172.16.48.41
> export SPARK_WORKER_MEMORY=2G
> export SPARK_WORKER_CORES=1
> export SPARK_LOCAL_IP=172.16.48.41
> # export SPARK_MASTER_OPTS+=" -Dspark.akka.frameSize=10000"
> # export SPARK_WORKER_OPTS+=" -Dspark.akka.frameSize=10000"
>
>
> No slaves file on the master node (I had it before, but now I am manually
> starting my worker node)
>
> I start master node using ./start-master command in sbin directory
>
> I got the following message:
>
> ./start-master.sh
> starting org.apache.spark.deploy.master.Master, logging to
> /apps/software/spark-1.0.0-bin-hadoop1/sbin/../logs/spark-userid-org.apache.spark.deploy.master.Master-1-pzxnvm2018.out
> bash-4.1$ tail
> /apps/software/spark-1.0.0-bin-hadoop1/sbin/../logs/spark-userid-org.apache.spark.deploy.master.Master-1-pzxnvm2018.out
> 14/07/09 09:55:53 INFO SecurityManager: Using Spark's default log4j
> profile: org/apache/spark/log4j-defaults.properties
> 14/07/09 09:55:53 INFO SecurityManager: Changing view acls to: userid
> 14/07/09 09:55:53 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(userid)
> 14/07/09 09:55:54 INFO Slf4jLogger: Slf4jLogger started
> 14/07/09 09:55:54 INFO Remoting: Starting remoting
> 14/07/09 09:55:54 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkMaster@172.16.48.41:7077]
> 14/07/09 09:55:55 INFO Master: Starting Spark master at spark://
> 172.16.48.41:7077
> 14/07/09 09:55:55 INFO MasterWebUI: Started MasterWebUI at
> http://pzxnvm2018.dcld.pldc.kp.org:8080
> 14/07/09 09:55:55 INFO Master: I have been elected leader! New state: ALIVE
> 14/07/09 09:56:37 INFO Master: Registering worker
> pzxnvm2022.dcld.pldc.kp.org:53029 with 2 cores, 4.0 GB RAM
>
> Worker node configuration::
>
> /etc/host on the Worker node
>
> 127.0.0.1 localhost.localdomain localhost
>
> Another entry
>
> 172.16.48.44 pzxnvm2022.x.y.name.org  pzxnvm2022
>
> No spark-env.sh or slaves file on the worker node. I manually start my
> worker node as:
>
> ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://
> 172.16.48.41:7077 -m 4G -c 2
>
>
> Inside my application I have the following code:
>
>  val conf = new SparkConf().setMaster("spark://172.16.48.41:7077
> ").setAppName("ApproxStrMatch")
>  conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>  conf.set("spark.kryo.registrator", "approxstrmatch.MyRegistrator")
>
>
>   I use the following command to start spark-shell and my app:
>
>  ./spark-shell --jars
> /apps/software/secondstring/secondstring/dist/lib/secondstring-20140630.jar,/apps/software/scala-approsstrmatch/approxstrmatch.jar
>
> When I use sc.getConf.get("spark.driver.host")
> res0: String = pzxnvm2018.x.y.name.org
>
> Master node shell o/p:
> 14/07/09 10:00:18 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140709095956-0000/6 on hostPort pzxnvm2022.x.y.name.org:53029 with
> 2 cores, 512.0 MB RAM
> 14/07/09 10:00:18 INFO AppClient$ClientActor: Executor updated:
> app-20140709095956-0000/6 is now RUNNING
> 14/07/09 10:00:21 INFO AppClient$ClientActor: Executor updated:
> app-20140709095956-0000/6 is now FAILED (Command exited with code 1)
> 14/07/09 10:00:21 INFO SparkDeploySchedulerBackend: Executor
> app-20140709095956-0000/6 removed: Command exited with code 1
> 14/07/09 10:00:21 INFO AppClient$ClientActor: Executor added:
> app-20140709095956-0000/7 on
> worker-20140709095636-pzxnvm2022.x.y.name.org-53029 (
> pzxnvm2022.x.y.name.org:53029) with 2 cores
> 14/07/09 10:00:21 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140709095956-0000/7 on hostPort pzxnvm2022.x.y.name.org:53029 with
> 2 cores, 512.0 MB RAM
> 14/07/09 10:00:21 INFO AppClient$ClientActor: Executor updated:
> app-20140709095956-0000/7 is now RUNNING
> 14/07/09 10:00:24 INFO AppClient$ClientActor: Executor updated:
> app-20140709095956-0000/7 is now FAILED (Command exited with code 1)
> 14/07/09 10:00:24 INFO SparkDeploySchedulerBackend: Executor
> app-20140709095956-0000/7 removed: Command exited with code 1
> 14/07/09 10:00:24 INFO AppClient$ClientActor: Executor added:
> app-20140709095956-0000/8 on
> worker-20140709095636-pzxnvm2022.x.y.name.org-53029 (
> pzxnvm2022.x.y.name.org:53029) with 2 cores
> 14/07/09 10:00:24 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140709095956-0000/8 on hostPort pzxnvm2022.x.y.name.org:53029 with
> 2 cores, 512.0 MB RAM
> 14/07/09 10:00:24 INFO AppClient$ClientActor: Executor updated:
> app-20140709095956-0000/8 is now RUNNING
> 14/07/09 10:00:28 INFO AppClient$ClientActor: Executor updated:
> app-20140709095956-0000/8 is now FAILED (Command exited with code 1)
> 14/07/09 10:00:28 INFO SparkDeploySchedulerBackend: Executor
> app-20140709095956-0000/8 removed: Command exited with code 1
> 14/07/09 10:00:28 INFO AppClient$ClientActor: Executor added:
> app-20140709095956-0000/9 on
> worker-20140709095636-pzxnvm2022.x.y.name.org-53029 (
> pzxnvm2022.x.y.name.org:53029) with 2 cores
> 14/07/09 10:00:28 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140709095956-0000/9 on hostPort pzxnvm2022.x.y.name.org:53029 with
> 2 cores, 512.0 MB RAM
> 14/07/09 10:00:28 INFO AppClient$ClientActor: Executor updated:
> app-20140709095956-0000/9 is now RUNNING
> 14/07/09 10:00:31 INFO AppClient$ClientActor: Executor updated:
> app-20140709095956-0000/9 is now FAILED (Command exited with code 1)
> 14/07/09 10:00:31 INFO SparkDeploySchedulerBackend: Executor
> app-20140709095956-0000/9 removed: Command exited with code 1
> 14/07/09 10:00:31 ERROR SparkDeploySchedulerBackend: Application has been
> killed. Reason: Master removed our application: FAILED
> 14/07/09 10:00:31 ERROR TaskSchedulerImpl: Exiting due to error from
> cluster scheduler: Master removed our application: FAILED
>
>
> Worker node error log from the
>
> 14/07/09 10:00:14 INFO Worker: Executor app-20140709095956-0000/4 finished
> with state FAILED message Command exited with code 1 exitStatus 1
> 14/07/09 10:00:14 INFO Worker: Asked to launch executor
> app-20140709095956-0000/5 for ApproxStrMatch
> 14/07/09 10:00:14 INFO ExecutorRunner: Launch command:
> "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64/jre/bin/java" "-cp"
> "::/apps/software/spark-1.0.0-bin-hadoop1/conf:/apps/software/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/apps/hadoop/hadoop-conf"
> "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M"
> "org.apache.spark.executor.CoarseGrainedExecutorBackend"
> "akka.tcp://spark@localhost:38145/user/CoarseGrainedScheduler" "5" "
> pzxnvm2022.dcld.pldc.kp.org" "2" "akka.tcp://
> sparkWorker@pzxnvm2022.x.y.name.org:53029/user/Worker"
> "app-20140709095956-0000"
> 14/07/09 10:00:18 INFO Worker: Executor app-20140709095956-0000/5 finished
> with state FAILED message Command exited with code 1 exitStatus 1
> 14/07/09 10:00:18 INFO Worker: Asked to launch executor
> app-20140709095956-0000/6 for ApproxStrMatch
> 14/07/09 10:00:18 INFO ExecutorRunner: Launch command:
> "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64/jre/bin/java" "-cp"
> "::/apps/software/spark-1.0.0-bin-hadoop1/conf:/apps/software/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/apps/hadoop/hadoop-conf"
> "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M"
> "org.apache.spark.executor.CoarseGrainedExecutorBackend"
> "akka.tcp://spark@localhost:38145/user/CoarseGrainedScheduler" "6" "
> pzxnvm2022.dcld.pldc.kp.org" "2" "akka.tcp://
> sparkWorker@pzxnvm2022.x.y.name.org:53029/user/Worker"
> "app-20140709095956-0000"
> 14/07/09 10:00:21 INFO Worker: Executor app-20140709095956-0000/6 finished
> with state FAILED message Command exited with code 1 exitStatus 1
> 14/07/09 10:00:21 INFO Worker: Asked to launch executor
> app-20140709095956-0000/7 for ApproxStrMatch
> 14/07/09 10:00:21 INFO ExecutorRunner: Launch command:
> "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64/jre/bin/java" "-cp"
> "::/apps/software/spark-1.0.0-bin-hadoop1/conf:/apps/software/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/apps/hadoop/hadoop-conf"
> "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M"
> "org.apache.spark.executor.CoarseGrainedExecutorBackend"
> "akka.tcp://spark@localhost:38145/user/CoarseGrainedScheduler" "7" "
> pzxnvm2022.dcld.pldc.kp.org" "2" "akka.tcp://
> sparkWorker@pzxnvm2022.x.y.name.org:53029/user/Worker"
> "app-20140709095956-0000"
> 14/07/09 10:00:24 INFO Worker: Executor app-20140709095956-0000/7 finished
> with state FAILED message Command exited with code 1 exitStatus 1
> 14/07/09 10:00:24 INFO Worker: Asked to launch executor
> app-20140709095956-0000/8 for ApproxStrMatch
> 14/07/09 10:00:24 INFO ExecutorRunner: Launch command:
> "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64/jre/bin/java" "-cp"
> "::/apps/software/spark-1.0.0-bin-hadoop1/conf:/apps/software/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/apps/hadoop/hadoop-conf"
> "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M"
> "org.apache.spark.executor.CoarseGrainedExecutorBackend"
> "akka.tcp://spark@localhost:38145/user/CoarseGrainedScheduler" "8" "
> pzxnvm2022.dcld.pldc.kp.org" "2" "akka.tcp://
> sparkWorker@pzxnvm2022.x.y.name.org:53029/user/Worker"
> "app-20140709095956-0000"
> 14/07/09 10:00:28 INFO Worker: Executor app-20140709095956-0000/8 finished
> with state FAILED message Command exited with code 1 exitStatus 1
> 14/07/09 10:00:28 INFO Worker: Asked to launch executor
> app-20140709095956-0000/9 for ApproxStrMatch
>
> Here is what I see  on the order node:
>
> Spark Executor Command:
> "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64/jre/bin/java" "-cp"
> "::/apps/software/spark-1.0.0-bin-hadoop1/conf:/apps/software/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/apps/hadoop/hadoop-conf"
> "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M"
> "org.apache.spark.executor.CoarseGrainedExecutorBackend"
> "akka.tcp://spark@localhost:38145/user/CoarseGrainedScheduler" "0" "
> pzxnvm2022.dcld.pldc.kp.org" "2" "akka.tcp://
> sparkWorker@pzxnvm2022.x.y.name.org:53029/user/Worker"
> "app-20140709095956-0000"
> ========================================
>
> log4j:WARN No appenders could be found for logger
> (org.apache.hadoop.conf.Configuration).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
> 14/07/09 09:59:58 INFO SparkHadoopUtil: Using Spark's default log4j
> profile: org/apache/spark/log4j-defaults.properties
> 14/07/09 09:59:58 INFO SecurityManager: Changing view acls to: userid
> 14/07/09 09:59:58 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(userid)
> 14/07/09 09:59:59 INFO Slf4jLogger: Slf4jLogger started
> 14/07/09 09:59:59 INFO Remoting: Starting remoting
> 14/07/09 10:00:00 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkExecutor@pzxnvm2022.x.y.name.org:48733]
> 14/07/09 10:00:00 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://sparkExecutor@pzxnvm2022.x.y.name.org:48733]
> 14/07/09 10:00:00 INFO CoarseGrainedExecutorBackend: Connecting to driver:
> akka.tcp://spark@localhost:38145/user/CoarseGrainedScheduler
> 14/07/09 10:00:00 INFO WorkerWatcher: Connecting to worker akka.tcp://
> sparkWorker@pzxnvm2022.x.y.name.org:53029/user/Worker
> 14/07/09 10:00:00 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
> [akka.tcp://sparkExecutor@pzxnvm2022.x.y.name.org:48733] ->
> [akka.tcp://spark@localhost:38145] disassociated! Shutting down.
>
>
> ------------------------------
> Date: Tue, 8 Jul 2014 15:32:09 -0700
> Subject: Re: Spark: All masters are unresponsive!
> From: andrew@databricks.com
> To: user@spark.apache.org
>
> It seems that your driver (which I'm assuming you launched on the master
> node) can now connect to the Master, but your executors cannot. Did you
> make sure that all nodes have the same conf/spark-defaults.conf,
> conf/spark-env.sh, and conf/slaves? It would be good if you can post the
> stderr of the executor logs here. They are located on the worker node under
> $SPARK_HOME/work.
>
> (As of Spark-1.0, we recommend that you use the spark-submit arguments,
> i.e.
>
> bin/spark-submit --master spark://pzxnvm2018.x.y.name.org:7077
> --executor-memory 4g --executor-cores 3 --class <your main class> <your
> application jar> <application arguments ...>)
>
>
> 2014-07-08 10:12 GMT-07:00 Sameer Tilak <ss...@live.com>:
>
> Hi Akhil et al.,
> I made the following changes:
>
> In spark-env.sh I added the following three entries (standalone mode)
>
> export SPARK_MASTER_IP=pzxnvm2018.x.y.name.org
> export SPARK_WORKER_MEMORY=4G
> export SPARK_WORKER_CORES=3
>
> I then use start-master and start-slaves commands to start the services.
> Another sthing that I have noticed is that the number of cores that I
> specified is npot  used: 2022 shows up with only 1 core and 2023 and 2024
> show up with 4 cores.
>
> In the Web UI:
> URL: spark://pzxnvm2018.x.y.name.org:7077
>
> I run the spark shell command from pzxnvm2018.
>
> /etc/hosts on my master node has following entry:
> master-ip  pzxnvm2018.x.y.name.org pzxnvm2018
>
> /etc/hosts on my a worker node has following entry:
> worker-ip        pzxnvm2023.x.y.name.org pzxnvm2023
>
>
> However, on my master node log file I still see this:
>
> ERROR EndpointWriter: AssociationError [akka.tcp://
> sparkMaster@pzxnvm2018.x.y.name.org:7077
> <http://pzxnvm2018.x.y.name.org:7077>] -> [akka.tcp://spark@localhost:43569]:
> Error [Association failed with [akka.tcp://spark@localhost:43569]]
>
> My spark-shell has the following o/p
>
>
> scala> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Connected to
> Spark cluster with app ID app-20140708100139-0000
> 14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/0 on
> worker-20140708095558-pzxnvm2024.x.y.name.orgg-50218 (
> pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores
> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/0 on hostPort pzxnvm2024.x.y.name.org:50218 with
> 4 cores, 512.0 MB RAM
> 14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/1 on
> worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (
> pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores
> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/1 on hostPort pzxnvm2023.x.y.name.org:38294 with
> 4 cores, 512.0 MB RAM
> 14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/2 on
> worker-20140708095559-pzxnvm2022.x.y.name.org-41826 (
> pzxnvm2022.dcld.pldc.kp.org:41826) with 1 cores
> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/2 on hostPort pzxnvm2022.x.y.name.org:41826 with
> 1 cores, 512.0 MB RAM
> 14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/0 is now RUNNING
> 14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/1 is now RUNNING
> 14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/2 is now RUNNING
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/0 is now FAILED (Command exited with code 1)
> 14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Executor
> app-20140708100139-0000/0 removed: Command exited with code 1
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/3 on
> worker-20140708095558-pzxnvm2024.x.y.name.org-50218 (
> pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores
> 14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/3 on hostPort pzxnvm2024.x.y.name.org:50218 with
> 4 cores, 512.0 MB RAM
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/3 is now RUNNING
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/1 is now FAILED (Command exited with code 1)
> 14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Executor
> app-20140708100139-0000/1 removed: Command exited with code 1
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/4 on
> worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (
> pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores
> 14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/4 on hostPort pzxnvm2023.x.y.name.org:38294 with
> 4 cores, 512.0 MB RAM
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/4 is now RUNNING
> 14/07/08 10:01:43 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/2 is now FAILED (Command exited with code 1)
> 14/07/08 10:01:43 INFO SparkDeploySchedulerBackend: Executor
> app-20140708100139-0000/2 removed: Command exited with code 1
> 14/07/08 10:01:43 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/5 on
> worker-20140708095559-pzxnvm2022.x.y.name.org-41826 (
> pzxnvm2022.dcld.pldc.kp.org:41826) with 1 cores
> 14/07/08 10:01:43 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/5 on hostPort pzxnvm2022.x.y.name.org:41826 with
> 1 cores, 512.0 MB RAM
> 14/07/08 10:01:43 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/5 is now RUNNING
> 14/07/08 10:01:44 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/3 is now FAILED (Command exited with code 1)
> 14/07/08 10:01:44 INFO SparkDeploySchedulerBackend: Executor
> app-20140708100139-0000/3 removed: Command exited with code 1
> 14/07/08 10:01:44 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/6 on
> worker-20140708095558-pzxnvm2024.x.y.name.org-50218 (
> pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores
> 14/07/08 10:01:44 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/6 on hostPort pzxnvm2024.x.y.name.org:50218 with
> 4 cores, 512.0 MB RAM
> 14/07/08 10:01:44 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/6 is now RUNNING
> 14/07/08 10:01:45 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/4 is now FAILED (Command exited with code 1)
> 14/07/08 10:01:45 INFO SparkDeploySchedulerBackend: Executor
> app-20140708100139-0000/4 removed: Command exited with code 1
> 14/07/08 10:01:45 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/7 on
> worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (
> pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores
>
>
> ------------------------------
> Date: Tue, 8 Jul 2014 12:29:21 +0530
> Subject: Re: Spark: All masters are unresponsive!
> From: akhil@sigmoidanalytics.com
> To: user@spark.apache.org
>
>
> Are you sure this is your master URL spark://pzxnvm2018:7077 ?
>
> You can look it up in the WebUI (mostly http://pzxnvm2018:8080) top left
> corner. Also make sure you are able to telnet pzxnvm2018 7077 from the
> machines where you are running the spark shell.
>
> Thanks
> Best Regards
>
>
> On Tue, Jul 8, 2014 at 12:21 PM, Sameer Tilak <ss...@live.com> wrote:
>
> Hi All,
>
> I am having a few issues with stability and scheduling. When I use spark
> shell to submit my application. I get the following error message and spark
> shell crashes. I have a small 4-node cluster for PoC. I tried both manual
> and scripts-based cluster set up. I tried using FQDN as well for specifying
> the master node, but no luck.
>
> 14/07/07 23:44:35 INFO DAGScheduler: Submitting 2 missing tasks from Stage
> 1 (MappedRDD[6] at map at JaccardScore.scala:83)
> 14/07/07 23:44:35 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
> 14/07/07 23:44:35 INFO TaskSetManager: Starting task 1.0:0 as TID 1 on
> executor localhost: localhost (PROCESS_LOCAL)
> 14/07/07 23:44:35 INFO TaskSetManager: Serialized task 1.0:0 as 2322 bytes
> in 0 ms
>  14/07/07 23:44:35 INFO TaskSetManager: Starting task 1.0:1 as TID 2 on
> executor localhost: localhost (PROCESS_LOCAL)
> 14/07/07 23:44:35 INFO TaskSetManager: Serialized task 1.0:1 as 2322 bytes
> in 0 ms
> 14/07/07 23:44:35 INFO Executor: Running task ID 1
> 14/07/07 23:44:35 INFO Executor: Running task ID 2
> 14/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 locally
> 14/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 locally
> 14/07/07 23:44:35 INFO HadoopRDD: Input split:
> hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:0+97239389
> 14/07/07 23:44:35 INFO HadoopRDD: Input split:
> hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:97239389+97239390
> 14/07/07 23:44:54 INFO AppClient$ClientActor: Connecting to master
> spark://pzxnvm2018:7077...
> 14/07/07 23:45:14 INFO AppClient$ClientActor: Connecting to master
> spark://pzxnvm2018:7077...
> 14/07/07 23:45:35 ERROR SparkDeploySchedulerBackend: Application has been
> killed. Reason: All masters are unresponsive! Giving up.
> 14/07/07 23:45:35 ERROR TaskSchedulerImpl: Exiting due to error from
> cluster scheduler: All masters are unresponsive! Giving up.
> 14/07/07 23:45:35 WARN HadoopRDD: Exception in RecordReader.close()
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264)
> at org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74)
>  at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close(DFSClient.java:2135)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>  at org.apache.hadoop.util.LineReader.close(LineReader.java:83)
> at
> org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:168)
>  at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:208)
> at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
>  at
> org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:193)
> at
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
>  at
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:113)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>  at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> 14/07/07 23:45:35 ERROR Executor: Exception in task ID 2
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264)
>  at org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2213)
>  at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>  at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133)
> at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38)
>  at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181)
>  at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>  at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>  at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>  at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>  at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>  at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
>  at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
>  at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>  at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>  at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>  at java.lang.Thread.run(Thread.java:722)
>
>
>
>

RE: Spark: All masters are unresponsive!

Posted by Sameer Tilak <ss...@live.com>.

Hi All,I would really appreciate help with this issue. We are evaluating Spark and would love to get this set up to do some benchmarking. 
From: sstilak@live.com
To: user@spark.apache.org
Subject: RE: Spark: All masters are unresponsive!
Date: Wed, 9 Jul 2014 10:11:24 -0700

Dear Andrew and Aaron,Please find all the details here. I really appreciate your help.
Master node configuration:
/etc/host on the Master node 127.0.0.1	localhost.localdomain	localhost
Another entry 
172.16.48.41	pzxnvm2018.x.y.name.org pzxnvm2018

spark-env.sh on the Master node 
export SPARK_MASTER_IP=172.16.48.41export SPARK_WORKER_MEMORY=2Gexport SPARK_WORKER_CORES=1export SPARK_LOCAL_IP=172.16.48.41# export SPARK_MASTER_OPTS+=" -Dspark.akka.frameSize=10000"# export SPARK_WORKER_OPTS+=" -Dspark.akka.frameSize=10000"

No slaves file on the master node (I had it before, but now I am manually starting my worker node)
I start master node using ./start-master command in sbin directory
I got the following message:
./start-master.shstarting org.apache.spark.deploy.master.Master, logging to /apps/software/spark-1.0.0-bin-hadoop1/sbin/../logs/spark-userid-org.apache.spark.deploy.master.Master-1-pzxnvm2018.outbash-4.1$ tail /apps/software/spark-1.0.0-bin-hadoop1/sbin/../logs/spark-userid-org.apache.spark.deploy.master.Master-1-pzxnvm2018.out14/07/09 09:55:53 INFO SecurityManager: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties14/07/09 09:55:53 INFO SecurityManager: Changing view acls to: userid14/07/09 09:55:53 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(userid)14/07/09 09:55:54 INFO Slf4jLogger: Slf4jLogger started14/07/09 09:55:54 INFO Remoting: Starting remoting14/07/09 09:55:54 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@172.16.48.41:7077]14/07/09 09:55:55 INFO Master: Starting Spark master at spark://172.16.48.41:707714/07/09 09:55:55 INFO MasterWebUI: Started MasterWebUI at http://pzxnvm2018.dcld.pldc.kp.org:808014/07/09 09:55:55 INFO Master: I have been elected leader! New state: ALIVE14/07/09 09:56:37 INFO Master: Registering worker pzxnvm2022.dcld.pldc.kp.org:53029 with 2 cores, 4.0 GB RAM
Worker node configuration::
/etc/host on the Worker node 
127.0.0.1	localhost.localdomain	localhost
Another entry 
172.16.48.44	pzxnvm2022.x.y.name.org  pzxnvm2022
No spark-env.sh or slaves file on the worker node. I manually start my worker node as:
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://172.16.48.41:7077 -m 4G -c 2

Inside my application I have the following code:  val conf = new SparkConf().setMaster("spark://172.16.48.41:7077").setAppName("ApproxStrMatch") conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") conf.set("spark.kryo.registrator", "approxstrmatch.MyRegistrator")      I use the following command to start spark-shell and my app:   ./spark-shell --jars /apps/software/secondstring/secondstring/dist/lib/secondstring-20140630.jar,/apps/software/scala-approsstrmatch/approxstrmatch.jar
When I use sc.getConf.get("spark.driver.host")res0: String = pzxnvm2018.x.y.name.org
Master node shell o/p:14/07/09 10:00:18 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140709095956-0000/6 on hostPort pzxnvm2022.x.y.name.org:53029 with 2 cores, 512.0 MB RAM14/07/09 10:00:18 INFO AppClient$ClientActor: Executor updated: app-20140709095956-0000/6 is now RUNNING14/07/09 10:00:21 INFO AppClient$ClientActor: Executor updated: app-20140709095956-0000/6 is now FAILED (Command exited with code 1)14/07/09 10:00:21 INFO SparkDeploySchedulerBackend: Executor app-20140709095956-0000/6 removed: Command exited with code 114/07/09 10:00:21 INFO AppClient$ClientActor: Executor added: app-20140709095956-0000/7 on worker-20140709095636-pzxnvm2022.x.y.name.org-53029 (pzxnvm2022.x.y.name.org:53029) with 2 cores14/07/09 10:00:21 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140709095956-0000/7 on hostPort pzxnvm2022.x.y.name.org:53029 with 2 cores, 512.0 MB RAM14/07/09 10:00:21 INFO AppClient$ClientActor: Executor updated: app-20140709095956-0000/7 is now RUNNING14/07/09 10:00:24 INFO AppClient$ClientActor: Executor updated: app-20140709095956-0000/7 is now FAILED (Command exited with code 1)14/07/09 10:00:24 INFO SparkDeploySchedulerBackend: Executor app-20140709095956-0000/7 removed: Command exited with code 114/07/09 10:00:24 INFO AppClient$ClientActor: Executor added: app-20140709095956-0000/8 on worker-20140709095636-pzxnvm2022.x.y.name.org-53029 (pzxnvm2022.x.y.name.org:53029) with 2 cores14/07/09 10:00:24 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140709095956-0000/8 on hostPort pzxnvm2022.x.y.name.org:53029 with 2 cores, 512.0 MB RAM14/07/09 10:00:24 INFO AppClient$ClientActor: Executor updated: app-20140709095956-0000/8 is now RUNNING14/07/09 10:00:28 INFO AppClient$ClientActor: Executor updated: app-20140709095956-0000/8 is now FAILED (Command exited with code 1)14/07/09 10:00:28 INFO SparkDeploySchedulerBackend: Executor app-20140709095956-0000/8 removed: Command exited with code 114/07/09 10:00:28 INFO AppClient$ClientActor: Executor added: app-20140709095956-0000/9 on worker-20140709095636-pzxnvm2022.x.y.name.org-53029 (pzxnvm2022.x.y.name.org:53029) with 2 cores14/07/09 10:00:28 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140709095956-0000/9 on hostPort pzxnvm2022.x.y.name.org:53029 with 2 cores, 512.0 MB RAM14/07/09 10:00:28 INFO AppClient$ClientActor: Executor updated: app-20140709095956-0000/9 is now RUNNING14/07/09 10:00:31 INFO AppClient$ClientActor: Executor updated: app-20140709095956-0000/9 is now FAILED (Command exited with code 1)14/07/09 10:00:31 INFO SparkDeploySchedulerBackend: Executor app-20140709095956-0000/9 removed: Command exited with code 114/07/09 10:00:31 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED14/07/09 10:00:31 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: Master removed our application: FAILED

Worker node error log from the 
14/07/09 10:00:14 INFO Worker: Executor app-20140709095956-0000/4 finished with state FAILED message Command exited with code 1 exitStatus 114/07/09 10:00:14 INFO Worker: Asked to launch executor app-20140709095956-0000/5 for ApproxStrMatch14/07/09 10:00:14 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64/jre/bin/java" "-cp" "::/apps/software/spark-1.0.0-bin-hadoop1/conf:/apps/software/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/apps/hadoop/hadoop-conf" "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@localhost:38145/user/CoarseGrainedScheduler" "5" "pzxnvm2022.dcld.pldc.kp.org" "2" "akka.tcp://sparkWorker@pzxnvm2022.x.y.name.org:53029/user/Worker" "app-20140709095956-0000"14/07/09 10:00:18 INFO Worker: Executor app-20140709095956-0000/5 finished with state FAILED message Command exited with code 1 exitStatus 114/07/09 10:00:18 INFO Worker: Asked to launch executor app-20140709095956-0000/6 for ApproxStrMatch14/07/09 10:00:18 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64/jre/bin/java" "-cp" "::/apps/software/spark-1.0.0-bin-hadoop1/conf:/apps/software/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/apps/hadoop/hadoop-conf" "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@localhost:38145/user/CoarseGrainedScheduler" "6" "pzxnvm2022.dcld.pldc.kp.org" "2" "akka.tcp://sparkWorker@pzxnvm2022.x.y.name.org:53029/user/Worker" "app-20140709095956-0000"14/07/09 10:00:21 INFO Worker: Executor app-20140709095956-0000/6 finished with state FAILED message Command exited with code 1 exitStatus 114/07/09 10:00:21 INFO Worker: Asked to launch executor app-20140709095956-0000/7 for ApproxStrMatch14/07/09 10:00:21 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64/jre/bin/java" "-cp" "::/apps/software/spark-1.0.0-bin-hadoop1/conf:/apps/software/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/apps/hadoop/hadoop-conf" "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@localhost:38145/user/CoarseGrainedScheduler" "7" "pzxnvm2022.dcld.pldc.kp.org" "2" "akka.tcp://sparkWorker@pzxnvm2022.x.y.name.org:53029/user/Worker" "app-20140709095956-0000"14/07/09 10:00:24 INFO Worker: Executor app-20140709095956-0000/7 finished with state FAILED message Command exited with code 1 exitStatus 114/07/09 10:00:24 INFO Worker: Asked to launch executor app-20140709095956-0000/8 for ApproxStrMatch14/07/09 10:00:24 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64/jre/bin/java" "-cp" "::/apps/software/spark-1.0.0-bin-hadoop1/conf:/apps/software/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/apps/hadoop/hadoop-conf" "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@localhost:38145/user/CoarseGrainedScheduler" "8" "pzxnvm2022.dcld.pldc.kp.org" "2" "akka.tcp://sparkWorker@pzxnvm2022.x.y.name.org:53029/user/Worker" "app-20140709095956-0000"14/07/09 10:00:28 INFO Worker: Executor app-20140709095956-0000/8 finished with state FAILED message Command exited with code 1 exitStatus 114/07/09 10:00:28 INFO Worker: Asked to launch executor app-20140709095956-0000/9 for ApproxStrMatch
Here is what I see  on the order node:
Spark Executor Command: "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64/jre/bin/java" "-cp" "::/apps/software/spark-1.0.0-bin-hadoop1/conf:/apps/software/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/apps/hadoop/hadoop-conf" "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@localhost:38145/user/CoarseGrainedScheduler" "0" "pzxnvm2022.dcld.pldc.kp.org" "2" "akka.tcp://sparkWorker@pzxnvm2022.x.y.name.org:53029/user/Worker" "app-20140709095956-0000"========================================
log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).log4j:WARN Please initialize the log4j system properly.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.14/07/09 09:59:58 INFO SparkHadoopUtil: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties14/07/09 09:59:58 INFO SecurityManager: Changing view acls to: userid14/07/09 09:59:58 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(userid)14/07/09 09:59:59 INFO Slf4jLogger: Slf4jLogger started14/07/09 09:59:59 INFO Remoting: Starting remoting14/07/09 10:00:00 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@pzxnvm2022.x.y.name.org:48733]14/07/09 10:00:00 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutor@pzxnvm2022.x.y.name.org:48733]14/07/09 10:00:00 INFO CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@localhost:38145/user/CoarseGrainedScheduler14/07/09 10:00:00 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@pzxnvm2022.x.y.name.org:53029/user/Worker14/07/09 10:00:00 ERROR CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@pzxnvm2022.x.y.name.org:48733] -> [akka.tcp://spark@localhost:38145] disassociated! Shutting down.

Date: Tue, 8 Jul 2014 15:32:09 -0700
Subject: Re: Spark: All masters are unresponsive!
From: andrew@databricks.com
To: user@spark.apache.org

It seems that your driver (which I'm assuming you launched on the master node) can now connect to the Master, but your executors cannot. Did you make sure that all nodes have the same conf/spark-defaults.conf, conf/spark-env.sh, and conf/slaves? It would be good if you can post the stderr of the executor logs here. They are located on the worker node under $SPARK_HOME/work.

(As of Spark-1.0, we recommend that you use the spark-submit arguments, i.e.
bin/spark-submit --master spark://pzxnvm2018.x.y.name.org:7077 --executor-memory 4g --executor-cores 3 --class <your main class> <your application jar> <application arguments ...>)

2014-07-08 10:12 GMT-07:00 Sameer Tilak <ss...@live.com>:

Hi Akhil et al.,I made the following changes:
In spark-env.sh I added the following three entries (standalone mode)
export SPARK_MASTER_IP=pzxnvm2018.x.y.name.org
export SPARK_WORKER_MEMORY=4Gexport SPARK_WORKER_CORES=3
I then use start-master and start-slaves commands to start the services. Another sthing that I have noticed is that the number of cores that I specified is npot  used: 2022 shows up with only 1 core and 2023 and 2024 show up with 4 cores. 

In the Web UI:URL: spark://pzxnvm2018.x.y.name.org:7077
I run the spark shell command from pzxnvm2018. 

/etc/hosts on my master node has following entry:master-ip 	pzxnvm2018.x.y.name.org pzxnvm2018

/etc/hosts on my a worker node has following entry:
worker-ip	       pzxnvm2023.x.y.name.org pzxnvm2023

However, on my master node log file I still see this:
ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster@pzxnvm2018.x.y.name.org:7077] -> [akka.tcp://spark@localhost:43569]: Error [Association failed with [akka.tcp://spark@localhost:43569]]

My spark-shell has the following o/p

scala> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140708100139-0000
14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/0 on worker-20140708095558-pzxnvm2024.x.y.name.orgg-50218 (pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores
14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/0 on hostPort pzxnvm2024.x.y.name.org:50218 with 4 cores, 512.0 MB RAM
14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/1 on worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores
14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/1 on hostPort pzxnvm2023.x.y.name.org:38294 with 4 cores, 512.0 MB RAM
14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/2 on worker-20140708095559-pzxnvm2022.x.y.name.org-41826 (pzxnvm2022.dcld.pldc.kp.org:41826) with 1 cores
14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/2 on hostPort pzxnvm2022.x.y.name.org:41826 with 1 cores, 512.0 MB RAM
14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/0 is now RUNNING14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/1 is now RUNNING
14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/2 is now RUNNING14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/0 is now FAILED (Command exited with code 1)
14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/0 removed: Command exited with code 114/07/08 10:01:42 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/3 on worker-20140708095558-pzxnvm2024.x.y.name.org-50218 (pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores
14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/3 on hostPort pzxnvm2024.x.y.name.org:50218 with 4 cores, 512.0 MB RAM
14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/3 is now RUNNING14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/1 is now FAILED (Command exited with code 1)
14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/1 removed: Command exited with code 114/07/08 10:01:42 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/4 on worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores
14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/4 on hostPort pzxnvm2023.x.y.name.org:38294 with 4 cores, 512.0 MB RAM
14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/4 is now RUNNING14/07/08 10:01:43 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/2 is now FAILED (Command exited with code 1)
14/07/08 10:01:43 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/2 removed: Command exited with code 114/07/08 10:01:43 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/5 on worker-20140708095559-pzxnvm2022.x.y.name.org-41826 (pzxnvm2022.dcld.pldc.kp.org:41826) with 1 cores
14/07/08 10:01:43 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/5 on hostPort pzxnvm2022.x.y.name.org:41826 with 1 cores, 512.0 MB RAM
14/07/08 10:01:43 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/5 is now RUNNING14/07/08 10:01:44 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/3 is now FAILED (Command exited with code 1)
14/07/08 10:01:44 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/3 removed: Command exited with code 114/07/08 10:01:44 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/6 on worker-20140708095558-pzxnvm2024.x.y.name.org-50218 (pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores
14/07/08 10:01:44 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/6 on hostPort pzxnvm2024.x.y.name.org:50218 with 4 cores, 512.0 MB RAM
14/07/08 10:01:44 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/6 is now RUNNING14/07/08 10:01:45 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/4 is now FAILED (Command exited with code 1)
14/07/08 10:01:45 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/4 removed: Command exited with code 114/07/08 10:01:45 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/7 on worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores

Date: Tue, 8 Jul 2014 12:29:21 +0530
Subject: Re: Spark: All masters are unresponsive!
From: akhil@sigmoidanalytics.com

To: user@spark.apache.org

Are you sure this is your master URL spark://pzxnvm2018:7077 ?

You can look it up in the WebUI (mostly http://pzxnvm2018:8080) top left corner. Also make sure you are able to telnet pzxnvm2018 7077 from the machines where you are running the spark shell. 

ThanksBest Regards

On Tue, Jul 8, 2014 at 12:21 PM, Sameer Tilak <ss...@live.com> wrote:

Hi All,
I am having a few issues with stability and scheduling. When I use spark shell to submit my application. I get the following error message and spark shell crashes. I have a small 4-node cluster for PoC. I tried both manual and scripts-based cluster set up. I tried using FQDN as well for specifying the master node, but no luck.  

14/07/07 23:44:35 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MappedRDD[6] at map at JaccardScore.scala:83)14/07/07 23:44:35 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks

14/07/07 23:44:35 INFO TaskSetManager: Starting task 1.0:0 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)14/07/07 23:44:35 INFO TaskSetManager: Serialized task 1.0:0 as 2322 bytes in 0 ms

14/07/07 23:44:35 INFO TaskSetManager: Starting task 1.0:1 as TID 2 on executor localhost: localhost (PROCESS_LOCAL)14/07/07 23:44:35 INFO TaskSetManager: Serialized task 1.0:1 as 2322 bytes in 0 ms
14/07/07 23:44:35 INFO Executor: Running task ID 1
14/07/07 23:44:35 INFO Executor: Running task ID 214/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 locally14/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 locally

14/07/07 23:44:35 INFO HadoopRDD: Input split: hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:0+9723938914/07/07 23:44:35 INFO HadoopRDD: Input split: hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:97239389+97239390

14/07/07 23:44:54 INFO AppClient$ClientActor: Connecting to master spark://pzxnvm2018:7077...14/07/07 23:45:14 INFO AppClient$ClientActor: Connecting to master spark://pzxnvm2018:7077...14/07/07 23:45:35 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.

14/07/07 23:45:35 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.14/07/07 23:45:35 WARN HadoopRDD: Exception in RecordReader.close()

java.io.IOException: Filesystem closed	at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264)	at org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74)

	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close(DFSClient.java:2135)	at java.io.FilterInputStream.close(FilterInputStream.java:181)

	at org.apache.hadoop.util.LineReader.close(LineReader.java:83)	at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:168)

	at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:208)	at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)

	at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:193)	at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)

	at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)	at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)

	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:113)	at org.apache.spark.scheduler.Task.run(Task.scala:51)

	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)	at java.lang.Thread.run(Thread.java:722)

14/07/07 23:45:35 ERROR Executor: Exception in task ID 2java.io.IOException: Filesystem closed	at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264)

	at org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74)	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2213)

	at java.io.DataInputStream.read(DataInputStream.java:100)	at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)

	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133)	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38)

	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198)	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181)

	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

	at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

	at scala.collection.Iterator$class.foreach(Iterator.scala:727)	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)

	at scala.collection.AbstractIterator.to(Iterator.scala:1157)	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)

	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)	at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)

	at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)	at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)

	at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)

	at org.apache.spark.scheduler.Task.run(Task.scala:51)	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)

	at java.lang.Thread.run(Thread.java:722)

RE: Spark: All masters are unresponsive!

Posted by Sameer Tilak <ss...@live.com>.

Dear Andrew and Aaron,Please find all the details here. I really appreciate your help.
Master node configuration:
/etc/host on the Master node 127.0.0.1	localhost.localdomain	localhost
Another entry 
172.16.48.41	pzxnvm2018.x.y.name.org pzxnvm2018

spark-env.sh on the Master node 
export SPARK_MASTER_IP=172.16.48.41export SPARK_WORKER_MEMORY=2Gexport SPARK_WORKER_CORES=1export SPARK_LOCAL_IP=172.16.48.41# export SPARK_MASTER_OPTS+=" -Dspark.akka.frameSize=10000"# export SPARK_WORKER_OPTS+=" -Dspark.akka.frameSize=10000"

No slaves file on the master node (I had it before, but now I am manually starting my worker node)
I start master node using ./start-master command in sbin directory
I got the following message:
./start-master.shstarting org.apache.spark.deploy.master.Master, logging to /apps/software/spark-1.0.0-bin-hadoop1/sbin/../logs/spark-userid-org.apache.spark.deploy.master.Master-1-pzxnvm2018.outbash-4.1$ tail /apps/software/spark-1.0.0-bin-hadoop1/sbin/../logs/spark-userid-org.apache.spark.deploy.master.Master-1-pzxnvm2018.out14/07/09 09:55:53 INFO SecurityManager: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties14/07/09 09:55:53 INFO SecurityManager: Changing view acls to: userid14/07/09 09:55:53 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(userid)14/07/09 09:55:54 INFO Slf4jLogger: Slf4jLogger started14/07/09 09:55:54 INFO Remoting: Starting remoting14/07/09 09:55:54 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@172.16.48.41:7077]14/07/09 09:55:55 INFO Master: Starting Spark master at spark://172.16.48.41:707714/07/09 09:55:55 INFO MasterWebUI: Started MasterWebUI at http://pzxnvm2018.dcld.pldc.kp.org:808014/07/09 09:55:55 INFO Master: I have been elected leader! New state: ALIVE14/07/09 09:56:37 INFO Master: Registering worker pzxnvm2022.dcld.pldc.kp.org:53029 with 2 cores, 4.0 GB RAM
Worker node configuration::
/etc/host on the Worker node 
127.0.0.1	localhost.localdomain	localhost
Another entry 
172.16.48.44	pzxnvm2022.x.y.name.org  pzxnvm2022
No spark-env.sh or slaves file on the worker node. I manually start my worker node as:
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://172.16.48.41:7077 -m 4G -c 2

Inside my application I have the following code:  val conf = new SparkConf().setMaster("spark://172.16.48.41:7077").setAppName("ApproxStrMatch") conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") conf.set("spark.kryo.registrator", "approxstrmatch.MyRegistrator")      I use the following command to start spark-shell and my app:   ./spark-shell --jars /apps/software/secondstring/secondstring/dist/lib/secondstring-20140630.jar,/apps/software/scala-approsstrmatch/approxstrmatch.jar
When I use sc.getConf.get("spark.driver.host")res0: String = pzxnvm2018.x.y.name.org
Master node shell o/p:14/07/09 10:00:18 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140709095956-0000/6 on hostPort pzxnvm2022.x.y.name.org:53029 with 2 cores, 512.0 MB RAM14/07/09 10:00:18 INFO AppClient$ClientActor: Executor updated: app-20140709095956-0000/6 is now RUNNING14/07/09 10:00:21 INFO AppClient$ClientActor: Executor updated: app-20140709095956-0000/6 is now FAILED (Command exited with code 1)14/07/09 10:00:21 INFO SparkDeploySchedulerBackend: Executor app-20140709095956-0000/6 removed: Command exited with code 114/07/09 10:00:21 INFO AppClient$ClientActor: Executor added: app-20140709095956-0000/7 on worker-20140709095636-pzxnvm2022.x.y.name.org-53029 (pzxnvm2022.x.y.name.org:53029) with 2 cores14/07/09 10:00:21 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140709095956-0000/7 on hostPort pzxnvm2022.x.y.name.org:53029 with 2 cores, 512.0 MB RAM14/07/09 10:00:21 INFO AppClient$ClientActor: Executor updated: app-20140709095956-0000/7 is now RUNNING14/07/09 10:00:24 INFO AppClient$ClientActor: Executor updated: app-20140709095956-0000/7 is now FAILED (Command exited with code 1)14/07/09 10:00:24 INFO SparkDeploySchedulerBackend: Executor app-20140709095956-0000/7 removed: Command exited with code 114/07/09 10:00:24 INFO AppClient$ClientActor: Executor added: app-20140709095956-0000/8 on worker-20140709095636-pzxnvm2022.x.y.name.org-53029 (pzxnvm2022.x.y.name.org:53029) with 2 cores14/07/09 10:00:24 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140709095956-0000/8 on hostPort pzxnvm2022.x.y.name.org:53029 with 2 cores, 512.0 MB RAM14/07/09 10:00:24 INFO AppClient$ClientActor: Executor updated: app-20140709095956-0000/8 is now RUNNING14/07/09 10:00:28 INFO AppClient$ClientActor: Executor updated: app-20140709095956-0000/8 is now FAILED (Command exited with code 1)14/07/09 10:00:28 INFO SparkDeploySchedulerBackend: Executor app-20140709095956-0000/8 removed: Command exited with code 114/07/09 10:00:28 INFO AppClient$ClientActor: Executor added: app-20140709095956-0000/9 on worker-20140709095636-pzxnvm2022.x.y.name.org-53029 (pzxnvm2022.x.y.name.org:53029) with 2 cores14/07/09 10:00:28 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140709095956-0000/9 on hostPort pzxnvm2022.x.y.name.org:53029 with 2 cores, 512.0 MB RAM14/07/09 10:00:28 INFO AppClient$ClientActor: Executor updated: app-20140709095956-0000/9 is now RUNNING14/07/09 10:00:31 INFO AppClient$ClientActor: Executor updated: app-20140709095956-0000/9 is now FAILED (Command exited with code 1)14/07/09 10:00:31 INFO SparkDeploySchedulerBackend: Executor app-20140709095956-0000/9 removed: Command exited with code 114/07/09 10:00:31 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED14/07/09 10:00:31 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: Master removed our application: FAILED

Worker node error log from the 
14/07/09 10:00:14 INFO Worker: Executor app-20140709095956-0000/4 finished with state FAILED message Command exited with code 1 exitStatus 114/07/09 10:00:14 INFO Worker: Asked to launch executor app-20140709095956-0000/5 for ApproxStrMatch14/07/09 10:00:14 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64/jre/bin/java" "-cp" "::/apps/software/spark-1.0.0-bin-hadoop1/conf:/apps/software/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/apps/hadoop/hadoop-conf" "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@localhost:38145/user/CoarseGrainedScheduler" "5" "pzxnvm2022.dcld.pldc.kp.org" "2" "akka.tcp://sparkWorker@pzxnvm2022.x.y.name.org:53029/user/Worker" "app-20140709095956-0000"14/07/09 10:00:18 INFO Worker: Executor app-20140709095956-0000/5 finished with state FAILED message Command exited with code 1 exitStatus 114/07/09 10:00:18 INFO Worker: Asked to launch executor app-20140709095956-0000/6 for ApproxStrMatch14/07/09 10:00:18 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64/jre/bin/java" "-cp" "::/apps/software/spark-1.0.0-bin-hadoop1/conf:/apps/software/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/apps/hadoop/hadoop-conf" "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@localhost:38145/user/CoarseGrainedScheduler" "6" "pzxnvm2022.dcld.pldc.kp.org" "2" "akka.tcp://sparkWorker@pzxnvm2022.x.y.name.org:53029/user/Worker" "app-20140709095956-0000"14/07/09 10:00:21 INFO Worker: Executor app-20140709095956-0000/6 finished with state FAILED message Command exited with code 1 exitStatus 114/07/09 10:00:21 INFO Worker: Asked to launch executor app-20140709095956-0000/7 for ApproxStrMatch14/07/09 10:00:21 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64/jre/bin/java" "-cp" "::/apps/software/spark-1.0.0-bin-hadoop1/conf:/apps/software/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/apps/hadoop/hadoop-conf" "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@localhost:38145/user/CoarseGrainedScheduler" "7" "pzxnvm2022.dcld.pldc.kp.org" "2" "akka.tcp://sparkWorker@pzxnvm2022.x.y.name.org:53029/user/Worker" "app-20140709095956-0000"14/07/09 10:00:24 INFO Worker: Executor app-20140709095956-0000/7 finished with state FAILED message Command exited with code 1 exitStatus 114/07/09 10:00:24 INFO Worker: Asked to launch executor app-20140709095956-0000/8 for ApproxStrMatch14/07/09 10:00:24 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64/jre/bin/java" "-cp" "::/apps/software/spark-1.0.0-bin-hadoop1/conf:/apps/software/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/apps/hadoop/hadoop-conf" "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@localhost:38145/user/CoarseGrainedScheduler" "8" "pzxnvm2022.dcld.pldc.kp.org" "2" "akka.tcp://sparkWorker@pzxnvm2022.x.y.name.org:53029/user/Worker" "app-20140709095956-0000"14/07/09 10:00:28 INFO Worker: Executor app-20140709095956-0000/8 finished with state FAILED message Command exited with code 1 exitStatus 114/07/09 10:00:28 INFO Worker: Asked to launch executor app-20140709095956-0000/9 for ApproxStrMatch
Here is what I see  on the order node:
Spark Executor Command: "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64/jre/bin/java" "-cp" "::/apps/software/spark-1.0.0-bin-hadoop1/conf:/apps/software/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/apps/hadoop/hadoop-conf" "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@localhost:38145/user/CoarseGrainedScheduler" "0" "pzxnvm2022.dcld.pldc.kp.org" "2" "akka.tcp://sparkWorker@pzxnvm2022.x.y.name.org:53029/user/Worker" "app-20140709095956-0000"========================================
log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).log4j:WARN Please initialize the log4j system properly.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.14/07/09 09:59:58 INFO SparkHadoopUtil: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties14/07/09 09:59:58 INFO SecurityManager: Changing view acls to: userid14/07/09 09:59:58 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(userid)14/07/09 09:59:59 INFO Slf4jLogger: Slf4jLogger started14/07/09 09:59:59 INFO Remoting: Starting remoting14/07/09 10:00:00 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@pzxnvm2022.x.y.name.org:48733]14/07/09 10:00:00 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutor@pzxnvm2022.x.y.name.org:48733]14/07/09 10:00:00 INFO CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@localhost:38145/user/CoarseGrainedScheduler14/07/09 10:00:00 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@pzxnvm2022.x.y.name.org:53029/user/Worker14/07/09 10:00:00 ERROR CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@pzxnvm2022.x.y.name.org:48733] -> [akka.tcp://spark@localhost:38145] disassociated! Shutting down.

Date: Tue, 8 Jul 2014 15:32:09 -0700
Subject: Re: Spark: All masters are unresponsive!
From: andrew@databricks.com
To: user@spark.apache.org

It seems that your driver (which I'm assuming you launched on the master node) can now connect to the Master, but your executors cannot. Did you make sure that all nodes have the same conf/spark-defaults.conf, conf/spark-env.sh, and conf/slaves? It would be good if you can post the stderr of the executor logs here. They are located on the worker node under $SPARK_HOME/work.

(As of Spark-1.0, we recommend that you use the spark-submit arguments, i.e.
bin/spark-submit --master spark://pzxnvm2018.x.y.name.org:7077 --executor-memory 4g --executor-cores 3 --class <your main class> <your application jar> <application arguments ...>)


2014-07-08 10:12 GMT-07:00 Sameer Tilak <ss...@live.com>:




Hi Akhil et al.,I made the following changes:
In spark-env.sh I added the following three entries (standalone mode)
export SPARK_MASTER_IP=pzxnvm2018.x.y.name.org
export SPARK_WORKER_MEMORY=4Gexport SPARK_WORKER_CORES=3
I then use start-master and start-slaves commands to start the services. Another sthing that I have noticed is that the number of cores that I specified is npot  used: 2022 shows up with only 1 core and 2023 and 2024 show up with 4 cores. 

In the Web UI:URL: spark://pzxnvm2018.x.y.name.org:7077
I run the spark shell command from pzxnvm2018. 

/etc/hosts on my master node has following entry:master-ip 	pzxnvm2018.x.y.name.org pzxnvm2018

/etc/hosts on my a worker node has following entry:
worker-ip	       pzxnvm2023.x.y.name.org pzxnvm2023


However, on my master node log file I still see this:
ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster@pzxnvm2018.x.y.name.org:7077] -> [akka.tcp://spark@localhost:43569]: Error [Association failed with [akka.tcp://spark@localhost:43569]]

My spark-shell has the following o/p

scala> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140708100139-0000
14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/0 on worker-20140708095558-pzxnvm2024.x.y.name.orgg-50218 (pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores
14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/0 on hostPort pzxnvm2024.x.y.name.org:50218 with 4 cores, 512.0 MB RAM
14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/1 on worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores
14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/1 on hostPort pzxnvm2023.x.y.name.org:38294 with 4 cores, 512.0 MB RAM
14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/2 on worker-20140708095559-pzxnvm2022.x.y.name.org-41826 (pzxnvm2022.dcld.pldc.kp.org:41826) with 1 cores
14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/2 on hostPort pzxnvm2022.x.y.name.org:41826 with 1 cores, 512.0 MB RAM
14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/0 is now RUNNING14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/1 is now RUNNING
14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/2 is now RUNNING14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/0 is now FAILED (Command exited with code 1)
14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/0 removed: Command exited with code 114/07/08 10:01:42 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/3 on worker-20140708095558-pzxnvm2024.x.y.name.org-50218 (pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores
14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/3 on hostPort pzxnvm2024.x.y.name.org:50218 with 4 cores, 512.0 MB RAM
14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/3 is now RUNNING14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/1 is now FAILED (Command exited with code 1)
14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/1 removed: Command exited with code 114/07/08 10:01:42 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/4 on worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores
14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/4 on hostPort pzxnvm2023.x.y.name.org:38294 with 4 cores, 512.0 MB RAM
14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/4 is now RUNNING14/07/08 10:01:43 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/2 is now FAILED (Command exited with code 1)
14/07/08 10:01:43 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/2 removed: Command exited with code 114/07/08 10:01:43 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/5 on worker-20140708095559-pzxnvm2022.x.y.name.org-41826 (pzxnvm2022.dcld.pldc.kp.org:41826) with 1 cores
14/07/08 10:01:43 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/5 on hostPort pzxnvm2022.x.y.name.org:41826 with 1 cores, 512.0 MB RAM
14/07/08 10:01:43 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/5 is now RUNNING14/07/08 10:01:44 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/3 is now FAILED (Command exited with code 1)
14/07/08 10:01:44 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/3 removed: Command exited with code 114/07/08 10:01:44 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/6 on worker-20140708095558-pzxnvm2024.x.y.name.org-50218 (pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores
14/07/08 10:01:44 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/6 on hostPort pzxnvm2024.x.y.name.org:50218 with 4 cores, 512.0 MB RAM
14/07/08 10:01:44 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/6 is now RUNNING14/07/08 10:01:45 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/4 is now FAILED (Command exited with code 1)
14/07/08 10:01:45 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/4 removed: Command exited with code 114/07/08 10:01:45 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/7 on worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores


Date: Tue, 8 Jul 2014 12:29:21 +0530
Subject: Re: Spark: All masters are unresponsive!
From: akhil@sigmoidanalytics.com

To: user@spark.apache.org

Are you sure this is your master URL spark://pzxnvm2018:7077 ?


You can look it up in the WebUI (mostly http://pzxnvm2018:8080) top left corner. Also make sure you are able to telnet pzxnvm2018 7077 from the machines where you are running the spark shell. 

ThanksBest Regards


On Tue, Jul 8, 2014 at 12:21 PM, Sameer Tilak <ss...@live.com> wrote:





Hi All,
I am having a few issues with stability and scheduling. When I use spark shell to submit my application. I get the following error message and spark shell crashes. I have a small 4-node cluster for PoC. I tried both manual and scripts-based cluster set up. I tried using FQDN as well for specifying the master node, but no luck.  


14/07/07 23:44:35 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MappedRDD[6] at map at JaccardScore.scala:83)14/07/07 23:44:35 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks

14/07/07 23:44:35 INFO TaskSetManager: Starting task 1.0:0 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)14/07/07 23:44:35 INFO TaskSetManager: Serialized task 1.0:0 as 2322 bytes in 0 ms

14/07/07 23:44:35 INFO TaskSetManager: Starting task 1.0:1 as TID 2 on executor localhost: localhost (PROCESS_LOCAL)14/07/07 23:44:35 INFO TaskSetManager: Serialized task 1.0:1 as 2322 bytes in 0 ms
14/07/07 23:44:35 INFO Executor: Running task ID 1
14/07/07 23:44:35 INFO Executor: Running task ID 214/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 locally14/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 locally

14/07/07 23:44:35 INFO HadoopRDD: Input split: hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:0+9723938914/07/07 23:44:35 INFO HadoopRDD: Input split: hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:97239389+97239390

14/07/07 23:44:54 INFO AppClient$ClientActor: Connecting to master spark://pzxnvm2018:7077...14/07/07 23:45:14 INFO AppClient$ClientActor: Connecting to master spark://pzxnvm2018:7077...14/07/07 23:45:35 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.

14/07/07 23:45:35 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.14/07/07 23:45:35 WARN HadoopRDD: Exception in RecordReader.close()

java.io.IOException: Filesystem closed	at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264)	at org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74)

	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close(DFSClient.java:2135)	at java.io.FilterInputStream.close(FilterInputStream.java:181)

	at org.apache.hadoop.util.LineReader.close(LineReader.java:83)	at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:168)

	at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:208)	at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)

	at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:193)	at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)

	at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)	at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)

	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:113)	at org.apache.spark.scheduler.Task.run(Task.scala:51)

	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)	at java.lang.Thread.run(Thread.java:722)

14/07/07 23:45:35 ERROR Executor: Exception in task ID 2java.io.IOException: Filesystem closed	at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264)

	at org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74)	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2213)

	at java.io.DataInputStream.read(DataInputStream.java:100)	at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)

	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133)	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38)

	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198)	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181)

	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

	at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

	at scala.collection.Iterator$class.foreach(Iterator.scala:727)	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)

	at scala.collection.AbstractIterator.to(Iterator.scala:1157)	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)

	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)	at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)

	at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)	at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)

	at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)

	at org.apache.spark.scheduler.Task.run(Task.scala:51)	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)

	at java.lang.Thread.run(Thread.java:722)

Re: Spark: All masters are unresponsive!

Posted by Andrew Or <an...@databricks.com>.

It seems that your driver (which I'm assuming you launched on the master
node) can now connect to the Master, but your executors cannot. Did you
make sure that all nodes have the same conf/spark-defaults.conf,
conf/spark-env.sh, and conf/slaves? It would be good if you can post the
stderr of the executor logs here. They are located on the worker node under
$SPARK_HOME/work.

(As of Spark-1.0, we recommend that you use the spark-submit arguments, i.e.

bin/spark-submit --master spark://pzxnvm2018.x.y.name.org:7077
--executor-memory 4g --executor-cores 3 --class <your main class> <your
application jar> <application arguments ...>)


2014-07-08 10:12 GMT-07:00 Sameer Tilak <ss...@live.com>:

> Hi Akhil et al.,
> I made the following changes:
>
> In spark-env.sh I added the following three entries (standalone mode)
>
> export SPARK_MASTER_IP=pzxnvm2018.x.y.name.org
> export SPARK_WORKER_MEMORY=4G
> export SPARK_WORKER_CORES=3
>
> I then use start-master and start-slaves commands to start the services.
> Another sthing that I have noticed is that the number of cores that I
> specified is npot  used: 2022 shows up with only 1 core and 2023 and 2024
> show up with 4 cores.
>
> In the Web UI:
> URL: spark://pzxnvm2018.x.y.name.org:7077
>
> I run the spark shell command from pzxnvm2018.
>
> /etc/hosts on my master node has following entry:
> master-ip  pzxnvm2018.x.y.name.org pzxnvm2018
>
> /etc/hosts on my a worker node has following entry:
> worker-ip        pzxnvm2023.x.y.name.org pzxnvm2023
>
>
> However, on my master node log file I still see this:
>
> ERROR EndpointWriter: AssociationError [akka.tcp://
> sparkMaster@pzxnvm2018.x.y.name.org:7077] -> [akka.tcp://spark@localhost:43569]:
> Error [Association failed with [akka.tcp://spark@localhost:43569]]
>
> My spark-shell has the following o/p
>
>
> scala> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Connected to
> Spark cluster with app ID app-20140708100139-0000
> 14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/0 on
> worker-20140708095558-pzxnvm2024.x.y.name.orgg-50218 (
> pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores
> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/0 on hostPort pzxnvm2024.x.y.name.org:50218 with
> 4 cores, 512.0 MB RAM
> 14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/1 on
> worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (
> pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores
> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/1 on hostPort pzxnvm2023.x.y.name.org:38294 with
> 4 cores, 512.0 MB RAM
> 14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/2 on
> worker-20140708095559-pzxnvm2022.x.y.name.org-41826 (
> pzxnvm2022.dcld.pldc.kp.org:41826) with 1 cores
> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/2 on hostPort pzxnvm2022.x.y.name.org:41826 with
> 1 cores, 512.0 MB RAM
> 14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/0 is now RUNNING
> 14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/1 is now RUNNING
> 14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/2 is now RUNNING
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/0 is now FAILED (Command exited with code 1)
> 14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Executor
> app-20140708100139-0000/0 removed: Command exited with code 1
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/3 on
> worker-20140708095558-pzxnvm2024.x.y.name.org-50218 (
> pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores
> 14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/3 on hostPort pzxnvm2024.x.y.name.org:50218 with
> 4 cores, 512.0 MB RAM
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/3 is now RUNNING
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/1 is now FAILED (Command exited with code 1)
> 14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Executor
> app-20140708100139-0000/1 removed: Command exited with code 1
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/4 on
> worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (
> pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores
> 14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/4 on hostPort pzxnvm2023.x.y.name.org:38294 with
> 4 cores, 512.0 MB RAM
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/4 is now RUNNING
> 14/07/08 10:01:43 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/2 is now FAILED (Command exited with code 1)
> 14/07/08 10:01:43 INFO SparkDeploySchedulerBackend: Executor
> app-20140708100139-0000/2 removed: Command exited with code 1
> 14/07/08 10:01:43 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/5 on
> worker-20140708095559-pzxnvm2022.x.y.name.org-41826 (
> pzxnvm2022.dcld.pldc.kp.org:41826) with 1 cores
> 14/07/08 10:01:43 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/5 on hostPort pzxnvm2022.x.y.name.org:41826 with
> 1 cores, 512.0 MB RAM
> 14/07/08 10:01:43 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/5 is now RUNNING
> 14/07/08 10:01:44 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/3 is now FAILED (Command exited with code 1)
> 14/07/08 10:01:44 INFO SparkDeploySchedulerBackend: Executor
> app-20140708100139-0000/3 removed: Command exited with code 1
> 14/07/08 10:01:44 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/6 on
> worker-20140708095558-pzxnvm2024.x.y.name.org-50218 (
> pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores
> 14/07/08 10:01:44 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/6 on hostPort pzxnvm2024.x.y.name.org:50218 with
> 4 cores, 512.0 MB RAM
> 14/07/08 10:01:44 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/6 is now RUNNING
> 14/07/08 10:01:45 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/4 is now FAILED (Command exited with code 1)
> 14/07/08 10:01:45 INFO SparkDeploySchedulerBackend: Executor
> app-20140708100139-0000/4 removed: Command exited with code 1
> 14/07/08 10:01:45 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/7 on
> worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (
> pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores
>
>
> ------------------------------
> Date: Tue, 8 Jul 2014 12:29:21 +0530
> Subject: Re: Spark: All masters are unresponsive!
> From: akhil@sigmoidanalytics.com
> To: user@spark.apache.org
>
>
> Are you sure this is your master URL spark://pzxnvm2018:7077 ?
>
> You can look it up in the WebUI (mostly http://pzxnvm2018:8080) top left
> corner. Also make sure you are able to telnet pzxnvm2018 7077 from the
> machines where you are running the spark shell.
>
> Thanks
> Best Regards
>
>
> On Tue, Jul 8, 2014 at 12:21 PM, Sameer Tilak <ss...@live.com> wrote:
>
> Hi All,
>
> I am having a few issues with stability and scheduling. When I use spark
> shell to submit my application. I get the following error message and spark
> shell crashes. I have a small 4-node cluster for PoC. I tried both manual
> and scripts-based cluster set up. I tried using FQDN as well for specifying
> the master node, but no luck.
>
> 14/07/07 23:44:35 INFO DAGScheduler: Submitting 2 missing tasks from Stage
> 1 (MappedRDD[6] at map at JaccardScore.scala:83)
> 14/07/07 23:44:35 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
> 14/07/07 23:44:35 INFO TaskSetManager: Starting task 1.0:0 as TID 1 on
> executor localhost: localhost (PROCESS_LOCAL)
> 14/07/07 23:44:35 INFO TaskSetManager: Serialized task 1.0:0 as 2322 bytes
> in 0 ms
> 14/07/07 23:44:35 INFO TaskSetManager: Starting task 1.0:1 as TID 2 on
> executor localhost: localhost (PROCESS_LOCAL)
> 14/07/07 23:44:35 INFO TaskSetManager: Serialized task 1.0:1 as 2322 bytes
> in 0 ms
> 14/07/07 23:44:35 INFO Executor: Running task ID 1
> 14/07/07 23:44:35 INFO Executor: Running task ID 2
> 14/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 locally
> 14/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 locally
> 14/07/07 23:44:35 INFO HadoopRDD: Input split:
> hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:0+97239389
> 14/07/07 23:44:35 INFO HadoopRDD: Input split:
> hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:97239389+97239390
> 14/07/07 23:44:54 INFO AppClient$ClientActor: Connecting to master
> spark://pzxnvm2018:7077...
> 14/07/07 23:45:14 INFO AppClient$ClientActor: Connecting to master
> spark://pzxnvm2018:7077...
> 14/07/07 23:45:35 ERROR SparkDeploySchedulerBackend: Application has been
> killed. Reason: All masters are unresponsive! Giving up.
> 14/07/07 23:45:35 ERROR TaskSchedulerImpl: Exiting due to error from
> cluster scheduler: All masters are unresponsive! Giving up.
> 14/07/07 23:45:35 WARN HadoopRDD: Exception in RecordReader.close()
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264)
> at org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74)
>  at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close(DFSClient.java:2135)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>  at org.apache.hadoop.util.LineReader.close(LineReader.java:83)
> at
> org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:168)
>  at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:208)
> at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
>  at
> org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:193)
> at
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
>  at
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:113)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>  at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> 14/07/07 23:45:35 ERROR Executor: Exception in task ID 2
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264)
>  at org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2213)
>  at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>  at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133)
> at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38)
>  at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181)
>  at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>  at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>  at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>  at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>  at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>  at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
>  at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
>  at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>  at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>  at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>  at java.lang.Thread.run(Thread.java:722)
>
>
>

RE: Spark: All masters are unresponsive!

Posted by Sameer Tilak <ss...@live.com>.

Hi Akhil et al.,I made the following changes:
In spark-env.sh I added the following three entries (standalone mode)
export SPARK_MASTER_IP=pzxnvm2018.x.y.name.orgexport SPARK_WORKER_MEMORY=4Gexport SPARK_WORKER_CORES=3
I then use start-master and start-slaves commands to start the services. Another sthing that I have noticed is that the number of cores that I specified is npot  used: 2022 shows up with only 1 core and 2023 and 2024 show up with 4 cores. 
In the Web UI:URL: spark://pzxnvm2018.x.y.name.org:7077
I run the spark shell command from pzxnvm2018. 
/etc/hosts on my master node has following entry:master-ip 	pzxnvm2018.x.y.name.org pzxnvm2018
/etc/hosts on my a worker node has following entry:
worker-ip	       pzxnvm2023.x.y.name.org pzxnvm2023

However, on my master node log file I still see this:
ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster@pzxnvm2018.x.y.name.org:7077] -> [akka.tcp://spark@localhost:43569]: Error [Association failed with [akka.tcp://spark@localhost:43569]]
My spark-shell has the following o/p

scala> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140708100139-000014/07/08 10:01:39 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/0 on worker-20140708095558-pzxnvm2024.x.y.name.orgg-50218 (pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/0 on hostPort pzxnvm2024.x.y.name.org:50218 with 4 cores, 512.0 MB RAM14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/1 on worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/1 on hostPort pzxnvm2023.x.y.name.org:38294 with 4 cores, 512.0 MB RAM14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/2 on worker-20140708095559-pzxnvm2022.x.y.name.org-41826 (pzxnvm2022.dcld.pldc.kp.org:41826) with 1 cores14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/2 on hostPort pzxnvm2022.x.y.name.org:41826 with 1 cores, 512.0 MB RAM14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/0 is now RUNNING14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/1 is now RUNNING14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/2 is now RUNNING14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/0 is now FAILED (Command exited with code 1)14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/0 removed: Command exited with code 114/07/08 10:01:42 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/3 on worker-20140708095558-pzxnvm2024.x.y.name.org-50218 (pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/3 on hostPort pzxnvm2024.x.y.name.org:50218 with 4 cores, 512.0 MB RAM14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/3 is now RUNNING14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/1 is now FAILED (Command exited with code 1)14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/1 removed: Command exited with code 114/07/08 10:01:42 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/4 on worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/4 on hostPort pzxnvm2023.x.y.name.org:38294 with 4 cores, 512.0 MB RAM14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/4 is now RUNNING14/07/08 10:01:43 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/2 is now FAILED (Command exited with code 1)14/07/08 10:01:43 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/2 removed: Command exited with code 114/07/08 10:01:43 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/5 on worker-20140708095559-pzxnvm2022.x.y.name.org-41826 (pzxnvm2022.dcld.pldc.kp.org:41826) with 1 cores14/07/08 10:01:43 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/5 on hostPort pzxnvm2022.x.y.name.org:41826 with 1 cores, 512.0 MB RAM14/07/08 10:01:43 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/5 is now RUNNING14/07/08 10:01:44 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/3 is now FAILED (Command exited with code 1)14/07/08 10:01:44 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/3 removed: Command exited with code 114/07/08 10:01:44 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/6 on worker-20140708095558-pzxnvm2024.x.y.name.org-50218 (pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores14/07/08 10:01:44 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708100139-0000/6 on hostPort pzxnvm2024.x.y.name.org:50218 with 4 cores, 512.0 MB RAM14/07/08 10:01:44 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/6 is now RUNNING14/07/08 10:01:45 INFO AppClient$ClientActor: Executor updated: app-20140708100139-0000/4 is now FAILED (Command exited with code 1)14/07/08 10:01:45 INFO SparkDeploySchedulerBackend: Executor app-20140708100139-0000/4 removed: Command exited with code 114/07/08 10:01:45 INFO AppClient$ClientActor: Executor added: app-20140708100139-0000/7 on worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores

Date: Tue, 8 Jul 2014 12:29:21 +0530
Subject: Re: Spark: All masters are unresponsive!
From: akhil@sigmoidanalytics.com
To: user@spark.apache.org

Are you sure this is your master URL spark://pzxnvm2018:7077 ?

You can look it up in the WebUI (mostly http://pzxnvm2018:8080) top left corner. Also make sure you are able to telnet pzxnvm2018 7077 from the machines where you are running the spark shell. 
ThanksBest Regards


On Tue, Jul 8, 2014 at 12:21 PM, Sameer Tilak <ss...@live.com> wrote:




Hi All,
I am having a few issues with stability and scheduling. When I use spark shell to submit my application. I get the following error message and spark shell crashes. I have a small 4-node cluster for PoC. I tried both manual and scripts-based cluster set up. I tried using FQDN as well for specifying the master node, but no luck.  

14/07/07 23:44:35 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MappedRDD[6] at map at JaccardScore.scala:83)14/07/07 23:44:35 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
14/07/07 23:44:35 INFO TaskSetManager: Starting task 1.0:0 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)14/07/07 23:44:35 INFO TaskSetManager: Serialized task 1.0:0 as 2322 bytes in 0 ms
14/07/07 23:44:35 INFO TaskSetManager: Starting task 1.0:1 as TID 2 on executor localhost: localhost (PROCESS_LOCAL)14/07/07 23:44:35 INFO TaskSetManager: Serialized task 1.0:1 as 2322 bytes in 0 ms14/07/07 23:44:35 INFO Executor: Running task ID 1
14/07/07 23:44:35 INFO Executor: Running task ID 214/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 locally14/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 locally
14/07/07 23:44:35 INFO HadoopRDD: Input split: hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:0+9723938914/07/07 23:44:35 INFO HadoopRDD: Input split: hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:97239389+97239390
14/07/07 23:44:54 INFO AppClient$ClientActor: Connecting to master spark://pzxnvm2018:7077...14/07/07 23:45:14 INFO AppClient$ClientActor: Connecting to master spark://pzxnvm2018:7077...14/07/07 23:45:35 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
14/07/07 23:45:35 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.14/07/07 23:45:35 WARN HadoopRDD: Exception in RecordReader.close()
java.io.IOException: Filesystem closed	at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264)	at org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74)
	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close(DFSClient.java:2135)	at java.io.FilterInputStream.close(FilterInputStream.java:181)
	at org.apache.hadoop.util.LineReader.close(LineReader.java:83)	at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:168)
	at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:208)	at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
	at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:193)	at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
	at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)	at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:113)	at org.apache.spark.scheduler.Task.run(Task.scala:51)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)	at java.lang.Thread.run(Thread.java:722)
14/07/07 23:45:35 ERROR Executor: Exception in task ID 2java.io.IOException: Filesystem closed	at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264)
	at org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74)	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2213)
	at java.io.DataInputStream.read(DataInputStream.java:100)	at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133)	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38)
	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198)	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
	at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
	at scala.collection.AbstractIterator.to(Iterator.scala:1157)	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)	at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
	at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)	at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
	at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
	at org.apache.spark.scheduler.Task.run(Task.scala:51)	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
	at java.lang.Thread.run(Thread.java:722)

Re: Spark: All masters are unresponsive!

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Are you sure this is your master URL spark://pzxnvm2018:7077 ?

You can look it up in the WebUI (mostly http://pzxnvm2018:8080) top left
corner. Also make sure you are able to telnet pzxnvm2018 7077 from the
machines where you are running the spark shell.

Thanks
Best Regards


On Tue, Jul 8, 2014 at 12:21 PM, Sameer Tilak <ss...@live.com> wrote:

> Hi All,
>
> I am having a few issues with stability and scheduling. When I use spark
> shell to submit my application. I get the following error message and spark
> shell crashes. I have a small 4-node cluster for PoC. I tried both manual
> and scripts-based cluster set up. I tried using FQDN as well for specifying
> the master node, but no luck.
>
> 14/07/07 23:44:35 INFO DAGScheduler: Submitting 2 missing tasks from Stage
> 1 (MappedRDD[6] at map at JaccardScore.scala:83)
> 14/07/07 23:44:35 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
> 14/07/07 23:44:35 INFO TaskSetManager: Starting task 1.0:0 as TID 1 on
> executor localhost: localhost (PROCESS_LOCAL)
> 14/07/07 23:44:35 INFO TaskSetManager: Serialized task 1.0:0 as 2322 bytes
> in 0 ms
> 14/07/07 23:44:35 INFO TaskSetManager: Starting task 1.0:1 as TID 2 on
> executor localhost: localhost (PROCESS_LOCAL)
> 14/07/07 23:44:35 INFO TaskSetManager: Serialized task 1.0:1 as 2322 bytes
> in 0 ms
> 14/07/07 23:44:35 INFO Executor: Running task ID 1
> 14/07/07 23:44:35 INFO Executor: Running task ID 2
> 14/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 locally
> 14/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 locally
> 14/07/07 23:44:35 INFO HadoopRDD: Input split:
> hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:0+97239389
> 14/07/07 23:44:35 INFO HadoopRDD: Input split:
> hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:97239389+97239390
> 14/07/07 23:44:54 INFO AppClient$ClientActor: Connecting to master
> spark://pzxnvm2018:7077...
> 14/07/07 23:45:14 INFO AppClient$ClientActor: Connecting to master
> spark://pzxnvm2018:7077...
> 14/07/07 23:45:35 ERROR SparkDeploySchedulerBackend: Application has been
> killed. Reason: All masters are unresponsive! Giving up.
> 14/07/07 23:45:35 ERROR TaskSchedulerImpl: Exiting due to error from
> cluster scheduler: All masters are unresponsive! Giving up.
> 14/07/07 23:45:35 WARN HadoopRDD: Exception in RecordReader.close()
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264)
> at org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close(DFSClient.java:2135)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> at org.apache.hadoop.util.LineReader.close(LineReader.java:83)
> at
> org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:168)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:208)
> at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:193)
> at
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:113)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> 14/07/07 23:45:35 ERROR Executor: Exception in task ID 2
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264)
> at org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2213)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
> at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133)
> at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
> at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
>