You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ranjanp <pi...@hotmail.com> on 2014/07/18 00:57:26 UTC

Error with spark-submit (formatting corrected)

Hi, 
I am new to Spark and trying out with a stand-alone, 3-node (1 master, 2
workers) cluster. 

>From the Web UI at the master, I see that the workers are registered. But
when I try running the SparkPi example from the master node, I get the
following message and then an exception. 

14/07/17 01:20:36 INFO AppClient$ClientActor: Connecting to master
spark://10.1.3.7:7077... 
14/07/17 01:20:46 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory 

I searched a bit for the above warning, and found and found that others have
encountered this problem before, but did not see a clear resolution except
for this link:
http://apache-spark-user-list.1001560.n3.nabble.com/TaskSchedulerImpl-Initial-job-has-not-accepted-any-resources-check-your-cluster-UI-to-ensure-that-woy-tt8247.html#a8444

Based on the suggestion there I tried supplying --executor-memory option to
spark-submit but that did not help. 

Any suggestions. Here are the details of my set up. 
- 3 nodes (each with 4 CPU cores and 7 GB memory) 
- 1 node configured as Master, and the other two configured as workers 
- Firewall is disabled on all nodes, and network communication between the
nodes is not a problem 
- Edited the conf/spark-env.sh on all nodes to set the following: 
  SPARK_WORKER_CORES=3 
  SPARK_WORKER_MEMORY=5G 
- The Web UI as well as logs on master show that Workers were able to
register correctly. Also the Web UI correctly shows the aggregate available
memory and CPU cores on the workers: 

URL: spark://vmsparkwin1:7077
Workers: 2
Cores: 6 Total, 0 Used
Memory: 10.0 GB Total, 0.0 B Used
Applications: 0 Running, 0 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE

I try running the SparkPi example first using the run-example (which was
failing) and later directly using the spark-submit as shown below: 

$ export MASTER=spark://vmsparkwin1:7077

$ echo $MASTER
spark://vmsparkwin1:7077

azureuser@vmsparkwin1 /cygdrive/c/opt/spark-1.0.0
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
spark://10.1.3.7:7077 --executor-memory 1G --total-executor-cores 2
./lib/spark-examples-1.0.0-hadoop2.2.0.jar 10


The following is the full screen output:

14/07/17 01:20:13 INFO SecurityManager: Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
14/07/17 01:20:13 INFO SecurityManager: Changing view acls to: azureuser
14/07/17 01:20:13 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(azureuser)
14/07/17 01:20:14 INFO Slf4jLogger: Slf4jLogger started
14/07/17 01:20:14 INFO Remoting: Starting remoting
14/07/17 01:20:14 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49839]
14/07/17 01:20:14 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49839]
14/07/17 01:20:14 INFO SparkEnv: Registering MapOutputTracker
14/07/17 01:20:14 INFO SparkEnv: Registering BlockManagerMaster
14/07/17 01:20:14 INFO DiskBlockManager: Created local directory at
C:\cygwin\tmp\spark-local-20140717012014-b606
14/07/17 01:20:14 INFO MemoryStore: MemoryStore started with capacity 294.9
MB.
14/07/17 01:20:14 INFO ConnectionManager: Bound socket to port 49842 with id
= ConnectionManagerId(vmsparkwin1.cssparkwin.b1.internal.cloudapp.net,49842)
14/07/17 01:20:14 INFO BlockManagerMaster: Trying to register BlockManager
14/07/17 01:20:14 INFO BlockManagerInfo: Registering block manager
vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49842 with 294.9 MB RAM
14/07/17 01:20:14 INFO BlockManagerMaster: Registered BlockManager
14/07/17 01:20:14 INFO HttpServer: Starting HTTP Server
14/07/17 01:20:14 INFO HttpBroadcast: Broadcast server started at
http://10.1.3.7:49843
14/07/17 01:20:14 INFO HttpFileServer: HTTP File server directory is
C:\cygwin\tmp\spark-6a076e92-53bb-4c7a-9e27-ce53a818146d
14/07/17 01:20:14 INFO HttpServer: Starting HTTP Server
14/07/17 01:20:15 INFO SparkUI: Started SparkUI at
http://vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:4040
14/07/17 01:20:15 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/07/17 01:20:16 INFO SparkContext: Added JAR
file:/C:/opt/spark-1.0.0/./lib/spark-examples-1.0.0-hadoop2.2.0.jar at
http://10.1.3.7:49844/jars/spark-examples-1.0.0-hadoop2.2.0.jar with
timestamp 1405560016316
14/07/17 01:20:16 INFO AppClient$ClientActor: Connecting to master
spark://10.1.3.7:7077...
14/07/17 01:20:16 INFO SparkContext: Starting job: reduce at
SparkPi.scala:35
14/07/17 01:20:16 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:35)
with 10 output partitions (allowLocal=false)
14/07/17 01:20:16 INFO DAGScheduler: Final stage: Stage 0(reduce at
SparkPi.scala:35)
14/07/17 01:20:16 INFO DAGScheduler: Parents of final stage: List()
14/07/17 01:20:16 INFO DAGScheduler: Missing parents: List()
14/07/17 01:20:16 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map
at SparkPi.scala:31), which has no missing parents
14/07/17 01:20:16 INFO DAGScheduler: Submitting 10 missing tasks from Stage
0 (MappedRDD[1] at map at SparkPi.scala:31)
14/07/17 01:20:16 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
14/07/17 01:20:31 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory
14/07/17 01:20:36 INFO AppClient$ClientActor: Connecting to master
spark://10.1.3.7:7077...
14/07/17 01:20:46 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory
14/07/17 01:20:56 INFO AppClient$ClientActor: Connecting to master
spark://10.1.3.7:7077...
14/07/17 01:21:01 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory
14/07/17 01:21:16 ERROR SparkDeploySchedulerBackend: Application has been
killed. Reason: All masters are unresponsive! Giving up.
14/07/17 01:21:16 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool
14/07/17 01:21:16 INFO TaskSchedulerImpl: Cancelling stage 0
14/07/17 01:21:16 INFO DAGScheduler: Failed to run reduce at
SparkPi.scala:35
Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: All masters are unresponsive! Giving up.
        at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
        at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
        at scala.Option.foreach(Option.scala:236)
        at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
        at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-with-spark-submit-formatting-corrected-tp10102.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error with spark-submit (formatting corrected)

Posted by ranjanp <pi...@hotmail.com>.

Thanks for your help; problem resolved.

As pointed out by Andrew and Meethu, I needed to use
spark://vmsparkwin1:7077 rather than the equivalent spark://10.1.3.7:7077 in
the spark-submit command.

It appears that the argument in the --master option for the spark-submit
must match exactly (not just equivalent) what is displayed on the Master web
UI. Perhaps this should be called out in the docs.

Thanks, again.

-pr



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-with-spark-submit-formatting-corrected-tp10102p10284.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error with spark-submit (formatting corrected)

Posted by MEETHU MATHEW <me...@yahoo.co.in>.

Hi,
Instead of spark://10.1.3.7:7077 use spark://vmsparkwin1:7077  try this

$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
> spark://vmsparkwin1:7077 --executor-memory 1G --total-executor-cores 2
> ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 10

 
Thanks & Regards, 
Meethu M


On Friday, 18 July 2014 7:51 AM, Jay Vyas <ja...@gmail.com> wrote:
 


I think I know what is happening to you.  I've looked some into this just this week, and so its fresh in my brain :) hope this helps.


When no workers are known to the master, iirc, you get this message.

I think  this is how it works.

1) You start your master
2) You start a slave, and give it master url as an argument.
3) The slave then binds to a random port
4) The slave then does a handshake with master, which you can see in the slave logs (it sais something like "sucesfully connected to master at …".
  Actualy, i think tha master also logs that it now is aware of a slave running on ip:port…

So in your case, I suspect, none of the slaves have connected to the master, so the job sits idle.

This is similar to the yarn scenario of submitting a job to a resource manager with no node-managers running. 



On Jul 17, 2014, at 6:57 PM, ranjanp <pi...@hotmail.com> wrote:

> Hi, 
> I am new to Spark and trying out with a stand-alone, 3-node (1 master, 2
> workers) cluster. 
> 
> From the Web UI at the master, I see that the workers are registered. But
> when I try running the SparkPi example from the master node, I get the
> following message and then an exception. 
> 
> 14/07/17 01:20:36 INFO AppClient$ClientActor: Connecting to master
> spark://10.1.3.7:7077... 
> 14/07/17 01:20:46 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory 
> 
> I searched a bit for the above warning, and found and found that others have
> encountered this problem before, but did not see a clear resolution except
> for this link:
> http://apache-spark-user-list.1001560.n3.nabble.com/TaskSchedulerImpl-Initial-job-has-not-accepted-any-resources-check-your-cluster-UI-to-ensure-that-woy-tt8247.html#a8444
> 
> Based on the suggestion there I tried supplying --executor-memory option to
> spark-submit but that did not help. 
> 
> Any suggestions. Here are the details of my set up. 
> - 3 nodes (each with 4 CPU cores and 7 GB memory) 
> - 1 node configured as Master, and the other two configured as workers 
> - Firewall is disabled on all nodes, and network communication between the
> nodes is not a problem 
> - Edited the conf/spark-env.sh on all nodes to set the following: 
>  SPARK_WORKER_CORES=3 
>  SPARK_WORKER_MEMORY=5G 
> - The Web UI as well as logs on master show that Workers were able to
> register correctly. Also the Web UI correctly shows the aggregate available
> memory and CPU cores on the workers: 
> 
> URL: spark://vmsparkwin1:7077
> Workers: 2
> Cores: 6 Total, 0 Used
> Memory: 10.0 GB Total, 0.0 B Used
> Applications: 0 Running, 0 Completed
> Drivers: 0 Running, 0 Completed
> Status: ALIVE
> 
> I try running the SparkPi example first using the run-example (which was
> failing) and later directly using the spark-submit as shown below: 
> 
> $ export MASTER=spark://vmsparkwin1:7077
> 
> $ echo $MASTER
> spark://vmsparkwin1:7077
> 
> azureuser@vmsparkwin1 /cygdrive/c/opt/spark-1.0.0
> $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
> spark://10.1.3.7:7077 --executor-memory 1G --total-executor-cores 2
> ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 10
> 
> 
> The following is the full screen output:
> 
> 14/07/17 01:20:13 INFO SecurityManager: Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 14/07/17 01:20:13 INFO SecurityManager: Changing view acls to: azureuser
> 14/07/17 01:20:13 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(azureuser)
> 14/07/17 01:20:14 INFO Slf4jLogger: Slf4jLogger started
> 14/07/17 01:20:14 INFO Remoting: Starting remoting
> 14/07/17 01:20:14 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://spark@vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49839]
> 14/07/17 01:20:14 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://spark@vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49839]
> 14/07/17 01:20:14 INFO SparkEnv: Registering MapOutputTracker
> 14/07/17 01:20:14 INFO SparkEnv: Registering BlockManagerMaster
> 14/07/17 01:20:14 INFO DiskBlockManager: Created local directory at
> C:\cygwin\tmp\spark-local-20140717012014-b606
> 14/07/17 01:20:14 INFO MemoryStore: MemoryStore started with capacity 294.9
> MB.
> 14/07/17 01:20:14 INFO ConnectionManager: Bound socket to port 49842 with id
> = ConnectionManagerId(vmsparkwin1.cssparkwin.b1.internal.cloudapp.net,49842)
> 14/07/17 01:20:14 INFO BlockManagerMaster: Trying to register BlockManager
> 14/07/17 01:20:14 INFO BlockManagerInfo: Registering block manager
> vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49842 with 294.9 MB RAM
> 14/07/17 01:20:14 INFO BlockManagerMaster: Registered BlockManager
> 14/07/17 01:20:14 INFO HttpServer: Starting HTTP Server
> 14/07/17 01:20:14 INFO HttpBroadcast: Broadcast server started at
> http://10.1.3.7:49843
> 14/07/17 01:20:14 INFO HttpFileServer: HTTP File server directory is
> C:\cygwin\tmp\spark-6a076e92-53bb-4c7a-9e27-ce53a818146d
> 14/07/17 01:20:14 INFO HttpServer: Starting HTTP Server
> 14/07/17 01:20:15 INFO SparkUI: Started SparkUI at
> http://vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:4040
> 14/07/17 01:20:15 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 14/07/17 01:20:16 INFO SparkContext: Added JAR
> file:/C:/opt/spark-1.0.0/./lib/spark-examples-1.0.0-hadoop2.2.0.jar at
> http://10.1.3.7:49844/jars/spark-examples-1.0.0-hadoop2.2.0.jar with
> timestamp 1405560016316
> 14/07/17 01:20:16 INFO AppClient$ClientActor: Connecting to master
> spark://10.1.3.7:7077...
> 14/07/17 01:20:16 INFO SparkContext: Starting job: reduce at
> SparkPi.scala:35
> 14/07/17 01:20:16 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:35)
> with 10 output partitions (allowLocal=false)
> 14/07/17 01:20:16 INFO DAGScheduler: Final stage: Stage 0(reduce at
> SparkPi.scala:35)
> 14/07/17 01:20:16 INFO DAGScheduler: Parents of final stage: List()
> 14/07/17 01:20:16 INFO DAGScheduler: Missing parents: List()
> 14/07/17 01:20:16 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map
> at SparkPi.scala:31), which has no missing parents
> 14/07/17 01:20:16 INFO DAGScheduler: Submitting 10 missing tasks from Stage
> 0 (MappedRDD[1] at map at SparkPi.scala:31)
> 14/07/17 01:20:16 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
> 14/07/17 01:20:31 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/07/17 01:20:36 INFO AppClient$ClientActor: Connecting to master
> spark://10.1.3.7:7077...
> 14/07/17 01:20:46 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/07/17 01:20:56 INFO AppClient$ClientActor: Connecting to master
> spark://10.1.3.7:7077...
> 14/07/17 01:21:01 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/07/17 01:21:16 ERROR SparkDeploySchedulerBackend: Application has been
> killed. Reason: All masters are unresponsive! Giving up.
> 14/07/17 01:21:16 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
> have all completed, from pool
> 14/07/17 01:21:16 INFO TaskSchedulerImpl: Cancelling stage 0
> 14/07/17 01:21:16 INFO DAGScheduler: Failed to run reduce at
> SparkPi.scala:35
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
> to stage failure: All masters are unresponsive! Giving up.
>        at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
>        at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
>        at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
>        at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>        at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>        at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
>        at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
>        at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
>        at scala.Option.foreach(Option.scala:236)
>        at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
>        at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
>        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>        at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>        at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>        at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>        at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>        at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-with-spark-submit-formatting-corrected-tp10102.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error with spark-submit (formatting corrected)

Posted by Jay Vyas <ja...@gmail.com>.

I think I know what is happening to you.  I've looked some into this just this week, and so its fresh in my brain :) hope this helps.


When no workers are known to the master, iirc, you get this message.

I think  this is how it works.

1) You start your master
2) You start a slave, and give it master url as an argument.
3) The slave then binds to a random port
4) The slave then does a handshake with master, which you can see in the slave logs (it sais something like "sucesfully connected to master at …".
  Actualy, i think tha master also logs that it now is aware of a slave running on ip:port…

So in your case, I suspect, none of the slaves have connected to the master, so the job sits idle.

This is similar to the yarn scenario of submitting a job to a resource manager with no node-managers running. 



On Jul 17, 2014, at 6:57 PM, ranjanp <pi...@hotmail.com> wrote:

> Hi, 
> I am new to Spark and trying out with a stand-alone, 3-node (1 master, 2
> workers) cluster. 
> 
> From the Web UI at the master, I see that the workers are registered. But
> when I try running the SparkPi example from the master node, I get the
> following message and then an exception. 
> 
> 14/07/17 01:20:36 INFO AppClient$ClientActor: Connecting to master
> spark://10.1.3.7:7077... 
> 14/07/17 01:20:46 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory 
> 
> I searched a bit for the above warning, and found and found that others have
> encountered this problem before, but did not see a clear resolution except
> for this link:
> http://apache-spark-user-list.1001560.n3.nabble.com/TaskSchedulerImpl-Initial-job-has-not-accepted-any-resources-check-your-cluster-UI-to-ensure-that-woy-tt8247.html#a8444
> 
> Based on the suggestion there I tried supplying --executor-memory option to
> spark-submit but that did not help. 
> 
> Any suggestions. Here are the details of my set up. 
> - 3 nodes (each with 4 CPU cores and 7 GB memory) 
> - 1 node configured as Master, and the other two configured as workers 
> - Firewall is disabled on all nodes, and network communication between the
> nodes is not a problem 
> - Edited the conf/spark-env.sh on all nodes to set the following: 
>  SPARK_WORKER_CORES=3 
>  SPARK_WORKER_MEMORY=5G 
> - The Web UI as well as logs on master show that Workers were able to
> register correctly. Also the Web UI correctly shows the aggregate available
> memory and CPU cores on the workers: 
> 
> URL: spark://vmsparkwin1:7077
> Workers: 2
> Cores: 6 Total, 0 Used
> Memory: 10.0 GB Total, 0.0 B Used
> Applications: 0 Running, 0 Completed
> Drivers: 0 Running, 0 Completed
> Status: ALIVE
> 
> I try running the SparkPi example first using the run-example (which was
> failing) and later directly using the spark-submit as shown below: 
> 
> $ export MASTER=spark://vmsparkwin1:7077
> 
> $ echo $MASTER
> spark://vmsparkwin1:7077
> 
> azureuser@vmsparkwin1 /cygdrive/c/opt/spark-1.0.0
> $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
> spark://10.1.3.7:7077 --executor-memory 1G --total-executor-cores 2
> ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 10
> 
> 
> The following is the full screen output:
> 
> 14/07/17 01:20:13 INFO SecurityManager: Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 14/07/17 01:20:13 INFO SecurityManager: Changing view acls to: azureuser
> 14/07/17 01:20:13 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(azureuser)
> 14/07/17 01:20:14 INFO Slf4jLogger: Slf4jLogger started
> 14/07/17 01:20:14 INFO Remoting: Starting remoting
> 14/07/17 01:20:14 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://spark@vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49839]
> 14/07/17 01:20:14 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://spark@vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49839]
> 14/07/17 01:20:14 INFO SparkEnv: Registering MapOutputTracker
> 14/07/17 01:20:14 INFO SparkEnv: Registering BlockManagerMaster
> 14/07/17 01:20:14 INFO DiskBlockManager: Created local directory at
> C:\cygwin\tmp\spark-local-20140717012014-b606
> 14/07/17 01:20:14 INFO MemoryStore: MemoryStore started with capacity 294.9
> MB.
> 14/07/17 01:20:14 INFO ConnectionManager: Bound socket to port 49842 with id
> = ConnectionManagerId(vmsparkwin1.cssparkwin.b1.internal.cloudapp.net,49842)
> 14/07/17 01:20:14 INFO BlockManagerMaster: Trying to register BlockManager
> 14/07/17 01:20:14 INFO BlockManagerInfo: Registering block manager
> vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49842 with 294.9 MB RAM
> 14/07/17 01:20:14 INFO BlockManagerMaster: Registered BlockManager
> 14/07/17 01:20:14 INFO HttpServer: Starting HTTP Server
> 14/07/17 01:20:14 INFO HttpBroadcast: Broadcast server started at
> http://10.1.3.7:49843
> 14/07/17 01:20:14 INFO HttpFileServer: HTTP File server directory is
> C:\cygwin\tmp\spark-6a076e92-53bb-4c7a-9e27-ce53a818146d
> 14/07/17 01:20:14 INFO HttpServer: Starting HTTP Server
> 14/07/17 01:20:15 INFO SparkUI: Started SparkUI at
> http://vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:4040
> 14/07/17 01:20:15 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 14/07/17 01:20:16 INFO SparkContext: Added JAR
> file:/C:/opt/spark-1.0.0/./lib/spark-examples-1.0.0-hadoop2.2.0.jar at
> http://10.1.3.7:49844/jars/spark-examples-1.0.0-hadoop2.2.0.jar with
> timestamp 1405560016316
> 14/07/17 01:20:16 INFO AppClient$ClientActor: Connecting to master
> spark://10.1.3.7:7077...
> 14/07/17 01:20:16 INFO SparkContext: Starting job: reduce at
> SparkPi.scala:35
> 14/07/17 01:20:16 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:35)
> with 10 output partitions (allowLocal=false)
> 14/07/17 01:20:16 INFO DAGScheduler: Final stage: Stage 0(reduce at
> SparkPi.scala:35)
> 14/07/17 01:20:16 INFO DAGScheduler: Parents of final stage: List()
> 14/07/17 01:20:16 INFO DAGScheduler: Missing parents: List()
> 14/07/17 01:20:16 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map
> at SparkPi.scala:31), which has no missing parents
> 14/07/17 01:20:16 INFO DAGScheduler: Submitting 10 missing tasks from Stage
> 0 (MappedRDD[1] at map at SparkPi.scala:31)
> 14/07/17 01:20:16 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
> 14/07/17 01:20:31 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/07/17 01:20:36 INFO AppClient$ClientActor: Connecting to master
> spark://10.1.3.7:7077...
> 14/07/17 01:20:46 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/07/17 01:20:56 INFO AppClient$ClientActor: Connecting to master
> spark://10.1.3.7:7077...
> 14/07/17 01:21:01 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/07/17 01:21:16 ERROR SparkDeploySchedulerBackend: Application has been
> killed. Reason: All masters are unresponsive! Giving up.
> 14/07/17 01:21:16 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
> have all completed, from pool
> 14/07/17 01:21:16 INFO TaskSchedulerImpl: Cancelling stage 0
> 14/07/17 01:21:16 INFO DAGScheduler: Failed to run reduce at
> SparkPi.scala:35
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
> to stage failure: All masters are unresponsive! Giving up.
>        at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
>        at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
>        at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
>        at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>        at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>        at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
>        at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
>        at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
>        at scala.Option.foreach(Option.scala:236)
>        at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
>        at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
>        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>        at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>        at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>        at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>        at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>        at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-with-spark-submit-formatting-corrected-tp10102.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error with spark-submit (formatting corrected)

Posted by Andrew Or <an...@databricks.com>.

Hi ranjanp,

If you go to the master UI (masterIP:8080), what does the first line say?
Verify that this is the same as what you expect. Another thing is that
--master in spark submit overwrites whatever you set MASTER to, so the
environment variable won't actually take effect. Another obvious thing to
check is whether the node from which you launch spark submit can access the
internal address of the master (and port 7077). One quick way to verify
that is to attempt a telnet into it.

Let me know if you find anything.
Andrew


2014-07-17 15:57 GMT-07:00 ranjanp <pi...@hotmail.com>:

> Hi,
> I am new to Spark and trying out with a stand-alone, 3-node (1 master, 2
> workers) cluster.
>
> From the Web UI at the master, I see that the workers are registered. But
> when I try running the SparkPi example from the master node, I get the
> following message and then an exception.
>
> 14/07/17 01:20:36 INFO AppClient$ClientActor: Connecting to master
> spark://10.1.3.7:7077...
> 14/07/17 01:20:46 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
>
> I searched a bit for the above warning, and found and found that others
> have
> encountered this problem before, but did not see a clear resolution except
> for this link:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/TaskSchedulerImpl-Initial-job-has-not-accepted-any-resources-check-your-cluster-UI-to-ensure-that-woy-tt8247.html#a8444
>
> Based on the suggestion there I tried supplying --executor-memory option to
> spark-submit but that did not help.
>
> Any suggestions. Here are the details of my set up.
> - 3 nodes (each with 4 CPU cores and 7 GB memory)
> - 1 node configured as Master, and the other two configured as workers
> - Firewall is disabled on all nodes, and network communication between the
> nodes is not a problem
> - Edited the conf/spark-env.sh on all nodes to set the following:
>   SPARK_WORKER_CORES=3
>   SPARK_WORKER_MEMORY=5G
> - The Web UI as well as logs on master show that Workers were able to
> register correctly. Also the Web UI correctly shows the aggregate available
> memory and CPU cores on the workers:
>
> URL: spark://vmsparkwin1:7077
> Workers: 2
> Cores: 6 Total, 0 Used
> Memory: 10.0 GB Total, 0.0 B Used
> Applications: 0 Running, 0 Completed
> Drivers: 0 Running, 0 Completed
> Status: ALIVE
>
> I try running the SparkPi example first using the run-example (which was
> failing) and later directly using the spark-submit as shown below:
>
> $ export MASTER=spark://vmsparkwin1:7077
>
> $ echo $MASTER
> spark://vmsparkwin1:7077
>
> azureuser@vmsparkwin1 /cygdrive/c/opt/spark-1.0.0
> $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
> spark://10.1.3.7:7077 --executor-memory 1G --total-executor-cores 2
> ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 10
>
>
> The following is the full screen output:
>
> 14/07/17 01:20:13 INFO SecurityManager: Using Spark's default log4j
> profile:
> org/apache/spark/log4j-defaults.properties
> 14/07/17 01:20:13 INFO SecurityManager: Changing view acls to: azureuser
> 14/07/17 01:20:13 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(azureuser)
> 14/07/17 01:20:14 INFO Slf4jLogger: Slf4jLogger started
> 14/07/17 01:20:14 INFO Remoting: Starting remoting
> 14/07/17 01:20:14 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://spark@vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49839]
> 14/07/17 01:20:14 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://spark@vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49839]
> 14/07/17 01:20:14 INFO SparkEnv: Registering MapOutputTracker
> 14/07/17 01:20:14 INFO SparkEnv: Registering BlockManagerMaster
> 14/07/17 01:20:14 INFO DiskBlockManager: Created local directory at
> C:\cygwin\tmp\spark-local-20140717012014-b606
> 14/07/17 01:20:14 INFO MemoryStore: MemoryStore started with capacity 294.9
> MB.
> 14/07/17 01:20:14 INFO ConnectionManager: Bound socket to port 49842 with
> id
> = ConnectionManagerId(vmsparkwin1.cssparkwin.b1.internal.cloudapp.net
> ,49842)
> 14/07/17 01:20:14 INFO BlockManagerMaster: Trying to register BlockManager
> 14/07/17 01:20:14 INFO BlockManagerInfo: Registering block manager
> vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49842 with 294.9 MB RAM
> 14/07/17 01:20:14 INFO BlockManagerMaster: Registered BlockManager
> 14/07/17 01:20:14 INFO HttpServer: Starting HTTP Server
> 14/07/17 01:20:14 INFO HttpBroadcast: Broadcast server started at
> http://10.1.3.7:49843
> 14/07/17 01:20:14 INFO HttpFileServer: HTTP File server directory is
> C:\cygwin\tmp\spark-6a076e92-53bb-4c7a-9e27-ce53a818146d
> 14/07/17 01:20:14 INFO HttpServer: Starting HTTP Server
> 14/07/17 01:20:15 INFO SparkUI: Started SparkUI at
> http://vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:4040
> 14/07/17 01:20:15 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 14/07/17 01:20:16 INFO SparkContext: Added JAR
> file:/C:/opt/spark-1.0.0/./lib/spark-examples-1.0.0-hadoop2.2.0.jar at
> http://10.1.3.7:49844/jars/spark-examples-1.0.0-hadoop2.2.0.jar with
> timestamp 1405560016316
> 14/07/17 01:20:16 INFO AppClient$ClientActor: Connecting to master
> spark://10.1.3.7:7077...
> 14/07/17 01:20:16 INFO SparkContext: Starting job: reduce at
> SparkPi.scala:35
> 14/07/17 01:20:16 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:35)
> with 10 output partitions (allowLocal=false)
> 14/07/17 01:20:16 INFO DAGScheduler: Final stage: Stage 0(reduce at
> SparkPi.scala:35)
> 14/07/17 01:20:16 INFO DAGScheduler: Parents of final stage: List()
> 14/07/17 01:20:16 INFO DAGScheduler: Missing parents: List()
> 14/07/17 01:20:16 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at
> map
> at SparkPi.scala:31), which has no missing parents
> 14/07/17 01:20:16 INFO DAGScheduler: Submitting 10 missing tasks from Stage
> 0 (MappedRDD[1] at map at SparkPi.scala:31)
> 14/07/17 01:20:16 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
> 14/07/17 01:20:31 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/07/17 01:20:36 INFO AppClient$ClientActor: Connecting to master
> spark://10.1.3.7:7077...
> 14/07/17 01:20:46 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/07/17 01:20:56 INFO AppClient$ClientActor: Connecting to master
> spark://10.1.3.7:7077...
> 14/07/17 01:21:01 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/07/17 01:21:16 ERROR SparkDeploySchedulerBackend: Application has been
> killed. Reason: All masters are unresponsive! Giving up.
> 14/07/17 01:21:16 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
> have all completed, from pool
> 14/07/17 01:21:16 INFO TaskSchedulerImpl: Cancelling stage 0
> 14/07/17 01:21:16 INFO DAGScheduler: Failed to run reduce at
> SparkPi.scala:35
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
> to stage failure: All masters are unresponsive! Giving up.
>         at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
>         at
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
>         at scala.Option.foreach(Option.scala:236)
>         at
>
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
>         at
>
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>         at
>
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
>
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
>
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Error-with-spark-submit-formatting-corrected-tp10102.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>