You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by vuakko <ni...@gmail.com> on 2014/01/14 15:09:41 UTC

Akka error kills workers in standalone mode

Spark fails to run practically any standalone mode jobs sent to it. The local
mode works and spark-shell works even in standalone, but sending any other
jobs manually fails with worker posting the following error:

2014-01-14 15:47:05,073 [sparkWorker-akka.actor.default-dispatcher-5] INFO 
org.apache.spark.deploy.worker.Worker - Connecting to master
spark://niko-VirtualBox:7077...
2014-01-14 15:47:05,715 [sparkWorker-akka.actor.default-dispatcher-2] INFO 
org.apache.spark.deploy.worker.Worker - Successfully registered with master
spark://niko-VirtualBox:7077
2014-01-14 15:47:23,408 [sparkWorker-akka.actor.default-dispatcher-14] INFO 
org.apache.spark.deploy.worker.Worker - Asked to launch executor
app-20140114154723-0000/0 for Spark test
2014-01-14 15:47:23,431 [sparkWorker-akka.actor.default-dispatcher-14] ERROR
akka.actor.OneForOneStrategy - 
java.lang.NullPointerException
        at java.io.File.<init>(File.java:251)
        at
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:213)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2014-01-14 15:47:23,514 [sparkWorker-akka.actor.default-dispatcher-14] INFO 
org.apache.spark.deploy.worker.Worker - Starting Spark worker
niko-VirtualBox.local:33576 with 1 cores, 6.8 GB RAM
2014-01-14 15:47:23,514 [sparkWorker-akka.actor.default-dispatcher-14] INFO 
org.apache.spark.deploy.worker.Worker - Spark home:
/home/niko/local/incubator-spark
2014-01-14 15:47:23,517 [sparkWorker-akka.actor.default-dispatcher-14] INFO 
org.apache.spark.deploy.worker.ui.WorkerWebUI - Started Worker web UI at
http://niko-VirtualBox.local:8081
2014-01-14 15:47:23,517 [sparkWorker-akka.actor.default-dispatcher-14] INFO 
org.apache.spark.deploy.worker.Worker - Connecting to master
spark://niko-VirtualBox:7077...
2014-01-14 15:47:23,528 [sparkWorker-akka.actor.default-dispatcher-3] INFO 
org.apache.spark.deploy.worker.Worker - Successfully registered with master
spark://niko-VirtualBox:7077


Master spits out the following logs at the same time:


2014-01-14 15:47:05,683 [sparkMaster-akka.actor.default-dispatcher-4] INFO 
org.apache.spark.deploy.master.Master - Registering worker
niko-VirtualBox:33576 with 1 cores, 6.8 GB RAM
2014-01-14 15:47:23,090 [sparkMaster-akka.actor.default-dispatcher-15] INFO 
org.apache.spark.deploy.master.Master - Registering app Spark test
2014-01-14 15:47:23,102 [sparkMaster-akka.actor.default-dispatcher-15] INFO 
org.apache.spark.deploy.master.Master - Registered app Spark test with ID
app-20140114154723-0000
2014-01-14 15:47:23,216 [sparkMaster-akka.actor.default-dispatcher-15] INFO 
org.apache.spark.deploy.master.Master - Launching executor
app-20140114154723-0000/0 on worker
worker-20140114154704-niko-VirtualBox.local-33576
2014-01-14 15:47:23,523 [sparkMaster-akka.actor.default-dispatcher-15] INFO 
org.apache.spark.deploy.master.Master - Registering worker
niko-VirtualBox:33576 with 1 cores, 6.8 GB RAM
2014-01-14 15:47:23,525 [sparkMaster-akka.actor.default-dispatcher-15] INFO 
org.apache.spark.deploy.master.Master - Attempted to re-register worker at
same address: akka.tcp://sparkWorker@niko-VirtualBox.local:33576
2014-01-14 15:47:23,535 [sparkMaster-akka.actor.default-dispatcher-14] WARN 
org.apache.spark.deploy.master.Master - Got heartbeat from unregistered
worker worker-20140114154723-niko-VirtualBox.local-33576
...

Soon after this the master decides that the worker is dead, disassociates it
and marks it DEAD in the web UI. The worker process however is still alive
and still thinks that it's connected to master (as shown by the log).

I'm launching the job with the following command (last argument is the
master, replacing local there makes things run ok):
java -cp
./target/classes:/etc/hadoop/conf:$SPARK_HOME/conf:$SPARK_HOME/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-SNAPSHOT-hadoop2.0.0-mr1-cdh4.5.0.jar
SparkTest spark://niko-VirtualBox:7077

Relevant versions are:
Spark: current git HEAD fa75e5e1c50da7d1e6c6f41c2d6d591c1e8a025f
Hadoop: 2.0.0-mr1-cdh4.5.0
Scala: 2.10.3





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Akka-error-kills-workers-in-standalone-mode-tp537.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Akka error kills workers in standalone mode

Posted by vuakko <ni...@gmail.com>.
Good to know, great stuff you're putting together!
This is getting a bit off-topic, but it's not really worth another thread:
The best improvement, by far, in my view would actually be to split the
Spark Configuration wiki page in parts explaining which environment
variables/Java properties are relevant and used by
a) Master b) Worker c) Executor d) Driver

Right now it's a bit of guess-work with some of them.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Akka-error-kills-workers-in-standalone-mode-tp537p604.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Akka error kills workers in standalone mode

Posted by Patrick Wendell <pw...@gmail.com>.
We included a patch in 0.9.0-rc1 (currently being voted) that uses a
default SPARK_HOME if the user doesn't specify it. Having an NPE is
indeed bad behavior here. Thanks for reporting this.

- Patrick

On Thu, Jan 16, 2014 at 4:09 AM, vuakko <ni...@gmail.com> wrote:
> Yes, thanks for the help. The environment variable SPARK_HOME was messed up,
> that was the cause.
> Incidentally, another thread
>
> http://apache-spark-user-list.1001560.n3.nabble.com/master-attempted-to-re-register-the-worker-and-then-took-all-workers-as-unregistered-td553.html
>
> just had the same problem and I concur with the writer there: having a bad
> application shouldn't crash the workers.
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Akka-error-kills-workers-in-standalone-mode-tp537p587.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Akka error kills workers in standalone mode

Posted by vuakko <ni...@gmail.com>.
Yes, thanks for the help. The environment variable SPARK_HOME was messed up,
that was the cause.
Incidentally, another thread

http://apache-spark-user-list.1001560.n3.nabble.com/master-attempted-to-re-register-the-worker-and-then-took-all-workers-as-unregistered-td553.html

just had the same problem and I concur with the writer there: having a bad
application shouldn't crash the workers.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Akka-error-kills-workers-in-standalone-mode-tp537p587.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Akka error kills workers in standalone mode

Posted by Archit Thakur <ar...@gmail.com>.
You are getting a NullPointerException because of which it gets failed. It
runs at local means you are ignoring a fact that many of the classes wont
be initialized on the worker executor node
when you might have initialized them in your master executor JVM.
To check = Does your code works when you give master as local[n] instead of
local.





On Tue, Jan 14, 2014 at 7:39 PM, vuakko <ni...@gmail.com> wrote:

> Spark fails to run practically any standalone mode jobs sent to it. The
> local
> mode works and spark-shell works even in standalone, but sending any other
> jobs manually fails with worker posting the following error:
>
> 2014-01-14 15:47:05,073 [sparkWorker-akka.actor.default-dispatcher-5] INFO
> org.apache.spark.deploy.worker.Worker - Connecting to master
> spark://niko-VirtualBox:7077...
> 2014-01-14 15:47:05,715 [sparkWorker-akka.actor.default-dispatcher-2] INFO
> org.apache.spark.deploy.worker.Worker - Successfully registered with master
> spark://niko-VirtualBox:7077
> 2014-01-14 15:47:23,408 [sparkWorker-akka.actor.default-dispatcher-14] INFO
> org.apache.spark.deploy.worker.Worker - Asked to launch executor
> app-20140114154723-0000/0 for Spark test
> 2014-01-14 15:47:23,431 [sparkWorker-akka.actor.default-dispatcher-14]
> ERROR
> akka.actor.OneForOneStrategy -
> java.lang.NullPointerException
>         at java.io.File.<init>(File.java:251)
>         at
>
> org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:213)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>         at
>
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
>
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
>
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2014-01-14 15:47:23,514 [sparkWorker-akka.actor.default-dispatcher-14] INFO
> org.apache.spark.deploy.worker.Worker - Starting Spark worker
> niko-VirtualBox.local:33576 with 1 cores, 6.8 GB RAM
> 2014-01-14 15:47:23,514 [sparkWorker-akka.actor.default-dispatcher-14] INFO
> org.apache.spark.deploy.worker.Worker - Spark home:
> /home/niko/local/incubator-spark
> 2014-01-14 15:47:23,517 [sparkWorker-akka.actor.default-dispatcher-14] INFO
> org.apache.spark.deploy.worker.ui.WorkerWebUI - Started Worker web UI at
> http://niko-VirtualBox.local:8081
> 2014-01-14 15:47:23,517 [sparkWorker-akka.actor.default-dispatcher-14] INFO
> org.apache.spark.deploy.worker.Worker - Connecting to master
> spark://niko-VirtualBox:7077...
> 2014-01-14 15:47:23,528 [sparkWorker-akka.actor.default-dispatcher-3] INFO
> org.apache.spark.deploy.worker.Worker - Successfully registered with master
> spark://niko-VirtualBox:7077
>
>
> Master spits out the following logs at the same time:
>
>
> 2014-01-14 15:47:05,683 [sparkMaster-akka.actor.default-dispatcher-4] INFO
> org.apache.spark.deploy.master.Master - Registering worker
> niko-VirtualBox:33576 with 1 cores, 6.8 GB RAM
> 2014-01-14 15:47:23,090 [sparkMaster-akka.actor.default-dispatcher-15] INFO
> org.apache.spark.deploy.master.Master - Registering app Spark test
> 2014-01-14 15:47:23,102 [sparkMaster-akka.actor.default-dispatcher-15] INFO
> org.apache.spark.deploy.master.Master - Registered app Spark test with ID
> app-20140114154723-0000
> 2014-01-14 15:47:23,216 [sparkMaster-akka.actor.default-dispatcher-15] INFO
> org.apache.spark.deploy.master.Master - Launching executor
> app-20140114154723-0000/0 on worker
> worker-20140114154704-niko-VirtualBox.local-33576
> 2014-01-14 15:47:23,523 [sparkMaster-akka.actor.default-dispatcher-15] INFO
> org.apache.spark.deploy.master.Master - Registering worker
> niko-VirtualBox:33576 with 1 cores, 6.8 GB RAM
> 2014-01-14 15:47:23,525 [sparkMaster-akka.actor.default-dispatcher-15] INFO
> org.apache.spark.deploy.master.Master - Attempted to re-register worker at
> same address: akka.tcp://sparkWorker@niko-VirtualBox.local:33576
> 2014-01-14 15:47:23,535 [sparkMaster-akka.actor.default-dispatcher-14] WARN
> org.apache.spark.deploy.master.Master - Got heartbeat from unregistered
> worker worker-20140114154723-niko-VirtualBox.local-33576
> ...
>
> Soon after this the master decides that the worker is dead, disassociates
> it
> and marks it DEAD in the web UI. The worker process however is still alive
> and still thinks that it's connected to master (as shown by the log).
>
> I'm launching the job with the following command (last argument is the
> master, replacing local there makes things run ok):
> java -cp
>
> ./target/classes:/etc/hadoop/conf:$SPARK_HOME/conf:$SPARK_HOME/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-SNAPSHOT-hadoop2.0.0-mr1-cdh4.5.0.jar
> SparkTest spark://niko-VirtualBox:7077
>
> Relevant versions are:
> Spark: current git HEAD fa75e5e1c50da7d1e6c6f41c2d6d591c1e8a025f
> Hadoop: 2.0.0-mr1-cdh4.5.0
> Scala: 2.10.3
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Akka-error-kills-workers-in-standalone-mode-tp537.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>