You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Rob Povey <ro...@maana.io> on 2014/03/05 19:27:58 UTC

Spark Worker crashing and Master not seeing recovered worker

I installed Spark 0.9.0 from the CDH parcel yesterday in standalone mode on
top of a 6 node cluster running CDH4.6 on Centos.

What I'm seeing is that when jobs fail, often the worker process will crash,
it seems that the worker restarts on the node but the Master then never
utilizes the restarted worker, and it doesn't show up in the web interface. 

Has anyone seen anything like this, is there an obvious workaround/fix other
than manually restarting the workers?

In the Master log I see the following repeated many times, filer being the
"lost" node. What it looks like to me is that when the worker actor is
restarted by AKKA, it gets a new ID and for whatever reason does not
register with the master.

Any ideas? 

14/03/04 20:04:44 WARN master.Master: Got heartbeat from unregistered worker
worker-20140304183709-filer.maana.io-7078
14/03/04 20:04:54 WARN master.Master: Got heartbeat from unregistered worker
worker-20140304183709-filer.maana.io-7078
14/03/04 20:04:59 WARN master.Master: Got heartbeat from unregistered worker
worker-20140304183709-filer.maana.io-7078


On Filer itself I can see it's shutdown with the following exception, and I
can see that it's been restarted and is running.

14/03/04 18:37:09 INFO worker.Worker: Executor app-20140304183705-0036/0
finished with state KILLED
14/03/04 18:37:09 INFO worker.CommandUtils: Redirection to
/var/run/spark/work/app-20140304183705-0036/0/stderr closed: Bad file
descriptor
14/03/04 18:37:09 ERROR actor.OneForOneStrategy: key not found:
app-20140304183705-0036/0
java.util.NoSuchElementException: key not found: app-20140304183705-0036/0
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at scala.collection.AbstractMap.default(Map.scala:58)
        at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
        at
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:232)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
        at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/03/04 18:37:09 ERROR remote.EndpointWriter: AssociationError
[akka.tcp://sparkWorker@filer.maana.io:7078] ->
[akka.tcp://sparkExecutor@filer.maana.io:58331]: Error [Association failed
with [akka.tcp://sparkExecutor@filer.maana.io:58331]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@filer.maana.io:58331]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: filer.maana.io/192.168.1.33:58331
]
14/03/04 18:37:09 INFO handler.ContextHandler: stopped
o.e.j.s.h.ContextHandler{*,null}
14/03/04 18:37:09 INFO handler.ContextHandler: stopped
o.e.j.s.h.ContextHandler{/json,null}
14/03/04 18:37:09 INFO handler.ContextHandler: stopped
o.e.j.s.h.ContextHandler{/logPage,null}
14/03/04 18:37:09 INFO handler.ContextHandler: stopped
o.e.j.s.h.ContextHandler{/log,null}
14/03/04 18:37:09 INFO handler.ContextHandler: stopped
o.e.j.s.h.ContextHandler{/static,null}
14/03/04 18:37:09 INFO handler.ContextHandler: stopped
o.e.j.s.h.ContextHandler{/metrics/json,null}
14/03/04 18:37:09 INFO worker.Worker: Starting Spark worker
filer.maana.io:7078 with 4 cores, 30.3 GB RAM
14/03/04 18:37:09 INFO worker.Worker: Spark home:
/opt/cloudera/parcels/SPARK/lib/spark
14/03/04 18:37:09 INFO server.Server: jetty-7.6.8.v20121106
14/03/04 18:37:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/metrics/json,null}
14/03/04 18:37:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/static,null}
14/03/04 18:37:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/log,null}
14/03/04 18:37:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/static,null}
14/03/04 18:37:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/log,null}
14/03/04 18:37:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/logPage,null}
14/03/04 18:37:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/json,null}
14/03/04 18:37:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{*,null}
14/03/04 18:37:09 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:18081
14/03/04 18:37:09 INFO ui.WorkerWebUI: Started Worker web UI at
http://filer.maana.io:18081
14/03/04 18:37:09 INFO worker.Worker: Connecting to master
spark://Master.maana.io:7077...
14/03/04 18:37:09 INFO worker.Worker: Successfully registered with master
spark://Master.maana.io:7077
14/03/05 08:53:35 INFO actor.LocalActorRef: Message
[akka.remote.transport.AssociationHandle$Disassociated] from
Actor[akka://sparkWorker/deadLetters] to
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40192.168.1.33%3A37859-60#-1831633323]
was not delivered. [24] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child
process.
14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child
process.
14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child
process.
14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child
process.
14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child
process.
~
~
~






--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Worker-crashing-and-Master-not-seeing-recovered-worker-tp2312.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Worker crashing and Master not seeing recovered worker

Posted by Malte <ma...@gmail.com>.
This is still happening to me on mesos. Any workarounds?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Worker-crashing-and-Master-not-seeing-recovered-worker-tp2312p16506.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Spark Worker crashing and Master not seeing recovered worker

Posted by Ognen Duzlevski <og...@plainvanillagames.com>.
Rob,

I have seen this too. I have 16 nodes in my spark cluster and for some 
reason (after app failures) one of the workers will go offline. I will 
ssh to the machine in question and find that the java process is running 
but for some reason the master is not noticing this. I have not had the 
time to investigate (my setup is manual, 0.9 in standalone mode).

Ognen

On 3/5/14, 12:27 PM, Rob Povey wrote:
> I installed Spark 0.9.0 from the CDH parcel yesterday in standalone mode on
> top of a 6 node cluster running CDH4.6 on Centos.
>
> What I'm seeing is that when jobs fail, often the worker process will crash,
> it seems that the worker restarts on the node but the Master then never
> utilizes the restarted worker, and it doesn't show up in the web interface.
>
> Has anyone seen anything like this, is there an obvious workaround/fix other
> than manually restarting the workers?
>
> In the Master log I see the following repeated many times, filer being the
> "lost" node. What it looks like to me is that when the worker actor is
> restarted by AKKA, it gets a new ID and for whatever reason does not
> register with the master.
>
> Any ideas?
>
> 14/03/04 20:04:44 WARN master.Master: Got heartbeat from unregistered worker
> worker-20140304183709-filer.maana.io-7078
> 14/03/04 20:04:54 WARN master.Master: Got heartbeat from unregistered worker
> worker-20140304183709-filer.maana.io-7078
> 14/03/04 20:04:59 WARN master.Master: Got heartbeat from unregistered worker
> worker-20140304183709-filer.maana.io-7078
>
>
> On Filer itself I can see it's shutdown with the following exception, and I
> can see that it's been restarted and is running.
>
> 14/03/04 18:37:09 INFO worker.Worker: Executor app-20140304183705-0036/0
> finished with state KILLED
> 14/03/04 18:37:09 INFO worker.CommandUtils: Redirection to
> /var/run/spark/work/app-20140304183705-0036/0/stderr closed: Bad file
> descriptor
> 14/03/04 18:37:09 ERROR actor.OneForOneStrategy: key not found:
> app-20140304183705-0036/0
> java.util.NoSuchElementException: key not found: app-20140304183705-0036/0
>          at scala.collection.MapLike$class.default(MapLike.scala:228)
>          at scala.collection.AbstractMap.default(Map.scala:58)
>          at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>          at
> org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:232)
>          at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>          at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>          at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>          at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>          at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>          at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>          at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>          at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>          at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>          at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>          at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 14/03/04 18:37:09 ERROR remote.EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@filer.maana.io:7078] ->
> [akka.tcp://sparkExecutor@filer.maana.io:58331]: Error [Association failed
> with [akka.tcp://sparkExecutor@filer.maana.io:58331]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkExecutor@filer.maana.io:58331]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: filer.maana.io/192.168.1.33:58331
> ]
> 14/03/04 18:37:09 INFO handler.ContextHandler: stopped
> o.e.j.s.h.ContextHandler{*,null}
> 14/03/04 18:37:09 INFO handler.ContextHandler: stopped
> o.e.j.s.h.ContextHandler{/json,null}
> 14/03/04 18:37:09 INFO handler.ContextHandler: stopped
> o.e.j.s.h.ContextHandler{/logPage,null}
> 14/03/04 18:37:09 INFO handler.ContextHandler: stopped
> o.e.j.s.h.ContextHandler{/log,null}
> 14/03/04 18:37:09 INFO handler.ContextHandler: stopped
> o.e.j.s.h.ContextHandler{/static,null}
> 14/03/04 18:37:09 INFO handler.ContextHandler: stopped
> o.e.j.s.h.ContextHandler{/metrics/json,null}
> 14/03/04 18:37:09 INFO worker.Worker: Starting Spark worker
> filer.maana.io:7078 with 4 cores, 30.3 GB RAM
> 14/03/04 18:37:09 INFO worker.Worker: Spark home:
> /opt/cloudera/parcels/SPARK/lib/spark
> 14/03/04 18:37:09 INFO server.Server: jetty-7.6.8.v20121106
> 14/03/04 18:37:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/metrics/json,null}
> 14/03/04 18:37:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/static,null}
> 14/03/04 18:37:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/log,null}
> 14/03/04 18:37:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/static,null}
> 14/03/04 18:37:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/log,null}
> 14/03/04 18:37:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/logPage,null}
> 14/03/04 18:37:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{/json,null}
> 14/03/04 18:37:09 INFO handler.ContextHandler: started
> o.e.j.s.h.ContextHandler{*,null}
> 14/03/04 18:37:09 INFO server.AbstractConnector: Started
> SelectChannelConnector@0.0.0.0:18081
> 14/03/04 18:37:09 INFO ui.WorkerWebUI: Started Worker web UI at
> http://filer.maana.io:18081
> 14/03/04 18:37:09 INFO worker.Worker: Connecting to master
> spark://Master.maana.io:7077...
> 14/03/04 18:37:09 INFO worker.Worker: Successfully registered with master
> spark://Master.maana.io:7077
> 14/03/05 08:53:35 INFO actor.LocalActorRef: Message
> [akka.remote.transport.AssociationHandle$Disassociated] from
> Actor[akka://sparkWorker/deadLetters] to
> Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40192.168.1.33%3A37859-60#-1831633323]
> was not delivered. [24] dead letters encountered. This logging can be turned
> off or adjusted with configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> 14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child
> process.
> 14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child
> process.
> 14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child
> process.
> 14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child
> process.
> 14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child
> process.
> ~
> ~
> ~
>
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Worker-crashing-and-Master-not-seeing-recovered-worker-tp2312.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

-- 
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-- Jamie Zawinski