You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Josh Walton <jw...@gmail.com> on 2014/05/14 17:25:49 UTC

Recovering From Zookeeper Failure

Recently, we have had a couple of power failures for the servers running
our zookeeper cluster. When zookeeper dies, the nimbus and supervisor
processes eventually die as well. After the zookeeper failure, the only way
I have gotten the supervisor processes to start back up is to delete the
supervisor and worker directories as specified in the storm.yaml file. Is
there a better/cleaner way to restart them?

I have also noticed that when I start nimbus and the UI process back up,
and navigate to the storm status page, the topologies we had started are
still shown as active (even though they are not).

This is the exception in the supervisor logs when I try to start them up
after the zookeeper failure:

2014-05-14 09:16:03 b.s.event [ERROR] Error when processing event
java.lang.RuntimeException: java.io.EOFException
at backtype.storm.utils.Utils.deserialize(Utils.java:69)
~[storm-core-0.9.0-rc3.jar:na]
at backtype.storm.utils.LocalState.snapshot(LocalState.java:28)
~[storm-core-0.9.0-rc3.jar:na]
at backtype.storm.utils.LocalState.get(LocalState.java:39)
~[storm-core-0.9.0-rc3.jar:na]
at
backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:187)
~[storm-core-0.9.0-rc3.jar:na]
at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.4.0.jar:na]
at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.4.0.jar:na]
at clojure.core$apply.invoke(core.clj:603) ~[clojure-1.4.0.jar:na]
at clojure.core$partial$fn__4070.doInvoke(core.clj:2343)
~[clojure-1.4.0.jar:na]
at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.4.0.jar:na]
at backtype.storm.event$event_manager$fn__3070.invoke(event.clj:24)
~[storm-core-0.9.0-rc3.jar:na]
at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na]
at java.lang.Thread.run(Thread.java:722) [na:1.7.0_21]
Caused by: java.io.EOFException: null
at
java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
~[na:1.7.0_21]
at
java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2792)
~[na:1.7.0_21]
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:799)
~[na:1.7.0_21]
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
~[na:1.7.0_21]
at backtype.storm.utils.Utils.deserialize(Utils.java:64)
~[storm-core-0.9.0-rc3.jar:na]
... 11 common frames omitted
2014-05-14 09:16:03 b.s.util [INFO] Halting process: ("Error when
processing an event")

Re: Recovering From Zookeeper Failure

Posted by Ryan Chan <ry...@gmail.com>.

Hi Josh

We are having the same issue for long time, and only solution is restart
the whole storm cluster.
(Actually I have asked the same question on 12 May but got no response.)

In the meantime, we are currently evaluating switch to Apache Spark for
streaming, you might also have a look.





On Wed, May 14, 2014 at 11:25 PM, Josh Walton <jw...@gmail.com> wrote:

> Recently, we have had a couple of power failures for the servers running
> our zookeeper cluster. When zookeeper dies, the nimbus and supervisor
> processes eventually die as well. After the zookeeper failure, the only way
> I have gotten the supervisor processes to start back up is to delete the
> supervisor and worker directories as specified in the storm.yaml file. Is
> there a better/cleaner way to restart them?
>
> I have also noticed that when I start nimbus and the UI process back up,
> and navigate to the storm status page, the topologies we had started are
> still shown as active (even though they are not).
>
> This is the exception in the supervisor logs when I try to start them up
> after the zookeeper failure:
>
> 2014-05-14 09:16:03 b.s.event [ERROR] Error when processing event
> java.lang.RuntimeException: java.io.EOFException
> at backtype.storm.utils.Utils.deserialize(Utils.java:69)
> ~[storm-core-0.9.0-rc3.jar:na]
> at backtype.storm.utils.LocalState.snapshot(LocalState.java:28)
> ~[storm-core-0.9.0-rc3.jar:na]
>  at backtype.storm.utils.LocalState.get(LocalState.java:39)
> ~[storm-core-0.9.0-rc3.jar:na]
> at
> backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:187)
> ~[storm-core-0.9.0-rc3.jar:na]
>  at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.4.0.jar:na]
> at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.4.0.jar:na]
>  at clojure.core$apply.invoke(core.clj:603) ~[clojure-1.4.0.jar:na]
> at clojure.core$partial$fn__4070.doInvoke(core.clj:2343)
> ~[clojure-1.4.0.jar:na]
>  at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.4.0.jar:na]
> at backtype.storm.event$event_manager$fn__3070.invoke(event.clj:24)
> ~[storm-core-0.9.0-rc3.jar:na]
>  at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na]
> at java.lang.Thread.run(Thread.java:722) [na:1.7.0_21]
> Caused by: java.io.EOFException: null
> at
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
> ~[na:1.7.0_21]
> at
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2792)
> ~[na:1.7.0_21]
>  at
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:799)
> ~[na:1.7.0_21]
> at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
> ~[na:1.7.0_21]
>  at backtype.storm.utils.Utils.deserialize(Utils.java:64)
> ~[storm-core-0.9.0-rc3.jar:na]
> ... 11 common frames omitted
> 2014-05-14 09:16:03 b.s.util [INFO] Halting process: ("Error when
> processing an event")
>
>