You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by "Evered, Randolf" <ra...@lmco.com> on 2014/04/16 00:49:33 UTC

Storm 0.9.0.1 - supervisor failing to start repeatedly (EOFException)

We are using Storm 0.9.0.1 with Netty and Trident topologies on a single machine (nimbus, supervisor, and drpc running on the same machine).  Supervisor keeps dying and gets restarted after 7-8 seconds by Supervisord (the service that restarts storm and zookeeper processes).  Here is the error in supervisor.log we see over and over:

2014-04-15 21:13:13 b.s.event [ERROR] Error when processing event
java.lang.RuntimeException: java.io.EOFException
        at backtype.storm.utils.Utils.deserialize(Utils.java:69) ~[storm-core-0.9.0.1.jar:na]
        at backtype.storm.utils.LocalState.snapshot(LocalState.java:28) ~[storm-core-0.9.0.1.jar:na]
        at backtype.storm.utils.LocalState.get(LocalState.java:39) ~[storm-core-0.9.0.1.jar:na]
        at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:187) ~[storm-core-0.9.0.1.jar:na]
        at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.4.0.jar:na]
        at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.4.0.jar:na]
        at clojure.core$apply.invoke(core.clj:603) ~[clojure-1.4.0.jar:na]
        at clojure.core$partial$fn__4070.doInvoke(core.clj:2343) ~[clojure-1.4.0.jar:na]
        at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.4.0.jar:na]
        at backtype.storm.event$event_manager$fn__3072.invoke(event.clj:24) ~[storm-core-0.9.0.1.jar:na]
        at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na]
        at java.lang.Thread.run(Unknown Source) [na:1.7.0_45]
Caused by: java.io.EOFException: null
        at java.io.ObjectInputStream$PeekInputStream.readFully(Unknown Source) ~[na:1.7.0_45]
        at java.io.ObjectInputStream$BlockDataInputStream.readShort(Unknown Source) ~[na:1.7.0_45]
        at java.io.ObjectInputStream.readStreamHeader(Unknown Source) ~[na:1.7.0_45]
        at java.io.ObjectInputStream.<init>(Unknown Source) ~[na:1.7.0_45]
        at backtype.storm.utils.Utils.deserialize(Utils.java:64) ~[storm-core-0.9.0.1.jar:na]
        ... 11 common frames omitted
2014-04-15 21:13:13 b.s.util [INFO] Halting process: ("Error when processing an event")

Any ideas why supervisor might be dying?

Per recommendation from the post "Supervisor throwing error on start up" from https://groups.google.com/forum/#!topic/storm-user/2gapTYTRrX8, we stopped storm processes, cleared the storm and zookeeper data directories, and it was fine (after we loaded the topologies again).  However, we would like to know how to prevent this bug from happening in a production system environment.

We are also getting a ton of Connection refused errors in the Nimbus and Worker logs.  I expect this would be the case if Supervisor can't start up.

Thank you,
Randy