You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Damien Raude-Morvan (JIRA)" <ji...@apache.org> on 2014/05/05 12:33:14 UTC

[jira] [Created] (STORM-307) After host crash, supervisor is unable to restart itself

Damien Raude-Morvan created STORM-307:
-----------------------------------------

             Summary: After host crash, supervisor is unable to restart itself
                 Key: STORM-307
                 URL: https://issues.apache.org/jira/browse/STORM-307
             Project: Apache Storm (Incubating)
          Issue Type: Bug
    Affects Versions: 0.9.1-incubating
         Environment: Debian Linux Wheezy
Zookeeper 3.3.3
Java 1.7.0_25
            Reporter: Damien Raude-Morvan


Hi,

I've observed [multiple times|#links] that supervisor state de-serialisation after host crash or reboot can fail. Supervisor is then unable to come up without manual intervention. AFAICT, it seems that serialized supervisor state if invalid and coun't be read at next start.

Observed error in supervisor log :
{noformat}
2014-04-29 19:38:35 c.n.c.f.i.CuratorFrameworkImpl [INFO] Starting
2014-04-29 19:38:35 o.a.z.ZooKeeper [INFO] Initiating client connection, connectString=127.0.0.1:2181/storm sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@18d055e0
2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Opening socket connection to server /127.0.0.1:2181
2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Socket connection established to localhost/127.0.0.1:2181, initiating session
2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x145a7cc1c7e48b1, negotiated timeout = 20000
2014-04-29 19:38:35 b.s.d.supervisor [INFO] Starting supervisor with id 71b01216-9d00-4fb6-8538-6673058ab5ef at host storm
2014-04-29 19:38:36 b.s.event [ERROR] Error when processing event
java.lang.RuntimeException: java.io.EOFException
        at backtype.storm.utils.Utils.deserialize(Utils.java:86) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
        at backtype.storm.utils.LocalState.snapshot(LocalState.java:45) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
        at backtype.storm.utils.LocalState.get(LocalState.java:56) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
        at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:207) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
        at clojure.lang.AFn.applyToHelper(AFn.java:161) ~[clojure-1.4.0.jar:na]
        at clojure.lang.AFn.applyTo(AFn.java:151) ~[clojure-1.4.0.jar:na]
        at clojure.core$apply.invoke(core.clj:603) ~[clojure-1.4.0.jar:na]
        at clojure.core$partial$fn__4070.doInvoke(core.clj:2343) ~[clojure-1.4.0.jar:na]
        at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.4.0.jar:na]
        at backtype.storm.event$event_manager$fn__2593.invoke(event.clj:39) ~[na:na]
        at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.4.0.jar:na]
        at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
Caused by: java.io.EOFException: null
        at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323) ~[na:1.7.0_25]
        at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2792) ~[na:1.7.0_25]
        at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:799) ~[na:1.7.0_25]
        at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299) ~[na:1.7.0_25]
        at backtype.storm.utils.Utils.deserialize(Utils.java:81) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
        ... 11 common frames omitted
2014-04-29 19:38:36 b.s.util [INFO] Halting process: ("Error when processing an event")
{noformat}

Current workaround : full stop supervisor daemon and delete all Storm's data/supervisor directory helped, and after restarting Supervisor is now running smoothly. 

{anchor:links} Here is some references of very similar issues :
* http://mail-archives.apache.org/mod_mbox/storm-user/201402.mbox/%3C23100d14e7ac4cef947f7236ef8963e1@BY2PR08MB144.namprd08.prod.outlook.com%3E
* https://groups.google.com/forum/#!topic/storm-user/SL9FK9XeoI8
* https://groups.google.com/forum/#!topic/storm-user/2gapTYTRrX8

Regards,




--
This message was sent by Atlassian JIRA
(v6.2#6252)