You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Sean Zhong (JIRA)" <ji...@apache.org> on 2014/11/04 14:09:34 UTC

[jira] [Updated] (STORM-307) After host crash, supervisor is unable to restart itself

     [ https://issues.apache.org/jira/browse/STORM-307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Zhong updated STORM-307:
-----------------------------
    Assignee: Jiahong Li

> After host crash, supervisor is unable to restart itself
> --------------------------------------------------------
>
>                 Key: STORM-307
>                 URL: https://issues.apache.org/jira/browse/STORM-307
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.1-incubating
>         Environment: Debian Linux Wheezy
> Zookeeper 3.3.3
> Java 1.7.0_25
>            Reporter: Damien Raude-Morvan
>            Assignee: Jiahong Li
>         Attachments: supeof.tar.bz2
>
>
> Hi,
> I've observed [multiple times|#links] that supervisor state de-serialisation after host crash or reboot can fail. Supervisor is then unable to come up without manual intervention. AFAICT, it seems that serialized supervisor state if invalid and coun't be read at next start.
> Observed error in supervisor log :
> {noformat}
> 2014-04-29 19:38:35 c.n.c.f.i.CuratorFrameworkImpl [INFO] Starting
> 2014-04-29 19:38:35 o.a.z.ZooKeeper [INFO] Initiating client connection, connectString=127.0.0.1:2181/storm sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@18d055e0
> 2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Opening socket connection to server /127.0.0.1:2181
> 2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Socket connection established to localhost/127.0.0.1:2181, initiating session
> 2014-04-29 19:38:35 o.a.z.ClientCnxn [INFO] Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x145a7cc1c7e48b1, negotiated timeout = 20000
> 2014-04-29 19:38:35 b.s.d.supervisor [INFO] Starting supervisor with id 71b01216-9d00-4fb6-8538-6673058ab5ef at host storm
> 2014-04-29 19:38:36 b.s.event [ERROR] Error when processing event
> java.lang.RuntimeException: java.io.EOFException
>         at backtype.storm.utils.Utils.deserialize(Utils.java:86) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
>         at backtype.storm.utils.LocalState.snapshot(LocalState.java:45) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
>         at backtype.storm.utils.LocalState.get(LocalState.java:56) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
>         at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:207) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
>         at clojure.lang.AFn.applyToHelper(AFn.java:161) ~[clojure-1.4.0.jar:na]
>         at clojure.lang.AFn.applyTo(AFn.java:151) ~[clojure-1.4.0.jar:na]
>         at clojure.core$apply.invoke(core.clj:603) ~[clojure-1.4.0.jar:na]
>         at clojure.core$partial$fn__4070.doInvoke(core.clj:2343) ~[clojure-1.4.0.jar:na]
>         at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.4.0.jar:na]
>         at backtype.storm.event$event_manager$fn__2593.invoke(event.clj:39) ~[na:na]
>         at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.4.0.jar:na]
>         at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> Caused by: java.io.EOFException: null
>         at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323) ~[na:1.7.0_25]
>         at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2792) ~[na:1.7.0_25]
>         at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:799) ~[na:1.7.0_25]
>         at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299) ~[na:1.7.0_25]
>         at backtype.storm.utils.Utils.deserialize(Utils.java:81) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
>         ... 11 common frames omitted
> 2014-04-29 19:38:36 b.s.util [INFO] Halting process: ("Error when processing an event")
> {noformat}
> Current workaround : full stop supervisor daemon and delete all Storm's data/supervisor directory helped, and after restarting Supervisor is now running smoothly. 
> {anchor:links} Here is some references of very similar issues :
> * http://mail-archives.apache.org/mod_mbox/storm-user/201402.mbox/%3C23100d14e7ac4cef947f7236ef8963e1@BY2PR08MB144.namprd08.prod.outlook.com%3E
> * https://groups.google.com/forum/#!topic/storm-user/SL9FK9XeoI8
> * https://groups.google.com/forum/#!topic/storm-user/2gapTYTRrX8
> Regards,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)