You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Rick Kellogg (JIRA)" <ji...@apache.org> on 2015/10/09 02:31:27 UTC
[jira] [Updated] (STORM-131) Intermittent Zookeper errors when shutting down local Topology

     [ https://issues.apache.org/jira/browse/STORM-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rick Kellogg updated STORM-131:
-------------------------------
    Component/s: storm-core

> Intermittent Zookeper errors when shutting down local Topology
> --------------------------------------------------------------
>
>                 Key: STORM-131
>                 URL: https://issues.apache.org/jira/browse/STORM-131
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>            Reporter: James Xu
>            Priority: Minor
>
> https://github.com/nathanmarz/storm/issues/259
> We have a great deal of Storm integration tests in our project (Storm version 0.7.3) using local topology. We have only one topology operational at any moment in time. As tests run they are organized in groups. Each group works within the boundaries of a topology. When the tests finish executing they shutdown their local cluster, then the new group of tests launches its own cluster.
> We see with some remarkable regularity failures related to, what looks like, incorrect Zookeeper shutdown, which leads to a JVM exit (which is a disaster as no test information is recorded at the end). Here is what we see in the main error log (log level: WARN and higher):
> {code}
> 2012-07-07 00:22:58,420 WARN [ConnectionStateManager-0|]@jenkins com.netflix.curator.framework.state.ConnectionStateManager
> => There are no ConnectionStateListeners registered.
> 2012-07-07 00:22:58,534 WARN [Thread-23-EventThread|]@jenkins backtype.storm.cluster
> => Received event :disconnected::none: with disconnected Zookeeper.
> 2012-07-07 00:23:00,013 WARN [Thread-23-SendThread(localhost:2000)|]@jenkins org.apache.zookeeper.ClientCnxn
> => Session 0x1385ece8f1b0017 for server null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.$$YJP$$checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.checkConnect(SocketChannelImpl.java)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
> 2012-07-07 00:23:01,527 WARN [Thread-23-SendThread(localhost:2000)|]@jenkins org.apache.zookeeper.ClientCnxn
> => Session 0x1385ece8f1b0017 for server null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.$$YJP$$checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.checkConnect(SocketChannelImpl.java)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
> 2012-07-07 00:23:03,510 WARN [Thread-23-SendThread(localhost:2000)|]@jenkins org.apache.zookeeper.ClientCnxn
> => Session 0x1385ece8f1b0017 for server null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.$$YJP$$checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.checkConnect(SocketChannelImpl.java)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
> 2012-07-07 00:23:04,687 WARN [Thread-23-SendThread(localhost:2000)|]@jenkins org.apache.zookeeper.ClientCnxn
> => Session 0x1385ece8f1b0017 for server null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.$$YJP$$checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.checkConnect(SocketChannelImpl.java)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
> 2012-07-07 00:23:05,961 WARN [Thread-23-SendThread(localhost:2000)|]@jenkins org.apache.zookeeper.ClientCnxn
> => Session 0x1385ece8f1b0017 for server null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.$$YJP$$checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.checkConnect(SocketChannelImpl.java)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
> 2012-07-07 00:23:07,588 WARN [Thread-23-SendThread(localhost:2000)|]@jenkins org.apache.zookeeper.ClientCnxn
> => Session 0x1385ece8f1b0017 for server null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.$$YJP$$checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.checkConnect(SocketChannelImpl.java)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
> 2012-07-07 00:23:07,691 ERROR [Thread-23-EventThread|]@jenkins com.netflix.curator.framework.imps.CuratorFrameworkImpl
> => Background operation retry gave up
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
> at com.netflix.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:380)
> at com.netflix.curator.framework.imps.BackgroundSyncImpl$1.processResult(BackgroundSyncImpl.java:49)
> at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:617)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> 2012-07-07 00:23:07,697 WARN [ConnectionStateManager-0|]@jenkins com.netflix.curator.framework.state.ConnectionStateManager
> => There are no ConnectionStateListeners registered.
> 2012-07-07 00:23:07,699 ERROR [Thread-23-EventThread|]@jenkins backtype.storm.zookeeper
> => Unrecoverable Zookeeper error Background operation retry gave up
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
> at com.netflix.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:380)
> at com.netflix.curator.framework.imps.BackgroundSyncImpl$1.processResult(BackgroundSyncImpl.java:49)
> at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:617)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> And here is what we see in Storm dedicated log file (log level: DEBUG):
> 2012-07-07 00:22:58,306 INFO [main|]@jenkins backtype.storm.daemon.task
> => Shut down task TLTopology-1-1341620393:31
> 2012-07-07 00:22:58,306 INFO [main|]@jenkins backtype.storm.messaging.loader
> => Shutting down receiving-thread: [TLTopology-1-1341620393, 5]
> 2012-07-07 00:22:58,307 INFO [main|]@jenkins backtype.storm.messaging.loader
> => Waiting for receiving-thread:[TLTopology-1-1341620393, 5] to die
> 2012-07-07 00:22:58,307 INFO [Thread-319|]@jenkins backtype.storm.messaging.loader
> => Receiving-thread:[TLTopology-1-1341620393, 5] received shutdown notice
> 2012-07-07 00:22:58,307 INFO [main|]@jenkins backtype.storm.messaging.loader
> => Shutdown receiving-thread: [TLTopology-1-1341620393, 5]
> 2012-07-07 00:22:58,307 INFO [main|]@jenkins backtype.storm.daemon.worker
> => Terminating zmq context
> 2012-07-07 00:22:58,307 INFO [main|]@jenkins backtype.storm.daemon.worker
> => Waiting for threads to die
> 2012-07-07 00:22:58,307 INFO [Thread-318|]@jenkins backtype.storm.util
> => Async loop interrupted!
> 2012-07-07 00:22:58,309 INFO [main|]@jenkins backtype.storm.daemon.worker
> => Disconnecting from storm cluster state context
> 2012-07-07 00:22:58,311 INFO [main|]@jenkins backtype.storm.daemon.worker
> => Shut down worker TLTopology-1-1341620393 96e12303-4c22-4821-9f3b-3bce2230bf08 5
> 2012-07-07 00:22:58,311 DEBUG [main|]@jenkins backtype.storm.util
> => Rmr path /tmp/f308eb0e-2e72-4221-9620-43e15a9c1bdc/workers/16966f32-d0d4-4ee1-a0fe-1d85fc4a478e/heartbeats
> 2012-07-07 00:22:58,313 DEBUG [main|]@jenkins backtype.storm.util
> => Removing path /tmp/f308eb0e-2e72-4221-9620-43e15a9c1bdc/workers/16966f32-d0d4-4ee1-a0fe-1d85fc4a478e/pids
> 2012-07-07 00:22:58,313 DEBUG [main|]@jenkins backtype.storm.util
> => Removing path /tmp/f308eb0e-2e72-4221-9620-43e15a9c1bdc/workers/16966f32-d0d4-4ee1-a0fe-1d85fc4a478e
> 2012-07-07 00:22:58,313 INFO [main|]@jenkins backtype.storm.daemon.supervisor
> => Shut down 96e12303-4c22-4821-9f3b-3bce2230bf08:16966f32-d0d4-4ee1-a0fe-1d85fc4a478e
> 2012-07-07 00:22:58,314 INFO [main|]@jenkins backtype.storm.daemon.supervisor
> => Shutting down supervisor 96e12303-4c22-4821-9f3b-3bce2230bf08
> 2012-07-07 00:22:58,314 INFO [Thread-25|]@jenkins backtype.storm.event
> => Event manager interrupted
> 2012-07-07 00:22:58,315 INFO [Thread-26|]@jenkins backtype.storm.event
> => Event manager interrupted
> 2012-07-07 00:22:58,318 INFO [main|]@jenkins backtype.storm.testing
> => Shutting down in process zookeeper
> 2012-07-07 00:22:58,321 INFO [main|]@jenkins backtype.storm.testing
> => Done shutting down in process zookeeper
> 2012-07-07 00:22:58,321 INFO [main|]@jenkins backtype.storm.testing
> => Deleting temporary path /tmp/0202cf11-6ad7-4dda-94d6-622a63c9f6b6
> 2012-07-07 00:22:58,321 DEBUG [main|]@jenkins backtype.storm.util
> => Rmr path /tmp/0202cf11-6ad7-4dda-94d6-622a63c9f6b6
> 2012-07-07 00:22:58,322 INFO [main|]@jenkins backtype.storm.testing
> => Deleting temporary path /tmp/ee47e3e3-752f-40a8-b6a9-a197a9dda3de
> 2012-07-07 00:22:58,323 DEBUG [main|]@jenkins backtype.storm.util
> => Rmr path /tmp/ee47e3e3-752f-40a8-b6a9-a197a9dda3de
> 2012-07-07 00:22:58,323 INFO [main|]@jenkins backtype.storm.testing
> => Deleting temporary path /tmp/ece72b84-357e-4183-aeb5-e0d2dc5d6eca
> 2012-07-07 00:22:58,323 DEBUG [main|]@jenkins backtype.storm.util
> => Rmr path /tmp/ece72b84-357e-4183-aeb5-e0d2dc5d6eca
> 2012-07-07 00:22:58,326 INFO [main|]@jenkins backtype.storm.testing
> => Deleting temporary path /tmp/f308eb0e-2e72-4221-9620-43e15a9c1bdc
> 2012-07-07 00:22:58,326 DEBUG [main|]@jenkins backtype.storm.util
> => Rmr path /tmp/f308eb0e-2e72-4221-9620-43e15a9c1bdc
> 2012-07-07 00:22:58,534 WARN [Thread-23-EventThread|]@jenkins backtype.storm.cluster
> => Received event :disconnected::none: with disconnected Zookeeper.
> 2012-07-07 00:23:07,699 ERROR [Thread-23-EventThread|]@jenkins backtype.storm.zookeeper
> => Unrecoverable Zookeeper error Background operation retry gave up
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
> at com.netflix.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:380)
> at com.netflix.curator.framework.imps.BackgroundSyncImpl$1.processResult(BackgroundSyncImpl.java:49)
> at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:617)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> 2012-07-07 00:23:07,702 INFO [Thread-23-EventThread|]@jenkins backtype.storm.util
> => Halting process: ("Unrecoverable Zookeeper error")
> {code}
> It seems like a threading issue to me personally. I wonder if there is some form of workaround. I also understand that since this is a "local" topology issue, this might not receive due attention... However, fundamentally this is what new users would start with when they begin to play with Storm, and, I think, it is important to make this experience positive.
> Nathan, thank you very much for everything that you're doing.
> -Kyrill
> ----------
> dkincaid: Looking through the shutdown code for local clusters I noticed a comment in the code about a possible race condition. I'm wondering if we could be running into this on our Jenkins server (which we know runs pretty slowly). Is a worker getting restarted before the supervisor can be shutdown?
> Here is the function with the comment:
> {code}
> (defn kill-local-storm-cluster [cluster-map]
>   (.shutdown (:nimbus cluster-map))
>   (.close (:state cluster-map))
>   (.disconnect (:storm-cluster-state cluster-map))
>   (doseq [s @(:supervisors cluster-map)]
>     (.shutdown-all-workers s)
>     ;; race condition here? will it launch the workers again?
>     (supervisor/kill-supervisor s))
>   (psim/kill-all-processes)
>   (log-message "Shutting down in process zookeeper")
>   (zk/shutdown-inprocess-zookeeper (:zookeeper cluster-map))
>   (log-message "Done shutting down in process zookeeper")
>   (doseq [t @(:tmp-dirs cluster-map)]
>     (log-message "Deleting temporary path " t)
>     (rmr t)
>     ))
> {code}
> --------
> kyrill007: Fantastic catch, Dave!!! This exactly what is happening: supervisor begins launching new workers when the other ones are still being shut down. Here is the proof from the logs:
> {code}
> Shut down process is initiated at 04:37:05,136.
> 2012-07-11 04:37:05,136 INFO [main|]@jenkins backtype.storm.daemon.nimbus
>   => Shutting down master
> 2012-07-11 04:37:05,145 INFO [main|]@jenkins backtype.storm.daemon.nimbus
>   => Shut down master
> 2012-07-11 04:37:05,151 INFO [main|]@jenkins backtype.storm.daemon.supervisor
>   => Shutting down 5c48d4fc-769f-41ef-abd6-f92df60fa543:12eba15d-fb17-4a3c-8e25-1c0266eed04d
> 2012-07-11 04:37:05,152 INFO [main|]@jenkins backtype.storm.process-simulator
>   => Killing process ea132b37-dc6a-447c-b1de-ac6727c82cef
> 2012-07-11 04:37:05,152 INFO [main|]@jenkins backtype.storm.daemon.worker
>   => Shutting down worker TLTopology-1-1341981237 5c48d4fc-769f-41ef-abd6-f92df60fa543 1
> 2012-07-11 04:37:05,152 INFO [main|]@jenkins backtype.storm.daemon.task
>   => Shutting down task TLTopology-1-1341981237:64
> 2012-07-11 04:37:05,153 INFO [Thread-129|]@jenkins backtype.storm.util
>   => Async loop interrupted!
> 2012-07-11 04:37:05,180 INFO [main|]@jenkins backtype.storm.daemon.task
>   => Shut down task TLTopology-1-1341981237:64
> 2012-07-11 04:37:05,180 INFO [main|]@jenkins backtype.storm.daemon.task
>   => Shutting down task TLTopology-1-1341981237:34
> It continues for a while (we have a lot of workers). Then at 04:37:05,665 we start seeing this:
> 012-07-11 04:37:05,665 DEBUG [Thread-19|]@jenkins backtype.storm.daemon.supervisor
>   => Assigned tasks: {2 #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id "TLTopology-1-1341981237", :task-ids (96 66 36 6 102 72 42 12 108 78 48 18 114 84 54 24 120 90 60 30 126)}, 1 #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id "TLTopology-1-1341981237", :task-ids (64 34 4 100 70 40 10 106 76 46 16 112 82 52 22 118 88 58 28 124 94)}, 3 #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id "TLTopology-1-1341981237", :task-ids (32 2 98 68 38 8 104 74 44 14 110 80 50 20 116 86 56 26 122 92 62)}}
> 2012-07-11 04:37:05,665 DEBUG [Thread-19|]@jenkins backtype.storm.daemon.supervisor
>   => Allocated: {"a724dc19-84ec-46dc-9768-afb73df94237" [:valid #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1341981425, :storm-id "TLTopology-1-1341981237", :task-ids #{96 66 36 6 102 72 42 12 108 78 48 18 114 84 54 24 120 90 60 30 126}, :port 2}]}
> 2012-07-11 04:37:05,665 DEBUG [Thread-19|]@jenkins backtype.storm.util
>   => Making dirs at /tmp/4884ffb5-c6c7-43a9-ac72-e0a5426eea3c/workers/a7f81ea0-a5f6-47de-9a89-47998b1e1639/pids
> 2012-07-11 04:37:05,666 DEBUG [Thread-19|]@jenkins backtype.storm.util
>   => Making dirs at /tmp/4884ffb5-c6c7-43a9-ac72-e0a5426eea3c/workers/1b5c4c87-4e05-4cab-a580-ae1dabb3fd2e/pids
> 2012-07-11 04:37:05,666 INFO [main|]@jenkins backtype.storm.daemon.worker
>   => Shut down worker TLTopology-1-1341981237 5c48d4fc-769f-41ef-abd6-f92df60fa543 2
> 2012-07-11 04:37:05,667 DEBUG [main|]@jenkins backtype.storm.util
>   => Rmr path /tmp/4884ffb5-c6c7-43a9-ac72-e0a5426eea3c/workers/a724dc19-84ec-46dc-9768-afb73df94237/heartbeats
> 2012-07-11 04:37:05,669 DEBUG [main|]@jenkins backtype.storm.util
>   => Removing path /tmp/4884ffb5-c6c7-43a9-ac72-e0a5426eea3c/workers/a724dc19-84ec-46dc-9768-afb73df94237/pids
> 2012-07-11 04:37:05,669 DEBUG [main|]@jenkins backtype.storm.util
>   => Removing path /tmp/4884ffb5-c6c7-43a9-ac72-e0a5426eea3c/workers/a724dc19-84ec-46dc-9768-afb73df94237
> 2012-07-11 04:37:05,669 INFO [main|]@jenkins backtype.storm.daemon.supervisor
>   => Shut down 5c48d4fc-769f-41ef-abd6-f92df60fa543:a724dc19-84ec-46dc-9768-afb73df94237
> 2012-07-11 04:37:05,669 INFO [main|]@jenkins backtype.storm.daemon.supervisor
>   => Shutting down supervisor 5c48d4fc-769f-41ef-abd6-f92df60fa543
> 2012-07-11 04:37:05,670 INFO [Thread-18|]@jenkins backtype.storm.event
>   => Event manager interrupted
> 2012-07-11 04:37:05,670 INFO [Thread-19|]@jenkins backtype.storm.daemon.supervisor
>   => Launching worker with assignment #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id "TLTopology-1-1341981237", :task-ids (64 34 4 100 70 40 10 106 76 46 16 112 82 52 22 118 88 58 28 124 94)} for this supervisor 5c48d4fc-769f-41ef-abd6-f92df60fa543 on port 1 with id a7f81ea0-a5f6-47de-9a89-47998b1e1639
> 2012-07-11 04:37:05,672 INFO [Thread-19|]@jenkins backtype.storm.daemon.worker
>   => Launching worker for TLTopology-1-1341981237 on 5c48d4fc-769f-41ef-abd6-f92df60fa543:1 with id a7f81ea0-a5f6-47de-9a89-47998b1e1639 and conf {"dev.zookeeper.path" "/tmp/dev-storm-zookeeper", "topology.fall.back.on.java.serialization" true, "zmq.linger.millis" 0, "topology.skip.missing.kryo.registrations" true, "ui.childopts" "-Xmx768m", "storm.zookeeper.session.timeout" 20000, "nimbus.reassign" true, "nimbus.monitor.freq.secs" 10, "java.library.path" "/usr/local/lib:/opt/local/lib:/usr/lib", "storm.local.dir" "/tmp/4884ffb5-c6c7-43a9-ac72-e0a5426eea3c", "supervisor.worker.start.timeout.secs" 120, "nimbus.cleanup.inbox.freq.secs" 600, "nimbus.inbox.jar.expiration.secs" 3600, "nimbus.host" "localhost", "storm.zookeeper.port" 2000, "transactional.zookeeper.port" nil, "transactional.zookeeper.servers" nil, "storm.zookeeper.root" "/storm", "supervisor.enable" true, "storm.zookeeper.servers" ["localhost"], "transactional.zookeeper.root" "/transactional", "topology.worker.childopts" nil, "worker.childopts" "-Xmx768m", "supervisor.heartbeat.frequency.secs" 5, "drpc.port" 3772, "supervisor.monitor.frequency.secs" 3, "task.heartbeat.frequency.secs" 3, "topology.max.spout.pending" nil, "storm.zookeeper.retry.interval" 1000, "supervisor.slots.ports" (1 2 3), "topology.debug" false, "nimbus.task.launch.secs" 120, "nimbus.supervisor.timeout.secs" 60, "topology.message.timeout.secs" 30, "task.refresh.poll.secs" 10, "topology.workers" 1, "supervisor.childopts" "-Xmx1024m", "nimbus.thrift.port" 6627, "topology.stats.sample.rate" 0.05, "worker.heartbeat.frequency.secs" 1, "nimbus.task.timeout.secs" 30, "drpc.invocations.port" 3773, "zmq.threads" 1, "storm.zookeeper.retry.times" 5, "topology.state.synchronization.timeout.secs" 60, "supervisor.worker.timeout.secs" 30, "nimbus.file.copy.expiration.secs" 600, "drpc.request.timeout.secs" 600, "storm.local.mode.zmq" false, "ui.port" 8080, "nimbus.childopts" "-Xmx1024m", "topology.ackers" 1, "storm.cluster.mode" "local", "topology.optimize" true, "topology.max.task.parallelism" nil}
> 2012-07-11 04:37:05,675 INFO [Thread-19|]@jenkins backtype.storm.event
>   => Event manager interrupted
> 2012-07-11 04:37:05,677 INFO [Thread-19-EventThread|]@jenkins backtype.storm.zookeeper
>   => Zookeeper state update: :connected:none
> which at the end result in this:
> 2012-07-11 04:37:06,175 INFO [Thread-19-EventThread|]@jenkins backtype.storm.zookeeper
>   => Zookeeper state update: :disconnected:none
> 2012-07-11 04:37:06,175 WARN [Thread-22-EventThread|]@jenkins backtype.storm.cluster
>   => Received event :disconnected::none: with disconnected Zookeeper.
> 2012-07-11 04:37:15,923 ERROR [Thread-22-EventThread|]@jenkins backtype.storm.zookeeper
>   => Unrecoverable Zookeeper error Background operation retry gave up
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>     at com.netflix.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:380)
>     at com.netflix.curator.framework.imps.BackgroundSyncImpl$1.processResult(BackgroundSyncImpl.java:49)
>     at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:613)
>     at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502)
> 2012-07-11 04:37:15,926 INFO [Thread-22-EventThread|]@jenkins backtype.storm.util
>   => Halting process: ("Unrecoverable Zookeeper error")
> {code}
> Dear Nathan,
> If this race condition could somehow be fixed (presumably it is not that hard since we know what the problem is), it would so much appreciated!!!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)