You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Itai Frenkel (JIRA)" <ji...@apache.org> on 2014/10/10 09:11:33 UTC

[jira] [Commented] (STORM-526) Nimbus triggered complete removal of all topologies due to maintenance in 2 out of 3 zookeeper servers

    [ https://issues.apache.org/jira/browse/STORM-526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14166494#comment-14166494 ] 

Itai Frenkel commented on STORM-526:
------------------------------------

If anyone has had the same problem we overcome it by fixing our zookeeper issues. We disabled exhibitor s3 backups since they perform CPU intensive compression, and moved to ephemeral SSD drives. We also disabled automatic restarts of zookeepers by exhibitor and only trigger a pagerduty alert when one of the zookeepers is down.

> Nimbus triggered complete removal of all topologies due to maintenance in 2 out of 3 zookeeper servers
> ------------------------------------------------------------------------------------------------------
>
>                 Key: STORM-526
>                 URL: https://issues.apache.org/jira/browse/STORM-526
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.2-incubating
>         Environment: AWS EC2 ubuntu
>            Reporter: Itai Frenkel
>
> We use a cluster of 3 zookeepers, all 3 ip addresses are in the storm.yml file. We were restarting one zookeeper, and once it was ready, we restarted the second zookeeper. All this time the third zookeeper was "green" (as monitored by Netfix Exhibitor).
> At this same time nimbus has "decided" to remove all topologies (log entry is "Corrupt topology my-topology-xxx has state on zookeeper but doesn't have a local dir on Nimbus. Cleaning up...").
> I looked at the relevant code and I am not entirely sure the log message describes correctly the code.
> Could anyone please read the nimbus.clj#cleanup-corrupt-topologies and explain under what conditions does nimbus act in that way ?
> https://github.com/apache/storm/blob/v0.9.2-incubating/storm-core/src/clj/backtype/storm/daemon/nimbus.clj#L854
> Log file:
> 2014-10-01 10:47:19 b.s.d.nimbus [INFO] Corrupt topology my-topology-1-2-1412151059 has state on zookeeper but doesn't have a local dir on Nimbus. Cleaning up...
> 2014-10-01 10:47:19 b.s.d.nimbus [INFO] Corrupt topology my-topology-0-1-1412151059 has state on zookeeper but doesn't have a local dir on Nimbus. Cleaning up...
> 2014-10-01 10:47:19 b.s.d.nimbus [INFO] Corrupt topology my-topology-3-4-1412151062 has state on zookeeper but doesn't have a local dir on Nimbus. Cleaning up...
> 2014-10-01 10:47:19 b.s.d.nimbus [INFO] Corrupt topology my-topology-2-3-1412151060 has state on zookeeper but doesn't have a local dir on Nimbus. Cleaning up...
> 2014-10-01 10:47:19 b.s.d.nimbus [INFO] Starting Nimbus server...
> 2014-10-01 10:47:20 b.s.d.nimbus [INFO] Cleaning up my-topology-1-2-1412151059
> 2014-10-01 10:47:20 b.s.d.nimbus [INFO] Cleaning up my-topology-0-1-1412151059
> 2014-10-01 10:47:20 b.s.d.nimbus [INFO] Cleaning up my-topology-3-4-1412151062
> 2014-10-01 10:47:20 b.s.d.nimbus [INFO] Cleaning up my-topology-2-3-1412151060
> 2014-10-01 10:52:16 b.s.d.nimbus [INFO] Shutting down master



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)