You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Mauro Giusti <ma...@microsoft.com> on 2017/08/15 16:33:52 UTC

How many nimbus / zookeepers for a zero downtime topology in production?

So we want to keep our topologies always running -
We have a production cluster hosted on K8 with 1 nimbus, 1 zookeeper, 1 UI and 3 supervisor containers.

We are wondering whether we need 2 nimbuses and/or 2 zookeepers to make sure the topologies are always up when we do maintenance.

We observed that restarting the nimbus does not affect the topologies from running (UI was not accessible though).
When we restarted the zookeeper, the configuration was lost though - so we had to re-deploy the topologies.

Any pointer to configuration for this case is appreciated -

Thanks -
Mauro Giusti

Re: How many nimbus / zookeepers for a zero downtime topology in production?

Posted by Stig Rohde Døssing <sr...@apache.org>.

You should have at least 3 Zookeepers. Zookeeper clusters can handle losing
floor(n/2) nodes, which means with 3 nodes the cluster survives if 2 nodes
are up. Add more nodes as necessary for your use case (e.g. 5 to be able to
handle 2 nodes failing at the same time). Storm assumes that the Zookeeper
cluster is always available, so you should provision it to handle failures.

You might also want to add sufficient supervisors so there are enough
worker slots for the topologies you run if one or more supervisors go
offline. You can see in the UI how many slots you are using. By default
there are 4 slots for each supervisor, but you can add more in the
storm.yaml configuration if you need to.

Regarding multiple Nimbus hosts, I think that's mainly if you need the UI
or ability to submit/kill/rebalance topologies while you're doing
maintenance, but it can also have an effect if a supervisor dies while
Nimbus is down, since the dead workers won't be reassigned until Nimbus
becomes available again. Please read
https://storm.apache.org/releases/2.0.0-SNAPSHOT/nimbus-ha-design.html for
how to set this up. I haven't used HA Nimbus, but going by the description
in that document, you should be able to handle n Nimbus failures as long as
you have n + 1 Nimbus nodes with the topology code. You'll need to decide
what to set nimbus.min.replication.count to as well. It configures how many
Nimbus nodes need to have the topology code before a submit is considered
complete.

2017-08-15 18:33 GMT+02:00 Mauro Giusti <ma...@microsoft.com>:

> So we want to keep our topologies always running –
>
> We have a production cluster hosted on K8 with 1 nimbus, 1 zookeeper, 1 UI
> and 3 supervisor containers.
>
>
>
> We are wondering whether we need 2 nimbuses and/or 2 zookeepers to make
> sure the topologies are always up when we do maintenance.
>
>
>
> We observed that restarting the nimbus does not affect the topologies from
> running (UI was not accessible though).
>
> When we restarted the zookeeper, the configuration was lost though – so we
> had to re-deploy the topologies.
>
>
>
> Any pointer to configuration for this case is appreciated -
>
>
>
> Thanks –
>
> Mauro Giusti
>