You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Martin Kou <bi...@gmail.com> on 2012/04/19 20:29:44 UTC

Zookeeper server recovery behaviors

Hi folks,

I've got a few questions about how Zookeeper servers behave in fail-recover
scenarios:

Assuming I have a 5-Zookeeper cluster, and one of the servers went dark and
came back, like 1 hour later.

   1. Is it correct to assume that clients won't be able to connect to the
   recovering server while it's still synchronizing with the leader, and thus
   any new client connections would automatically fall back to the other 4
   servers during synchronization?
   2. The documentation says a newly recovered server would have (initLimit
   * tickTime) seconds to synchronize with the leader when it's restarted. Is
   it correct to assume the time needed for synchronization is bounded by the
   amount of data managed by Zookeeper? Let's say in the worst case, someone
   set a very large snapCount to the cluster, there were a lot of
   transactions, but there aren't a lot of znodes - and thus there aren't a
   lot of data in each Zookeeper server but a very long transaction log. Would
   that bound still hold?
   3. I noticed from the documentation that a Zookeeper server falling >
   (syncLimit * tickTime) seconds from the leader will be dropped from quorum.
   I guess that's for detecting network partitions, right? If the partitioned
   server does report back to the leader later, how would it behave? (e.g.
   would it deny new client connections while it's synchronizing?)

Thanks.

Best Regards,
Martin Kou

Re: Zookeeper server recovery behaviors

Posted by Ryan LeCompte <le...@gmail.com>.
Wow, thank you for that awesome explanation Ted. Makes me feel really happy
that I decided to build redis_failover on top of ZooKeeper! :)

https://github.com/ryanlecompte/redis_failover

Best,
Ryan

On Thu, Apr 19, 2012 at 11:53 AM, Ted Dunning <te...@gmail.com> wrote:

> The client can't think it has succeeded with a deletion if it is connected
> to the minority side of a partitioned cluster.  To think that, the commit
> would have to be be ack'ed by a majority which by definition can't happen
> either because the master is in the minority and can't get a majority or
> because the master is no longer reachable from the server the client is
> connected to.  If the master is in the minority, then when the commit
> fails, the minority will start a leader election which will fail due to
> inability to commit.  At some point, the majority side will tire of waiting
> to hear from the master and will also start an election which will succeed.
>
> All clients connected to the minority side will be told to reconnect and
> will either fail if they can't talk to a node on the master side or will
> succeed in connecting with a node in the new quorum.
>
> When and if the partition heals, the nodes in the minority will
> resynchronize and then start handling connections and requests.  Pending
> requests from them will be discarded because the epoch number will have
> been incremented in the new leader election.
>
> On Thu, Apr 19, 2012 at 11:42 AM, Ryan LeCompte <le...@gmail.com>
> wrote:
>
> > Great questions. I'd also like to add:
> >
> > - What happens when there is a network partition, and one client
> > successfully deletes a znode for which other clients have setup watches?
> > Are the clients guaranteed to receive that node deleted watch event if
> the
> > client successfully thinks it deleted the znode from the other side?
> >
> > Thanks,
> > Ryan
> >
> > On Thu, Apr 19, 2012 at 11:29 AM, Martin Kou <bi...@gmail.com>
> wrote:
> >
> > > Hi folks,
> > >
> > > I've got a few questions about how Zookeeper servers behave in
> > fail-recover
> > > scenarios:
> > >
> > > Assuming I have a 5-Zookeeper cluster, and one of the servers went dark
> > and
> > > came back, like 1 hour later.
> > >
> > >   1. Is it correct to assume that clients won't be able to connect to
> the
> > >   recovering server while it's still synchronizing with the leader, and
> > > thus
> > >   any new client connections would automatically fall back to the
> other 4
> > >   servers during synchronization?
> > >   2. The documentation says a newly recovered server would have
> > (initLimit
> > >   * tickTime) seconds to synchronize with the leader when it's
> restarted.
> > > Is
> > >   it correct to assume the time needed for synchronization is bounded
> by
> > > the
> > >   amount of data managed by Zookeeper? Let's say in the worst case,
> > someone
> > >   set a very large snapCount to the cluster, there were a lot of
> > >   transactions, but there aren't a lot of znodes - and thus there
> aren't
> > a
> > >   lot of data in each Zookeeper server but a very long transaction log.
> > > Would
> > >   that bound still hold?
> > >   3. I noticed from the documentation that a Zookeeper server falling >
> > >   (syncLimit * tickTime) seconds from the leader will be dropped from
> > > quorum.
> > >   I guess that's for detecting network partitions, right? If the
> > > partitioned
> > >   server does report back to the leader later, how would it behave?
> (e.g.
> > >   would it deny new client connections while it's synchronizing?)
> > >
> > > Thanks.
> > >
> > > Best Regards,
> > > Martin Kou
> > >
> >
>

Re: Zookeeper server recovery behaviors

Posted by Ted Dunning <te...@gmail.com>.
The client can't think it has succeeded with a deletion if it is connected
to the minority side of a partitioned cluster.  To think that, the commit
would have to be be ack'ed by a majority which by definition can't happen
either because the master is in the minority and can't get a majority or
because the master is no longer reachable from the server the client is
connected to.  If the master is in the minority, then when the commit
fails, the minority will start a leader election which will fail due to
inability to commit.  At some point, the majority side will tire of waiting
to hear from the master and will also start an election which will succeed.

All clients connected to the minority side will be told to reconnect and
will either fail if they can't talk to a node on the master side or will
succeed in connecting with a node in the new quorum.

When and if the partition heals, the nodes in the minority will
resynchronize and then start handling connections and requests.  Pending
requests from them will be discarded because the epoch number will have
been incremented in the new leader election.

On Thu, Apr 19, 2012 at 11:42 AM, Ryan LeCompte <le...@gmail.com> wrote:

> Great questions. I'd also like to add:
>
> - What happens when there is a network partition, and one client
> successfully deletes a znode for which other clients have setup watches?
> Are the clients guaranteed to receive that node deleted watch event if the
> client successfully thinks it deleted the znode from the other side?
>
> Thanks,
> Ryan
>
> On Thu, Apr 19, 2012 at 11:29 AM, Martin Kou <bi...@gmail.com> wrote:
>
> > Hi folks,
> >
> > I've got a few questions about how Zookeeper servers behave in
> fail-recover
> > scenarios:
> >
> > Assuming I have a 5-Zookeeper cluster, and one of the servers went dark
> and
> > came back, like 1 hour later.
> >
> >   1. Is it correct to assume that clients won't be able to connect to the
> >   recovering server while it's still synchronizing with the leader, and
> > thus
> >   any new client connections would automatically fall back to the other 4
> >   servers during synchronization?
> >   2. The documentation says a newly recovered server would have
> (initLimit
> >   * tickTime) seconds to synchronize with the leader when it's restarted.
> > Is
> >   it correct to assume the time needed for synchronization is bounded by
> > the
> >   amount of data managed by Zookeeper? Let's say in the worst case,
> someone
> >   set a very large snapCount to the cluster, there were a lot of
> >   transactions, but there aren't a lot of znodes - and thus there aren't
> a
> >   lot of data in each Zookeeper server but a very long transaction log.
> > Would
> >   that bound still hold?
> >   3. I noticed from the documentation that a Zookeeper server falling >
> >   (syncLimit * tickTime) seconds from the leader will be dropped from
> > quorum.
> >   I guess that's for detecting network partitions, right? If the
> > partitioned
> >   server does report back to the leader later, how would it behave? (e.g.
> >   would it deny new client connections while it's synchronizing?)
> >
> > Thanks.
> >
> > Best Regards,
> > Martin Kou
> >
>

Re: Zookeeper server recovery behaviors

Posted by Ryan LeCompte <le...@gmail.com>.
Great questions. I'd also like to add:

- What happens when there is a network partition, and one client
successfully deletes a znode for which other clients have setup watches?
Are the clients guaranteed to receive that node deleted watch event if the
client successfully thinks it deleted the znode from the other side?

Thanks,
Ryan

On Thu, Apr 19, 2012 at 11:29 AM, Martin Kou <bi...@gmail.com> wrote:

> Hi folks,
>
> I've got a few questions about how Zookeeper servers behave in fail-recover
> scenarios:
>
> Assuming I have a 5-Zookeeper cluster, and one of the servers went dark and
> came back, like 1 hour later.
>
>   1. Is it correct to assume that clients won't be able to connect to the
>   recovering server while it's still synchronizing with the leader, and
> thus
>   any new client connections would automatically fall back to the other 4
>   servers during synchronization?
>   2. The documentation says a newly recovered server would have (initLimit
>   * tickTime) seconds to synchronize with the leader when it's restarted.
> Is
>   it correct to assume the time needed for synchronization is bounded by
> the
>   amount of data managed by Zookeeper? Let's say in the worst case, someone
>   set a very large snapCount to the cluster, there were a lot of
>   transactions, but there aren't a lot of znodes - and thus there aren't a
>   lot of data in each Zookeeper server but a very long transaction log.
> Would
>   that bound still hold?
>   3. I noticed from the documentation that a Zookeeper server falling >
>   (syncLimit * tickTime) seconds from the leader will be dropped from
> quorum.
>   I guess that's for detecting network partitions, right? If the
> partitioned
>   server does report back to the leader later, how would it behave? (e.g.
>   would it deny new client connections while it's synchronizing?)
>
> Thanks.
>
> Best Regards,
> Martin Kou
>