You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zookeeper.apache.org by Michael Borokhovich <mi...@gmail.com> on 2018/12/05 23:20:08 UTC

Leader election

Hello,

We have a service that runs on 3 hosts for high availability. However, at
any given time, exactly one instance must be active. So, we are thinking to
use Leader election using Zookeeper.
To this goal, on each service host we also start a ZK server, so we have a
3-nodes ZK cluster and each service instance is a client to its dedicated
ZK server.
Then, we implement a leader election on top of Zookeeper using a basic
recipe:
https://zookeeper.apache.org/doc/r3.1.2/recipes.html#sc_leaderElection.

I have the following questions doubts regarding the approach:

1. It seems like we can run into inconsistency issues when network
partition occurs. Zookeeper documentation says that the inconsistency
period may last “tens of seconds”. Am I understanding correctly that during
this time we may have 0 or 2 leaders?
2. Is it possible to reduce this inconsistency time (let's say to 3
seconds) by tweaking tickTime and syncLimit parameters?
3. Is there a way to guarantee exactly one leader all the time? Should we
implement a more complex leader election algorithm than the one suggested
in the recipe (using ephemeral_sequential nodes)?

Thanks,
Michael.

Re: Leader election

Posted by Michael Borokhovich <mi...@gmail.com>.

Yes, I agree, our system should be able to tolerate two leaders for a short
and bounded period of time.
Thank you for the help!

On Thu, Dec 6, 2018 at 11:09 AM Jordan Zimmerman <jo...@jordanzimmerman.com>
wrote:

> > it seems like the
> > inconsistency may be caused by the partition of the Zookeeper cluster
> > itself
>
> Yes - there are many ways in which you can end up with 2 leaders. However,
> if properly tuned and configured, it will be for a few seconds at most.
> During a GC pause no work is being done anyway. But, this stuff is very
> tricky. Requiring an atomically unique leader is actually a design smell
> and you should reconsider your architecture.
>
> > Maybe we can use a more
> > lightweight Hazelcast for example?
>
> There is no distributed system that can guarantee a single leader. Instead
> you need to adjust your design and algorithms to deal with this (using
> optimistic locking, etc.).
>
> -Jordan
>
> > On Dec 6, 2018, at 1:52 PM, Michael Borokhovich <mi...@gmail.com>
> wrote:
> >
> > Thanks Jordan,
> >
> > Yes, I will try Curator.
> > Also, beyond the problem described in the Tech Note, it seems like the
> > inconsistency may be caused by the partition of the Zookeeper cluster
> > itself. E.g., if a "leader" client is connected to the partitioned ZK
> node,
> > it may be notified not at the same time as the other clients connected to
> > the other ZK nodes. So, another client may take leadership while the
> > current leader still unaware of the change. Is it true?
> >
> > Another follow up question. If Zookeeper can guarantee a single leader,
> is
> > it worth using it just for leader election? Maybe we can use a more
> > lightweight Hazelcast for example?
> >
> > Michael.
> >
> >
> > On Thu, Dec 6, 2018 at 4:50 AM Jordan Zimmerman <
> jordan@jordanzimmerman.com>
> > wrote:
> >
> >> It is not possible to achieve the level of consistency you're after in
> an
> >> eventually consistent system such as ZooKeeper. There will always be an
> >> edge case where two ZooKeeper clients will believe they are leaders
> (though
> >> for a short period of time). In terms of how it affects Apache Curator,
> we
> >> have this Tech Note on the subject:
> >> https://cwiki.apache.org/confluence/display/CURATOR/TN10 <
> >> https://cwiki.apache.org/confluence/display/CURATOR/TN10> (the
> >> description is true for any ZooKeeper client, not just Curator
> clients). If
> >> you do still intend to use a ZooKeeper lock/leader I suggest you try
> Apache
> >> Curator as writing these "recipes" is not trivial and have many gotchas
> >> that aren't obvious.
> >>
> >> -Jordan
> >>
> >> http://curator.apache.org <http://curator.apache.org/>
> >>
> >>
> >>> On Dec 5, 2018, at 6:20 PM, Michael Borokhovich <mi...@gmail.com>
> >> wrote:
> >>>
> >>> Hello,
> >>>
> >>> We have a service that runs on 3 hosts for high availability. However,
> at
> >>> any given time, exactly one instance must be active. So, we are
> thinking
> >> to
> >>> use Leader election using Zookeeper.
> >>> To this goal, on each service host we also start a ZK server, so we
> have
> >> a
> >>> 3-nodes ZK cluster and each service instance is a client to its
> dedicated
> >>> ZK server.
> >>> Then, we implement a leader election on top of Zookeeper using a basic
> >>> recipe:
> >>> https://zookeeper.apache.org/doc/r3.1.2/recipes.html#sc_leaderElection
> .
> >>>
> >>> I have the following questions doubts regarding the approach:
> >>>
> >>> 1. It seems like we can run into inconsistency issues when network
> >>> partition occurs. Zookeeper documentation says that the inconsistency
> >>> period may last “tens of seconds”. Am I understanding correctly that
> >> during
> >>> this time we may have 0 or 2 leaders?
> >>> 2. Is it possible to reduce this inconsistency time (let's say to 3
> >>> seconds) by tweaking tickTime and syncLimit parameters?
> >>> 3. Is there a way to guarantee exactly one leader all the time? Should
> we
> >>> implement a more complex leader election algorithm than the one
> suggested
> >>> in the recipe (using ephemeral_sequential nodes)?
> >>>
> >>> Thanks,
> >>> Michael.
> >>
> >>
>
>

Re: Leader election

Posted by Michael Han <ha...@apache.org>.

Tweak timeout is tempting as your solution might work most of the time yet
fail in certain cases (which others have pointed out). If the goal is
absolute correctness then we should avoid timeout, which does not guarantee
correctness as it only makes the problem hard to manifest. Fencing is the
right solution here - the zxid and also znode cversion can be used as
fencing token if you use ZooKeeper. Fencing will guarantee at any single
point in time you will have one active leader in action (it does not
guarantee that at a single point of time there are multiple parties *think*
they are the leader). An alternative solution, depends on your use case, is
to instead of requiring a single active leader in action at any time, make
your workload idempotent so multiple active leaders don't do any damage.

On Thu, Dec 6, 2018 at 1:05 PM Jordan Zimmerman <jo...@jordanzimmerman.com>
wrote:

> > Old service leader will detect network partition max 15 seconds after it
> > happened.
>
> If the old service leader is in a very long GC it will not detect the
> partition. In the face of VM pauses, etc. it's not possible to avoid 2
> leaders for a short period of time.
>
> -JZ

Re: Leader election

Posted by Michael Borokhovich <mi...@gmail.com>.

Thanks, Maciej. That sounds good. We will try playing with the parameters
and have at least a known upper limit on the inconsistency interval.

On Fri, Dec 7, 2018 at 2:11 AM Maciej Smoleński <je...@gmail.com> wrote:

> On Fri, Dec 7, 2018 at 3:03 AM Michael Borokhovich <mi...@gmail.com>
> wrote:
>
> > We are planning to run Zookeeper nodes embedded with the client nodes.
> > I.e., each client runs also a ZK node. So, network partition will
> > disconnect a ZK node and not only the client.
> > My concern is about the following statement from the ZK documentation:
> >
> > "Timeliness: The clients view of the system is guaranteed to be
> up-to-date
> > within a certain time bound. (*On the order of tens of seconds.*) Either
> > system changes will be seen by a client within this bound, or the client
> > will detect a service outage."
> >
>
> This is related to the fact that ZooKeeper server handles reads from its
> local state - without communicating with other ZooKeeper servers.
> This design ensures scalability for read dominated workloads.
> In this approach client might receive data which is not up to date (it
> might not contain updates from other ZooKeeper servers (quorum)).
> Parameter 'syncLimit' describes how often ZooKeeper server
> synchronizes/updates its local state to global state.
> Client read operation will retrieve data from state not older then
> described by 'syncLimit'.
>
> However ZooKeeper client can always force to retrieve data which is up to
> date.
> It needs to issue 'sync' command to ZooKeeper server before issueing
> 'read'.
> With 'sync' ZooKeeper server with synchronize its local state with global
> state.
> Later 'read' will be handled from updated state.
> Client should be careful here - so that it communicates with the same
> ZooKeeper server for both 'sync' and 'read'.
>
>
> > What are these "*tens of seconds*"? Can we reduce this time by
> configuring
> > "syncLimit" and "tickTime" to let's say 5 seconds? Can we have a strong
> > guarantee on this time bound?
> >
>
> As describe above - you might use 'sync'+'read' to avoid this problem.
>
>
> >
> >
> > On Thu, Dec 6, 2018 at 1:05 PM Jordan Zimmerman <
> > jordan@jordanzimmerman.com>
> > wrote:
> >
> > > > Old service leader will detect network partition max 15 seconds after
> > it
> > > > happened.
> > >
> > > If the old service leader is in a very long GC it will not detect the
> > > partition. In the face of VM pauses, etc. it's not possible to avoid 2
> > > leaders for a short period of time.
> > >
> > > -JZ
> >
>

Re: Leader election

Posted by Michael Han <ha...@apache.org>.

>> Can we reduce this time by configuring "syncLimit" and "tickTime" to
let's say 5 seconds? Can we have a strong
guarantee on this time bound?

It's not possible to guarantee the time bound, because of FLP impossibility
(reliable failure detection is not possible in async environment). Though
it's certainly possible to tune the parameters to some reasonable value
that fits your environment (which would be the SLA of your service).

>> As describe above - you might use 'sync'+'read' to avoid this problem.

I am afraid sync + read would not be correct 100% in all cases here. The
state of the world (e.g. leaders) could change between sync and read
operation. What we need here is linearizable read, which means we need have
read operations also go through the quorum consensus, which might be a nice
feature to have for ZooKeeper (for reference, etcd implements linearizable
read). Also, note ZooKeeper sync has bugs (sync should be a quorum
operation itself, but it's not implemented that way).

On Fri, Dec 7, 2018 at 2:11 AM Maciej Smoleński <je...@gmail.com> wrote:

> On Fri, Dec 7, 2018 at 3:03 AM Michael Borokhovich <mi...@gmail.com>
> wrote:
>
> > We are planning to run Zookeeper nodes embedded with the client nodes.
> > I.e., each client runs also a ZK node. So, network partition will
> > disconnect a ZK node and not only the client.
> > My concern is about the following statement from the ZK documentation:
> >
> > "Timeliness: The clients view of the system is guaranteed to be
> up-to-date
> > within a certain time bound. (*On the order of tens of seconds.*) Either
> > system changes will be seen by a client within this bound, or the client
> > will detect a service outage."
> >
>
> This is related to the fact that ZooKeeper server handles reads from its
> local state - without communicating with other ZooKeeper servers.
> This design ensures scalability for read dominated workloads.
> In this approach client might receive data which is not up to date (it
> might not contain updates from other ZooKeeper servers (quorum)).
> Parameter 'syncLimit' describes how often ZooKeeper server
> synchronizes/updates its local state to global state.
> Client read operation will retrieve data from state not older then
> described by 'syncLimit'.
>
> However ZooKeeper client can always force to retrieve data which is up to
> date.
> It needs to issue 'sync' command to ZooKeeper server before issueing
> 'read'.
> With 'sync' ZooKeeper server with synchronize its local state with global
> state.
> Later 'read' will be handled from updated state.
> Client should be careful here - so that it communicates with the same
> ZooKeeper server for both 'sync' and 'read'.
>
>
> > What are these "*tens of seconds*"? Can we reduce this time by
> configuring
> > "syncLimit" and "tickTime" to let's say 5 seconds? Can we have a strong
> > guarantee on this time bound?
> >
>
> As describe above - you might use 'sync'+'read' to avoid this problem.
>
>
> >
> >
> > On Thu, Dec 6, 2018 at 1:05 PM Jordan Zimmerman <
> > jordan@jordanzimmerman.com>
> > wrote:
> >
> > > > Old service leader will detect network partition max 15 seconds after
> > it
> > > > happened.
> > >
> > > If the old service leader is in a very long GC it will not detect the
> > > partition. In the face of VM pauses, etc. it's not possible to avoid 2
> > > leaders for a short period of time.
> > >
> > > -JZ
> >
>

Re: Leader election

Posted by Maciej Smoleński <je...@gmail.com>.

On Fri, Dec 7, 2018 at 3:03 AM Michael Borokhovich <mi...@gmail.com>
wrote:

> We are planning to run Zookeeper nodes embedded with the client nodes.
> I.e., each client runs also a ZK node. So, network partition will
> disconnect a ZK node and not only the client.
> My concern is about the following statement from the ZK documentation:
>
> "Timeliness: The clients view of the system is guaranteed to be up-to-date
> within a certain time bound. (*On the order of tens of seconds.*) Either
> system changes will be seen by a client within this bound, or the client
> will detect a service outage."
>

This is related to the fact that ZooKeeper server handles reads from its
local state - without communicating with other ZooKeeper servers.
This design ensures scalability for read dominated workloads.
In this approach client might receive data which is not up to date (it
might not contain updates from other ZooKeeper servers (quorum)).
Parameter 'syncLimit' describes how often ZooKeeper server
synchronizes/updates its local state to global state.
Client read operation will retrieve data from state not older then
described by 'syncLimit'.

However ZooKeeper client can always force to retrieve data which is up to
date.
It needs to issue 'sync' command to ZooKeeper server before issueing 'read'.
With 'sync' ZooKeeper server with synchronize its local state with global
state.
Later 'read' will be handled from updated state.
Client should be careful here - so that it communicates with the same
ZooKeeper server for both 'sync' and 'read'.

> What are these "*tens of seconds*"? Can we reduce this time by configuring
> "syncLimit" and "tickTime" to let's say 5 seconds? Can we have a strong
> guarantee on this time bound?
>

As describe above - you might use 'sync'+'read' to avoid this problem.

>
>
> On Thu, Dec 6, 2018 at 1:05 PM Jordan Zimmerman <
> jordan@jordanzimmerman.com>
> wrote:
>
> > > Old service leader will detect network partition max 15 seconds after
> it
> > > happened.
> >
> > If the old service leader is in a very long GC it will not detect the
> > partition. In the face of VM pauses, etc. it's not possible to avoid 2
> > leaders for a short period of time.
> >
> > -JZ
>

Re: Leader election

Posted by Michael Borokhovich <mi...@gmail.com>.

Thanks a lot for sharing the design, Ted. It is very helpful. Will check
what is applicable to our case and let you know in case of questions.

On Mon, Dec 10, 2018 at 23:37 Ted Dunning <te...@gmail.com> wrote:

> One very useful way to deal with this is the method used in MapR FS. The
> idea is that ZK should only be used rarely and short periods of two leaders
> must be tolerated, but other data has to be written with absolute
> consistency.
>
> The method that we chose was to associate an epoch number with every write,
> require all writes to to all replicas and require that all replicas only
> acknowledge writes with their idea of the current epoch for an object.
>
> What happens in the even of partition is that we have a few possible cases,
> but in any case where data replicas are split by a partition, writes will
> fail triggering a new leader election. Only replicas on the side of the new
> ZK quorum (which may be the old quorum) have a chance of succeeding here.
> If the replicas are split away from the ZK quorum, writes may not be
> possible until the partition heals. If a new leader is elected, it will
> increment the epoch and form a replication chain out of the replicas it can
> find telling them about the new epoch. Writes can then proceed. During
> partition healing, any pending writes from the old epoch will be ignored by
> the current replicas. None of the writes to the new epoch will be directed
> to the old replicas after partition healing, but such writes should be
> ignored as well.
>
> In a side process, replicas that have come back after a partition may be
> updated with writes from the new replicas. If the partition lasts long
> enough, a new replica should be formed from the members of the current
> epoch. If a new replica is formed and an old one is resurrected, then the
> old one should probably be deprecated, although data balancing
> considerations may come into play.
>
> In the actual implementation of MapR FS, there is a lot of sophistication
> that does into the details, of course, and there is actually one more level
> of delegation that happens, but this outline is good enough for a lot of
> systems.
>
> The virtues of this system are multiple:
>
> 1) partition is detected exactly as soon as it affects a write. Detecting a
> partition sooner than that doesn't serve a lot of purpose, especially since
> the time to recover from a failed write is comparable to the duration of a
> fair number of partitions.
>
> 2) having an old master continue under false pretenses does no harm since
> it cannot write to a more recent replica chain. This is more important than
> it might seem since there can be situations where clocks don't necessarily
> advance at the expected rate so what seems like a short time can actually
> be much longer (Rip van Winkle failures).
>
> 3) forcing writes to all live replicas while allowing reorganization is
> actually very fast and as long as we can retain one live replica we can
> continue writing. This is in contrast to quorum systems where dropping
> below the quorum stops writes. This is important because the replicas of
> different objects can be arranged so that the portion of the cluster with a
> ZK quorum might not have a majority of replicas for some objects.
>
> 4) electing a new master of a replica chain can be done quite quickly so
> the duration of any degradation can be quite short (because you can set
> write timeouts fairly short because an unnecessary election takes less time
> than a long timeout)
>
> Anyway, you probably already have a design in mind. If this helps anyway,
> that's great.
>
> On Mon, Dec 10, 2018 at 10:32 PM Michael Borokhovich <michaelbor@gmail.com
> >
> wrote:
>
> > Makes sense. Thanks, Ted. We will design our system to cope with the
> short
> > periods where we might have two leaders.
> >
> > On Thu, Dec 6, 2018 at 11:03 PM Ted Dunning <te...@gmail.com>
> wrote:
> >
> > > ZK is able to guarantee that there is only one leader for the purposes
> of
> > > updating ZK data. That is because all commits have to originate with
> the
> > > current quorum leader and then be acknowledged by a quorum of the
> current
> > > cluster. IF the leader can't get enough acks, then it has de facto lost
> > > leadership.
> > >
> > > The problem comes when there is another system that depends on ZK data.
> > > Such data might record which node is the leader for some other
> purposes.
> > > That leader will only assume that they have become leader if they
> succeed
> > > in writing to ZK, but if there is a partition, then the old leader may
> > not
> > > be notified that another leader has established themselves until some
> > time
> > > after it has happened. Of course, if the erstwhile leader tried to
> > validate
> > > its position with a write to ZK, that write would fail, but you can't
> > spend
> > > 100% of your time doing that.
> > >
> > > it all comes down to the fact that a ZK client determines that it is
> > > connected to a ZK cluster member by pinging and that cluster member
> sees
> > > heartbeats from the leader. The fact is, though, that you can't tune
> > these
> > > pings to be faster than some level because you start to see lots of
> false
> > > positives for loss of connection. Remember that it isn't just loss of
> > > connection here that is the point. Any kind of delay would have the
> same
> > > effect. Getting these ping intervals below one second makes for a very
> > > twitchy system.
> > >
> >
>

Re: Leader election

Posted by Ted Dunning <te...@gmail.com>.

One very useful way to deal with this is the method used in MapR FS. The
idea is that ZK should only be used rarely and short periods of two leaders
must be tolerated, but other data has to be written with absolute
consistency.

The method that we chose was to associate an epoch number with every write,
require all writes to to all replicas and require that all replicas only
acknowledge writes with their idea of the current epoch for an object.

What happens in the even of partition is that we have a few possible cases,
but in any case where data replicas are split by a partition, writes will
fail triggering a new leader election. Only replicas on the side of the new
ZK quorum (which may be the old quorum) have a chance of succeeding here.
If the replicas are split away from the ZK quorum, writes may not be
possible until the partition heals. If a new leader is elected, it will
increment the epoch and form a replication chain out of the replicas it can
find telling them about the new epoch. Writes can then proceed. During
partition healing, any pending writes from the old epoch will be ignored by
the current replicas. None of the writes to the new epoch will be directed
to the old replicas after partition healing, but such writes should be
ignored as well.

In a side process, replicas that have come back after a partition may be
updated with writes from the new replicas. If the partition lasts long
enough, a new replica should be formed from the members of the current
epoch. If a new replica is formed and an old one is resurrected, then the
old one should probably be deprecated, although data balancing
considerations may come into play.

In the actual implementation of MapR FS, there is a lot of sophistication
that does into the details, of course, and there is actually one more level
of delegation that happens, but this outline is good enough for a lot of
systems.

The virtues of this system are multiple:

1) partition is detected exactly as soon as it affects a write. Detecting a
partition sooner than that doesn't serve a lot of purpose, especially since
the time to recover from a failed write is comparable to the duration of a
fair number of partitions.

2) having an old master continue under false pretenses does no harm since
it cannot write to a more recent replica chain. This is more important than
it might seem since there can be situations where clocks don't necessarily
advance at the expected rate so what seems like a short time can actually
be much longer (Rip van Winkle failures).

3) forcing writes to all live replicas while allowing reorganization is
actually very fast and as long as we can retain one live replica we can
continue writing. This is in contrast to quorum systems where dropping
below the quorum stops writes. This is important because the replicas of
different objects can be arranged so that the portion of the cluster with a
ZK quorum might not have a majority of replicas for some objects.

4) electing a new master of a replica chain can be done quite quickly so
the duration of any degradation can be quite short (because you can set
write timeouts fairly short because an unnecessary election takes less time
than a long timeout)

Anyway, you probably already have a design in mind. If this helps anyway,
that's great.

On Mon, Dec 10, 2018 at 10:32 PM Michael Borokhovich <mi...@gmail.com>
wrote:

> Makes sense. Thanks, Ted. We will design our system to cope with the short
> periods where we might have two leaders.
>
> On Thu, Dec 6, 2018 at 11:03 PM Ted Dunning <te...@gmail.com> wrote:
>
> > ZK is able to guarantee that there is only one leader for the purposes of
> > updating ZK data. That is because all commits have to originate with the
> > current quorum leader and then be acknowledged by a quorum of the current
> > cluster. IF the leader can't get enough acks, then it has de facto lost
> > leadership.
> >
> > The problem comes when there is another system that depends on ZK data.
> > Such data might record which node is the leader for some other purposes.
> > That leader will only assume that they have become leader if they succeed
> > in writing to ZK, but if there is a partition, then the old leader may
> not
> > be notified that another leader has established themselves until some
> time
> > after it has happened. Of course, if the erstwhile leader tried to
> validate
> > its position with a write to ZK, that write would fail, but you can't
> spend
> > 100% of your time doing that.
> >
> > it all comes down to the fact that a ZK client determines that it is
> > connected to a ZK cluster member by pinging and that cluster member sees
> > heartbeats from the leader. The fact is, though, that you can't tune
> these
> > pings to be faster than some level because you start to see lots of false
> > positives for loss of connection. Remember that it isn't just loss of
> > connection here that is the point. Any kind of delay would have the same
> > effect. Getting these ping intervals below one second makes for a very
> > twitchy system.
> >
>

Re: Leader election

Posted by Michael Borokhovich <mi...@gmail.com>.

Makes sense. Thanks, Ted. We will design our system to cope with the short
periods where we might have two leaders.

On Thu, Dec 6, 2018 at 11:03 PM Ted Dunning <te...@gmail.com> wrote:

> ZK is able to guarantee that there is only one leader for the purposes of
> updating ZK data. That is because all commits have to originate with the
> current quorum leader and then be acknowledged by a quorum of the current
> cluster. IF the leader can't get enough acks, then it has de facto lost
> leadership.
>
> The problem comes when there is another system that depends on ZK data.
> Such data might record which node is the leader for some other purposes.
> That leader will only assume that they have become leader if they succeed
> in writing to ZK, but if there is a partition, then the old leader may not
> be notified that another leader has established themselves until some time
> after it has happened. Of course, if the erstwhile leader tried to validate
> its position with a write to ZK, that write would fail, but you can't spend
> 100% of your time doing that.
>
> it all comes down to the fact that a ZK client determines that it is
> connected to a ZK cluster member by pinging and that cluster member sees
> heartbeats from the leader. The fact is, though, that you can't tune these
> pings to be faster than some level because you start to see lots of false
> positives for loss of connection. Remember that it isn't just loss of
> connection here that is the point. Any kind of delay would have the same
> effect. Getting these ping intervals below one second makes for a very
> twitchy system.
>
>
>
> On Fri, Dec 7, 2018 at 11:03 AM Michael Borokhovich <mi...@gmail.com>
> wrote:
>
> > We are planning to run Zookeeper nodes embedded with the client nodes.
> > I.e., each client runs also a ZK node. So, network partition will
> > disconnect a ZK node and not only the client.
> > My concern is about the following statement from the ZK documentation:
> >
> > "Timeliness: The clients view of the system is guaranteed to be
> up-to-date
> > within a certain time bound. (*On the order of tens of seconds.*) Either
> > system changes will be seen by a client within this bound, or the client
> > will detect a service outage."
> >
> > What are these "*tens of seconds*"? Can we reduce this time by
> configuring
> > "syncLimit" and "tickTime" to let's say 5 seconds? Can we have a strong
> > guarantee on this time bound?
> >
> >
> > On Thu, Dec 6, 2018 at 1:05 PM Jordan Zimmerman <
> > jordan@jordanzimmerman.com>
> > wrote:
> >
> > > > Old service leader will detect network partition max 15 seconds after
> > it
> > > > happened.
> > >
> > > If the old service leader is in a very long GC it will not detect the
> > > partition. In the face of VM pauses, etc. it's not possible to avoid 2
> > > leaders for a short period of time.
> > >
> > > -JZ
> >
>

Re: Leader election

Posted by Ted Dunning <te...@gmail.com>.

ZK is able to guarantee that there is only one leader for the purposes of
updating ZK data. That is because all commits have to originate with the
current quorum leader and then be acknowledged by a quorum of the current
cluster. IF the leader can't get enough acks, then it has de facto lost
leadership.

The problem comes when there is another system that depends on ZK data.
Such data might record which node is the leader for some other purposes.
That leader will only assume that they have become leader if they succeed
in writing to ZK, but if there is a partition, then the old leader may not
be notified that another leader has established themselves until some time
after it has happened. Of course, if the erstwhile leader tried to validate
its position with a write to ZK, that write would fail, but you can't spend
100% of your time doing that.

it all comes down to the fact that a ZK client determines that it is
connected to a ZK cluster member by pinging and that cluster member sees
heartbeats from the leader. The fact is, though, that you can't tune these
pings to be faster than some level because you start to see lots of false
positives for loss of connection. Remember that it isn't just loss of
connection here that is the point. Any kind of delay would have the same
effect. Getting these ping intervals below one second makes for a very
twitchy system.

On Fri, Dec 7, 2018 at 11:03 AM Michael Borokhovich <mi...@gmail.com>
wrote:

> We are planning to run Zookeeper nodes embedded with the client nodes.
> I.e., each client runs also a ZK node. So, network partition will
> disconnect a ZK node and not only the client.
> My concern is about the following statement from the ZK documentation:
>
> "Timeliness: The clients view of the system is guaranteed to be up-to-date
> within a certain time bound. (*On the order of tens of seconds.*) Either
> system changes will be seen by a client within this bound, or the client
> will detect a service outage."
>
> What are these "*tens of seconds*"? Can we reduce this time by configuring
> "syncLimit" and "tickTime" to let's say 5 seconds? Can we have a strong
> guarantee on this time bound?
>
>
> On Thu, Dec 6, 2018 at 1:05 PM Jordan Zimmerman <
> jordan@jordanzimmerman.com>
> wrote:
>
> > > Old service leader will detect network partition max 15 seconds after
> it
> > > happened.
> >
> > If the old service leader is in a very long GC it will not detect the
> > partition. In the face of VM pauses, etc. it's not possible to avoid 2
> > leaders for a short period of time.
> >
> > -JZ
>

Re: Leader election

Posted by Michael Borokhovich <mi...@gmail.com>.

We are planning to run Zookeeper nodes embedded with the client nodes.
I.e., each client runs also a ZK node. So, network partition will
disconnect a ZK node and not only the client.
My concern is about the following statement from the ZK documentation:

"Timeliness: The clients view of the system is guaranteed to be up-to-date
within a certain time bound. (*On the order of tens of seconds.*) Either
system changes will be seen by a client within this bound, or the client
will detect a service outage."

What are these "*tens of seconds*"? Can we reduce this time by configuring
"syncLimit" and "tickTime" to let's say 5 seconds? Can we have a strong
guarantee on this time bound?

On Thu, Dec 6, 2018 at 1:05 PM Jordan Zimmerman <jo...@jordanzimmerman.com>
wrote:

> > Old service leader will detect network partition max 15 seconds after it
> > happened.
>
> If the old service leader is in a very long GC it will not detect the
> partition. In the face of VM pauses, etc. it's not possible to avoid 2
> leaders for a short period of time.
>
> -JZ

Re: Leader election

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.

> Old service leader will detect network partition max 15 seconds after it
> happened.

If the old service leader is in a very long GC it will not detect the partition. In the face of VM pauses, etc. it's not possible to avoid 2 leaders for a short period of time.

-JZ

Re: Leader election

Posted by Maciej Smoleński <je...@gmail.com>.

Hello,

Ensuring reliability requires to use consensus directly in your service or
change the service to use distributed log/journal (e.g. bookkeeper).

However following idea is simple and in many situation good enough.
If you configure session timeout to 15 seconds - then zookeeper client will
be disconnected when partitioned - after max 15 seconds.
Old service leader will detect network partition max 15 seconds after it
happened.
The new service leader should be idle for initial 15+ seconds (let's say 30
seconds).
In this way you avoid situation with 2 concurrently working leaders.

Described solution has time dependencies and in some situations leads to
incorrect state e.g.:
High load on machine might cause that zookeeper client will detect
disconnection after 60 seconds (instead of expected 15 seconds). In such
situation there will be 2 concurrent leaders.

Maciej




On Thu, Dec 6, 2018 at 8:09 PM Jordan Zimmerman <jo...@jordanzimmerman.com>
wrote:

> > it seems like the
> > inconsistency may be caused by the partition of the Zookeeper cluster
> > itself
>
> Yes - there are many ways in which you can end up with 2 leaders. However,
> if properly tuned and configured, it will be for a few seconds at most.
> During a GC pause no work is being done anyway. But, this stuff is very
> tricky. Requiring an atomically unique leader is actually a design smell
> and you should reconsider your architecture.
>
> > Maybe we can use a more
> > lightweight Hazelcast for example?
>
> There is no distributed system that can guarantee a single leader. Instead
> you need to adjust your design and algorithms to deal with this (using
> optimistic locking, etc.).
>
> -Jordan
>
> > On Dec 6, 2018, at 1:52 PM, Michael Borokhovich <mi...@gmail.com>
> wrote:
> >
> > Thanks Jordan,
> >
> > Yes, I will try Curator.
> > Also, beyond the problem described in the Tech Note, it seems like the
> > inconsistency may be caused by the partition of the Zookeeper cluster
> > itself. E.g., if a "leader" client is connected to the partitioned ZK
> node,
> > it may be notified not at the same time as the other clients connected to
> > the other ZK nodes. So, another client may take leadership while the
> > current leader still unaware of the change. Is it true?
> >
> > Another follow up question. If Zookeeper can guarantee a single leader,
> is
> > it worth using it just for leader election? Maybe we can use a more
> > lightweight Hazelcast for example?
> >
> > Michael.
> >
> >
> > On Thu, Dec 6, 2018 at 4:50 AM Jordan Zimmerman <
> jordan@jordanzimmerman.com>
> > wrote:
> >
> >> It is not possible to achieve the level of consistency you're after in
> an
> >> eventually consistent system such as ZooKeeper. There will always be an
> >> edge case where two ZooKeeper clients will believe they are leaders
> (though
> >> for a short period of time). In terms of how it affects Apache Curator,
> we
> >> have this Tech Note on the subject:
> >> https://cwiki.apache.org/confluence/display/CURATOR/TN10 <
> >> https://cwiki.apache.org/confluence/display/CURATOR/TN10> (the
> >> description is true for any ZooKeeper client, not just Curator
> clients). If
> >> you do still intend to use a ZooKeeper lock/leader I suggest you try
> Apache
> >> Curator as writing these "recipes" is not trivial and have many gotchas
> >> that aren't obvious.
> >>
> >> -Jordan
> >>
> >> http://curator.apache.org <http://curator.apache.org/>
> >>
> >>
> >>> On Dec 5, 2018, at 6:20 PM, Michael Borokhovich <mi...@gmail.com>
> >> wrote:
> >>>
> >>> Hello,
> >>>
> >>> We have a service that runs on 3 hosts for high availability. However,
> at
> >>> any given time, exactly one instance must be active. So, we are
> thinking
> >> to
> >>> use Leader election using Zookeeper.
> >>> To this goal, on each service host we also start a ZK server, so we
> have
> >> a
> >>> 3-nodes ZK cluster and each service instance is a client to its
> dedicated
> >>> ZK server.
> >>> Then, we implement a leader election on top of Zookeeper using a basic
> >>> recipe:
> >>> https://zookeeper.apache.org/doc/r3.1.2/recipes.html#sc_leaderElection
> .
> >>>
> >>> I have the following questions doubts regarding the approach:
> >>>
> >>> 1. It seems like we can run into inconsistency issues when network
> >>> partition occurs. Zookeeper documentation says that the inconsistency
> >>> period may last “tens of seconds”. Am I understanding correctly that
> >> during
> >>> this time we may have 0 or 2 leaders?
> >>> 2. Is it possible to reduce this inconsistency time (let's say to 3
> >>> seconds) by tweaking tickTime and syncLimit parameters?
> >>> 3. Is there a way to guarantee exactly one leader all the time? Should
> we
> >>> implement a more complex leader election algorithm than the one
> suggested
> >>> in the recipe (using ephemeral_sequential nodes)?
> >>>
> >>> Thanks,
> >>> Michael.
> >>
> >>
>
>

Re: Leader election

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.

> it seems like the
> inconsistency may be caused by the partition of the Zookeeper cluster
> itself

Yes - there are many ways in which you can end up with 2 leaders. However, if properly tuned and configured, it will be for a few seconds at most. During a GC pause no work is being done anyway. But, this stuff is very tricky. Requiring an atomically unique leader is actually a design smell and you should reconsider your architecture.

> Maybe we can use a more
> lightweight Hazelcast for example?

There is no distributed system that can guarantee a single leader. Instead you need to adjust your design and algorithms to deal with this (using optimistic locking, etc.).

-Jordan

> On Dec 6, 2018, at 1:52 PM, Michael Borokhovich <mi...@gmail.com> wrote:
> 
> Thanks Jordan,
> 
> Yes, I will try Curator.
> Also, beyond the problem described in the Tech Note, it seems like the
> inconsistency may be caused by the partition of the Zookeeper cluster
> itself. E.g., if a "leader" client is connected to the partitioned ZK node,
> it may be notified not at the same time as the other clients connected to
> the other ZK nodes. So, another client may take leadership while the
> current leader still unaware of the change. Is it true?
> 
> Another follow up question. If Zookeeper can guarantee a single leader, is
> it worth using it just for leader election? Maybe we can use a more
> lightweight Hazelcast for example?
> 
> Michael.
> 
> 
> On Thu, Dec 6, 2018 at 4:50 AM Jordan Zimmerman <jo...@jordanzimmerman.com>
> wrote:
> 
>> It is not possible to achieve the level of consistency you're after in an
>> eventually consistent system such as ZooKeeper. There will always be an
>> edge case where two ZooKeeper clients will believe they are leaders (though
>> for a short period of time). In terms of how it affects Apache Curator, we
>> have this Tech Note on the subject:
>> https://cwiki.apache.org/confluence/display/CURATOR/TN10 <
>> https://cwiki.apache.org/confluence/display/CURATOR/TN10> (the
>> description is true for any ZooKeeper client, not just Curator clients). If
>> you do still intend to use a ZooKeeper lock/leader I suggest you try Apache
>> Curator as writing these "recipes" is not trivial and have many gotchas
>> that aren't obvious.
>> 
>> -Jordan
>> 
>> http://curator.apache.org <http://curator.apache.org/>
>> 
>> 
>>> On Dec 5, 2018, at 6:20 PM, Michael Borokhovich <mi...@gmail.com>
>> wrote:
>>> 
>>> Hello,
>>> 
>>> We have a service that runs on 3 hosts for high availability. However, at
>>> any given time, exactly one instance must be active. So, we are thinking
>> to
>>> use Leader election using Zookeeper.
>>> To this goal, on each service host we also start a ZK server, so we have
>> a
>>> 3-nodes ZK cluster and each service instance is a client to its dedicated
>>> ZK server.
>>> Then, we implement a leader election on top of Zookeeper using a basic
>>> recipe:
>>> https://zookeeper.apache.org/doc/r3.1.2/recipes.html#sc_leaderElection.
>>> 
>>> I have the following questions doubts regarding the approach:
>>> 
>>> 1. It seems like we can run into inconsistency issues when network
>>> partition occurs. Zookeeper documentation says that the inconsistency
>>> period may last “tens of seconds”. Am I understanding correctly that
>> during
>>> this time we may have 0 or 2 leaders?
>>> 2. Is it possible to reduce this inconsistency time (let's say to 3
>>> seconds) by tweaking tickTime and syncLimit parameters?
>>> 3. Is there a way to guarantee exactly one leader all the time? Should we
>>> implement a more complex leader election algorithm than the one suggested
>>> in the recipe (using ephemeral_sequential nodes)?
>>> 
>>> Thanks,
>>> Michael.
>> 
>>

Re: Leader election

Posted by Michael Borokhovich <mi...@gmail.com>.

Thanks Jordan,

Yes, I will try Curator.
Also, beyond the problem described in the Tech Note, it seems like the
inconsistency may be caused by the partition of the Zookeeper cluster
itself. E.g., if a "leader" client is connected to the partitioned ZK node,
it may be notified not at the same time as the other clients connected to
the other ZK nodes. So, another client may take leadership while the
current leader still unaware of the change. Is it true?

Another follow up question. If Zookeeper can guarantee a single leader, is
it worth using it just for leader election? Maybe we can use a more
lightweight Hazelcast for example?

Michael.


On Thu, Dec 6, 2018 at 4:50 AM Jordan Zimmerman <jo...@jordanzimmerman.com>
wrote:

> It is not possible to achieve the level of consistency you're after in an
> eventually consistent system such as ZooKeeper. There will always be an
> edge case where two ZooKeeper clients will believe they are leaders (though
> for a short period of time). In terms of how it affects Apache Curator, we
> have this Tech Note on the subject:
> https://cwiki.apache.org/confluence/display/CURATOR/TN10 <
> https://cwiki.apache.org/confluence/display/CURATOR/TN10> (the
> description is true for any ZooKeeper client, not just Curator clients). If
> you do still intend to use a ZooKeeper lock/leader I suggest you try Apache
> Curator as writing these "recipes" is not trivial and have many gotchas
> that aren't obvious.
>
> -Jordan
>
> http://curator.apache.org <http://curator.apache.org/>
>
>
> > On Dec 5, 2018, at 6:20 PM, Michael Borokhovich <mi...@gmail.com>
> wrote:
> >
> > Hello,
> >
> > We have a service that runs on 3 hosts for high availability. However, at
> > any given time, exactly one instance must be active. So, we are thinking
> to
> > use Leader election using Zookeeper.
> > To this goal, on each service host we also start a ZK server, so we have
> a
> > 3-nodes ZK cluster and each service instance is a client to its dedicated
> > ZK server.
> > Then, we implement a leader election on top of Zookeeper using a basic
> > recipe:
> > https://zookeeper.apache.org/doc/r3.1.2/recipes.html#sc_leaderElection.
> >
> > I have the following questions doubts regarding the approach:
> >
> > 1. It seems like we can run into inconsistency issues when network
> > partition occurs. Zookeeper documentation says that the inconsistency
> > period may last “tens of seconds”. Am I understanding correctly that
> during
> > this time we may have 0 or 2 leaders?
> > 2. Is it possible to reduce this inconsistency time (let's say to 3
> > seconds) by tweaking tickTime and syncLimit parameters?
> > 3. Is there a way to guarantee exactly one leader all the time? Should we
> > implement a more complex leader election algorithm than the one suggested
> > in the recipe (using ephemeral_sequential nodes)?
> >
> > Thanks,
> > Michael.
>
>

Re: Leader election

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.

It is not possible to achieve the level of consistency you're after in an eventually consistent system such as ZooKeeper. There will always be an edge case where two ZooKeeper clients will believe they are leaders (though for a short period of time). In terms of how it affects Apache Curator, we have this Tech Note on the subject: https://cwiki.apache.org/confluence/display/CURATOR/TN10 <https://cwiki.apache.org/confluence/display/CURATOR/TN10> (the description is true for any ZooKeeper client, not just Curator clients). If you do still intend to use a ZooKeeper lock/leader I suggest you try Apache Curator as writing these "recipes" is not trivial and have many gotchas that aren't obvious. 

-Jordan

http://curator.apache.org <http://curator.apache.org/>

> On Dec 5, 2018, at 6:20 PM, Michael Borokhovich <mi...@gmail.com> wrote:
> 
> Hello,
> 
> We have a service that runs on 3 hosts for high availability. However, at
> any given time, exactly one instance must be active. So, we are thinking to
> use Leader election using Zookeeper.
> To this goal, on each service host we also start a ZK server, so we have a
> 3-nodes ZK cluster and each service instance is a client to its dedicated
> ZK server.
> Then, we implement a leader election on top of Zookeeper using a basic
> recipe:
> https://zookeeper.apache.org/doc/r3.1.2/recipes.html#sc_leaderElection.
> 
> I have the following questions doubts regarding the approach:
> 
> 1. It seems like we can run into inconsistency issues when network
> partition occurs. Zookeeper documentation says that the inconsistency
> period may last “tens of seconds”. Am I understanding correctly that during
> this time we may have 0 or 2 leaders?
> 2. Is it possible to reduce this inconsistency time (let's say to 3
> seconds) by tweaking tickTime and syncLimit parameters?
> 3. Is there a way to guarantee exactly one leader all the time? Should we
> implement a more complex leader election algorithm than the one suggested
> in the recipe (using ephemeral_sequential nodes)?
> 
> Thanks,
> Michael.