You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by José Armando García Sancio <js...@confluent.io.INVALID> on 2022/07/28 14:35:45 UTC

Apache ZooKeeper Consistency with Majority Failure

Hi all,

I am currently working on a consensus protocol for Apache Kafka and I
am trying to understand Apache ZooKeeper's behaviour when the majority
of the quorum loses their transaction and snapshot directory
("version-2"). I ran the following experiment with
apache-zookeeper-3.8.0:

A) On disk I created the following 3 configurations:
$ cat zk1/conf/zoo.cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/home/jsancio/work/apache-zookeeper-3.8.0-bin/zk1/data
clientPort=2181
server.1=localhost:2888:3888
server.2=localhost:2889:3889
server.3=localhost:2890:3890

$ cat zk1/data/myid
1

$ cat zk2/conf/zoo.cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/home/jsancio/work/apache-zookeeper-3.8.0-bin/zk2/data
clientPort=2182
server.1=localhost:2888:3888
server.2=localhost:2889:3889
server.3=localhost:2890:3890

$ cat zk2/data/myid
2

$ cat zk3/conf/zoo.cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/home/jsancio/work/apache-zookeeper-3.8.0-bin/zk3/data
clientPort=2183
server.1=localhost:2888:3888
server.2=localhost:2889:3889
server.3=localhost:2890:3890

$ cat zk3/data/myid
3

B) I started the majority of the nodes (1, 2). The ensemble was
established and I was able to create a znode using the CLI.

C) I shutdown all of the nodes (1, 2 since I never started node 3). To
simulate a disk failure I deleted the content of the transaction and
snapshot directory (version-2) for node 2.

D) I started the majority of the nodes (2, 3). The ensemble was
established and I was able to establish a connection with the CLI.

E) I finally started node 1 which had the committed transactions and
snapshots. The znode created in step B) was not present.

Based on your understanding of Apache ZooKeeper is the expected
behavior? Or did I configure and bootstrap the cluster incorrectly?
For example, another possible outcome is that the ensemble would not
be established in step D.

Any feedback helping me understand this is greatly appreciated!
-- 
-José

Re: Apache ZooKeeper Consistency with Majority Failure

Posted by Shawn Heisey <ap...@elyograg.org>.

On 7/28/22 13:33, Shawn Heisey wrote:

> Node 1 is most likely informed that its database is now out of date 
> (or it decides that for itself) so it syncs the whole DB from the 
> current leader, which will not know about the znode created in step B.
>
> Not in any way a ZK expert.  But that seems like the most logical way 
> for it to work.
>
> I'm just guessing that there is some timestamp which declares the last 
> time a database was running with quorum and that comparing those 
> timestamps is how ZK decides that a node's database is out of date.  I 
> am curious as to whether I have deduced things incorrectly.

Further extrapolation ... I would guess that if at any point the entire 
cluster goes down or runs without quorum, any node that later starts up 
and joins the cluster will have its DB overwritten and be identical to 
whatever nodes ARE running.  A question for the experts on the list ... 
is that correct?

Thanks,
Shawn

Re: Apache ZooKeeper Consistency with Majority Failure

Posted by Shawn Heisey <ap...@elyograg.org>.

On 7/28/22 08:35, José Armando García Sancio wrote:
> B) I started the majority of the nodes (1, 2). The ensemble was
> established and I was able to create a znode using the CLI.
>
> C) I shutdown all of the nodes (1, 2 since I never started node 3). To
> simulate a disk failure I deleted the content of the transaction and
> snapshot directory (version-2) for node 2.

Note that at this point only node 1 knows about the znode you created in 
step B.

> D) I started the majority of the nodes (2, 3). The ensemble was
> established and I was able to establish a connection with the CLI.
>
> E) I finally started node 1 which had the committed transactions and
> snapshots. The znode created in step B) was not present.

Node 1 is most likely informed that its database is now out of date (or 
it decides that for itself) so it syncs the whole DB from the current 
leader, which will not know about the znode created in step B.

Not in any way a ZK expert.  But that seems like the most logical way 
for it to work.

I'm just guessing that there is some timestamp which declares the last 
time a database was running with quorum and that comparing those 
timestamps is how ZK decides that a node's database is out of date.  I 
am curious as to whether I have deduced things incorrectly.

Thanks,
Shawn