You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "Jon Marius Venstad (Jira)" <ji...@apache.org> on 2022/09/23 13:46:00 UTC
[jira] [Commented] (ZOOKEEPER-4541) Ephemeral znode owned by closed session visible in 1 of 3 servers

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-4541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608752#comment-17608752 ] 

Jon Marius Venstad commented on ZOOKEEPER-4541:
-----------------------------------------------

After some thorough investigation, we have a good idea of what has happened. There are two faults in the ZK codebase, both of which are attempted fixed in [https://github.com/apache/zookeeper/pull/1925]. The details of our findings are as follows:

 

First, some background: we do rolling restarts of embedded ZK clusters of size 3, roughly 50 times a day, across many data centres. 

A few times a week, we find a ZK server with a different view of the world, from what the other servers in the cluster have. 

As far as we could ascertain, this always happens as a result of the leader restarting, that is, the enclosing JVM shuts down (cleanly, and we call {{{}QuorumPeer.shutdown(){}}}), and start up again a few minutes later. 

Looking at the data directories of the servers after a recent episode, we found two interesting things, of which the second is the critical:

Transaction logs had segments of transactions that were repeated, both 2 and 3 times. This was found on all servers, but it seems to be harmless, despite the ERROR logging it produces when the data directory is loaded on ZK server start. It causes a digest mismatch, which isn't acted upon. It also becomes invisible when a new snapshot is taken. This masked the real error, though, because we were looking for digest mismatches, and the digest already failed prior to what was the second finding: 

_The transaction log_ on the server that did _not_ agree with the others (in particular, it typically insist there are ephemeral nodes present, but for sessions which died a long time ago!) was _missing entries at the end of one of the log files!_ The transaction log in question was the active one at the end of the epoch of the leader which restarted.

We've also noticed this error is more prevalent on machines with slower disks, where we sometimes see complaints from ZK about slow fsyncing. 

 

Putting this all together, we finally discovered that the following happens:

 

ZK3 is leader, ZK1 and ZK2 are followers. X < Z are transaction IDs. 

ZK1 is busy fsyncing transactions <= X.

ZK3 proposes Z, and ZK1 and ZK2 receive the proposal.

ZK1 enqueues the proposal, but does not ACK it. ZK2 ACKs the proposal. 

ZK3 observes quorum for Z, and sends a COMMIT to ZK1 and ZK2. 

ZK1 applies Z to its data tree, but it is still not processed by the SyncRequestProcessor, which is busy fsyncing. 

ZK3 shuts down, closing the socket to ZK1 and ZK2.

ZK1 stops following the dead leader, and shuts down its {{{}Follower{}}}, to prepare for a new leader election and epoch.

ZK1 joins a new leader election, using Z as its {{{}lastSeenZxid{}}}, because that was COMMITTed, and applied to its data tree. 

ZK1 completes fsync up to X, and then proceeds to ACK those transactions, but the socket is gone, and the sync thread dies from an unexpected NPE. The remaining transaction Z is not written to disk. 

ZK1 and ZK2 agree that Z is the newest transaction, ZK2 becomes the new leader, and ZK3 eventually rejoins as a follower. 

ZK1 has a data tree which is correct, and new transactions come in, and are written to the transaction log of the new epoch. 

ZK1 restarts. When it comes back up, Z is missing from its transaction log, and its data is inconsistent with that of the other servers. It seems we never hit SNAP syncs, unless we intervene, so it stays this way indefinitely. 

 

So, expecting the socket of the learner to possibly be {{{}null{}}}, in {{{}SendAckRequestProcessor{}}}, is enough to fix this problem in most cases, because that seems to be the common reason for the sync thread to die—had it lived on, it would eventually have synced Z as well; nevertheless, there would be a thread leak, and there could be other failure scenarios that I'm not aware of as well, both now, and certainly later. 

The attempted fix therefore also refactors the shutdown for child classes of {{{}ZooKeeperServer{}}}, such that they all override the {{shutdown(boolean)}} method, which is ultimately the one that is called. I also moved the shutdown (and sync) of the {{SyncRequestProcessor}} to be before the shutdown of the parent {{{}ZooKeeperServer{}}}, because of the {{fastForwardFromEdits()}} that's called in the latter in an attempt to be up to date when the leader election starts; obviously, this makes more sense if the pending transactions are actually written first, and I believe the missing synchronisation with the persistence thread is also what caused the duplicate series of transactions in the transaction logs: inflight transactions (not yet committed) could be written to the log by the {{SyncRequestProcessor}} in parallel with a leader election using an older Zxid than the tail of the log, where the resulting DIFF from a more up-to-date leader would _also_ contain those same entries that were just written. This wouldn't even be noticeable, because the data tree of the new follower wouldn't have the duplicate entries in it, and not complain when they were applied; and the transaction log itself does not detect this either. It would only be detectable upon database recreation. 

 

 

Anyway, long story. I hope you strongly consider this as a candidate to make the next release. We'll be running a patched ZK in the meantime. But I'd be surprised if we're the only ones seeing this. All it takes is a few servers, some automation, and some slow disks :) 

 

> Ephemeral znode owned by closed session visible in 1 of 3 servers
> -----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4541
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4541
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>            Reporter: Håkon Hallingstad
>            Priority: Major
>
> We have a ZooKeeper 3.7.0 ensemble with servers 1-3. Using zkCli.sh we saw the following znode on server 1:
> {code:java}
> stat /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
> cZxid = 0xa240000381f
> ctime = Tue May 10 17:17:27 UTC 2022
> mZxid = 0xa240000381f
> mtime = Tue May 10 17:17:27 UTC 2022
> pZxid = 0xa240000381f
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x20006d5d6a10000
> dataLength = 14
> numChildren = 0
> {code}
> This znode was absent on server 2 and 3. A delete on the node failed everywhere.
> {code:java}
> delete /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
> Node does not exist: /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
> {code}
> This makes sense as a mutable operation goes to the leader, which was server 3 and where the node is absent. Restarting server 1 did not fix the issue. Stopping server 1, removing the zookeeper database and version-2 directory, and starting the server fixed the issue.
> Session 0x20006d5d6a10000 was created by a ZooKeeper client and initially connected to server 2. The client later closed the session at around the same time as the then-leader server 2 was stopped:
> {code:java}
> 2022-05-10 17:17:28.141 I - cfg2 Submitting global closeSession request for session 0x20006d5d6a10000 (ZooKeeperServer)
> 2022-05-10 17:17:28.145 I - cfg2 Session: 0x20006d5d6a10000 closed (ZooKeeper)
> {code}
> That both were closed/shutdown at the same time is something we do on our side. Perhaps there is a race in the handling of closing of sessions and the shutdown of the leader?
> We did a {{dump}} of server 1-3, and server 1 had two sessions that did not exist in server 2 and 3, and there was one ephemeral node reported on 1 that was not reported on server 2 and 3. The ephemeral owner matched one of the two extraneous sessions.  The dump from server 1 with the extra entries:
> {code:java}
> Global Sessions(8):
> ...
> 0x20006d5d6a10000 120000ms
> 0x2004360ecca0000 120000ms
> ...
> Sessions with Ephemerals (4):
> 0x20006d5d6a10000:
> /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)