You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "Matteo Merli (Jira)" <ji...@apache.org> on 2020/08/28 16:32:00 UTC
[jira] [Created] (ZOOKEEPER-3923) Leader election issues with Istio

Matteo Merli created ZOOKEEPER-3923:
---------------------------------------

             Summary: Leader election issues with Istio
                 Key: ZOOKEEPER-3923
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3923
             Project: ZooKeeper
          Issue Type: Bug
    Affects Versions: 3.5.8, 3.6.1
            Reporter: Matteo Merli


We have seen repeated occasion in which restarting the ZK leader node can lead to ~2min of quorum unavailability. This is easily reproducible on a cluster when leader election happens through a TCP proxying layer (such as Istio) in a Kubernetes environment.

This happens > ~80% of the times on a 3 nodes cluster. 

After investigation, this is sequence of events:

# zk-1 is the leader and gets restarted
#  zk-2 is voted as leader
# zk-0 tries to connect to zk-2:2888 to sync-up. The TCP connection is established, though it's remotely closed immediately after
# zk-2 starts listening to port 2888 but never receives any connection
# zk-0, after the read error on the connection goes back into LOOKING mode, ready for a new leader election
# zk-2 is still waiting for follower to sync-up, at waits until the timeout expires (eg: 30sec) after which it goes back into looking state.

This sequence might get repeated several times until finally one leader election round can get through.

h3. Logs excerpt:

ZK-2 becomes leader: 
{noformat}
2020-08-27 16:40:07.216000+00:00 [INFO ] [che.zookeeper.server.quorum.Leader]  LEADING - LEADER ELECTION TOOK - 214 MS
2020-08-27 16:40:07.218000+00:00 [INFO ] [zookeeper.server.quorum.QuorumPeer]  Peer state changed: leading - discovery
2020-08-27 16:40:07.218000+00:00 [INFO ] [.server.persistence.FileTxnSnapLog]  Snapshotting: 0xf00000650 to /streamlio/zookeeper/data/version-2/snapshot.f00000650
2020-08-27 16:40:07.249000+00:00 [INFO ] [e.zookeeper.server.ZooKeeperServer]  Snapshot taken in 30 ms
{noformat}
 

ZK-0 is following but immediately goes back into LOOKING state:
{noformat}
2020-08-27 16:40:07.207000+00:00 [INFO ] [he.zookeeper.server.quorum.Learner]  FOLLOWING - LEADER ELECTION TOOK - 211 MS
2020-08-27 16:40:07.208000+00:00 [INFO ] [zookeeper.server.quorum.QuorumPeer]  Peer state changed: following - discovery
2020-08-27 16:40:07.208000+00:00 [INFO ] [he.zookeeper.server.quorum.Learner]  Successfully connected to leader, using address: zookeeper-zk35-2.zookeeper-zk35.pulsar-developers.svc.cluster.local/100.101.166.47:2888
2020-08-27 16:40:07.214000+00:00 [WARN ] [he.zookeeper.server.quorum.Learner]  Exception when following the leader
	java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:186) ~[?:?]
	at java.net.SocketInputStream.read(SocketInputStream.java:140) ~[?:?]
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:252) ~[?:?]
	at java.io.BufferedInputStream.read(BufferedInputStream.java:271) ~[?:?]
	at java.io.DataInputStream.readInt(DataInputStream.java:392) ~[?:?]
	at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96)
	at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86)
	at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134)
	at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:182)
	at org.apache.zookeeper.server.quorum.Learner.registerWithLeader(Learner.java:451)
	at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:89)
	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1458)

2020-08-27 16:40:07.215000+00:00 [INFO ] [he.zookeeper.server.quorum.Learner]  Disconnected from leader (with address: zookeeper-zk35-2.zookeeper-zk35.pulsar-developers.svc.cluster.local/100.101.166.47:2888). Was connected for 6ms. Sync state: false
2020-08-27 16:40:07.215000+00:00 [INFO ] [he.zookeeper.server.quorum.Learner]  shutdown Follower
2020-08-27 16:40:07.215000+00:00 [INFO ] [zookeeper.server.quorum.QuorumPeer]  Peer state changed: looking
2020-08-27 16:40:07.215000+00:00 [WARN ] [zookeeper.server.quorum.QuorumPeer]  PeerState set to LOOKING
2020-08-27 16:40:07.215000+00:00 [INFO ] [zookeeper.server.quorum.QuorumPeer]  LOOKING
2020-08-27 16:40:07.215000+00:00 [INFO ] [r.server.quorum.FastLeaderElection]  New election. My id = 1, proposed zxid=0xf00000650
{noformat}


After timeout, ZK-2 goes back into looking and a new leader election takes place:
{noformat}
2020-08-27 16:40:27.251000+00:00 [WARN ] [zookeeper.server.quorum.QuorumPeer]  Unexpected exception
	java.lang.InterruptedException: Timeout while waiting for epoch from quorum
	at org.apache.zookeeper.server.quorum.Leader.getEpochToPropose(Leader.java:1428)
	at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:599)
	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1471)

2020-08-27 16:40:27.251000+00:00 [INFO ] [che.zookeeper.server.quorum.Leader]  Shutting down
2020-08-27 16:40:27.251000+00:00 [INFO ] [che.zookeeper.server.quorum.Leader]  Shutdown called. For the reason Forcing shutdown
2020-08-27 16:40:27.251000+00:00 [INFO ] [zookeeper.server.quorum.QuorumPeer]  Peer state changed: looking
2020-08-27 16:40:27.252000+00:00 [WARN ] [zookeeper.server.quorum.QuorumPeer]  PeerState set to LOOKING
2020-08-27 16:40:27.252000+00:00 [INFO ] [zookeeper.server.quorum.QuorumPeer]  LOOKING
{noformat}

 

The main issue that triggers this is that there is an intrinsic race condition between the leader to start listening on port 2888 and the followers trying to connect to it. In normal deployment scenario, the followers will get a connection refused error, on which they will be retrying up to 5 times.

Instead, with a TCP proxy in between, the follower is able to establish the TCP connection (to the proxy) but this connection will be terminated immediately as we try to read or write on the socket, since when the proxy connects to the leader that will be failing.

One solution here, would be to apply the same retry logic, within the boundaries of the timeout also when we get a read error and not only on connection error. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)