You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Nicolas Carlot <ni...@chronopost.fr.INVALID> on 2020/07/31 10:28:07 UTC

Cached zkVersion [237] not equal to that in zookeeper, skip updating ISR

Hello community,

We're facing troubles on our cluster since yesterday. We suddenly have
massive ISR lost on 2 of our 3 nodes.
Most partitions leaders were moved to node 1, which has ISR for all
partitions.

On Node 2 & 3 we have many logs stating:
Cached zkVersion [xxx] not equal to that in zookeeper, skip updating ISR

I believe restarting those nodes should fix the issue. But we're very
concerned about why this happened.

When the issue arose, we got the following errors on kafka node 2/3
controller.log:
[2020-07-30 21:21:07,808] ERROR [ControllerEventThread controllerId=2]
Error processing event RegisterBrokerAndReelect
(kafka.controller.ControllerEventManager$ControllerEventThread)
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode =
NodeExists
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:122)
        at
kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1518)
        at kafka.zk.KafkaZkClient.registerBrokerInZk(KafkaZkClient.scala:84)
        at
kafka.controller.KafkaController$RegisterBrokerAndReelect$.process(KafkaController.scala:1519)
        at
kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:86)
        at
kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:86)
        at
kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:86)
        at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
        at
kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:85)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)

And the following trace on node 2/3 server.log:
[2020-07-30 21:21:06,381] INFO Socket connection established to
satezookeeperi2.chronopost.fr/10.37.92.239:62181, initiating session
(org.apache.zookeeper.ClientCnxn)
[2020-07-30 21:21:06,382] WARN Unable to reconnect to ZooKeeper service,
session 0x100001144cf000c has expired (org.apache.zookeeper.ClientCnxn)
[2020-07-30 21:21:06,382] INFO Unable to reconnect to ZooKeeper service,
session 0x100001144cf000c has expired, closing socket connection
(org.apache.zookeeper.ClientCnxn)
[2020-07-30 21:21:06,382] INFO [ZooKeeperClient] Session expired.
(kafka.zookeeper.ZooKeeperClient)
[2020-07-30 21:21:06,382] INFO EventThread shut down for session:
0x100001144cf000c (org.apache.zookeeper.ClientCnxn)
[2020-07-30 21:21:06,383] INFO [ZooKeeperClient] Initializing a new session
to satezookeeperi1.chronopost.fr:62181,satezookeeperi2.chronopost.fr:62181,
satezookeeperi3.chronopost.fr:62181. (kafka.zookeeper.ZooKeeperClient)
[2020-07-30 21:21:06,383] INFO Initiating client connection, connectString=
satezookeeperi1.chronopost.fr:62181,satezookeeperi2.chronopost.fr:62181,
satezookeeperi3.chronopost.fr:62181 sessionTimeout=6000
watcher=kafka.zookeeper.ZooKeeperClient$ZooKeeperClientWatcher$@758871cf
(org.apache.zookeeper.ZooKeeper)
[2020-07-30 21:21:06,383] ERROR Uncaught exception in scheduled task
'isr-expiration' (kafka.utils.KafkaScheduler)
kafka.zookeeper.ZooKeeperClientExpiredException: Session expired either
before or while waiting for connection
        at
kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply$mcV$sp(ZooKeeperClient.scala:238)
        at
kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply(ZooKeeperClient.scala:226)
        at
kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply(ZooKeeperClient.scala:226)
        at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)
        at
kafka.zookeeper.ZooKeeperClient.kafka$zookeeper$ZooKeeperClient$$waitUntilConnected(ZooKeeperClient.scala:226)
        at
kafka.zookeeper.ZooKeeperClient$$anonfun$waitUntilConnected$1.apply$mcV$sp(ZooKeeperClient.scala:220)
        at
kafka.zookeeper.ZooKeeperClient$$anonfun$waitUntilConnected$1.apply(ZooKeeperClient.scala:220)
        at
kafka.zookeeper.ZooKeeperClient$$anonfun$waitUntilConnected$1.apply(ZooKeeperClient.scala:220)
        at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)
        at
kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:219)
        at
kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1503)
        at
kafka.zk.KafkaZkClient.kafka$zk$KafkaZkClient$$retryRequestUntilConnected(KafkaZkClient.scala:1479)
        at
kafka.zk.KafkaZkClient.conditionalUpdatePath(KafkaZkClient.scala:604)
        at
kafka.utils.ReplicationUtils$.updateLeaderAndIsr(ReplicationUtils.scala:33)
        at
kafka.cluster.Partition.kafka$cluster$Partition$$updateIsr(Partition.scala:730)
        at
kafka.cluster.Partition$$anonfun$4.apply$mcZ$sp(Partition.scala:536)
        at kafka.cluster.Partition$$anonfun$4.apply(Partition.scala:527)
        at kafka.cluster.Partition$$anonfun$4.apply(Partition.scala:527)
        at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)
        at kafka.utils.CoreUtils$.inWriteLock(CoreUtils.scala:259)
        at kafka.cluster.Partition.maybeShrinkIsr(Partition.scala:526)
        at
kafka.server.ReplicaManager$$anonfun$kafka$server$ReplicaManager$$maybeShrinkIsr$2.apply(ReplicaManager.scala:1333)
        at
kafka.server.ReplicaManager$$anonfun$kafka$server$ReplicaManager$$maybeShrinkIsr$2.apply(ReplicaManager.scala:1333)
        at scala.collection.Iterator$class.foreach(Iterator.scala:891)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
        at
kafka.server.ReplicaManager.kafka$server$ReplicaManager$$maybeShrinkIsr(ReplicaManager.scala:1333)
        at
kafka.server.ReplicaManager$$anonfun$2.apply$mcV$sp(ReplicaManager.scala:316)
        at
kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:114)
        at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

The zookeeper logs on 2/3:
[2020-07-30 23:31:42,653] INFO Received connection request /
10.37.120.129:36280 (org.apache.zookeeper.server.quorum.QuorumCnxManager)
[2020-07-30 23:31:42,654] WARN Connection broken for id 1, my id = 3, error
=  (org.apache.zookeeper.server.quorum.QuorumCnxManager)
java.net.SocketException: Socket closed
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1010)
[2020-07-30 23:31:42,654] WARN Interrupting SendWorker
(org.apache.zookeeper.server.quorum.QuorumCnxManager)
[2020-07-30 23:31:42,654] WARN Interrupted while waiting for message on
queue (org.apache.zookeeper.server.quorum.QuorumCnxManager)
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
        at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1094)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:929)
[2020-07-30 23:31:42,655] WARN Send worker leaving thread
(org.apache.zookeeper.server.quorum.QuorumCnxManager)
[2020-07-30 23:31:42,657] INFO Notification: 1 (message format version), 3
(n.leader), 0x820e0000000c (n.zxid), 0x2 (n.round), LOOKING (n.state), 1
(n.sid), 0x820e (n.peerEpoch) LEADING (my state)
(org.apache.zookeeper.server.quorum.Fast
LeaderElection)




-- 
*Nicolas Carlot*

Lead dev
|  | nicolas.carlot@chronopost.fr


*Veuillez noter qu'à partir du 20 mai, le siège Chronopost déménage. La
nouvelle adresse est : 3 boulevard Romain Rolland 75014 Paris*

[image: Logo Chronopost]
| chronopost.fr <http://www.chronopost.fr/>
Suivez nous sur Facebook <https://fr-fr.facebook.com/chronopost> et Twitter
<https://twitter.com/chronopost>.

[image: DPD Group]