You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Jason Gustafson (Jira)" <ji...@apache.org> on 2021/08/30 23:46:00 UTC
[jira] [Created] (KAFKA-13254) Deadlock when expanding ISR
Jason Gustafson created KAFKA-13254:
---------------------------------------
Summary: Deadlock when expanding ISR
Key: KAFKA-13254
URL: https://issues.apache.org/jira/browse/KAFKA-13254
Project: Kafka
Issue Type: Improvement
Reporter: Jason Gustafson
Assignee: Jason Gustafson
Found this when debugging downgrade system test failures. The patch for https://issues.apache.org/jira/browse/KAFKA-13091 introduced a deadlock. Here are the jstack details:
{code}
"data-plane-kafka-request-handler-4":
waiting for ownable synchronizer 0x00000000fcc00020, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
which is held by "data-plane-kafka-request-handler-5"
"data-plane-kafka-request-handler-5":
waiting for ownable synchronizer 0x00000000c9161b20, (a
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
which is held by "data-plane-kafka-request-handler-4"
"data-plane-kafka-request-handler-4":
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000fcc00020> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at kafka.server.DelayedOperation.safeTryComplete(DelayedOperation.scala:121)
at kafka.server.DelayedOperationPurgatory$Watchers.tryCompleteWatched(DelayedOperation.scala:362)
at kafka.server.DelayedOperationPurgatory.checkAndComplete(DelayedOperation.scala:264)
at kafka.cluster.DelayedOperations.checkAndCompleteAll(Partition.scala:59)
at kafka.cluster.Partition.tryCompleteDelayedRequests(Partition.scala:907)
at kafka.cluster.Partition.handleAlterIsrResponse(Partition.scala:1421)
at kafka.cluster.Partition.$anonfun$sendAlterIsrRequest$1(Partition.scala:1340)
at kafka.cluster.Partition.$anonfun$sendAlterIsrRequest$1$adapted(Partition.scala:1340)
at kafka.cluster.Partition$$Lambda$1496/2055478409.apply(Unknown Source)
at kafka.server.ZkIsrManager.submit(ZkIsrManager.scala:74)
at kafka.cluster.Partition.sendAlterIsrRequest(Partition.scala:1345)
at kafka.cluster.Partition.expandIsr(Partition.scala:1312)
at kafka.cluster.Partition.$anonfun$maybeExpandIsr$2(Partition.scala:755)
at kafka.cluster.Partition.maybeExpandIsr(Partition.scala:754)
at kafka.cluster.Partition.updateFollowerFetchState(Partition.scala:672)
at kafka.server.ReplicaManager.$anonfun$updateFollowerFetchState$1(ReplicaManager.scala:1806)
at kafka.server.ReplicaManager$$Lambda$1075/1996432270.apply(Unknown Source)
at scala.collection.StrictOptimizedIterableOps.map(StrictOptimizedIterableOps.scala:99)
at scala.collection.StrictOptimizedIterableOps.map$(StrictOptimizedIterableOps.scala:86)
at scala.collection.mutable.ArrayBuffer.map(ArrayBuffer.scala:42)
at kafka.server.ReplicaManager.updateFollowerFetchState(ReplicaManager.scala:1790)
at kafka.server.ReplicaManager.readFromLog$1(ReplicaManager.scala:1025)
at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:1029)
at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:970)
at kafka.server.KafkaApis.handle(KafkaApis.scala:173)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75)
at java.lang.Thread.run(Thread.java:748)
"data-plane-kafka-request-handler-5":
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000c9161b20> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at kafka.cluster.Partition.fetchOffsetSnapshot(Partition.scala:1183)
at kafka.server.DelayedFetch.$anonfun$tryComplete$1(DelayedFetch.scala:96)
at kafka.server.DelayedFetch.$anonfun$tryComplete$1$adapted(DelayedFetch.scala:89)
at kafka.server.DelayedFetch$$Lambda$1115/1987378797.apply(Unknown Source)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561)
at scala.collection.AbstractIterable.foreach(Iterable.scala:919)
at kafka.server.DelayedFetch.tryComplete(DelayedFetch.scala:89)
at kafka.server.DelayedOperation.safeTryComplete(DelayedOperation.scala:121)
at kafka.server.DelayedOperationPurgatory$Watchers.tryCompleteWatched(DelayedOperation.scala:362)
at kafka.server.DelayedOperationPurgatory.checkAndComplete(DelayedOperation.scala:264)
at kafka.server.ReplicaManager.$anonfun$appendRecords$6(ReplicaManager.scala:622)
at kafka.server.ReplicaManager$$Lambda$1150/40125541.apply(Unknown Source)
at scala.collection.mutable.HashMap$Node.foreach(HashMap.scala:627)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:489)
at kafka.server.ReplicaManager.$anonfun$appendRecords$5(ReplicaManager.scala:611)
at kafka.server.ReplicaManager$$Lambda$1134/1761219075.apply$mcV$sp(Unknown Source)
at kafka.server.ActionQueue.tryCompleteActions(ActionQueue.scala:49)
at kafka.server.ReplicaManager.tryCompleteActions(ReplicaManager.scala:569)
at kafka.server.KafkaApis.handle(KafkaApis.scala:245)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75)
at java.lang.Thread.run(Thread.java:748)
{code}
Basically one thread holds the LeaderAndIsr write lock (from maybeExpandIsr) and is trying to grab the lock for a delayed operation in order to complete it. The other thread has the lock for the delayed operation and is trying to grab the LeaderAndIsr read lock.
Note that this does not affect 3.0 or any other released version.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)