You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "N S N MURTHY (JIRA)" <ji...@apache.org> on 2019/05/02 15:55:00 UTC

[jira] [Commented] (KAFKA-7697) Possible deadlock in kafka.cluster.Partition

    [ https://issues.apache.org/jira/browse/KAFKA-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831714#comment-16831714 ] 

N S N MURTHY commented on KAFKA-7697:
-------------------------------------

[~rsivaram]  We also encountered the same issue with 2.1.0 in our production stack.

Below is the sample stack trace. If we want to up grade/down grade Kafka in our setup, which version we can go.

kafka-request-handler-3 tid=53 [WAITING] [DAEMON]
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock() ReentrantReadWriteLock.java:727
kafka.utils.CoreUtils$.inLock(Lock, Function0) CoreUtils.scala:249
kafka.utils.CoreUtils$.inReadLock(ReadWriteLock, Function0) CoreUtils.scala:257
kafka.cluster.Partition.fetchOffsetSnapshot(Optional, boolean) Partition.scala:832
kafka.server.DelayedFetch.$anonfun$tryComplete$1(DelayedFetch, IntRef, Object, Tuple2) DelayedFetch.scala:87
kafka.server.DelayedFetch.$anonfun$tryComplete$1$adapted(DelayedFetch, IntRef, Object, Tuple2) DelayedFetch.scala:79
kafka.server.DelayedFetch$$Lambda$969.apply(Object)
scala.collection.mutable.ResizableArray.foreach(Function1) ResizableArray.scala:58
scala.collection.mutable.ResizableArray.foreach$(ResizableArray, Function1) ResizableArray.scala:51
scala.collection.mutable.ArrayBuffer.foreach(Function1) ArrayBuffer.scala:47
kafka.server.DelayedFetch.tryComplete() DelayedFetch.scala:79
kafka.server.DelayedOperation.maybeTryComplete() DelayedOperation.scala:121
kafka.server.DelayedOperationPurgatory$Watchers.tryCompleteWatched() DelayedOperation.scala:371
kafka.server.DelayedOperationPurgatory.checkAndComplete(Object) DelayedOperation.scala:277
kafka.server.ReplicaManager.tryCompleteDelayedFetch(DelayedOperationKey) ReplicaManager.scala:307
kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition, MemoryRecords, boolean, int) Partition.scala:743
kafka.cluster.Partition$$Lambda$856.apply()
kafka.utils.CoreUtils$.inLock(Lock, Function0) CoreUtils.scala:251
kafka.utils.CoreUtils$.inReadLock(ReadWriteLock, Function0) CoreUtils.scala:257
kafka.cluster.Partition.appendRecordsToLeader(MemoryRecords, boolean, int) Partition.scala:729
kafka.server.ReplicaManager.$anonfun$appendToLocalLog$2(ReplicaManager, boolean, boolean, short, Tuple2) ReplicaManager.scala:735
kafka.server.ReplicaManager$$Lambda$844.apply(Object)
scala.collection.TraversableLike.$anonfun$map$1(Function1, Builder, Object) TraversableLike.scala:233
scala.collection.TraversableLike$$Lambda$10.apply(Object)
scala.collection.mutable.HashMap.$anonfun$foreach$1(Function1, DefaultEntry) HashMap.scala:145
scala.collection.mutable.HashMap$$Lambda$22.apply(Object)
scala.collection.mutable.HashTable.foreachEntry(Function1) HashTable.scala:235
scala.collection.mutable.HashTable.foreachEntry$(HashTable, Function1) HashTable.scala:228
scala.collection.mutable.HashMap.foreachEntry(Function1) HashMap.scala:40
scala.collection.mutable.HashMap.foreach(Function1) HashMap.scala:145
scala.collection.TraversableLike.map(Function1, CanBuildFrom) TraversableLike.scala:233
scala.collection.TraversableLike.map$(TraversableLike, Function1, CanBuildFrom) TraversableLike.scala:226
scala.collection.AbstractTraversable.map(Function1, CanBuildFrom) Traversable.scala:104
kafka.server.ReplicaManager.appendToLocalLog(boolean, boolean, Map, short) ReplicaManager.scala:723
kafka.server.ReplicaManager.appendRecords(long, short, boolean, boolean, Map, Function1, Option, Function1) ReplicaManager.scala:470
kafka.server.KafkaApis.handleProduceRequest(RequestChannel$Request) KafkaApis.scala:482
kafka.server.KafkaApis.handle(RequestChannel$Request) KafkaApis.scala:106
kafka.server.KafkaRequestHandler.run() KafkaRequestHandler.scala:69
java.lang.Thread.run() Thread.java:748

-Murthy

> Possible deadlock in kafka.cluster.Partition
> --------------------------------------------
>
>                 Key: KAFKA-7697
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7697
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 2.1.0
>            Reporter: Gian Merlino
>            Assignee: Rajini Sivaram
>            Priority: Blocker
>             Fix For: 2.2.0, 2.1.1
>
>         Attachments: kafka.log, threaddump.txt
>
>
> After upgrading a fairly busy broker from 0.10.2.0 to 2.1.0, it locked up within a few minutes (by "locked up" I mean that all request handler threads were busy, and other brokers reported that they couldn't communicate with it). I restarted it a few times and it did the same thing each time. After downgrading to 0.10.2.0, the broker was stable. I attached a thread dump from the last attempt on 2.1.0 that shows lots of kafka-request-handler- threads trying to acquire the leaderIsrUpdateLock lock in kafka.cluster.Partition.
> It jumps out that there are two threads that already have some read lock (can't tell which one) and are trying to acquire a second one (on two different read locks: 0x0000000708184b88 and 0x000000070821f188): kafka-request-handler-1 and kafka-request-handler-4. Both are handling a produce request, and in the process of doing so, are calling Partition.fetchOffsetSnapshot while trying to complete a DelayedFetch. At the same time, both of those locks have writers from other threads waiting on them (kafka-request-handler-2 and kafka-scheduler-6). Neither of those locks appear to have writers that hold them (if only because no threads in the dump are deep enough in inWriteLock to indicate that).
> ReentrantReadWriteLock in nonfair mode prioritizes waiting writers over readers. Is it possible that kafka-request-handler-1 and kafka-request-handler-4 are each trying to read-lock the partition that is currently locked by the other one, and they're both parked waiting for kafka-request-handler-2 and kafka-scheduler-6 to get write locks, which they never will, because the former two threads own read locks and aren't giving them up?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)