You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Guozhang Wang (Jira)" <ji...@apache.org> on 2021/12/22 21:50:00 UTC
[jira] [Assigned] (KAFKA-13544) Deadlock during shutting down kafka broker because of connectivity problem with zookeeper
[ https://issues.apache.org/jira/browse/KAFKA-13544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Guozhang Wang reassigned KAFKA-13544:
-------------------------------------
Assignee: joecqupt
> Deadlock during shutting down kafka broker because of connectivity problem with zookeeper
> ------------------------------------------------------------------------------------------
>
> Key: KAFKA-13544
> URL: https://issues.apache.org/jira/browse/KAFKA-13544
> Project: Kafka
> Issue Type: Bug
> Components: core
> Affects Versions: 2.8.1
> Reporter: Andrei Lakhmanets
> Assignee: joecqupt
> Priority: Major
> Fix For: 3.1.0, 3.0.1, 2.8.2
>
> Attachments: kafka_broker_logs.log, kafka_broker_stackdump.txt
>
>
> Hi team,
> *Kafka version:* 2.8.1
> *Configuration:* 3 kafka brokers in different availability zones and 3 zookeeper brokers in different availability zones.
> I faced with deadlock in kafka. I've attached stack dump of the kafka state to this ticket. The locked threads are "feature-zk-node-event-process-thread" and "kafka-shutdown-hook".
> *Context:*
> My kafka cluster had connectivity problems with zookeeper and in the logs I saw the next exception:
> The stacktrace:
> {code:java}
> [2021-12-06 18:31:14,629] WARN Unable to reconnect to ZooKeeper service, session 0x10000039563000f has expired (org.apache.zookeeper.ClientCnxn)
> [2021-12-06 18:31:14,629] INFO Unable to reconnect to ZooKeeper service, session 0x10000039563000f has expired, closing socket connection (org.apache.zookeeper.ClientCnxn)
> [2021-12-06 18:31:14,629] INFO EventThread shut down for session: 0x10000039563000f (org.apache.zookeeper.ClientCnxn)
> [2021-12-06 18:31:14,631] INFO [ZooKeeperClient Kafka server] Session expired. (kafka.zookeeper.ZooKeeperClient)
> [2021-12-06 18:31:14,632] ERROR [feature-zk-node-event-process-thread]: Failed to process feature ZK node change event. The broker will eventually exit. (kafka.server.FinalizedFeatureChangeListener$ChangeNotificationProcessorThread)
> kafka.zookeeper.ZooKeeperClientExpiredException: Session expired either before or while waiting for connection
> at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:279)
> at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$1(ZooKeeperClient.scala:261)
> at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:261)
> at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1797)
> at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1767)
> at kafka.zk.KafkaZkClient.retryRequestUntilConnected(KafkaZkClient.scala:1762)
> at kafka.zk.KafkaZkClient.getDataAndStat(KafkaZkClient.scala:771)
> at kafka.zk.KafkaZkClient.getDataAndVersion(KafkaZkClient.scala:755)
> at kafka.server.FinalizedFeatureChangeListener$FeatureCacheUpdater.updateLatestOrThrow(FinalizedFeatureChangeListener.scala:74)
> at kafka.server.FinalizedFeatureChangeListener$ChangeNotificationProcessorThread.doWork(FinalizedFeatureChangeListener.scala:147)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96) {code}
> The exception is thrown in feature-zk-node-event-process-thread thread and it is catched in method FinalizedFeatureChangeListener.ChangeNotificationProcessorThread.doWork and then doWork method throws FatalExitError(1).
> The FatalExitError catched in ShutdownableThread.run method and call Exit.exit(e.statusCode()) which calls System.exit under the hood.
> The stackdump of "feature-zk-node-event-process-thread" thread:
> {code:java}
> "feature-zk-node-event-process-thread" #23 prio=5 os_prio=0 cpu=163.19ms elapsed=1563046.32s tid=0x00007fd0dcdec800 nid=0x2088 in Object.wait() [0x00007fd07e2c1000]
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(java.base@11.0.11/Native Method)
> - waiting on <no object reference available>
> at java.lang.Thread.join(java.base@11.0.11/Thread.java:1300)
> - waiting to re-lock in wait() <0x0000000088b9d3c8> (a org.apache.kafka.common.utils.KafkaThread)
> at java.lang.Thread.join(java.base@11.0.11/Thread.java:1375)
> at java.lang.ApplicationShutdownHooks.runHooks(java.base@11.0.11/ApplicationShutdownHooks.java:107)
> at java.lang.ApplicationShutdownHooks$1.run(java.base@11.0.11/ApplicationShutdownHooks.java:46)
> at java.lang.Shutdown.runHooks(java.base@11.0.11/Shutdown.java:130)
> at java.lang.Shutdown.exit(java.base@11.0.11/Shutdown.java:174)
> - locked <0x00000000806872f8> (a java.lang.Class for java.lang.Shutdown)
> at java.lang.Runtime.exit(java.base@11.0.11/Runtime.java:116)
> at java.lang.System.exit(java.base@11.0.11/System.java:1752)
> at org.apache.kafka.common.utils.Exit$2.execute(Exit.java:43)
> at org.apache.kafka.common.utils.Exit.exit(Exit.java:66)
> at kafka.utils.Exit$.exit(Exit.scala:28)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:102) {code}
> System.exit method before shutting down virtual machine calls shutdown hooks and spawns the thread "kafka-shutdown-hook".
> The thread "kafka-shutdown-hook" calls kafka.server.KafkaServer.shutdown which during work calls
> {code:java}
> if (featureChangeListener != null)
> CoreUtils.swallow(featureChangeListener.close(), this) {code}
> Where featureChangeListener.close is
> {code:java}
> def close(): Unit = {
> zkClient.unregisterStateChangeHandler(ZkStateChangeHandler.name)
> zkClient.unregisterZNodeChangeHandler(FeatureZNodeChangeHandler.path)
> queue.clear()
> thread.shutdown()
> thread.join()
> } {code}
> It enteres in the deadlock on the line "thread.join()" because it waits that the thread "feature-zk-node-event-process-thread" should die but this thread spawned the thead "kafka-shutdown-hook" and can't die until the thread "kafka-shutdown-hook" is finished.
> The stack dump of the thread "kafka-shutdown-hook":
> {code:java}
> "kafka-shutdown-hook" #18 prio=5 os_prio=0 cpu=134.10ms elapsed=50906.28s tid=0x00007fd0ac15c800 nid=0x6d3bb in Object.wait() [0x00007fd02dcba000]
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(java.base@11.0.11/Native Method)
> - waiting on <no object reference available>
> at java.lang.Thread.join(java.base@11.0.11/Thread.java:1300)
> - waiting to re-lock in wait() <0x0000000088973e98> (a kafka.server.FinalizedFeatureChangeListener$ChangeNotificationProcessorThread)
> at java.lang.Thread.join(java.base@11.0.11/Thread.java:1375)
> at kafka.server.FinalizedFeatureChangeListener.close(FinalizedFeatureChangeListener.scala:245)
> at kafka.server.KafkaServer.$anonfun$shutdown$23(KafkaServer.scala:717)
> at kafka.server.KafkaServer$$Lambda$1681/0x000000084082b040.apply$mcV$sp(Unknown Source)
> at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:68)
> at kafka.server.KafkaServer.shutdown(KafkaServer.scala:717)
> at kafka.Kafka$.$anonfun$main$3(Kafka.scala:100)
> at kafka.Kafka$$$Lambda$205/0x0000000840233840.apply$mcV$sp(Unknown Source)
> at kafka.utils.Exit$.$anonfun$addShutdownHook$1(Exit.scala:38)
> at kafka.Kafka$$$Lambda$206/0x0000000840233c40.run(Unknown Source)
> at java.lang.Thread.run(java.base@11.0.11/Thread.java:829){code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)