You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Denis Razuvaev (Jira)" <ji...@apache.org> on 2023/04/11 12:00:00 UTC

[jira] [Created] (KAFKA-14890) Kafka initiates shutdown due to connectivity problem with Zookeeper and FatalExitError from ChangeNotificationProcessorThread

Denis Razuvaev created KAFKA-14890:
--------------------------------------

             Summary: Kafka initiates shutdown due to connectivity problem with Zookeeper and FatalExitError from ChangeNotificationProcessorThread
                 Key: KAFKA-14890
                 URL: https://issues.apache.org/jira/browse/KAFKA-14890
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 3.3.2
            Reporter: Denis Razuvaev


Hello, 

We have faced several times the deadlock in Kafka, the similar issue is - https://issues.apache.org/jira/browse/KAFKA-13544 

The question - is it expected behavior that Kafka decided to shut down due to connectivity problems with Zookeeper? Seems like it is related to the inability to read data from */feature* Zk node and the _ZooKeeperClientExpiredException_ thrown from _ZooKeeperClient_ class. This exception is thrown and it is caught only in catch block of _doWork()_ method in {_}ChangeNotificationProcessorThread{_}, and it leads to {_}FatalExitError{_}. 

This problem with shutdown is reproduced in the new versions of Kafka (which already have fix regarding deadlock from 13544). 
It is hard to write a synthetic test to reproduce problem, but it can be reproduced locally via debug mode with the following steps: 
1) Start Zookeeper and start Kafka in debug mode. 
2) Emulate connectivity problem between Kafka and Zookeeper, for example connection can be closed via Netcrusher library. 
3) Put a breakpoint in _updateLatestOrThrow()_ method in _FeatureCacheUpdater_ class, before _zkClient.getDataAndVersion(featureZkNodePath)_ line execution. 
4) Restore connection between Kafka and Zookeeper after session expiration. Kafka execution should be stopped on the breakpoint.
5) Resume execution until Kafka starts to execute line _zooKeeperClient.handleRequests(remainingRequests)_ in _retryRequestsUntilConnected_ method in _KafkaZkClient_ class. 
6) Again emulate connectivity problem between Kafka and Zookeeper and wait until session will be expired. 
7) Restore connection between Kafka and Zookeeper. 
8) Kafka begins shutdown process, due to: 
_ERROR [feature-zk-node-event-process-thread]: Failed to process feature ZK node change event. The broker will eventually exit. (kafka.server.FinalizedFeatureChangeListener$ChangeNotificationProcessorThread)_ 

The following problems on the real environment can be caused by some network problems and periodic disconnection and connection to the Zookeeper in a short time period. 

I started mail thread in [https://lists.apache.org/thread/gbk4scwd8g7mg2tfsokzj5tjgrjrb9dw] regarding this problem, but have no answers.

For me it seems like defect, because Kafka initiates shutdown after restoring connection between Kafka and Zookeeper, and should be fixed. 

Thank you.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)