You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2019/02/22 01:36:00 UTC

[jira] [Commented] (KAFKA-7974) KafkaAdminClient loses worker thread/enters zombie state when initial DNS lookup fails

    [ https://issues.apache.org/jira/browse/KAFKA-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774661#comment-16774661 ] 

ASF GitHub Bot commented on KAFKA-7974:
---------------------------------------

nickbp commented on pull request #6305: Fix for KAFKA-7974: Avoid calling disconnect() when not yet connecting
URL: https://github.com/apache/kafka/pull/6305
 
 
   When attempting to get topic list via KafkaAdminClient against a server that isn't resolvable, the worker thread can get killed as follows, leading to a zombie KafkaAdminClient:
   
   ```
   ERROR [kafka-admin-client-thread | adminclient-1] 2019-02-18 01:00:45,597 KafkaThread.java:51 - Uncaught exception in thread 'kafka-admin-client-thread | adminclient-1':
   java.lang.IllegalStateException: No entry found for connection 0
       at org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:330)
       at org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:134)
       at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:921)
       at org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:287)
       at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.sendEligibleCalls(KafkaAdminClient.java:898)
       at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1113)
       at java.lang.Thread.run(Thread.java:748)
   ```
   
   It looks like cause is a bug in state handling between `NetworkClient` and `ClusterConnectionStates`:
   - `NetworkClient.ready()` invokes `this.initiateConnect()` as seen in the above stacktrace
   - `NetworkClient.initiateConnect()` invokes `ClusterConnectionStates.connecting()`, which internally invokes `ClientUtils.resolve()` to resolve the host when creating an entry for the connection.
   - If this host lookup fails, a `UnknownHostException` can be thrown back to `NetworkClient.initiateConnect()` and the connection entry is not created in `ClusterConnectionStates`. This exception doesn't currently get logged so this is a guess on my part.
   - `NetworkClient.initiateConnect()` catches the exception and attempts to call `ClusterConnectionStates.disconnected()`, which throws an `IllegalStateException` because no entry had yet been created due to the lookup failure.
   - This `IllegalStateException` ends up killing the worker thread and `KafkaAdminClient` gets stuck, never returning from `listTopics()`.
   
   This PR includes a unit test which reproduces the original issue (matching stacktrace) and verifies the fix.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> KafkaAdminClient loses worker thread/enters zombie state when initial DNS lookup fails
> --------------------------------------------------------------------------------------
>
>                 Key: KAFKA-7974
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7974
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Nicholas Parker
>            Priority: Major
>
> Version: kafka-clients-2.1.0
> I have some code that creates creates a KafkaAdminClient instance and then invokes listTopics(). I was seeing the following stacktrace in the logs, after which the KafkaAdminClient instance became unresponsive:
> {code:java}
> ERROR [kafka-admin-client-thread | adminclient-1] 2019-02-18 01:00:45,597 KafkaThread.java:51 - Uncaught exception in thread 'kafka-admin-client-thread | adminclient-1':
> java.lang.IllegalStateException: No entry found for connection 0
>     at org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:330)
>     at org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:134)
>     at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:921)
>     at org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:287)
>     at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.sendEligibleCalls(KafkaAdminClient.java:898)
>     at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1113)
>     at java.lang.Thread.run(Thread.java:748){code}
> From looking at the code I was able to trace down a possible cause:
>  * NetworkClient.ready() invokes this.initiateConnect() as seen in the above stacktrace
>  * NetworkClient.initiateConnect() invokes ClusterConnectionStates.connecting(), which internally invokes ClientUtils.resolve() to to resolve the host when creating an entry for the connection.
>  * If this host lookup fails, a UnknownHostException can be thrown back to NetworkClient.initiateConnect() and the connection entry is not created in ClusterConnectionStates. This exception doesn't get logged so this is a guess on my part.
>  * NetworkClient.initiateConnect() catches the exception and attempts to call ClusterConnectionStates.disconnected(), which throws an IllegalStateException because no entry had yet been created due to the lookup failure.
>  * This IllegalStateException ends up killing the worker thread and KafkaAdminClient gets stuck, never returning from listTopics().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)