You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "David Mao (Jira)" <ji...@apache.org> on 2021/12/10 15:38:00 UTC

[jira] [Updated] (KAFKA-13388) Kafka Producer nodes stuck in CHECKING_API_VERSIONS

     [ https://issues.apache.org/jira/browse/KAFKA-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mao updated KAFKA-13388:
------------------------------
    Priority: Critical  (was: Minor)

> Kafka Producer nodes stuck in CHECKING_API_VERSIONS
> ---------------------------------------------------
>
>                 Key: KAFKA-13388
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13388
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>            Reporter: David Hoffman
>            Priority: Critical
>         Attachments: Screen Shot 2021-10-25 at 10.28.48 AM.png, image-2021-10-21-13-42-06-528.png
>
>
> I have been seeing expired batch errors in my app.
> {code:java}
> org.apache.kafka.common.errors.TimeoutException: Expiring 51 record(s) for xxx-17:120002 ms has passed since batch creation
> {code}
>  I would have assumed a request timout or connection timeout should have also been logged. I could not find any other associated errors. 
> I added some instrumenting to my app and have traced this down to broker connections hanging in CHECKING_API_VERSIONS state. -It appears there is no effective timeout for Kafka Producer broker connections in CHECKING_API_VERSIONS state.-
> In the code see the after the NetworkClient connects to a broker node it makes a request to check api versions, when it receives the response it marks the node as ready. -I am seeing that sometimes a reply is not received for the check api versions request the connection just hangs in CHECKING_API_VERSIONS state until it is disposed I assume after the idle connection timeout.-
> Update: not actually sure what causes the connection to get stuck in CHECKING_API_VERSIONS.
> -I am guessing the connection setup timeout should be still in play for this, but it is not.- 
>  -There is a connectingNodes set that is consulted when checking timeouts and the node is removed- 
>  -when ClusterConnectionStates.checkingApiVersions(String id) is called to transition the node into CHECKING_API_VERSIONS-



--
This message was sent by Atlassian Jira
(v8.20.1#820001)