You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "David Hoffman (Jira)" <ji...@apache.org> on 2021/10/21 18:48:00 UTC

[jira] [Comment Edited] (KAFKA-13388) Kafka Producer has no timeout for nodes stuck in CHECKING_API_VERSIONS

    [ https://issues.apache.org/jira/browse/KAFKA-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432667#comment-17432667 ] 

David Hoffman edited comment on KAFKA-13388 at 10/21/21, 6:47 PM:
------------------------------------------------------------------

I have a screenshot of some logging I added that shows the connection stuck in CHECKING_API_VERSIONS state. The logging I added grabbed all the nodeIds from ClusterConnectionStates and determined if the Producer would treat them as 'not ready' by checking the state, selector and whether there were in flight requests. If they were 'not ready' it logged them and some other info. It ran on a schedule every 30 seconds.

For this occurrence, I looked up the node id, and it was the leader of the partition that batches expired for. That's all I have right now. It's relatively rare, happens like 1 or 2 times a day for 32 application instances connecting to 64 kafka brokers.

 I am trying to narrow it down more by adding more info and waiting for it to happen again. Like one question I was wondering is if there is an outstanding in flight request when it's in this state or did that somehow get dropped in the shuffle somewhere.
 !image-2021-10-21-13-42-06-528.png|width=664,height=308!


was (Author: dhofftgt):
I have a screenshot of some logging I added that shows the connection stuck in CHECKING_API_VERSIONS state. The logging I added grabbed all the nodeIds from ClusterConnectionStates and determined if the Producer would treat them as 'not ready' by checking the state, selector and whether there were in flight requests. If they were 'not ready' it logged them and some other info. It ran on a schedule every 30 seconds.

For this occurrence, I looked up the node id, and it was the leader of the partition that batches expired for. That's all I have right now. It's relatively rare, happens one or two times a day.

 It doesn't happen super often, like 1 or 2 times a day for 32 application instances connecting to 64 kafka brokers. I am trying to narrow it down more by adding more info and waiting for it to happen again. Like one question I was wondering is if there is an outstanding in flight request when it's in this state or did that somehow get dropped in the shuffle somewhere.
!image-2021-10-21-13-42-06-528.png|width=664,height=308!

> Kafka Producer has no timeout for nodes stuck in CHECKING_API_VERSIONS
> ----------------------------------------------------------------------
>
>                 Key: KAFKA-13388
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13388
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>            Reporter: David Hoffman
>            Priority: Major
>         Attachments: image-2021-10-21-13-42-06-528.png
>
>
> I have been seeing expired batch errors in my app.
> {code:java}
> org.apache.kafka.common.errors.TimeoutException: Expiring 51 record(s) for xxx-17:120002 ms has passed since batch creation
> {code}
>  I would have assumed a request timout or connection timeout should have also been logged. I could not find any other associated errors. 
> I added some instrumenting to my app and have traced this down to broker connections hanging in CHECKING_API_VERSIONS state. It appears there is no effective timeout for Kafka Producer broker connections in CHECKING_API_VERSIONS state.
> In the code see the after the NetworkClient connects to a broker node it makes a request to check api versions, when it receives the response it marks the node as ready. I am seeing that sometimes a reply is not received for the check api versions request the connection just hangs in CHECKING_API_VERSIONS state until it is disposed I assume after the idle connection timeout.
> I am guessing the connection setup timeout should be still in play for this, but it is not. 
> There is a connectingNodes set that is consulted when checking timeouts and the node is removed 
> when ClusterConnectionStates.checkingApiVersions(String id) is called to transition the node into CHECKING_API_VERSIONS



--
This message was sent by Atlassian Jira
(v8.3.4#803005)