You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Randall Hauch (Jira)" <ji...@apache.org> on 2021/06/18 17:10:00 UTC
[jira] [Commented] (KAFKA-12252) Distributed herder tick thread loops rapidly when worker loses leadership

    [ https://issues.apache.org/jira/browse/KAFKA-12252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17365610#comment-17365610 ] 

Randall Hauch commented on KAFKA-12252:
---------------------------------------

Backported to 2.6 for inclusion in any subsequent 2.6.3 patch release.

> Distributed herder tick thread loops rapidly when worker loses leadership
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-12252
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12252
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>            Reporter: Chris Egerton
>            Assignee: Chris Egerton
>            Priority: Major
>             Fix For: 3.0.0, 2.6.3, 2.7.2, 2.8.1
>
>
> When a new session key is read from the config topic, if the worker is the leader, it [schedules a new key rotation|https://github.com/apache/kafka/blob/5cf9cfcaba67cffa2435b07ade58365449c60bd9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L1579-L1581]. The time between key rotations is configurable but defaults to an hour.
> The herder then continues its tick loop, which usually ends with a long poll for rebalance activity. However, when a key rotation is scheduled, it will [limit the time spent polling|https://github.com/apache/kafka/blob/5cf9cfcaba67cffa2435b07ade58365449c60bd9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L384-L388] at the end of the tick loop in order to be able to perform the rotation.
> Once woken up, the worker checks to see if a key rotation is necessary and, if so, [sets the expected key rotation time to Long.MAX_VALUE|https://github.com/apache/kafka/blob/bf4afae8f53471ab6403cbbfcd2c4e427bdd4568/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L344], then [writes a new session key to the config topic|https://github.com/apache/kafka/blob/bf4afae8f53471ab6403cbbfcd2c4e427bdd4568/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L345-L348]. The problem is, [the worker only ever decides a key rotation is necessary if it is still the leader|https://github.com/apache/kafka/blob/5cf9cfcaba67cffa2435b07ade58365449c60bd9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L456-L469]. If the worker is no longer the leader at the time of the key rotation (likely due to falling out of the cluster after losing contact with the group coordinator), its key expiration time won’t be reset, and the long poll for rebalance activity at the end of the tick loop will be given a timeout of 0 ms and result in the tick loop being immediately restarted. Even if the worker reads a new session key from the config topic, it’ll continue looping like this since its scheduled key rotation won’t be updated. At this point, the only thing that would help the worker get back into a healthy state would be if it were made the leader of the cluster again.
> One possible fix could be to add a conditional check in the tick thread to only limit the time spent on rebalance polling if the worker is currently the leader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)