You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Barnabas Maidics (Jira)" <ji...@apache.org> on 2021/05/17 13:02:00 UTC

[jira] [Created] (KAFKA-12798) Fixing MM2 rebalance timeout issue when source cluster is not available

Barnabas Maidics created KAFKA-12798:
----------------------------------------

             Summary: Fixing MM2 rebalance timeout issue when source cluster is not available
                 Key: KAFKA-12798
                 URL: https://issues.apache.org/jira/browse/KAFKA-12798
             Project: Kafka
          Issue Type: Bug
          Components: mirrormaker, replication
            Reporter: Barnabas Maidics


If the network configuration of a source cluster which is taking part in a replication flow is changed (change of port number, if, for instance TLS is enabled or disabled) MirrorMaker2 won't update its internal configuration even after a reconfiguration followed by a restart.

What happens in MirrorMaker2 after a cluster "identity" (i.e. connectivity config) changes:
 # MM2 driver (MirrorMaker class) starts up with the new config.
 # DistributedHerder joins a dedicated consumer group that decides which driver instance has control over the assignments and the configuration topic.
 # The driver caches the consumer group assignment, which indicates that it is the leader of the group.
 # The driver reads the configuration topic (which is still not containing the new config), and starts the mm connectors.
 # Since the old config is invalid, the connectors cannot connect to the cluster anymore - MirrorSourceConnector tries to query the cluster through the admin client, but the queries time out after 2 minutes (it contains 2 tasks affecting the source cluster, the timeout is 1 minute for both).
 ## In the meantime, the background heartbeat thread checks on the state of the herder consumer membership. There is a default rebalance timeout of 1 minute. Since the herder thread was blocked due to the connector query timeouts, it wasn't able to call poll on the consumer. Heartbeat thread invalidates the consumer membership and triggers a new consumer creation.
 # The herder thread finishes the connector startup, and after realizing that the configuration has changed, tries to update the config topic.
 ## The config topic can only be updated by the leader herder.
 ## The driver checks the group assignment to see if it is the leader.
 ## In the local cache, the old assignment is present, in which the leader is the previous consumer with its own ID.
 ## The current consumer ID of the driver does not match the cached leader ID.
 # The driver refuses to update the config topic.

[~durban], thanks for digging deeper into this issue

*The proposed fix for this:*
The rebalance issue can be fixed by decreasing the time that we wait for tasks that affects the source cluster at the start of MM2. By decreasing the timeout (from 1 minute to 15 seconds by default), if the kafka config is old, the tasks affecting the source cluster won't block for too long. With this the herder will be able to update the config topic. This timout is configurable now and defaults to 15 seconds.

Also needed to increase the number of threads in the scheduler so that other tasks won't be blocked.

*Testing done:* 
 #  configure replication between source->target
 #  checked that the replication is working
 #  change source kafka cluster broker port
 #  restart kafka/mirrormaker2, produced new messages in the replicated topic
 #  after the restart mm2 was trying to use the old kafka configs, and even after a long time, it couldn't replicate. After applying the fix, the issue was solved, replication worked.

Also tested with the same scenario, but instead of changing the port, ssl was turned on the source kafka cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)