You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Chris Egerton (Jira)" <ji...@apache.org> on 2022/07/22 20:25:00 UTC
[jira] [Comment Edited] (KAFKA-14093) Flaky ExactlyOnceSourceIntegrationTest.testFencedLeaderRecovery

    [ https://issues.apache.org/jira/browse/KAFKA-14093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570169#comment-17570169 ] 

Chris Egerton edited comment on KAFKA-14093 at 7/22/22 8:24 PM:
----------------------------------------------------------------

Ah, I think the root cause of this is rather simple; the Javadoc for the test calls out that it uses a one-node cluster, and the [check we do|https://github.com/apache/kafka/blob/679e9e0cee67e7d3d2ece204a421ea7da31d73e9/connect/runtime/src/test/java/org/apache/kafka/connect/integration/ExactlyOnceSourceIntegrationTest.java#L493-L494] to make sure that the leader is up and running (and has had a chance to instantiate its transactional producer for the config topic) relies on that fact. However, I forgot to actually reduce the cluster size to one in the test, so right now it's running with three workers.

As a result, it's possible that the readiness check we're doing hits a follower worker and passes before the leader has had a chance to finish startup.

I've published a quick fix that reduces the cluster to a single worker, and also adds client IDs for producers created for testing so that they can be distinguished from any producers created by the Connect cluster.

I've also filed KAFKA-14098 as a potential follow-up item to add meaningful client IDs to the internal clients set up by Connect workers.


was (Author: chrisegerton):
Think the root cause of this is rather simple; the Javadoc for the test calls out that it uses a one-node cluster, and the [check we do|https://github.com/apache/kafka/blob/679e9e0cee67e7d3d2ece204a421ea7da31d73e9/connect/runtime/src/test/java/org/apache/kafka/connect/integration/ExactlyOnceSourceIntegrationTest.java#L493-L494] to make sure that the leader is up and running (and has had a chance to instantiate its transactional producer for the config topic) relies on that fact. However, I forgot to actually reduce the cluster size to one in the test, so right now it's running with three workers.

As a result, it's possible that the readiness check we're doing hits a follower worker and passes before the leader has had a chance to finish startup.

I've published a quick fix that reduces the cluster to a single worker, and also adds client IDs for producers created for testing so that they can be distinguished from any producers created by the Connect cluster.

I've also filed KAFKA-14098 as a potential follow-up item to add meaningful client IDs to the internal clients set up by Connect workers.

> Flaky ExactlyOnceSourceIntegrationTest.testFencedLeaderRecovery
> ---------------------------------------------------------------
>
>                 Key: KAFKA-14093
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14093
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 3.3.0
>            Reporter: Mickael Maison
>            Priority: Major
>         Attachments: org.apache.kafka.connect.integration.ExactlyOnceSourceIntegrationTest.testFencedLeaderRecovery.test.stdout
>
>
> org.apache.kafka.connect.integration.ExactlyOnceSourceIntegrationTest > testFencedLeaderRecovery FAILED
>     java.lang.AssertionError: expected org.apache.kafka.connect.runtime.rest.errors.ConnectRestException to be thrown, but nothing was thrown



--
This message was sent by Atlassian Jira
(v8.20.10#820010)