You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@kafka.apache.org by "Randall Hauch (Jira)" <ji...@apache.org> on 2020/07/20 14:04:00 UTC

[jira] [Updated] (KAFKA-10286) Connect system tests should wait for workers to join group

     [ https://issues.apache.org/jira/browse/KAFKA-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Randall Hauch updated KAFKA-10286:
----------------------------------
    Fix Version/s: 2.7.0
                   2.5.1
                   2.4.2
                   2.6.0
                   2.3.2

> Connect system tests should wait for workers to join group
> ----------------------------------------------------------
>
>                 Key: KAFKA-10286
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10286
>             Project: Kafka
>          Issue Type: Test
>          Components: KafkaConnect
>    Affects Versions: 2.6.0
>            Reporter: Greg Harris
>            Assignee: Greg Harris
>            Priority: Minor
>             Fix For: 2.3.2, 2.6.0, 2.4.2, 2.5.1, 2.7.0
>
>
> There are a few flakey test failures for {{connect_distributed_test}} in which one of the workers does not join the group quickly, and the test fails in the following manner:
>  # The test starts each of the connect workers, and waits for their REST APIs to become available
>  # All workers start up, complete plugin scanning, and start their REST API
>  # At least one worker kicks off an asynchronous job to join the group that hangs for a yet unknown reason (30s timeout)
>  # The test continues without all of the members joined
>  # The test makes a call to the REST api that it expects to succeed, and gets an error
>  # The test fails without the worker ever joining the group
> Instead of allowing the test to fail in this manner, we could wait for each worker to join the group with the existing 60s startup timeout. This change would go into effect for all system tests using the {{ConnectDistributedService}}, currently just {{connect_distributed_test}} and {{connect_rest_test}}. 
> Alternatively we could retry the operation that failed, or ensure that we use a known-good worker to continue the test, but these would require more involved code changes. The existing wait-for-startup logic is the most natural place to fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)