You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by GitBox <gi...@apache.org> on 2020/07/18 08:18:53 UTC

[GitHub] [kafka] gharris1727 opened a new pull request #9040: KAFKA-10286: Connect system tests should wait for workers to join group

gharris1727 opened a new pull request #9040:
URL: https://github.com/apache/kafka/pull/9040


   Signed-off-by: Greg Harris <gr...@confluent.io>
   
   Currently, the system tests `connect_distributed_test` and `connect_rest_test` only wait for the REST api to come up.
   The startup of the worker includes an asynchronous process for joining the worker group and syncing with other workers.
   There are some situations in which this sync takes an unusually long time, and the test continues without all workers up.
   This leads to flakey test failures, as worker joins are not given sufficient time to timeout and retry without waiting explicitly.
   
   This changes the ConnectDistributedTest to wait for the `Joined group` message to be printed to the logs before continuing with tests. I've activated this behavior by default, as it's a superset of the checks that were performed by default before.
   
   This log message is present in every version of DistributedHerder that I could find, in slightly different forms, but always with `Joined group` at the beginning of the log message. This change should be safe to backport to any branch.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] kkonstantine commented on pull request #9040: KAFKA-10286: Connect system tests should wait for workers to join group

Posted by GitBox <gi...@apache.org>.
kkonstantine commented on pull request #9040:
URL: https://github.com/apache/kafka/pull/9040#issuecomment-662605383


   Given this situation during startup, it makes sense. Thanks for the additional details @gharris1727 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] rhauch commented on pull request #9040: KAFKA-10286: Connect system tests should wait for workers to join group

Posted by GitBox <gi...@apache.org>.
rhauch commented on pull request #9040:
URL: https://github.com/apache/kafka/pull/9040#issuecomment-660996930


   @gharris1727 Thanks for the proposed fix. I've started a build for this branch: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/4047/


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] rhauch commented on pull request #9040: KAFKA-10286: Connect system tests should wait for workers to join group

Posted by GitBox <gi...@apache.org>.
rhauch commented on pull request #9040:
URL: https://github.com/apache/kafka/pull/9040#issuecomment-661046734


   All 69 Connect system tests passed: http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2020-07-20--001.1595250901--gharris1727--wait-for-join--2d2265f3b/report.html


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] kkonstantine commented on pull request #9040: KAFKA-10286: Connect system tests should wait for workers to join group

Posted by GitBox <gi...@apache.org>.
kkonstantine commented on pull request #9040:
URL: https://github.com/apache/kafka/pull/9040#issuecomment-661228560


   Thanks for the fix @gharris1727. It's good to stabilize these tests, they've been relatively flaky for some time. 
   
   W/r/t the actual fix I have to notice that we've been moving away from assertions that depend on log messages for some time. Yet, if that's the only way to stabilize these tests maybe we can maintain an exception here. 
   
   Having said that, I'm curious why a stable group with all the members joined makes this test successful? I'd imagine that this would not be a requirement (a cluster with 1 worker is as good as a cluster with 3 workers module capacity requirements which these tests should not assert on). There might be value in retaining our expectations while the workers group is formed, because assuming things work only with a stable full group might make us miss bugs in during phases of failure and recovery. 
   
   If your investigation revealed some more specific reasons around why these tests were flaky while the group was forming, it'd be good to know. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] gharris1727 commented on pull request #9040: KAFKA-10286: Connect system tests should wait for workers to join group

Posted by GitBox <gi...@apache.org>.
gharris1727 commented on pull request #9040:
URL: https://github.com/apache/kafka/pull/9040#issuecomment-660540672


   @aakashnshah can you please take a look at this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] gharris1727 commented on pull request #9040: KAFKA-10286: Connect system tests should wait for workers to join group

Posted by GitBox <gi...@apache.org>.
gharris1727 commented on pull request #9040:
URL: https://github.com/apache/kafka/pull/9040#issuecomment-660722898


   @rhauch cc for review, thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] rhauch edited a comment on pull request #9040: KAFKA-10286: Connect system tests should wait for workers to join group

Posted by GitBox <gi...@apache.org>.
rhauch edited a comment on pull request #9040:
URL: https://github.com/apache/kafka/pull/9040#issuecomment-660996930


   @gharris1727 Thanks for the proposed fix. I've started a system test build for this branch: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/4047/


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] gharris1727 commented on pull request #9040: KAFKA-10286: Connect system tests should wait for workers to join group

Posted by GitBox <gi...@apache.org>.
gharris1727 commented on pull request #9040:
URL: https://github.com/apache/kafka/pull/9040#issuecomment-661296268


   @kkonstantine In my investigation for this fix, I noticed that the long join was happening in the first worker to join the group, as it was creating the `connect-offsets` topic. The broker logs indicated that the topic was created in a timely manner, and that it was visible to the other workers that joined afterwards. The first worker remained waiting for the topic creation result after the other two workers had been started, causing the test to fail.
   
   I could only pick out one suspicious thing about the create topic operation on the broker, as I am not very familiar with broker logs. For the successful create topic operations, these unblocked messages appeared:
   ```
   [2020-07-05 09:47:32,423] DEBUG Request key TopicKey(connect-status) unblocked 1 topic operations (kafka.server.DelayedOperationPurgatory)
   [2020-07-05 09:47:32,423] DEBUG [Admin Manager on Broker 1]: Request key connect-status unblocked 1 topic requests. (kafka.server.AdminManager)
   ```
   These did not appear for the excessively long create topic operation. Reading the log message literally, it's possible that the operation is either never entering the DelayedOperationPurgatory, or never released from it, and thus the operation times out on the client side without the worker finding out that the request was filled.
   
   I think this is benign in our case, and a retry will be able to recover the test by discovering the topic has already been created.
   
   [Logs for that run with the long join](http://confluent-kafka-2-6-system-test-results.s3-us-west-2.amazonaws.com/2020-07-05--001.1593942687--confluentinc--2.6--926929cad/ConnectDistributedTest/test_pause_state_persistent/connect_protocol%3Dcompatible/689.tgz)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] rhauch merged pull request #9040: KAFKA-10286: Connect system tests should wait for workers to join group

Posted by GitBox <gi...@apache.org>.
rhauch merged pull request #9040:
URL: https://github.com/apache/kafka/pull/9040


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org