You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by GitBox <gi...@apache.org> on 2023/01/06 10:21:16 UTC

[GitHub] [kafka] ashwinpankaj opened a new pull request, #13084: KAFKA-14598: Fix flaky ConnectRestApiTest

ashwinpankaj opened a new pull request, #13084:
URL: https://github.com/apache/kafka/pull/13084

   https://issues.apache.org/jira/browse/KAFKA-14598
   
   ConnectRestApiTest sometimes fails with the message
   ```
   ConnectRestError(404, '<html>\n<head>\n<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>\n<title>Error 404 Not Found</title>\n</head>\n<body><h2>HTTP ERROR 404 Not Found</h2>\n<table>\n<tr><th>URI:</th><td>/connector-plugins/</td></tr>\n<tr><th>STATUS:</th><td>404</td></tr>\n<tr><th>MESSAGE:</th><td>Not Found</td></tr>\n<tr><th>SERVLET:</th><td>-</td></tr>\n</table>\n\n</body>\n</html>\n', 'http://172.31.1.75:8083/connector-plugins/')
   ```
   This happens because ConnectDistributedService.start() by default waits till the the line`Joined group at generation ..` is visible in the logs.
   
   In most cases this is sufficient. But in the cases where the test fails, we see that this message appears even before Connect RestServer has finished initialization.
   ```
   [2022-12-15 15:40:29,064] INFO [Worker clientId=connect-1, groupId=connect-cluster] Joined group at generation 2 with protocol version 1 and got assignment: Assignment
   Unknown macro: {error=0, leader='connect-1-07d9da63-9acb-4633-aee4-1ab79f4ab1ae', leaderUrl='http}
   with rebalance delay: 0 (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
   
   [2022-12-15 15:40:29,560] INFO 172.31.5.66 - - [15/Dec/2022:15:40:29 +0000] "GET /connector-plugins/ HTTP/1.1" 404 375 "-" "python-requests/2.24.0" 71 (org.apache.kafka.connect.runtime.rest.RestServer)
   [2022-12-15 15:40:29,579] INFO REST resources initialized; server is started and ready to handle requests (org.apache.kafka.connect.runtime.rest.RestServer)
   ```
   
   * Ran the fixed test. In connect logs we can see that `GET /connectors` is called multiple times till it returns 200 OK. This shows that the test waits till Connect REST service is up.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] gharris1727 commented on pull request #13084: KAFKA-14598: Fix flaky ConnectRestApiTest

Posted by GitBox <gi...@apache.org>.
gharris1727 commented on PR #13084:
URL: https://github.com/apache/kafka/pull/13084#issuecomment-1375950088

   Ah, I recall working on this before. I was the one that added that STARTUP_MODE_JOIN override you linked: https://github.com/apache/kafka/pull/9040
   
   In the description, I mentioned how JOIN was a superset of LISTEN, and I think that's still the case. The jetty server starts: https://github.com/apache/kafka/blob/e38526e375389868664c8977c7a2125e5da2388c/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/rest/RestServer.java#L207 before the herder does: https://github.com/apache/kafka/blob/e38526e375389868664c8977c7a2125e5da2388c/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Connect.java#L53-L54
   
   However, as you've noticed, the registration of the resources occurs _after_ the server begins listening, and _after_ the herder joins the group. So neither LISTEN or JOIN is sufficient to ensure that the resources are registered. But changing from JOIN to LISTEN is going to have the opposite effect that you're intending, as the LISTEN condition is true even earlier than JOIN is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] gharris1727 commented on a diff in pull request #13084: KAFKA-14598: Fix flaky ConnectRestApiTest

Posted by GitBox <gi...@apache.org>.
gharris1727 commented on code in PR #13084:
URL: https://github.com/apache/kafka/pull/13084#discussion_r1072856520


##########
tests/kafkatest/tests/connect/connect_rest_test.py:
##########
@@ -90,7 +90,8 @@ def test_rest_api(self, connect_protocol, metadata_quorum):
         self.cc.set_configs(lambda node: self.render("connect-distributed.properties", node=node))
         self.cc.set_external_configs(lambda node: self.render("connect-file-external.properties", node=node))
 
-        self.cc.start()
+        self.logger.info("Waiting till Connect REST server is listening")
+        self.cc.start(mode=ConnectServiceBase.STARTUP_MODE_LISTEN)

Review Comment:
   I think this is unnecessary for the stabilization fix, and actually weakens the test. Because this is actually creating the connectors in distributed mode, I think it would be smart to wait for the cluster to actually join the cluster.
   
   So we can revert these two lines and leave it as it was.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] gharris1727 commented on pull request #13084: KAFKA-14598: Fix flaky ConnectRestApiTest

Posted by GitBox <gi...@apache.org>.
gharris1727 commented on PR #13084:
URL: https://github.com/apache/kafka/pull/13084#issuecomment-1374130131

   Is this a change in behavior? It appears that the default mode is LISTEN: https://github.com/apache/kafka/blob/95910af3a9125c3c67fe5daebf1e01d7ec6f20c7/tests/kafkatest/services/connect.py#L78 and it is only overwritten if the incoming mode is truey: https://github.com/apache/kafka/blob/95910af3a9125c3c67fe5daebf1e01d7ec6f20c7/tests/kafkatest/services/connect.py#L124-L125
   
   It seems like the problem is not that the startup_mode is wrong, but that the STARTUP_MODE_LISTEN is not waiting for all resources to be registered.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] gharris1727 commented on pull request #13084: KAFKA-14598: Fix flaky ConnectRestApiTest

Posted by GitBox <gi...@apache.org>.
gharris1727 commented on PR #13084:
URL: https://github.com/apache/kafka/pull/13084#issuecomment-1386092163

   Thanks @ashwinpankaj for following up, I think that this is good after one last nit comment.
   
   > To test this theory, in my latest revision I have set retry_on_exc to True in start_and_wait_to_start_listening().
   I also added a 40 second sleep in [RestServer.initializeResources()](https://github.com/ashwinpankaj/kafka/blob/568c443a3ec31fd682133620cb38fdefcfe0b82f/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/rest/RestServer.java#L294).
   ConnectRestApiTests passed inspite of the delay.
   
   I was going to suggest something like this, thanks for verifying that the fix actually stabilizes the test!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] github-actions[bot] commented on pull request #13084: KAFKA-14598: Fix flaky ConnectRestApiTest

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #13084:
URL: https://github.com/apache/kafka/pull/13084#issuecomment-1585816432

   This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has  merge conflicts, please update it with the latest from trunk (or appropriate release branch) <p> If this PR is no longer valid or desired, please feel free to close it. If no activity occurrs in the next 30 days, it will be automatically closed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] KAFKA-14598: Fix flaky ConnectRestApiTest [kafka]

Posted by "ashwinpankaj (via GitHub)" <gi...@apache.org>.
ashwinpankaj closed pull request #13084: KAFKA-14598: Fix flaky ConnectRestApiTest
URL: https://github.com/apache/kafka/pull/13084


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] ashwinpankaj commented on pull request #13084: KAFKA-14598: Fix flaky ConnectRestApiTest

Posted by GitBox <gi...@apache.org>.
ashwinpankaj commented on PR #13084:
URL: https://github.com/apache/kafka/pull/13084#issuecomment-1377104369

   Thanks @gharris1727 for clarifying the reason why JOIN was used as the mode initially.
   
   I checked the code for [start_and_wait_to_start_listening](https://github.com/apache/kafka/blob/def8d724c82f84964f353f8cf45ef8d52247b2cc/tests/kafkatest/services/connect.py#L140).
   Here the test should ideally wait till it is able to [succesfully list connectors](https://github.com/apache/kafka/blob/def8d724c82f84964f353f8cf45ef8d52247b2cc/tests/kafkatest/services/connect.py#L113).
   Once the test code is able to list connectors, we can assume that the rest server is up.
   
   I think if we set the [retry_on_exc flag to true in wait_until](https://github.com/confluentinc/ducktape/blob/e54c75669b572c51630f1a44f028571871098967/ducktape/utils/util.py#L52), we will achieve the desired effect of trying to list connectors, backing off if that attempt fails and retrying again till a timeout of 60s expires.
   
   To test this theory, in my latest revision I have set retry_on_exc to True in start_and_wait_to_start_listening().
   I also added a 40 second sleep in [RestServer.initializeResources()](https://github.com/ashwinpankaj/kafka/blob/568c443a3ec31fd682133620cb38fdefcfe0b82f/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/rest/RestServer.java#L294).
   ConnectRestApiTests passed inspite of the delay.
   
   Hope this gives us the confidence that the fix works.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] ashwinpankaj commented on pull request #13084: KAFKA-14598: Fix flaky ConnectRestApiTest

Posted by GitBox <gi...@apache.org>.
ashwinpankaj commented on PR #13084:
URL: https://github.com/apache/kafka/pull/13084#issuecomment-1375070970

   > Is this a change in behavior? It appears that the default mode is LISTEN:
   > 
   > https://github.com/apache/kafka/blob/95910af3a9125c3c67fe5daebf1e01d7ec6f20c7/tests/kafkatest/services/connect.py#L78
   > 
   > and it is only overwritten if the incoming mode is truey:
   > https://github.com/apache/kafka/blob/95910af3a9125c3c67fe5daebf1e01d7ec6f20c7/tests/kafkatest/services/connect.py#L124-L125
   > 
   > It seems like the problem is not that the startup_mode is wrong, but that the STARTUP_MODE_LISTEN is not waiting for all resources to be registered.
   
   Thanks for taking a look @gharris1727 !
   The code which you linked is for ConnectServiceBase.
   ConnectRestApiTest uses [ConnectDistributedService, which overrides startup_mode to STARTUP_MODE_JOIN](https://github.com/apache/kafka/blob/95910af3a9125c3c67fe5daebf1e01d7ec6f20c7/tests/kafkatest/services/connect.py#L376) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org