You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@helix.apache.org by "zpinto (via GitHub)" <gi...@apache.org> on 2023/04/15 00:40:12 UTC

[GitHub] [helix] zpinto opened a new pull request, #2451: Fix flaky integration test for helix-rest for TestClusterAccessor#testClusterFreeze

zpinto opened a new pull request, #2451:
URL: https://github.com/apache/helix/pull/2451

   ### Issues
   
   - [x] Fixes #2404 
   - [x] Fixes #2229 
   - [x] Fixes #1979 
   - [x] Fixes #1832
   
   (#200 - Link your issue number here: You can write "Fixes #XXX". Please use the proper keyword so that the issue gets closed automatically. See https://docs.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue
   Any of the following keywords can be used: close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved)
   
   ### Description
   
   #### Fixing Flaky Unit Test TestClusterAccessor#testClusterFreeze
   
   #### What is this test doing?

   
   This integration test is making a POST request to helix-rest to place cluster TestCluster_0 into CLUSTER_FREEZE status. It then attempts to verify that the following GET request for clusterStatus is in CLUSTER_FREEZE mode.
   
   #### Why is this test likely flaky?
   
   Following the POST request, we check that the `pauseSignal` which we added exists under the CONTROLLER znode, this will succeed because this znode is written to before the POST responds.
   
   The next test is to see if the `clusterMode` is in `CLUSTER_FREEZE`. We also check if the `clusterStatus` is either `IN_PROGRESS` or `COMPLETED`(meaning all state transitions are either canceled or completed and participants are now frozen/paused). The reason the GET request can sometimes return `clusterMode == NORMAL` is because the verification we are doing before we make the GET request is not a definitive signal that the `PAUSE` signal has caused the `clusterMode` to change to `CLUSTER_FREEZE`. This is because a `clusterPause` signal/event has to be enqueued to the `_managementModeEventQueue` and processed before we make the GET for `clusterStatus`. Until it has been processed, the `clusterStatus` will be NORMAL.
   
   Previously we were running this verification to check if the clusterStatus znode existed.
   
   ```
   TestHelper.verify(() -> dataAccessor.getBaseDataAccessor()
           .exists(dataAccessor.keyBuilder().clusterStatus().getPath(), AccessOption.PERSISTENT),
       TestHelper.WAIT_DURATION);
   ```
   
   This will not necessarily indicate that the pause event has been processed, as there could be an earlier event passed through the `_managementEventPipeline` that causes the znode to be created with `clusterMode == NORMAL`. This would lead us to make the GET request too early and the test to fail here:
   
   ```
   Assert.assertEquals(responseMap.get("mode"), ClusterManagementMode.Type.CLUSTER_FREEZE.name());
   ```
   
   Instead, we will verify that the `clusterStatus` znode has `CLUSTER_FREEZE` mode before we make the GET request. This will ensure that the pause event was already processed. 
   
   ### Tests
   
   - [x] Fix TestClusterAccessor#testClusterFreeze
   
   (List the names of added unit/integration tests)
   
   - The following is the result of the "mvn test" command on the appropriate module:
   
   TBD
   
   (If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)
   
   ### Changes that Break Backward Compatibility (Optional)
   
   NA
   
   
   
   ### Documentation (Optional)
   
   NA
   
   ### Commits
   
   - My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Code Quality
   
   - My diff has been formatted using helix-style.xml 
   (helix-style-intellij.xml if IntelliJ IDE is used)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] junkaixue merged pull request #2451: Fix flaky integration test for helix-rest for TestClusterAccessor#testClusterFreeze

Posted by "junkaixue (via GitHub)" <gi...@apache.org>.
junkaixue merged PR #2451:
URL: https://github.com/apache/helix/pull/2451


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] junkaixue commented on pull request #2451: Fix flaky integration test for helix-rest for TestClusterAccessor#testClusterFreeze

Posted by "junkaixue (via GitHub)" <gi...@apache.org>.
junkaixue commented on PR #2451:
URL: https://github.com/apache/helix/pull/2451#issuecomment-1511747621

   Good job!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] zpinto commented on pull request #2451: Fix flaky integration test for helix-rest for TestClusterAccessor#testClusterFreeze

Posted by "zpinto (via GitHub)" <gi...@apache.org>.
zpinto commented on PR #2451:
URL: https://github.com/apache/helix/pull/2451#issuecomment-1511686142

   This PR is ready to merge, approved by @xyuanlu and @qqu0127
   
   Commit message:
   Fixing Flaky Integration Test TestClusterAccessor#testClusterFreeze
   
   What is this test doing?
   
   This integration test is making a POST request to helix-rest to place cluster TestCluster_0 into CLUSTER_FREEZE status. It then attempts to verify that the following GET request for clusterStatus is in CLUSTER_FREEZE mode.
   
   Why is this test likely flaky?
   
   Following the POST request, we check that the pauseSignal which we added exists under the CONTROLLER znode, this will succeed because this znode is written to before the POST responds.
   
   The next test is to see if the clusterMode is in CLUSTER_FREEZE. We also check if the clusterStatus is either IN_PROGRESS or COMPLETED(meaning all state transitions are either canceled or completed and participants are now frozen/paused). The reason the GET request can sometimes return clusterMode == NORMAL is because the verification we are doing before we make the GET request is not a definitive signal that the PAUSE signal has caused the clusterMode to change to CLUSTER_FREEZE. This is because a clusterPause signal/event has to be enqueued to the _managementModeEventQueue and processed before we make the GET for clusterStatus. Until it has been processed, the clusterStatus will be NORMAL.
   
   Previously we were running this verification to check if the clusterStatus znode existed.
   
   This will not necessarily indicate that the pause event has been processed, as there could be an earlier event passed through the _managementEventPipeline that causes the znode to be created with clusterMode == NORMAL. This would lead us to make the GET request too early.
   
   Instead, we will verify that the clusterStatus znode has CLUSTER_FREEZE mode before we make the GET request. This will ensure that the pause event was already processed.
   
   - fixes [Failed CI Test] testClusterFreezeMode(org.apache.helix.rest.server.TestClusterAccessor) #2404
   - fixes [Failed CI Test] testClusterFreezeMode(org.apache.helix.rest.server.TestClusterAccessor) #2229
   - fixes [Failed CI Test] testClusterFreezeMode(org.apache.helix.rest.server.TestClusterAccessor) #1979
   - fixes [Failed CI Test] testClusterFreezeMode(org.apache.helix.rest.server.TestClusterAccessor) #1832
   
   Testing Done:
   
   Run the integration test 10 times ./scripts/runSingleTest.sh TestClusterAccessor#testClusterFreezeMode 10 helix-rest


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] xyuanlu commented on pull request #2451: Fix flaky integration test for helix-rest for TestClusterAccessor#testClusterFreeze

Posted by "xyuanlu (via GitHub)" <gi...@apache.org>.
xyuanlu commented on PR #2451:
URL: https://github.com/apache/helix/pull/2451#issuecomment-1509435665

   Thanks for working on this! Really good analysis! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org