You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@helix.apache.org by "zpinto (via GitHub)" <gi...@apache.org> on 2023/04/17 16:14:48 UTC

[GitHub] [helix] zpinto commented on pull request #2451: Fix flaky integration test for helix-rest for TestClusterAccessor#testClusterFreeze

zpinto commented on PR #2451:
URL: https://github.com/apache/helix/pull/2451#issuecomment-1511686142

   This PR is ready to merge, approved by @xyuanlu and @qqu0127
   
   Commit message:
   Fixing Flaky Integration Test TestClusterAccessor#testClusterFreeze
   
   What is this test doing?
   
   This integration test is making a POST request to helix-rest to place cluster TestCluster_0 into CLUSTER_FREEZE status. It then attempts to verify that the following GET request for clusterStatus is in CLUSTER_FREEZE mode.
   
   Why is this test likely flaky?
   
   Following the POST request, we check that the pauseSignal which we added exists under the CONTROLLER znode, this will succeed because this znode is written to before the POST responds.
   
   The next test is to see if the clusterMode is in CLUSTER_FREEZE. We also check if the clusterStatus is either IN_PROGRESS or COMPLETED(meaning all state transitions are either canceled or completed and participants are now frozen/paused). The reason the GET request can sometimes return clusterMode == NORMAL is because the verification we are doing before we make the GET request is not a definitive signal that the PAUSE signal has caused the clusterMode to change to CLUSTER_FREEZE. This is because a clusterPause signal/event has to be enqueued to the _managementModeEventQueue and processed before we make the GET for clusterStatus. Until it has been processed, the clusterStatus will be NORMAL.
   
   Previously we were running this verification to check if the clusterStatus znode existed.
   
   This will not necessarily indicate that the pause event has been processed, as there could be an earlier event passed through the _managementEventPipeline that causes the znode to be created with clusterMode == NORMAL. This would lead us to make the GET request too early.
   
   Instead, we will verify that the clusterStatus znode has CLUSTER_FREEZE mode before we make the GET request. This will ensure that the pause event was already processed.
   
   - fixes [Failed CI Test] testClusterFreezeMode(org.apache.helix.rest.server.TestClusterAccessor) #2404
   - fixes [Failed CI Test] testClusterFreezeMode(org.apache.helix.rest.server.TestClusterAccessor) #2229
   - fixes [Failed CI Test] testClusterFreezeMode(org.apache.helix.rest.server.TestClusterAccessor) #1979
   - fixes [Failed CI Test] testClusterFreezeMode(org.apache.helix.rest.server.TestClusterAccessor) #1832
   
   Testing Done:
   
   Run the integration test 10 times ./scripts/runSingleTest.sh TestClusterAccessor#testClusterFreezeMode 10 helix-rest


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org