You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2021/02/04 07:41:26 UTC

[GitHub] [helix] kaisun2000 opened a new issue #1630: Fix flaky test TestCrushAutoRebalanceNonRack. testLackEnoughInstances

kaisun2000 opened a new issue #1630:
URL: https://github.com/apache/helix/issues/1630


   Log> [ERROR] testLackEnoughInstances(org.apache.helix.integration.rebalancer.CrushRebalancers.TestCrushAutoRebalanceNonRack)  Time elapsed: 0.545 s  <<< FAILURE!
   2021-02-04T02:19:29.8130317Z org.apache.helix.HelixException: Failed to drop instance: localhost_12920. Retry times: 3
   2021-02-04T02:19:29.8134877Z 	at org.apache.helix.integration.rebalancer.CrushRebalancers.TestCrushAutoRebalanceNonRack.testLackEnoughInstances(TestCrushAutoRebalanceNonRack.java:281)
   2021-02-04T02:19:29.8142900Z Caused by: org.apache.helix.zookeeper.exception.ZkClientException: Failed to delete /CLUSTER_TestCrushAutoRebalanceNonRack/INSTANCES/localhost_12920/MESSAGES
   2021-02-04T02:19:29.8148267Z 	at org.apache.helix.integration.rebalancer.CrushRebalancers.TestCrushAutoRebalanceNonRack.testLackEnoughInstances(TestCrushAutoRebalanceNonRack.java:281)
   2021-02-04T02:19:29.8154189Z Caused by: org.apache.helix.zookeeper.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /CLUSTER_TestCrushAutoRebalanceNonRack/INSTANCES/localhost_12920/MESSAGES
   2021-02-04T02:19:29.8160075Z 	at org.apache.helix.integration.rebalancer.CrushRebalancers.TestCrushAutoRebalanceNonRack.testLackEnoughInstances(TestCrushAutoRebalanceNonRack.java:281)
   2021-02-04T02:19:29.8165107Z Caused by: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /CLUSTER_TestCrushAutoRebalanceNonRack/INSTANCES/localhost_12920/MESSAGES
   2021-02-04T02:19:29.8170141Z 	at org.apache.helix.integration.rebalancer.CrushRebalancers.TestCrushAutoRebalanceNonRack.testLackEnoughInstances(TestCrushAutoRebalanceNonRack.java:281)
   2021-02-04T02:19:29.8173864Z 
   2021-02-04T02:19:29.8176834Z [ERROR] testLackEnoughInstances(org.apache.helix.integration.rebalancer.CrushRebalancers.TestCrushAutoRebalanceNonRack)  Time elapsed: 0.015 s  <<< FAILURE!
   2021-02-04T02:19:29.8181238Z org.apache.helix.HelixException: Cluster CLUSTER_TestCrushAutoRebalanceNonRack, instance: localhost_12920, instance config does not exist
   2021-02-04T02:19:29.8187135Z 	at org.apache.helix.integration.rebalancer.CrushRebalancers.TestCrushAutoRebalanceNonRack.testLackEnoughInstances(TestCrushAutoRebalanceNonRack.java:273)
   2021-02-04T02:19:29.8282933Z 
   2021-02-04T02:19:29.8289351Z [ERROR] testLackEnoughInstances(org.apache.helix.integration.rebalancer.CrushRebalancers.TestCrushAutoRebalanceTopoplogyAwareDisabled)  Time elapsed: 0.552 s  <<< FAILURE!
   2021-02-04T02:19:29.8306843Z org.apache.helix.HelixException: Failed to drop instance: localhost_12920. Retry times: 3
   2021-02-04T02:19:29.8312848Z 	at org.apache.helix.integration.rebalancer.CrushRebalancers.TestCrushAutoRebalanceTopoplogyAwareDisabled.testLackEnoughInstances(TestCrushAutoRebalanceTopoplogyAwareDisabled.java:86)
   2021-02-04T02:19:29.8332444Z Caused by: org.apache.helix.zookeeper.exception.ZkClientException: Failed to delete /CLUSTER_TestCrushAutoRebalanceTopoplogyAwareDisabled/INSTANCES/localhost_12920/MESSAGES
   2021-02-04T02:19:29.8339283Z 	at org.apache.helix.integration.rebalancer.CrushRebalancers.TestCrushAutoRebalanceTopoplogyAwareDisabled.testLackEnoughInstances(TestCrushAutoRebalanceTopoplogyAwareDisabled.java:86)
   2021-02-04T02:19:29.8346898Z Caused by: org.apache.helix.zookeeper.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /CLUSTER_TestCrushAutoRebalanceTopoplogyAwareDisabled/INSTANCES/localhost_12920/MESSAGES
   2021-02-04T02:19:29.8354598Z 	at org.apache.helix.integration.rebalancer.CrushRebalancers.TestCrushAutoRebalanceTopoplogyAwareDisabled.testLackEnoughInstances(TestCrushAutoRebalanceTopoplogyAwareDisabled.java:86)
   2021-02-04T02:19:29.8425842Z Caused by: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /CLUSTER_TestCrushAutoRebalanceTopoplogyAwareDisabled/INSTANCES/localhost_12920/MESSAGES
   2021-02-04T02:19:29.8436793Z 	at org.apache.helix.integration.rebalancer.CrushRebalancers.TestCrushAutoRebalanceTopoplogyAwareDisabled.testLackEnoughInstances(TestCrushAutoRebalanceTopoplogyAwareDisabled.java:86)
   2021-02-04T02:19:29.8441831Z 
   2021-02-04T02:19:30.2150311Z [ERROR] Failures: 
   2021-02-04T02:19:30.2158585Z [ERROR] org.apache.helix.integration.rebalancer.CrushRebalancers.TestCrushAutoRebalanceNonRack.testLackEnoughInstances(org.apache.helix.integration.rebalancer.CrushRebalancers.TestCrushAutoRebalanceNonRack)
   2021-02-04T02:19:30.2196371Z [ERROR]   Run 1: TestCrushAutoRebalanceNonRack.testLackEnoughInstances:281 » Helix Failed to dr...
   2021-02-04T02:19:30.2199157Z [ERROR]   Run 2: TestCrushAutoRebalanceNonRack.testLackEnoughInstances:273 » Helix Cluster CLUS...
   2021-02-04T02:19:30.2207049Z [ERROR] org.apache.helix.integration.rebalancer.CrushRebalancers.TestCrushAutoRebalanceTopoplogyAwareDisabled.testLackEnoughInstances(org.apache.helix.integration.rebalancer.CrushRebalancers.TestCrushAutoRebalanceTopoplogyAwareDisabled)
   2021-02-04T02:19:30.2216641Z [ERROR]   Run 2: TestCrushAutoRebalanceTopoplogyAwareDisabled.testLackEnoughInstances:86->TestCrushAutoRebalanceNonRack.testLackEnoughInstances:281 » Helix
   2021-02-04T02:19:30.2219427Z [ERROR] Tests run: 1257, Failures: 2, Errors: 0, Skipped: 0
   2021-02-04T02:19:30.2296392Z [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M3:test (default-test) on project helix-core: There are test failures.
   
   code>
   ```
     public void testLackEnoughInstances(String rebalanceStrategyName, String rebalanceStrategyClass)
         throws Exception {
       System.out.println("TestLackEnoughInstances " + rebalanceStrategyName);
       enablePersistBestPossibleAssignment(_gZkClient, CLUSTER_NAME, true);
   
       // shutdown participants, keep only two left
       HelixDataAccessor helixDataAccessor =
           new ZKHelixDataAccessor(CLUSTER_NAME, InstanceType.PARTICIPANT, _baseAccessor);
       for (int i = 2; i < _participants.size(); i++) {
         MockParticipantManager p = _participants.get(i);
         p.syncStop();
         _gSetupTool.getClusterManagementTool().enableInstance(CLUSTER_NAME, p.getInstanceName(),
             false);
         Assert.assertTrue(TestHelper.verify(() -> {
           _gSetupTool.getClusterManagementTool()
               .enableInstance(CLUSTER_NAME, p.getInstanceName(), false);
           return !InstanceValidationUtil.isEnabled(helixDataAccessor, p.getInstanceName())
               && !InstanceValidationUtil.isAlive(helixDataAccessor, p.getInstanceName());
         }, TestHelper.WAIT_DURATION), "Instance should be disabled and offline");
         _gSetupTool.dropInstanceFromCluster(CLUSTER_NAME, p.getInstanceName());
       }
   
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] kaisun2000 edited a comment on issue #1630: Flaky test TestCrushAutoRebalanceNonRack. testLackEnoughInstances

Posted by GitBox <gi...@apache.org>.

kaisun2000 edited a comment on issue #1630:
URL: https://github.com/apache/helix/issues/1630#issuecomment-773106700


   It seems that Huizhi fixed similar issue April 2020 by adding 
   ```
         Assert.assertTrue(TestHelper.verify(() -> {
           _gSetupTool.getClusterManagementTool()
               .enableInstance(CLUSTER_NAME, p.getInstanceName(), false);
           return !InstanceValidationUtil.isEnabled(helixDataAccessor, p.getInstanceName())
               && !InstanceValidationUtil.isAlive(helixDataAccessor, p.getInstanceName());
         }, TestHelper.WAIT_DURATION), "Instance should be disabled and offline");
   ```
   
   This addressed the previous other issues. There is still another race condition though. The reason is that from test code perspective at the time this instance is disabled and liveinstance gone, it may not reflect the view of controller. Controller may very well see it the instance live not disabled a little bit early and send messages to the instance. The fundamental issue is that HelixAdmin tool has not synchronization with Helix controller when changing messages of a participant.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] kaisun2000 commented on issue #1630: Flaky test TestCrushAutoRebalanceNonRack. testLackEnoughInstances

Posted by GitBox <gi...@apache.org>.

kaisun2000 commented on issue #1630:
URL: https://github.com/apache/helix/issues/1630#issuecomment-773618244


   You are right, I should say "it does not work".  It fixes other cases. Let me change the description there. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] kaisun2000 edited a comment on issue #1630: Flaky test TestCrushAutoRebalanceNonRack. testLackEnoughInstances

Posted by GitBox <gi...@apache.org>.

kaisun2000 edited a comment on issue #1630:
URL: https://github.com/apache/helix/issues/1630#issuecomment-773106700


   It seems that Huizhi fixed similar issue April 2020 by adding 
   ```
         Assert.assertTrue(TestHelper.verify(() -> {
           _gSetupTool.getClusterManagementTool()
               .enableInstance(CLUSTER_NAME, p.getInstanceName(), false);
           return !InstanceValidationUtil.isEnabled(helixDataAccessor, p.getInstanceName())
               && !InstanceValidationUtil.isAlive(helixDataAccessor, p.getInstanceName());
         }, TestHelper.WAIT_DURATION), "Instance should be disabled and offline");
   ```
   
   This addressed the previous similar issue. There is still existing race condition though. The reason is that from test code perspective at the time this instance is disabled and liveinstance gone, it may not reflect the view of controller. Controller may very well see it the instance live not disabled a little bit early and send messages to the instance. The fundamental issue is that HelixAdmin tool has not synchronization with Helix controller when changing messages of a participant.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] kaisun2000 commented on issue #1630: Fix flaky test TestCrushAutoRebalanceNonRack. testLackEnoughInstances

Posted by GitBox <gi...@apache.org>.

kaisun2000 commented on issue #1630:
URL: https://github.com/apache/helix/issues/1630#issuecomment-773106700


   It seems that Huizhi tried to fix issue April 2020 by adding 
   ```
         Assert.assertTrue(TestHelper.verify(() -> {
           _gSetupTool.getClusterManagementTool()
               .enableInstance(CLUSTER_NAME, p.getInstanceName(), false);
           return !InstanceValidationUtil.isEnabled(helixDataAccessor, p.getInstanceName())
               && !InstanceValidationUtil.isAlive(helixDataAccessor, p.getInstanceName());
         }, TestHelper.WAIT_DURATION), "Instance should be disabled and offline");
   ```
   
   However, this may not work. The reason is that from test code perspective at the time this instance is disabled and liveinstance gone, it may not reflect the view of controller. Controller may very well see it the instance live not disabled a little bit early and send messages to the instance. 
   
   A better approach seems to be stop controller before disable instance and re-enable controller afterward.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] jiajunwang commented on issue #1630: Flaky test TestCrushAutoRebalanceNonRack. testLackEnoughInstances

Posted by GitBox <gi...@apache.org>.

jiajunwang commented on issue #1630:
URL: https://github.com/apache/helix/issues/1630#issuecomment-773733776


   It seems to be a real production code issue instead of a test problem.
   If we try to drop an instance immediately after it is disabled, we may see some messages from the controller side too, right?
   
   I believe a better fix is to add a retry logic in the deleteRecursively to tolerate this situation in general. However, this is still not the proper fix. The best way is to lock the folder before delete, then remove the folder with confidence. The only downside is that this would be potentially complicated.
   
   @kaisun2000 I think it is not a good idea to change the test logic here. Since it is a good one which helps us to uncover a potential issue. Please feel free to create another issue for the proper logic fix.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] alirezazamani closed issue #1630: Flaky test TestCrushAutoRebalanceNonRack. testLackEnoughInstances

Posted by GitBox <gi...@apache.org>.

alirezazamani closed issue #1630:
URL: https://github.com/apache/helix/issues/1630


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] kaisun2000 edited a comment on issue #1630: Flaky test TestCrushAutoRebalanceNonRack. testLackEnoughInstances

Posted by GitBox <gi...@apache.org>.

kaisun2000 edited a comment on issue #1630:
URL: https://github.com/apache/helix/issues/1630#issuecomment-773106700


   It seems that Huizhi fixed similar issue April 2020 by adding 
   ```
         Assert.assertTrue(TestHelper.verify(() -> {
           _gSetupTool.getClusterManagementTool()
               .enableInstance(CLUSTER_NAME, p.getInstanceName(), false);
           return !InstanceValidationUtil.isEnabled(helixDataAccessor, p.getInstanceName())
               && !InstanceValidationUtil.isAlive(helixDataAccessor, p.getInstanceName());
         }, TestHelper.WAIT_DURATION), "Instance should be disabled and offline");
   ```
   
   This addressed the previous similar issue. There is still another race condition though. The reason is that from test code perspective at the time this instance is disabled and liveinstance gone, it may not reflect the view of controller. Controller may very well see it the instance live not disabled a little bit early and send messages to the instance. The fundamental issue is that HelixAdmin tool has not synchronization with Helix controller when changing messages of a participant.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] kaisun2000 edited a comment on issue #1630: Flaky test TestCrushAutoRebalanceNonRack. testLackEnoughInstances

Posted by GitBox <gi...@apache.org>.

kaisun2000 edited a comment on issue #1630:
URL: https://github.com/apache/helix/issues/1630#issuecomment-773106700


   There is one race condition here before `_gSetupTool.dropInstanceFromCluster(CLUSTER_NAME, p.getInstanceName());`, The reason is that from test code perspective at the time this instance is disabled and liveinstance gone, it may not reflect the view of controller. Controller may very well see it the instance live not disabled a little bit early and send messages to the instance. The fundamental issue is that HelixAdmin tool has not synchronization with Helix controller when changing messages of a participant.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] jiajunwang commented on issue #1630: Flaky test TestCrushAutoRebalanceNonRack. testLackEnoughInstances

Posted by GitBox <gi...@apache.org>.

jiajunwang commented on issue #1630:
URL: https://github.com/apache/helix/issues/1630#issuecomment-773733776


   It seems to be a real production code issue instead of a test problem.
   If we try to drop an instance immediately after it is disabled, we may see some messages from the controller side too, right?
   
   I believe a better fix is to add a retry logic in the deleteRecursively to tolerate this situation in general. However, this is still not the proper fix. The best way is to lock the folder before delete, then remove the folder with confidence. The only downside is that this would be potentially complicated.
   
   @kaisun2000 I think it is not a good idea to change the test logic here. Since it is a good one which helps us to uncover a potential issue. Please feel free to create another issue for the proper logic fix.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] kaisun2000 edited a comment on issue #1630: Flaky test TestCrushAutoRebalanceNonRack. testLackEnoughInstances

Posted by GitBox <gi...@apache.org>.

kaisun2000 edited a comment on issue #1630:
URL: https://github.com/apache/helix/issues/1630#issuecomment-773106700






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] pkuwm commented on issue #1630: Flaky test TestCrushAutoRebalanceNonRack. testLackEnoughInstances

Posted by GitBox <gi...@apache.org>.

pkuwm commented on issue #1630:
URL: https://github.com/apache/helix/issues/1630#issuecomment-773561069


   > It seems that Huizhi tried to fix issue April 2020 by adding
   > 
   > ```
   >       Assert.assertTrue(TestHelper.verify(() -> {
   >         _gSetupTool.getClusterManagementTool()
   >             .enableInstance(CLUSTER_NAME, p.getInstanceName(), false);
   >         return !InstanceValidationUtil.isEnabled(helixDataAccessor, p.getInstanceName())
   >             && !InstanceValidationUtil.isAlive(helixDataAccessor, p.getInstanceName());
   >       }, TestHelper.WAIT_DURATION), "Instance should be disabled and offline");
   > ```
   > 
   > However, this may not work. The reason is that from test code perspective at the time this instance is disabled and liveinstance gone, it may not reflect the view of controller. Controller may very well see it the instance live not disabled a little bit early and send messages to the instance.
   > 
   > A better approach seems to be stop controller before disable instance and re-enable controller afterward.
   
   @kaisun2000 Can you help understand what I fixed may not work?
   
   I believe what I fixed is different from the failure cause you are talking about. What I fixed is to make sure the correct exception is caught and retry is done. It worked and at that time the test was stable with no issue. Please be aware that it fails now because of a different issue. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] kaisun2000 commented on issue #1630: Fix flaky test TestCrushAutoRebalanceNonRack. testLackEnoughInstances

Posted by GitBox <gi...@apache.org>.

kaisun2000 commented on issue #1630:
URL: https://github.com/apache/helix/issues/1630#issuecomment-773106700






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] kaisun2000 commented on issue #1630: Flaky test TestCrushAutoRebalanceNonRack. testLackEnoughInstances

Posted by GitBox <gi...@apache.org>.

kaisun2000 commented on issue #1630:
URL: https://github.com/apache/helix/issues/1630#issuecomment-774320216


   #1640 is created to trace this race condition from production code point of view.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] pkuwm commented on issue #1630: Flaky test TestCrushAutoRebalanceNonRack. testLackEnoughInstances

Posted by GitBox <gi...@apache.org>.

pkuwm commented on issue #1630:
URL: https://github.com/apache/helix/issues/1630#issuecomment-773561069


   > It seems that Huizhi tried to fix issue April 2020 by adding
   > 
   > ```
   >       Assert.assertTrue(TestHelper.verify(() -> {
   >         _gSetupTool.getClusterManagementTool()
   >             .enableInstance(CLUSTER_NAME, p.getInstanceName(), false);
   >         return !InstanceValidationUtil.isEnabled(helixDataAccessor, p.getInstanceName())
   >             && !InstanceValidationUtil.isAlive(helixDataAccessor, p.getInstanceName());
   >       }, TestHelper.WAIT_DURATION), "Instance should be disabled and offline");
   > ```
   > 
   > However, this may not work. The reason is that from test code perspective at the time this instance is disabled and liveinstance gone, it may not reflect the view of controller. Controller may very well see it the instance live not disabled a little bit early and send messages to the instance.
   > 
   > A better approach seems to be stop controller before disable instance and re-enable controller afterward.
   
   @kaisun2000 Can you help understand what I fixed may not work?
   
   I believe what I fixed is different from the failure cause you are talking about. What I fixed is to make sure the correct exception is caught and retry is done. It worked and at that time the test was stable with no issue. Please be aware that it fails now because of a different issue. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] kaisun2000 edited a comment on issue #1630: Flaky test TestCrushAutoRebalanceNonRack. testLackEnoughInstances

Posted by GitBox <gi...@apache.org>.

kaisun2000 edited a comment on issue #1630:
URL: https://github.com/apache/helix/issues/1630#issuecomment-773618244


   You are right, I should not say "it does not work".  It fixes other cases. Let me change the description there. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org