You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2021/02/17 08:37:13 UTC

[GitHub] [helix] pkuwm opened a new pull request #1650: Improve auto enter maintenance mode

pkuwm opened a new pull request #1650:
URL: https://github.com/apache/helix/pull/1650

### Issues

- [ ] My PR addresses the following Helix issues and references them in the PR description:

Resolves #1648

### Description

- [ ] Here are some details about my PR, including screenshots of any UI changes:

Assume enter M mode threshold is 5. Now 20 nodes are down at the same time, the running pipeline creates the maintenance znode. Instead of using the maintenance rebalancer immediately, this running pipeline still continues with the normal rebalancer, which bootstraps new partitions on the online instances. The following pipelines after this running pipeline will use the maintenance rebalancer.

This PR improves auto enter maintenance mode logic by enabling maintenance mode in the data cache, so the best possible mapping can be computed by the maintenance rebalancer immediately for the first pipeline.

### Tests

- [ ] The following tests are written for this issue:

testAutoEnterMaintenanceMode

- The following is the result of the "mvn test" command on the appropriate module:

Running

### Documentation (Optional)

- In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

### Commits

- My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
1. Subject is separated from body by a blank line
1. Subject is limited to 50 characters (not including Jira issue reference)
1. Subject does not end with a period
1. Subject uses the imperative mood ("add", not "adding")
1. Body wraps at 72 characters
1. Body explains "what" and "why", not "how"

### Code Quality

- My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] pkuwm commented on a change in pull request #1650: Improve auto enter maintenance mode

Posted by GitBox <gi...@apache.org>.

pkuwm commented on a change in pull request #1650:
URL: https://github.com/apache/helix/pull/1650#discussion_r578839971



##########
File path: helix-core/src/main/java/org/apache/helix/controller/stages/IntermediateStateCalcStage.java
##########
@@ -239,11 +239,14 @@ private void validateMaxPartitionsPerInstance(ClusterEvent event,
           // in this instance
           partitionCount++;
           if (partitionCount > maxPartitionPerInstance) {
+            // Enable maintenance rebalancer for this pipeline

Review comment:
       M mode is in best possible. Realized it. Keeping it as is.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] jiajunwang commented on a change in pull request #1650: Improve auto enter maintenance mode

Posted by GitBox <gi...@apache.org>.

jiajunwang commented on a change in pull request #1650:
URL: https://github.com/apache/helix/pull/1650#discussion_r577838791



##########
File path: helix-core/src/main/java/org/apache/helix/controller/stages/BestPossibleStateCalcStage.java
##########
@@ -193,15 +193,19 @@ public Object call() {
   }
 
   // Check whether the offline/disabled instance count in the cluster reaches the set limit,
-  // if yes, pause the rebalancer, and throw exception to terminate rebalance cycle.
+  // if yes, auto enable maintenance mode, and use the maintenance rebalancer for this pipeline.
   private boolean validateOfflineInstancesLimit(final ResourceControllerDataProvider cache,
       final HelixManager manager) {
     int maxOfflineInstancesAllowed = cache.getClusterConfig().getMaxOfflineInstancesAllowed();
     if (maxOfflineInstancesAllowed >= 0) {
       int offlineCount = cache.getAllInstances().size() - cache.getEnabledLiveInstances().size();
       if (offlineCount > maxOfflineInstancesAllowed) {
+        // Enable maintenance mode in cache so the maintenance rebalancer is used for this pipeline
+        cache.enableMaintenanceMode();

Review comment:
       IMO, we should ensure that the ZK update has been done before enabling it in the cache. This helps to prevent multiple potential consistency issues.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] jiajunwang commented on a change in pull request #1650: Improve auto enter maintenance mode

Posted by GitBox <gi...@apache.org>.

jiajunwang commented on a change in pull request #1650:
URL: https://github.com/apache/helix/pull/1650#discussion_r577838791



##########
File path: helix-core/src/main/java/org/apache/helix/controller/stages/BestPossibleStateCalcStage.java
##########
@@ -193,15 +193,19 @@ public Object call() {
   }
 
   // Check whether the offline/disabled instance count in the cluster reaches the set limit,
-  // if yes, pause the rebalancer, and throw exception to terminate rebalance cycle.
+  // if yes, auto enable maintenance mode, and use the maintenance rebalancer for this pipeline.
   private boolean validateOfflineInstancesLimit(final ResourceControllerDataProvider cache,
       final HelixManager manager) {
     int maxOfflineInstancesAllowed = cache.getClusterConfig().getMaxOfflineInstancesAllowed();
     if (maxOfflineInstancesAllowed >= 0) {
       int offlineCount = cache.getAllInstances().size() - cache.getEnabledLiveInstances().size();
       if (offlineCount > maxOfflineInstancesAllowed) {
+        // Enable maintenance mode in cache so the maintenance rebalancer is used for this pipeline
+        cache.enableMaintenanceMode();

Review comment:
       IMO, we should ensure the ZK update is done before enable it in the cache. This will help to prevent multiple potential consistency issues.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] pkuwm merged pull request #1650: Improve auto enter maintenance mode

Posted by GitBox <gi...@apache.org>.

pkuwm merged pull request #1650:
URL: https://github.com/apache/helix/pull/1650


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] alirezazamani commented on a change in pull request #1650: Improve auto enter maintenance mode

Posted by GitBox <gi...@apache.org>.

alirezazamani commented on a change in pull request #1650:
URL: https://github.com/apache/helix/pull/1650#discussion_r578838841



##########
File path: helix-core/src/main/java/org/apache/helix/controller/stages/IntermediateStateCalcStage.java
##########
@@ -239,11 +239,14 @@ private void validateMaxPartitionsPerInstance(ClusterEvent event,
           // in this instance
           partitionCount++;
           if (partitionCount > maxPartitionPerInstance) {
+            // Enable maintenance rebalancer for this pipeline

Review comment:
       Isn't it better to enable this maintenance mode in the BestPossible state? Is there any reason that we enable it in this stage?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] dasahcc commented on a change in pull request #1650: Improve auto enter maintenance mode

Posted by GitBox <gi...@apache.org>.

dasahcc commented on a change in pull request #1650:
URL: https://github.com/apache/helix/pull/1650#discussion_r578832424



##########
File path: helix-core/src/main/java/org/apache/helix/controller/stages/IntermediateStateCalcStage.java
##########
@@ -239,11 +239,14 @@ private void validateMaxPartitionsPerInstance(ClusterEvent event,
           // in this instance
           partitionCount++;
           if (partitionCount > maxPartitionPerInstance) {
+            // Enable maintenance rebalancer for this pipeline
+            cache.enableMaintenanceMode();

Review comment:
       This could not be very useful. Because it is the stage after BestPossible. And the flag will be read in the cache read stage. Since we are not changing the place to add ZNode of maintenance mode. Let's try to minimize the code change.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] pkuwm commented on a change in pull request #1650: Improve auto enter maintenance mode

Posted by GitBox <gi...@apache.org>.

pkuwm commented on a change in pull request #1650:
URL: https://github.com/apache/helix/pull/1650#discussion_r578839734



##########
File path: helix-core/src/main/java/org/apache/helix/controller/stages/IntermediateStateCalcStage.java
##########
@@ -239,11 +239,14 @@ private void validateMaxPartitionsPerInstance(ClusterEvent event,
           // in this instance
           partitionCount++;
           if (partitionCount > maxPartitionPerInstance) {
+            // Enable maintenance rebalancer for this pipeline
+            cache.enableMaintenanceMode();

Review comment:
       Yes you're right. The maintenance rebalancer is used in the best possible stage. I'll keep it as is.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] pkuwm commented on a change in pull request #1650: Improve auto enter maintenance mode

Posted by GitBox <gi...@apache.org>.

pkuwm commented on a change in pull request #1650:
URL: https://github.com/apache/helix/pull/1650#discussion_r577940270



##########
File path: helix-core/src/main/java/org/apache/helix/controller/stages/BestPossibleStateCalcStage.java
##########
@@ -193,15 +193,19 @@ public Object call() {
   }
 
   // Check whether the offline/disabled instance count in the cluster reaches the set limit,
-  // if yes, pause the rebalancer, and throw exception to terminate rebalance cycle.
+  // if yes, auto enable maintenance mode, and use the maintenance rebalancer for this pipeline.
   private boolean validateOfflineInstancesLimit(final ResourceControllerDataProvider cache,
       final HelixManager manager) {
     int maxOfflineInstancesAllowed = cache.getClusterConfig().getMaxOfflineInstancesAllowed();
     if (maxOfflineInstancesAllowed >= 0) {
       int offlineCount = cache.getAllInstances().size() - cache.getEnabledLiveInstances().size();
       if (offlineCount > maxOfflineInstancesAllowed) {
+        // Enable maintenance mode in cache so the maintenance rebalancer is used for this pipeline
+        cache.enableMaintenanceMode();

Review comment:
       It's something I also thought about. It doesn't seem too much different. If ZK update fails, a `HelixException` will be thrown and the pipeline will be terminated. Either before or after ZK update is fine. The reason I put it before is I thought the logic looks a bit more clear to me, as it enables M mode immediately right after the offline count > threshold. It does not depend on helix manager is set or not.
   
   Synced with @jiajunwang offline. He also agreed either one will work. Thanks, @jiajunwang, for the review!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] jiajunwang commented on a change in pull request #1650: Improve auto enter maintenance mode

Posted by GitBox <gi...@apache.org>.

jiajunwang commented on a change in pull request #1650:
URL: https://github.com/apache/helix/pull/1650#discussion_r580567897



##########
File path: helix-core/src/main/java/org/apache/helix/controller/stages/BestPossibleStateCalcStage.java
##########
@@ -193,15 +193,19 @@ public Object call() {
   }
 
   // Check whether the offline/disabled instance count in the cluster reaches the set limit,
-  // if yes, pause the rebalancer, and throw exception to terminate rebalance cycle.
+  // if yes, auto enable maintenance mode, and use the maintenance rebalancer for this pipeline.
   private boolean validateOfflineInstancesLimit(final ResourceControllerDataProvider cache,
       final HelixManager manager) {
     int maxOfflineInstancesAllowed = cache.getClusterConfig().getMaxOfflineInstancesAllowed();
     if (maxOfflineInstancesAllowed >= 0) {
       int offlineCount = cache.getAllInstances().size() - cache.getEnabledLiveInstances().size();
       if (offlineCount > maxOfflineInstancesAllowed) {
+        // Enable maintenance mode in cache so the maintenance rebalancer is used for this pipeline
+        cache.enableMaintenanceMode();

Review comment:
       I think I didn't present my idea clear enough. Since either way works for now, please put the cache update after the ZK update logic.
   The reason is that we may want to change our logic that we continuously use the same cache object to optimize performance. If that is done, then the current logic is a potential bug.
   Please change it if there is no other concern.

##########
File path: helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java
##########
@@ -947,6 +947,10 @@ public boolean isMaintenanceModeEnabled() {
     return _isMaintenanceModeEnabled;
   }
 
+  public void enableMaintenanceMode() {

Review comment:
       The logic is fine. But let's be more careful here. This method is not supposed to be used anywhere else except when we update the ZK M mode signal. Let's comment on it here in the code. Also, please add a TODO here that we need to separate read-only cache from the updatable cache in the near future.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org

[GitHub] [helix] pkuwm commented on pull request #1650: Improve auto enter maintenance mode

Posted by GitBox <gi...@apache.org>.

pkuwm commented on pull request #1650:
URL: https://github.com/apache/helix/pull/1650#issuecomment-783853573


   Thanks, @jiajunwang @alirezazamani @dasahcc, for the review and approvals.
   
   This PR is ready to be merged, approved @jiajunwang @alirezazamani 
   
   Assume enter M mode threshold is 5. Now 20 nodes are down at the same time, the running pipeline creates the maintenance znode. Instead of using the maintenance rebalancer immediately, this running pipeline still continues with the normal rebalancer, which moves new partitions on the online instances.
   
   This commit improves auto enter maintenance mode logic by enabling maintenance mode in the data cache, so the best possible mapping can be computed by the maintenance rebalancer immediately for the first pipeline.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org