You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2020/05/28 23:10:33 UTC

[GitHub] [helix] jiajunwang opened a new pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

jiajunwang opened a new pull request #1037:
URL: https://github.com/apache/helix/pull/1037


   ### Issues
   
   - [X] My PR addresses the following Helix issues and references them in the PR description:
   
   #1028 
   
   ### Description
   
   - [X] Here are some details about my PR, including screenshots of any UI changes:
   
   Although the rebalancer will fix the additional master eventually, the default operations are arbitrary and it may cause an older master to survive. This may cause serious application logic issues since many applications require the master to have the latest data.
   With this state resolver, the rebalancer will change the default behavior to reset all the master replicas so as to ensure the remaining one is the youngest one. Then the double-masters situation is gracefully resolved.
   
   ### Tests
   
   - [X] The following tests are written for this issue:
   
   TestAbnormalStatesResolver.testExcessiveTopStateResolver()
   
   - [X] The following is the result of the "mvn test" command on the appropriate module:
   
   N/A, the newly added logic is only used in the new test case. The other tests won't touch the new logic.
   Will run the whole test before merging the branch to master.
   
   ### Commits
   
   - [X] My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Documentation (Optional)
   
   - [ ] In case of new functionality, my PR adds documentation in the following wiki page:
   
   (Link the GitHub wiki you added)
   
   ### Code Quality
   
   - [X] My diff has been formatted using helix-style.xml 
   (helix-style-intellij.xml if IntelliJ IDE is used)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] pkuwm commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
pkuwm commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r435058130



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;

Review comment:
       @jiajunwang In CurrentStateOutput, could we add a top state counter map so we could cache the top state counter, like below? Then we could avoid that stream filter computation? Tradeoff is we need a bit more memory for the cache. But most of them are just references. 
   ```
     public void setCurrentState(String resourceName, Partition partition, String instanceName,
         String state) {
       (...... current code ......)
       // Counter number of top state replicas for a single top state model. 
       if (state.equals(stateModelDef.getTopState())) {
         Map<String, Integer> counterMap =
             _topStateCounter.computeIfAbsent(resourceName, k -> new HashMap<>())
                 .computeIfAbsent(partition, k -> new HashMap<>());
         counterMap.put(state, counterMap.getOrDefault(state, 0));
       }
     }
   ```
   
   Not sure if we need to optimize this. Maybe you could test it. It seems for this part, the time complexity is down from O(n) to O(1), but I am not sure what the actual time saving is, considering the whole pipeline. If the whole pipeline complexity is O(N^2), with this optimization, it is O(N), that may help. If the whole pipeline is O(2 * N), with this optimization, still O(N).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432634958



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.

Review comment:
       I think what you want to say here is:
   
   "Note that Helix controller will eventually correct the fact that there are two top state replicas, but it may not happen immediately. This resolver exists to ensure that such cases are resolved right away."




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432787275



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;
+    }
+    return true;
+  }
+
+  @Override
+  public Map<String, String> computeRecoveryAssignment(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef,
+      List<String> preferenceList) {
+    Map<String, String> currentStateMap =
+        currentStateOutput.getCurrentStateMap(resourceName, partition);
+    if (isCurrentStatesValid(currentStateOutput, resourceName, partition, stateModelDef)) {
+      // This method should not be triggered when the mapping is valid.
+      // Log the warning for debug purposes.
+      LOG.warn("The input current state map {} is valid, return the original current state.",
+          currentStateMap);
+      return currentStateMap;
+    }
+
+    Map<String, String> recoverMap = new HashMap<>(currentStateMap);
+    String recoveryState = stateModelDef
+        .getNextStateForTransition(stateModelDef.getTopState(), stateModelDef.getInitialState());
+
+    // 1. We have to reset the expected top state replica host if it is hosting the top state
+    // replica. Otherwise, the potential data issue will never be fixed there.
+    if (preferenceList != null && !preferenceList.isEmpty()) {
+      String expectedTopStateHost = preferenceList.get(0);
+      if (recoverMap.get(expectedTopStateHost).equals(stateModelDef.getTopState())) {
+        recoverMap.put(expectedTopStateHost, recoveryState);
+      }
+    }
+
+    // 2. To minimize the impact of the resolution, we want to reserve one top state replica even
+    // during the recovery process.
+    boolean hasReservedTheTopState = false;
+    for (String instance : recoverMap.keySet()) {
+      if (recoverMap.get(instance).equals(stateModelDef.getTopState())) {
+        if (hasReservedTheTopState) {
+          recoverMap.put(instance, recoveryState);
+        } else {
+          hasReservedTheTopState = true;
+        }
+      }
+    }
+    // Here's what we expect to happen next:
+    // 1. The partition assignment is changed to the proposed recovery state. Or it may be halfway
+    // there.
+    // 2. If the new current state is still invalid, then continue fixing it with the same logic.

Review comment:
       There is no other resource considered here. Let me add "then the resolver continues fixing it with the same logic."




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432638975



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;
+    }
+    return true;
+  }
+
+  @Override
+  public Map<String, String> computeRecoveryAssignment(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef,
+      List<String> preferenceList) {
+    Map<String, String> currentStateMap =
+        currentStateOutput.getCurrentStateMap(resourceName, partition);
+    if (isCurrentStatesValid(currentStateOutput, resourceName, partition, stateModelDef)) {
+      // This method should not be triggered when the mapping is valid.
+      // Log the warning for debug purposes.
+      LOG.warn("The input current state map {} is valid, return the original current state.",
+          currentStateMap);
+      return currentStateMap;
+    }
+
+    Map<String, String> recoverMap = new HashMap<>(currentStateMap);
+    String recoveryState = stateModelDef
+        .getNextStateForTransition(stateModelDef.getTopState(), stateModelDef.getInitialState());
+
+    // 1. We have to reset the expected top state replica host if it is hosting the top state
+    // replica. Otherwise, the potential data issue will never be fixed there.

Review comment:
       What is the potential data issue? It would be good to explain this somewhere.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432635740



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,

Review comment:
       `checkCurrentStates` I think is a better name for this method. or isCurrentStateValid
   
   or `checkForCurrentStateViolation()`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432639889



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;
+    }
+    return true;
+  }
+
+  @Override
+  public Map<String, String> computeRecoveryAssignment(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef,
+      List<String> preferenceList) {
+    Map<String, String> currentStateMap =
+        currentStateOutput.getCurrentStateMap(resourceName, partition);
+    if (isCurrentStatesValid(currentStateOutput, resourceName, partition, stateModelDef)) {
+      // This method should not be triggered when the mapping is valid.
+      // Log the warning for debug purposes.
+      LOG.warn("The input current state map {} is valid, return the original current state.",
+          currentStateMap);
+      return currentStateMap;
+    }
+
+    Map<String, String> recoverMap = new HashMap<>(currentStateMap);
+    String recoveryState = stateModelDef
+        .getNextStateForTransition(stateModelDef.getTopState(), stateModelDef.getInitialState());
+
+    // 1. We have to reset the expected top state replica host if it is hosting the top state
+    // replica. Otherwise, the potential data issue will never be fixed there.
+    if (preferenceList != null && !preferenceList.isEmpty()) {
+      String expectedTopStateHost = preferenceList.get(0);
+      if (recoverMap.get(expectedTopStateHost).equals(stateModelDef.getTopState())) {
+        recoverMap.put(expectedTopStateHost, recoveryState);
+      }
+    }
+
+    // 2. To minimize the impact of the resolution, we want to reserve one top state replica even
+    // during the recovery process.
+    boolean hasReservedTheTopState = false;

Review comment:
       no need for "the" in the variable name




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432639578



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.

Review comment:
       Not exactly. Let me modify the description a little bit. But it is not about speed. It is still about correctness.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] dasahcc commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
dasahcc commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r435026270



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,120 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that gracefully fixes the abnormality of excessive top states for
+ * single-topstate state model. For example, two replcias of a MasterSlave partition are assigned
+ * with the Master state at the same time. This could be caused by a network partitioning or the
+ * other unexpected issues.
+ *
+ * The resolver checks for the abnormality and computes recovery assignment which triggers the
+ * rebalancer to eventually reset all the top state replias for once. After the resets, only one
+ * replica will be assigned the top state.
+ *
+ * Note that without using this resolver, the regular Helix rebalance pipeline also removes the
+ * excessive top state replicas. However, the default logic does not force resetting ALL the top
+ * state replicas. Since the multiple top states situation may break application data, the default
+ * resolution won't be enough to fix the potential problem.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean checkCurrentStates(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {

Review comment:
       I am fine with current top state one. But just thinking about the non-top state scenario. Because one resolver most likely works only for top state resolving. Anyway, we can think about it later.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432633442



##########
File path: helix-core/src/test/java/org/apache/helix/integration/rebalancer/TestAbnormalStatesResolver.java
##########
@@ -64,4 +80,101 @@ public void testConfigureResolver() {
     clusterConfig.setAbnormalStateResolverMap(Collections.emptyMap());
     configAccessor.setClusterConfig(CLUSTER_NAME, clusterConfig);
   }
+
+  @Test(dependsOnMethods = "testConfigureResolver")
+  public void testExcessiveTopStateResolver() {
+    BestPossibleExternalViewVerifier verifier =
+        new BestPossibleExternalViewVerifier.Builder(CLUSTER_NAME).setZkClient(_gZkClient).build();
+    Assert.assertTrue(verifier.verify(5000));

Review comment:
       Use a constant instead of 5000? TestHelper has a constant value built-in.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432664581



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;

Review comment:
       Could you please clarify why comparing diff will bring the complexity from O(n) to O(1)?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432665210



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;
+    }
+    return true;
+  }
+
+  @Override
+  public Map<String, String> computeRecoveryAssignment(final CurrentStateOutput currentStateOutput,

Review comment:
       Not really, this is not the corrected mapping, this is the next step of fixing.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] dasahcc commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
dasahcc commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432731962



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;
+    }
+    return true;
+  }
+
+  @Override
+  public Map<String, String> computeRecoveryAssignment(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef,
+      List<String> preferenceList) {
+    Map<String, String> currentStateMap =
+        currentStateOutput.getCurrentStateMap(resourceName, partition);
+    if (isCurrentStatesValid(currentStateOutput, resourceName, partition, stateModelDef)) {
+      // This method should not be triggered when the mapping is valid.
+      // Log the warning for debug purposes.
+      LOG.warn("The input current state map {} is valid, return the original current state.",
+          currentStateMap);
+      return currentStateMap;
+    }
+
+    Map<String, String> recoverMap = new HashMap<>(currentStateMap);
+    String recoveryState = stateModelDef
+        .getNextStateForTransition(stateModelDef.getTopState(), stateModelDef.getInitialState());
+
+    // 1. We have to reset the expected top state replica host if it is hosting the top state
+    // replica. Otherwise, the potential data issue will never be fixed there.
+    if (preferenceList != null && !preferenceList.isEmpty()) {
+      String expectedTopStateHost = preferenceList.get(0);
+      if (recoverMap.get(expectedTopStateHost).equals(stateModelDef.getTopState())) {
+        recoverMap.put(expectedTopStateHost, recoveryState);
+      }
+    }
+
+    // 2. To minimize the impact of the resolution, we want to reserve one top state replica even

Review comment:
       I would suggest to put more comments here. I can understand the reason that leave a top state for the host not for first in preference list can help reduce the resetting work. But may be other people may confused with your statement in description that resetting all top state.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432638528



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;
+    }
+    return true;
+  }
+
+  @Override
+  public Map<String, String> computeRecoveryAssignment(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef,
+      List<String> preferenceList) {
+    Map<String, String> currentStateMap =
+        currentStateOutput.getCurrentStateMap(resourceName, partition);
+    if (isCurrentStatesValid(currentStateOutput, resourceName, partition, stateModelDef)) {
+      // This method should not be triggered when the mapping is valid.
+      // Log the warning for debug purposes.
+      LOG.warn("The input current state map {} is valid, return the original current state.",
+          currentStateMap);
+      return currentStateMap;
+    }
+
+    Map<String, String> recoverMap = new HashMap<>(currentStateMap);

Review comment:
       recoverMap -> correctedStateMap?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] dasahcc commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
dasahcc commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r434789533



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,120 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that gracefully fixes the abnormality of excessive top states for
+ * single-topstate state model. For example, two replcias of a MasterSlave partition are assigned
+ * with the Master state at the same time. This could be caused by a network partitioning or the
+ * other unexpected issues.
+ *
+ * The resolver checks for the abnormality and computes recovery assignment which triggers the
+ * rebalancer to eventually reset all the top state replias for once. After the resets, only one
+ * replica will be assigned the top state.
+ *
+ * Note that without using this resolver, the regular Helix rebalance pipeline also removes the
+ * excessive top state replicas. However, the default logic does not force resetting ALL the top
+ * state replicas. Since the multiple top states situation may break application data, the default
+ * resolution won't be enough to fix the potential problem.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean checkCurrentStates(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {

Review comment:
       Just a question. Now we only support single top state. Are we gonna support user defined number of top states? Say not all replicas are top state but 2 masters or something.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432635740



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,

Review comment:
       `checkCurrentStates` I think is a better name for this method. or isCurrentStateValid




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] dasahcc commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
dasahcc commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r434978817



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,120 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that gracefully fixes the abnormality of excessive top states for
+ * single-topstate state model. For example, two replcias of a MasterSlave partition are assigned
+ * with the Master state at the same time. This could be caused by a network partitioning or the
+ * other unexpected issues.
+ *
+ * The resolver checks for the abnormality and computes recovery assignment which triggers the
+ * rebalancer to eventually reset all the top state replias for once. After the resets, only one
+ * replica will be assigned the top state.
+ *
+ * Note that without using this resolver, the regular Helix rebalance pipeline also removes the
+ * excessive top state replicas. However, the default logic does not force resetting ALL the top
+ * state replicas. Since the multiple top states situation may break application data, the default
+ * resolution won't be enough to fix the potential problem.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean checkCurrentStates(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {

Review comment:
       For 1 is clear to me. But the question becomes if we need more than 1 resolver, how we gonna handle it. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432788056



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;
+    }
+    return true;
+  }
+
+  @Override
+  public Map<String, String> computeRecoveryAssignment(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef,
+      List<String> preferenceList) {
+    Map<String, String> currentStateMap =
+        currentStateOutput.getCurrentStateMap(resourceName, partition);
+    if (isCurrentStatesValid(currentStateOutput, resourceName, partition, stateModelDef)) {
+      // This method should not be triggered when the mapping is valid.
+      // Log the warning for debug purposes.
+      LOG.warn("The input current state map {} is valid, return the original current state.",
+          currentStateMap);
+      return currentStateMap;
+    }
+
+    Map<String, String> recoverMap = new HashMap<>(currentStateMap);
+    String recoveryState = stateModelDef
+        .getNextStateForTransition(stateModelDef.getTopState(), stateModelDef.getInitialState());
+
+    // 1. We have to reset the expected top state replica host if it is hosting the top state
+    // replica. Otherwise, the potential data issue will never be fixed there.
+    if (preferenceList != null && !preferenceList.isEmpty()) {
+      String expectedTopStateHost = preferenceList.get(0);
+      if (recoverMap.get(expectedTopStateHost).equals(stateModelDef.getTopState())) {
+        recoverMap.put(expectedTopStateHost, recoveryState);
+      }
+    }
+
+    // 2. To minimize the impact of the resolution, we want to reserve one top state replica even

Review comment:
       That part has been covered in the below section "Here's what we expect to happen next:". There, the 3rd point explains how the rebalancer reset all top state eventually.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432637723



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;

Review comment:
       I wonder if this would be too much of an overhead for each pipeline run? 
   
   Do you think it would be better to try to come up with a way to cache currentState mappings and compare diffs (going from O(n) -> O(1) check by storing results across pipelines).
   
   For heavy users, this O(n) computation might become a significant bottleneck if done every pipeline.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r435477542



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;

Review comment:
       I see. In that case, we should add this to the cache instead of CurrentStateOutput. The cache is "protected" by the selective update, so it will help to reduce some calculations.
   
   That is a valid idea. But that requires more changes. For this specific usage, changing the fundamental cache class seems to be not worthy.
   Moreover, if the resolver is not enabled, then we don't do the calculation at all.
   
   Let me add a TODO there, if we have more usage of this count, then we shall do it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] pkuwm commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
pkuwm commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r435058130



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;

Review comment:
       @jiajunwang In CurrentStateOutput, could we add a top state counter map so we could cache the top state counter, like below? Then we could avoid that stream filter computation? Tradeoff is we need a bit more memory for the cache. But most of them are just references. 
   ```
     public void setCurrentState(String resourceName, Partition partition, String instanceName,
         String state) {
       (...... current code ......)
       // Counter number of top state replicas for a single top state model. 
       if (state.equals(stateModelDef.getTopState())) {
         Map<String, Integer> counterMap =
             _topStateCounter.computeIfAbsent(resourceName, k -> new HashMap<>())
                 .getOrDerfault(partition, new HashMap<>());
         counterMap.put(state, counterMap.getOrDefault(state, 0));
       }
     }
   ```
   
   Not sure if we need to optimize this. Maybe you could test it. It seems for this part, the time complexity is down from O(n) to O(1), but I am not sure what the actual time saving is, considering the whole pipeline. If the whole pipeline complexity is O(N^2), with this optimization, it is O(N), that may help. If the whole pipeline is O(2 * N), with this optimization, still O(N).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432634034



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate

Review comment:
       gracefully
   
   Also, it would be great if you could explain what the issue is instead of calling a "double-topstates issue" as this might not make much sense for readers without context.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] pkuwm commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
pkuwm commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r435058130



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;

Review comment:
       @jiajunwang In CurrentStateOutput, could we add a top state counter map so we could cache the top state counter, like below? Then we could avoid that stream filter computation? Tradeoff is we need a bit more memory for the cache. But most of them are just references. 
   ```
     public void setCurrentState(String resourceName, Partition partition, String instanceName,
         String state) {
       (...... current code ......)
       // Counter number of top state replicas for a single top state model. 
       if (state.equals(stateModelDef.getTopState())) {
         Map<String, Integer> counterMap =
             _topStateCounter.computeIfAbsent(resourceName, k -> new HashMap<>())
                 .computeIfAbsent(partition, k -> new HashMap<>());
         counterMap.put(state, counterMap.getOrDefault(state, 0) + 1);
       }
     }
   ```
   
   Not sure if we need to optimize this. Maybe you could test it. It seems for this part, the time complexity is down from O(n) to O(1), but I am not sure what the actual time saving is, considering the whole pipeline. If the whole pipeline complexity is O(N^2), with this optimization, it is O(N), that may help. If the whole pipeline is O(2 * N), with this optimization, still O(N).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432642049



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;
+    }
+    return true;
+  }
+
+  @Override
+  public Map<String, String> computeRecoveryAssignment(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef,
+      List<String> preferenceList) {
+    Map<String, String> currentStateMap =
+        currentStateOutput.getCurrentStateMap(resourceName, partition);
+    if (isCurrentStatesValid(currentStateOutput, resourceName, partition, stateModelDef)) {
+      // This method should not be triggered when the mapping is valid.
+      // Log the warning for debug purposes.
+      LOG.warn("The input current state map {} is valid, return the original current state.",
+          currentStateMap);
+      return currentStateMap;
+    }
+
+    Map<String, String> recoverMap = new HashMap<>(currentStateMap);
+    String recoveryState = stateModelDef
+        .getNextStateForTransition(stateModelDef.getTopState(), stateModelDef.getInitialState());
+
+    // 1. We have to reset the expected top state replica host if it is hosting the top state
+    // replica. Otherwise, the potential data issue will never be fixed there.
+    if (preferenceList != null && !preferenceList.isEmpty()) {
+      String expectedTopStateHost = preferenceList.get(0);
+      if (recoverMap.get(expectedTopStateHost).equals(stateModelDef.getTopState())) {
+        recoverMap.put(expectedTopStateHost, recoveryState);
+      }
+    }
+
+    // 2. To minimize the impact of the resolution, we want to reserve one top state replica even
+    // during the recovery process.
+    boolean hasReservedTheTopState = false;
+    for (String instance : recoverMap.keySet()) {
+      if (recoverMap.get(instance).equals(stateModelDef.getTopState())) {
+        if (hasReservedTheTopState) {
+          recoverMap.put(instance, recoveryState);
+        } else {
+          hasReservedTheTopState = true;
+        }
+      }
+    }
+    // Here's what we expect to happen next:
+    // 1. The partition assignment is changed to the proposed recovery state. Or it may be halfway
+    // there.
+    // 2. If the new current state is still invalid, then continue fixing it with the same logic.

Review comment:
       You mean the rebalancer will continue to run this for all N resources until everything has been resolved?
   
   Or will it check for each resource, say, resource N, I check isValid() and correctCurrentStates() until isValid() returns true?
   
   What I am not too unsure about is what you said about "new current state" being "still invalid". Is it possible that the result is invalid after correction?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r435468539



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,120 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that gracefully fixes the abnormality of excessive top states for
+ * single-topstate state model. For example, two replcias of a MasterSlave partition are assigned
+ * with the Master state at the same time. This could be caused by a network partitioning or the
+ * other unexpected issues.
+ *
+ * The resolver checks for the abnormality and computes recovery assignment which triggers the
+ * rebalancer to eventually reset all the top state replias for once. After the resets, only one
+ * replica will be assigned the top state.
+ *
+ * Note that without using this resolver, the regular Helix rebalance pipeline also removes the
+ * excessive top state replicas. However, the default logic does not force resetting ALL the top
+ * state replicas. Since the multiple top states situation may break application data, the default
+ * resolution won't be enough to fix the potential problem.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top

Review comment:
       Good catch




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432636847



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;
+    }
+    return true;
+  }
+
+  @Override
+  public Map<String, String> computeRecoveryAssignment(final CurrentStateOutput currentStateOutput,

Review comment:
       `computeCorrectedAssignment` is clearer?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] dasahcc commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
dasahcc commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r434789533



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,120 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that gracefully fixes the abnormality of excessive top states for
+ * single-topstate state model. For example, two replcias of a MasterSlave partition are assigned
+ * with the Master state at the same time. This could be caused by a network partitioning or the
+ * other unexpected issues.
+ *
+ * The resolver checks for the abnormality and computes recovery assignment which triggers the
+ * rebalancer to eventually reset all the top state replias for once. After the resets, only one
+ * replica will be assigned the top state.
+ *
+ * Note that without using this resolver, the regular Helix rebalance pipeline also removes the
+ * excessive top state replicas. However, the default logic does not force resetting ALL the top
+ * state replicas. Since the multiple top states situation may break application data, the default
+ * resolution won't be enough to fix the potential problem.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean checkCurrentStates(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {

Review comment:
       Just a question. Now we only support single top state. Are we gonna support user defined number of top states? Say not all replicas are top state but 2 masters or something.
   
   Why I am asking this is because  maybe sometimes there could be non-top state requires to be resolved as well.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] pkuwm commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
pkuwm commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r435058130



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;

Review comment:
       @jiajunwang In CurrentStateOutput, could we add a top state counter map so we could cache the top state counter, like below? Then we could avoid that stream filter computation? Tradeoff is we need a bit more memory for the cache. But most of them are just references. 
   ```
     public void setCurrentState(String resourceName, Partition partition, String instanceName,
         String state) {
       (...... current code ......)
       // Counter number of top state replicas for a single top state model. 
       if (state.equals(stateModelDef.getTopState())) {
         Map<String, Integer> counterMap =
             _topStateCounter.getOrDefault(resourceName, new HashMap<>())
                 .getOrDefault(partition, new HashMap<>());
         counterMap.put(state, counterMap.getOrDefault(state, 0));
       }
     }
   ```
   
   Not sure if we need to optimize this. Maybe you could test it. It seems for this part, the time complexity is down from O(n) to O(1), but I am not sure what the actual time saving is, considering the whole pipeline. If the whole pipeline complexity is O(N^2), with this optimization, it is O(N), that may help. If the whole pipeline is O(2 * N), with this optimization, still O(N).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] pkuwm commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
pkuwm commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r435058130



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;

Review comment:
       @jiajunwang In CurrentStateOutput, could we add a top state counter map so we could cache the top state counter, like below? Then we could avoid that stream filter computation? Tradeoff is we need a bit more memory for the cache. But most of them are just references. 
   ```
     public void setCurrentState(String resourceName, Partition partition, String instanceName,
         String state) {
       (...... current code ......)
       // Counter number of top state replicas for a single top state model. 
       if (state.equals(stateModelDef.getTopState())) {
         Map<String, Integer> counterMap =
             _topStateCounter.computeIfAbsent(resourceName, new HashMap<>())
                 .computeIfAbsent(partition, new HashMap<>());
         counterMap.put(state, counterMap.getOrDefault(state, 0));
       }
     }
   ```
   
   Not sure if we need to optimize this. Maybe you could test it. It seems for this part, the time complexity is down from O(n) to O(1), but I am not sure what the actual time saving is, considering the whole pipeline. If the whole pipeline complexity is O(N^2), with this optimization, it is O(N), that may help. If the whole pipeline is O(2 * N), with this optimization, still O(N).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r434792770



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,120 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that gracefully fixes the abnormality of excessive top states for
+ * single-topstate state model. For example, two replcias of a MasterSlave partition are assigned
+ * with the Master state at the same time. This could be caused by a network partitioning or the
+ * other unexpected issues.
+ *
+ * The resolver checks for the abnormality and computes recovery assignment which triggers the
+ * rebalancer to eventually reset all the top state replias for once. After the resets, only one
+ * replica will be assigned the top state.
+ *
+ * Note that without using this resolver, the regular Helix rebalance pipeline also removes the
+ * excessive top state replicas. However, the default logic does not force resetting ALL the top
+ * state replicas. Since the multiple top states situation may break application data, the default
+ * resolution won't be enough to fix the potential problem.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean checkCurrentStates(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {

Review comment:
       Not for this resolver.
   1. If we have multiple top states requirements, then we need to have different resolvers.
   2. As for non-top state, you mean we allow the resource to have no top state at all although the state model definition requires more than zero? Not sure how it can be done given the current rebalancer logic, but it is not considered as an abnormal for this resolver.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang merged pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
jiajunwang merged pull request #1037:
URL: https://github.com/apache/helix/pull/1037


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] pkuwm commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
pkuwm commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r435040792



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,120 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that gracefully fixes the abnormality of excessive top states for
+ * single-topstate state model. For example, two replcias of a MasterSlave partition are assigned
+ * with the Master state at the same time. This could be caused by a network partitioning or the
+ * other unexpected issues.
+ *
+ * The resolver checks for the abnormality and computes recovery assignment which triggers the
+ * rebalancer to eventually reset all the top state replias for once. After the resets, only one
+ * replica will be assigned the top state.
+ *
+ * Note that without using this resolver, the regular Helix rebalance pipeline also removes the
+ * excessive top state replicas. However, the default logic does not force resetting ALL the top
+ * state replicas. Since the multiple top states situation may break application data, the default
+ * resolution won't be enough to fix the potential problem.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top

Review comment:
       More than **1** top state?
   "a single top state model"?

##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;

Review comment:
       @jiajunwang In CurrentStateOutput, could we add a top state counter map so we could cache the top state counter, like below? Then we could avoid that stream filter computation? Tradeoff is we need a bit more memory for the cache. But most of them are just references. 
   ```
     public void setCurrentState(String resourceName, Partition partition, String instanceName,
         String state) {
       (...... current code ......)
       // Counter number of top state replicas for a single top state model. 
       if (state.equals(stateModelDef.getTopState())) {
         Map<String, Integer> counterMap =
             _topStateCounter.computeIfAbsent(resourceName, k -> new HashMap<>())
                 .computeIfAbsent(partition, k -> new HashMap<>());
         counterMap.put(state, counterMap.getOrDefault(state, 0));
       }
     }
   ```
   
   Not sure if we need to optimize this. Maybe you could test it. It seems for this part, the time complexity is down from O(n) to O(1), but I am not sure what the actual time saving is, considering the whole pipeline. If the whole pipeline complexity is O(N^2), with this optimization, it is O(N), that may help. If the whole pipeline is O(2 * N), with this optimization, still O(N).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432639929



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,

Review comment:
       I agree, let me change the method name.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432635740



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,

Review comment:
       `checkCurrentStates` I think is a better name for this method.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] narendly commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
narendly commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432637723



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;

Review comment:
       I wonder if this would be too much of an overhead for each pipeline run? 
   
   Do you think it would be better to try to come up with a way to cache currentState mappings and compare diffs (going from O(n) -> O(1) check by storing results across pipelines).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432787755



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;
+    }
+    return true;
+  }
+
+  @Override
+  public Map<String, String> computeRecoveryAssignment(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef,
+      List<String> preferenceList) {
+    Map<String, String> currentStateMap =
+        currentStateOutput.getCurrentStateMap(resourceName, partition);
+    if (isCurrentStatesValid(currentStateOutput, resourceName, partition, stateModelDef)) {
+      // This method should not be triggered when the mapping is valid.
+      // Log the warning for debug purposes.
+      LOG.warn("The input current state map {} is valid, return the original current state.",
+          currentStateMap);
+      return currentStateMap;
+    }
+
+    Map<String, String> recoverMap = new HashMap<>(currentStateMap);
+    String recoveryState = stateModelDef
+        .getNextStateForTransition(stateModelDef.getTopState(), stateModelDef.getInitialState());
+
+    // 1. We have to reset the expected top state replica host if it is hosting the top state
+    // replica. Otherwise, the potential data issue will never be fixed there.
+    if (preferenceList != null && !preferenceList.isEmpty()) {
+      String expectedTopStateHost = preferenceList.get(0);
+      if (recoverMap.get(expectedTopStateHost).equals(stateModelDef.getTopState())) {
+        recoverMap.put(expectedTopStateHost, recoveryState);
+      }
+    }
+
+    // 2. To minimize the impact of the resolution, we want to reserve one top state replica even
+    // during the recovery process.
+    boolean hasReservedTheTopState = false;
+    for (String instance : recoverMap.keySet()) {
+      if (recoverMap.get(instance).equals(stateModelDef.getTopState())) {
+        if (hasReservedTheTopState) {
+          recoverMap.put(instance, recoveryState);
+        } else {
+          hasReservedTheTopState = true;
+        }
+      }
+    }
+    // Here's what we expect to happen next:
+    // 1. The partition assignment is changed to the proposed recovery state. Or it may be halfway
+    // there.
+    // 2. If the new current state is still invalid, then continue fixing it with the same logic.

Review comment:
       > What I am not too unsure about is what you said about "new current state" being "still invalid". Is it possible that the result is invalid after correction?
   
   I think that's why you have the question why the method is named computeRecoveryAssignment instead of computeCorrectedAssignment. The result is push the rebalancer to move one step toward the fix. But it is not necessarily fixing the states directly. Like in this case, it cannot be done within one step without impact the availability. This is the first point.
   
   Secondly, even for some resolvers, they fix the things in one step, there is no guarantee that the participant will finish the state transition immediately.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on pull request #1037:
URL: https://github.com/apache/helix/pull/1037#issuecomment-639050638


   This PR is ready to be merged, approved by @dasahcc 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432762234



##########
File path: helix-core/src/test/java/org/apache/helix/integration/rebalancer/TestAbnormalStatesResolver.java
##########
@@ -64,4 +80,101 @@ public void testConfigureResolver() {
     clusterConfig.setAbnormalStateResolverMap(Collections.emptyMap());
     configAccessor.setClusterConfig(CLUSTER_NAME, clusterConfig);
   }
+
+  @Test(dependsOnMethods = "testConfigureResolver")
+  public void testExcessiveTopStateResolver() {
+    BestPossibleExternalViewVerifier verifier =
+        new BestPossibleExternalViewVerifier.Builder(CLUSTER_NAME).setZkClient(_gZkClient).build();
+    Assert.assertTrue(verifier.verify(5000));

Review comment:
       Sure, good point.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r434985817



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,120 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that gracefully fixes the abnormality of excessive top states for
+ * single-topstate state model. For example, two replcias of a MasterSlave partition are assigned
+ * with the Master state at the same time. This could be caused by a network partitioning or the
+ * other unexpected issues.
+ *
+ * The resolver checks for the abnormality and computes recovery assignment which triggers the
+ * rebalancer to eventually reset all the top state replias for once. After the resets, only one
+ * replica will be assigned the top state.
+ *
+ * Note that without using this resolver, the regular Helix rebalance pipeline also removes the
+ * excessive top state replicas. However, the default logic does not force resetting ALL the top
+ * state replicas. Since the multiple top states situation may break application data, the default
+ * resolution won't be enough to fix the potential problem.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean checkCurrentStates(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {

Review comment:
       This is controlled by the cluster config, if you still remember my previous PR https://github.com/apache/helix/pull/1028
   
   The assumption is that in a certain cluster, one state model has only one resolver configured.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org


[GitHub] [helix] jiajunwang commented on a change in pull request #1037: Add ExcessiveTopStateResolver to gracefully fix the double-masters situation.

Posted by GitBox <gi...@apache.org>.
jiajunwang commented on a change in pull request #1037:
URL: https://github.com/apache/helix/pull/1037#discussion_r432665390



##########
File path: helix-core/src/main/java/org/apache/helix/controller/rebalancer/constraint/ExcessiveTopStateResolver.java
##########
@@ -0,0 +1,112 @@
+package org.apache.helix.controller.rebalancer.constraint;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.helix.api.rebalancer.constraint.AbnormalStateResolver;
+import org.apache.helix.controller.stages.CurrentStateOutput;
+import org.apache.helix.model.Partition;
+import org.apache.helix.model.StateModelDefinition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * The abnormal state resolver that graceful fixes double-topstates issue for the single topstate
+ * state model.
+ * Note the regular Helix rebalance pipeline will also remove the excessive top state replica.
+ * However, the default rebalancer logic cannot guarantee a clean resolution. For example, if the
+ * double-topstates situation has already impact the data of the top state replicas, then the
+ * controller should reset both of them, then bring back one top state replica on the right
+ * allocation. For the application which has such a requirement, they should use this resolver or
+ * a more advanced resolver which check the application data to ensure the resolution is complete.
+ */
+public class ExcessiveTopStateResolver implements AbnormalStateResolver {
+  private static final Logger LOG = LoggerFactory.getLogger(ExcessiveTopStateResolver.class);
+
+  /**
+   * The current states are not valid if there are more than 2 top state replicas for a single top
+   * state state model.
+   */
+  @Override
+  public boolean isCurrentStatesValid(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef) {
+    if (!stateModelDef.isSingleTopStateModel()) {
+      return true;
+    }
+    if (currentStateOutput.getCurrentStateMap(resourceName, partition).values().stream()
+        .filter(state -> state.equals(stateModelDef.getTopState())).count() > 1) {
+      return false;
+    }
+    return true;
+  }
+
+  @Override
+  public Map<String, String> computeRecoveryAssignment(final CurrentStateOutput currentStateOutput,
+      final String resourceName, final Partition partition, StateModelDefinition stateModelDef,
+      List<String> preferenceList) {
+    Map<String, String> currentStateMap =
+        currentStateOutput.getCurrentStateMap(resourceName, partition);
+    if (isCurrentStatesValid(currentStateOutput, resourceName, partition, stateModelDef)) {
+      // This method should not be triggered when the mapping is valid.
+      // Log the warning for debug purposes.
+      LOG.warn("The input current state map {} is valid, return the original current state.",
+          currentStateMap);
+      return currentStateMap;
+    }
+
+    Map<String, String> recoverMap = new HashMap<>(currentStateMap);

Review comment:
       same reason here, correctedStateMap is not accurate.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org