You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2022/05/19 10:43:18 UTC

[GitHub] [ozone] lokeshj1703 commented on a diff in pull request #3423: HDDS-6280. Support Container Balancer HA

lokeshj1703 commented on code in PR #3423:
URL: https://github.com/apache/ozone/pull/3423#discussion_r876723181


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -227,7 +279,7 @@ private void balance() {
    */
   private boolean initializeIteration() {
     if (scmContext.isInSafeMode()) {
-      LOG.error("Container Balancer cannot operate while SCM is in Safe Mode.");
+      LOG.warn("Container Balancer cannot operate while SCM is in Safe Mode.");

Review Comment:
   I think we should keep this as error?



##########
hadoop-hdds/interface-client/src/main/proto/hdds.proto:
##########
@@ -431,3 +431,22 @@ message ReplicationManagerReportProto {
     repeated KeyIntValue stat = 2;
     repeated KeyContainerIDList statSample = 3;
 }
+
+message ContainerBalancerConfigurationProto {
+    optional string utilizationThreshold = 5;
+    optional int32 datanodesInvolvedMaxPercentagePerIteration = 6;
+    optional int64 sizeMovedMaxPerIteration = 7;
+    optional int64 sizeEnteringTargetMax = 8;
+    optional int64 sizeLeavingSourceMax = 9;
+    optional int32 iterations = 10;
+    optional string excludeContainers = 11;
+    optional int64 moveTimeout = 12;
+    optional int64 balancingIterationInterval = 13;
+    optional string includeDatanodes = 14;
+    optional string excludeDatanodes = 15;
+    optional bool moveNetworkTopologyEnable = 16;
+    optional bool triggerDuBeforeMoveEnable = 17;
+
+    required bool shouldRun = 18;
+    optional int32 nextIterationIndex = 19;
+}

Review Comment:
   Is it possible to convert ContainerBalancerConfiguration into a string format or some other format. I think we can convert it into xml and store the xml itself. We can avoid creating these fields using that approach.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -179,19 +188,53 @@ private void balance() {
         }
       }
 
-      // stop balancing if iteration is not initialized
+      // initialize this iteration. stop balancing on initialization failure
       if (!initializeIteration()) {
-        stopBalancer();
+        // just return if the reason for initialization failure is that
+        // balancer has been stopped in another thread
+        if (!isBalancerRunning()) {
+          return;
+        }
+        // otherwise, try to stop balancer
+        try {
+          stopBalancer();
+        } catch (IOException | IllegalContainerBalancerStateException e) {
+          LOG.warn("Tried and failed to stop Container Balancer when it " +
+              "could not initialize an iteration", e);
+        }
         return;
       }
 
-      //if no new move option is generated, it means the cluster can
-      //not be balanced any more , so just stop
       IterationResult iR = doIteration();
       metrics.incrementNumIterations(1);
       LOG.info("Result of this iteration of Container Balancer: {}", iR);
+
+      // persist next iteration index
+      if (iR == IterationResult.ITERATION_COMPLETED) {
+        lock.lock();
+        try {
+          saveConfiguration(
+              config.toProtobufBuilder()
+                  .setShouldRun(true)
+                  .setNextIterationIndex(i + 1)
+                  .build());
+        } catch (IOException e) {
+          LOG.warn("Could not persist next iteration index value for " +
+              "ContainerBalancer after completing an iteration", e);
+        } finally {
+          lock.unlock();
+        }
+      }
+
+      // if no new move option is generated, it means the cluster cannot be
+      // balanced anymore; so just stop balancer
       if (iR == IterationResult.CAN_NOT_BALANCE_ANY_MORE) {
-        stopBalancer();
+        try {
+          stopBalancer();
+        } catch (IOException | IllegalContainerBalancerStateException e) {
+          LOG.warn("Tried and failed to stop Container Balancer when result " +
+              "of the latest iteration was " + iR, e);
+        }

Review Comment:
   I think it is better to add try catch inside stopBalancer itself. stopBalancer can include stopReason as argument.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -179,19 +188,53 @@ private void balance() {
         }
       }
 
-      // stop balancing if iteration is not initialized
+      // initialize this iteration. stop balancing on initialization failure
       if (!initializeIteration()) {
-        stopBalancer();
+        // just return if the reason for initialization failure is that
+        // balancer has been stopped in another thread
+        if (!isBalancerRunning()) {
+          return;
+        }
+        // otherwise, try to stop balancer
+        try {
+          stopBalancer();
+        } catch (IOException | IllegalContainerBalancerStateException e) {
+          LOG.warn("Tried and failed to stop Container Balancer when it " +
+              "could not initialize an iteration", e);
+        }
         return;
       }
 
-      //if no new move option is generated, it means the cluster can
-      //not be balanced any more , so just stop
       IterationResult iR = doIteration();
       metrics.incrementNumIterations(1);
       LOG.info("Result of this iteration of Container Balancer: {}", iR);
+
+      // persist next iteration index
+      if (iR == IterationResult.ITERATION_COMPLETED) {
+        lock.lock();
+        try {
+          saveConfiguration(
+              config.toProtobufBuilder()
+                  .setShouldRun(true)
+                  .setNextIterationIndex(i + 1)
+                  .build());
+        } catch (IOException e) {
+          LOG.warn("Could not persist next iteration index value for " +
+              "ContainerBalancer after completing an iteration", e);
+        } finally {
+          lock.unlock();
+        }

Review Comment:
   We can include lock inside an internal function for saveConfiguration



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -830,12 +927,47 @@ public String getServiceName() {
   }
 
   /**
-   * Starts ContainerBalancer as an SCMService.
+   * Starts ContainerBalancer as an SCMService. Validates state, reads and
+   * validates persisted configuration, and then starts the balancing
+   * thread.
+   * @throws IllegalContainerBalancerStateException if balancer should not
+   * run according to persisted configuration
+   * @throws InvalidContainerBalancerConfigurationException if failed to
+   * retrieve persisted configuration, or the configuration is null
    */
   @Override
-  public void start() {
-    if (shouldRun()) {
+  public void start() throws IllegalContainerBalancerStateException,
+      InvalidContainerBalancerConfigurationException {
+    lock.lock();
+    try {
+      // should be leader-ready, out of safe mode, and not running already
+      validateState(false);
+      ContainerBalancerConfigurationProto proto;
+      try {
+        proto = readConfiguration(ContainerBalancerConfigurationProto.class);
+      } catch (IOException e) {
+        throw new InvalidContainerBalancerConfigurationException("Could not " +
+            "retrieve persisted configuration while starting " +
+            "Container Balancer as an SCMService. Will not start now.", e);
+      }
+      if (proto == null) {
+        throw new InvalidContainerBalancerConfigurationException("Persisted " +
+            "configuration for ContainerBalancer is null during start. Will " +
+            "not start now.");
+      }

Review Comment:
   Is this an error case? If the config is null could it also mean balancer has not been started yet?



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerConfiguration.java:
##########
@@ -131,6 +134,8 @@ public final class ContainerBalancerConfiguration {
           "data node is very high")
   private boolean triggerDuEnable = false;
 
+  private int nextIterationIndex = 0;
+

Review Comment:
   not required.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -801,24 +852,70 @@ private void resetState() {
   /**
    * Receives a notification for raft or safe mode related status changes.
    * Stops ContainerBalancer if it's running and the current SCM becomes a
-   * follower or enters safe mode.
+   * follower or enters safe mode. Starts ContainerBalancer if the current
+   * SCM becomes leader, is out of safe mode and balancer should run.
    */
   @Override
   public void notifyStatusChanged() {
-    if (!scmContext.isLeader() || scmContext.isInSafeMode()) {
-      if (isBalancerRunning()) {
-        stopBalancingThread();
+    lock.lock();
+    try {
+      if (!scmContext.isLeader() || scmContext.isInSafeMode()) {
+        if (isBalancerRunning()) {
+          stop();
+        }
+      } else {
+        if (shouldRun()) {
+          try {
+            start();
+          } catch (IllegalContainerBalancerStateException |
+              InvalidContainerBalancerConfigurationException e) {
+            LOG.warn("Could not start ContainerBalancer on raft/safe-mode " +
+                "status change.", e);
+          }
+        }
       }
+    } finally {
+      lock.unlock();
     }
   }
 
   /**
-   * Checks if ContainerBalancer should run.
-   * @return false
+   * Checks if ContainerBalancer should start (after a leader change, restart
+   * etc.) by reading persisted state.
+   * @return true if the persisted state is true, otherwise false
    */
   @Override
   public boolean shouldRun() {
-    return false;
+    try {
+      ContainerBalancerConfigurationProto proto =
+          readConfiguration(ContainerBalancerConfigurationProto.class);
+      if (proto == null) {
+        LOG.warn("Could not find persisted configuration for {} when checking" +
+            " if ContainerBalancer should run. ContainerBalancer should not " +
+            "run now.", this.getServiceName());
+        return false;
+      }
+      return proto.getShouldRun();
+    } catch (IOException e) {
+      LOG.warn("Could not read persisted configuration for checking if " +
+          "ContainerBalancer should start. ContainerBalancer should not start" +
+          " now.", e);
+      return false;
+    }

Review Comment:
   We are validating the configuration twice, once in start method and once in this function. I think we should validate only once and store the configuration.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org