You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2021/03/12 09:08:14 UTC

[GitHub] [ozone] adoroszlai commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

adoroszlai commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r592958568



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java
##########
@@ -604,7 +605,7 @@ public DatanodeLayoutStorage getLayoutStorage() {
   }
 
   @VisibleForTesting

Review comment:
       ```suggestion
   ```
   
   It's no longer visible.

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java
##########
@@ -303,4 +306,26 @@ private void logAndThrow(Exception e, String msg, ResultCodes resultCode)
     LOG.error(msg, e);
     throw new UpgradeException(msg, e, resultCode);
   }
+
+  @Override
+  public void configureTestInjectionFunction(
+      UpgradeTestInjectionPoints pointIndex,
+      Callable<Boolean> injectedTestFunction) {
+    injectTestFunction = injectedTestFunction;
+    testInjectionPoint = pointIndex;
+  }
+
+  @Override
+  public Boolean injectTestFunctionAtThisPoint(
+      UpgradeTestInjectionPoints pointIndex) throws Exception {
+    if ((testInjectionPoint != null) &&
+        (pointIndex.getValue() == testInjectionPoint.getValue())) {
+      if (injectTestFunction != null) {

Review comment:
       Please merge these `if` statements.

##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java
##########
@@ -616,6 +617,8 @@ public boolean canFinalizeDataNode() {
       case OPEN:
       case CLOSING:
       case UNHEALTHY:
+        LOG.warn("FinalizeUpgrade : Waiting for container to close, current " +
+            "state is: {}", ctr.getContainerState());

Review comment:
       Please extract `ctr.getContainerState()` to a variable to ensure we get the same state in `switch` and the log.

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java
##########
@@ -1322,4 +1322,8 @@ public StatusAndMessages queryUpgradeFinalizationProgress(
   ) throws IOException {
     return upgradeFinalizer.reportStatus(upgradeClientID, takeover);
   }
+
+  public UpgradeFinalizer<StorageContainerManager> getUpgradeFinalizer() {
+    return  upgradeFinalizer;

Review comment:
       ```suggestion
       return upgradeFinalizer;
   ```

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/UpgradeFinalizer.java
##########
@@ -43,6 +44,28 @@
 @InterfaceStability.Evolving
 public interface UpgradeFinalizer<T> {
 
+  enum UpgradeTestInjectionPoints {
+    BeforePreFinalizeUpgrade(1),
+    AfterPreFinalizeUpgrade(2),
+    BeforeCompleteFinalization(3),
+    AfterCompleteFinalization(4),
+    AfterPostFinalizeUpgrade(5);

Review comment:
       These names trigger Sonar warning due to not matching pattern for constants.  Should be `BEFORE_PRE_FINALIZE_UPGRADE`, etc.

##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/upgrade/DataNodeUpgradeFinalizer.java
##########
@@ -31,12 +35,16 @@
 import org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine;
 import org.apache.hadoop.ozone.upgrade.BasicUpgradeFinalizer;
 import org.apache.hadoop.ozone.upgrade.LayoutFeature;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 /**
  * UpgradeFinalizer for the DataNode.
  */
 public class DataNodeUpgradeFinalizer extends
     BasicUpgradeFinalizer<DatanodeStateMachine, HDDSLayoutVersionManager> {
+  static final Logger LOG =

Review comment:
       ```suggestion
     private static final Logger LOG =
   ```

##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHDDSUpgrade.java
##########
@@ -305,4 +400,579 @@ public void testFinalizationFromInitialVersionToLatestVersion()
     // Verify that new pipeline can be created with upgraded datanodes.
     testPostUpgradePipelineCreation();
   }
+
+  /*
+   * All the subsequent tests here are failure cases. Some of the tests below
+   * could simultaneously fail one or more nodes at specific execution points
+   * and in different thread contexts.
+   * Upgrade path key execution points are defined in
+   * UpgradeFinalizer:UpgradeTestInjectionPoints.
+   */
+
+  /*
+   * Helper function to inject SCM failure and a SCM restart at a given
+   * execution point during SCM-Upgrade.
+   *
+   * Injects Failure in  : SCM
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectSCMFailureDuringSCMUpgrade()
+      throws InterruptedException, TimeoutException, AuthenticationException,
+      IOException {
+    // For some tests this could get called in a different thread context.
+    // We need to guard concurrent updates to the cluster.
+    synchronized(cluster) {
+      cluster.restartStorageContainerManager(true);
+      loadSCMState();
+    }
+    // The ongoing current SCM Upgrade is getting aborted at this point. We
+    // need to schedule a new SCM Upgrade on a different thread context.
+    Thread t = new Thread(new Runnable() {
+      @Override
+      public void run() {
+        try {
+          loadSCMState();
+          scm.finalizeUpgrade("xyz");
+        } catch (IOException e) {
+          e.printStackTrace();
+          Assert.fail(e.getMessage());
+        }
+      }
+    });
+    t.start();
+    return true;
+  }
+
+  /*
+   * Helper function to inject DataNode failures and DataNode restarts at a
+   * given execution point during SCM-Upgrade. Please note that it fails all
+   * the DataNodes in the cluster and is part of test cases that simulate
+   * multi-node failure at specific code-execution points during SCM Upgrade.
+   * Please note that this helper function should be called in the thread
+   * context of an SCM-Upgrade only. The return value has a significance that
+   * it does not abort the currently ongoing SCM upgrade. because this
+   * failure injection does not fail the SCM node and only impacts datanodes,
+   *  we do not need to schedule another scm-finalize-upgrade here.
+   *
+   * Injects Failure in  : All the DataNodes
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectDataNodeFailureDuringSCMUpgrade() {
+    try {
+      // Work on a Copy of current set of DataNodes to avoid
+      // running into tricky situations.
+      List<HddsDatanodeService> currentDataNodes =
+          new ArrayList<>(cluster.getHddsDatanodes());
+      for (HddsDatanodeService ds: currentDataNodes) {
+        DatanodeDetails dn = ds.getDatanodeDetails();
+        cluster.restartHddsDatanode(dn, false);
+      }
+      cluster.waitForClusterToBeReady();
+    } catch (Exception e) {
+      LOG.info("DataNode Restarts Failed!");
+      Assert.fail(e.getMessage());
+    }
+    loadSCMState();
+    // returning false from injection function, continues currently ongoing
+    // SCM-Upgrade-Finalization.
+    return false;
+  }
+
+  /*
+   * Helper function to inject a DataNode failure and restart for a specific
+   * DataNode. This injection function can target a specific DataNode and
+   * thus facilitates getting called in the upgrade-finalization thread context
+   * of that specific DataNode.
+   *
+   * Injects Failure in  : Given DataNodes
+   * Executing-Thread-Context : the same DataNode that we are failing here.
+   */
+  private Thread injectDataNodeFailureDuringDataNodeUpgrade(
+      DatanodeDetails dn) {
+    Thread t = null;
+    try {
+      // Schedule the DataNode restart on a separate thread context
+      // otherwise DataNode restart will hang. Also any cluster modification
+      // needs to be guarded since it could get modified in multiple independent
+      // threads.
+      t = new Thread(new Runnable() {
+        @Override
+        public void run() {
+          try {
+            synchronized (cluster) {
+              cluster.restartHddsDatanode(dn, true);
+            }
+          } catch (Exception e) {
+            e.printStackTrace();
+            Assert.fail(e.getMessage());

Review comment:
       I don't think calling`fail()` on other thread really triggers test failure.  The following test passes:
   
   ```
     @Test
     public void testThread() throws InterruptedException {
       Thread thread = new Thread(() -> Assert.fail("test"));
       thread.start();
       thread.join();
     }
   ```
   
   To trigger failure, we need some shared state (eg. `AtomicBoolean`) that's set by the other thread to signal failure.  The main test thread should check it and fail if state is set.

##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHDDSUpgrade.java
##########
@@ -74,10 +95,10 @@
 public class TestHDDSUpgrade {
 
   /**
-    * Set a timeout for each test.
-    */
+   * Set a timeout for each test.
+   */
   @Rule
-  public Timeout timeout = new Timeout(300000);
+  public Timeout timeout = new Timeout(11000000);

Review comment:
       There is no point in increasing test method timeout to 183 minutes.  Surefire forks are killed after 20 minutes:
   
   ```
   Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) on project hadoop-ozone-integration-test: There was a timeout or other error in the fork
   ```

##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java
##########
@@ -645,4 +648,7 @@ public StatusAndMessages queryUpgradeStatus()
     return upgradeFinalizer.reportStatus(datanodeDetails.getUuidString(),
         false);
   }
+  public UpgradeFinalizer<DatanodeStateMachine> getUpgradeFinalizer() {
+    return  upgradeFinalizer;

Review comment:
       ```suggestion
       return upgradeFinalizer;
   ```

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java
##########
@@ -303,4 +306,26 @@ private void logAndThrow(Exception e, String msg, ResultCodes resultCode)
     LOG.error(msg, e);
     throw new UpgradeException(msg, e, resultCode);
   }
+
+  @Override
+  public void configureTestInjectionFunction(
+      UpgradeTestInjectionPoints pointIndex,
+      Callable<Boolean> injectedTestFunction) {
+    injectTestFunction = injectedTestFunction;
+    testInjectionPoint = pointIndex;
+  }
+
+  @Override
+  public Boolean injectTestFunctionAtThisPoint(
+      UpgradeTestInjectionPoints pointIndex) throws Exception {
+    if ((testInjectionPoint != null) &&
+        (pointIndex.getValue() == testInjectionPoint.getValue())) {
+      if (injectTestFunction != null) {
+        if (injectTestFunction.call()) {
+          throw new UpgradeTestInjectionAbort();

Review comment:
       I think `UpgradeTestInjectionAbort` should be an unchecked exception, and any exception thrown from `injectTestFunction.call()` should be wrapped in `UpgradeTestInjectionAbort`, too.  This way we could avoid the need to handle test-related exceptions in production code:
   
   https://github.com/apache/ozone/blob/f67b44d2696be2c15f0941ebce6d93833b9ed8d4/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/upgrade/DataNodeUpgradeFinalizer.java#L64-L69

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/UpgradeFinalizer.java
##########
@@ -167,4 +190,22 @@ StatusAndMessages finalize(String upgradeClientID, T service)
   StatusAndMessages reportStatus(String upgradeClientId, boolean takeover)
       throws IOException;
 
+  /**
+   * Interface to inject arbitrary failures for stress testing.
+   * @param InjectTestFunction function that will be called

Review comment:
       ```suggestion
      * @param injectTestFunction function that will be called
   ```

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/SCMUpgradeFinalizer.java
##########
@@ -68,7 +77,7 @@ public StatusAndMessages finalize(String upgradeClientID,
     }
 
     @Override
-    public Void call() throws IOException {
+    public Void call() throws IOException, InterruptedException {

Review comment:
       I don't see any blocking calls.  Removing `InterruptedException` does not indicate unhandled exception in this method.

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/UpgradeFinalizer.java
##########
@@ -167,4 +190,22 @@ StatusAndMessages finalize(String upgradeClientID, T service)
   StatusAndMessages reportStatus(String upgradeClientId, boolean takeover)
       throws IOException;
 
+  /**
+   * Interface to inject arbitrary failures for stress testing.
+   * @param InjectTestFunction function that will be called
+   *        code execution reached injectTestFunctionAtThisPoint() location.
+   * @param pointIndex code execution point for a given thread.
+   */
+  void configureTestInjectionFunction(UpgradeTestInjectionPoints pointIndex,
+                                      Callable<Boolean> injectTestFunction);
+
+  /**
+   * Interface to inject error at a given point in an upgrade thread.
+   * @param pointIndex TestFunction Injection point in an upgrade thread.
+   * @return "true" if the calling thread should not continue with further
+   *          upgrade processing, "false" otherwise.
+   */
+  Boolean injectTestFunctionAtThisPoint(UpgradeTestInjectionPoints pointIndex)
+      throws Exception;

Review comment:
       Can be `void`, as its return value is always ignored.

##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/upgrade/DataNodeUpgradeFinalizer.java
##########
@@ -95,14 +110,22 @@ public Void call() throws IOException {
               datanodeStateMachine.getLayoutStorage());
           versionManager.finalized(f);
         }
+        injectTestFunctionAtThisPoint(BeforeCompleteFinalization);
         versionManager.completeFinalization();
+        injectTestFunctionAtThisPoint(AfterCompleteFinalization);
         datanodeStateMachine.postFinalizeUpgrade();
+        injectTestFunctionAtThisPoint(AfterPostFinalizeUpgrade);
         emitFinishedMsg();
         return null;
+      } catch (Exception e) {
+        e.printStackTrace();

Review comment:
       Log?

##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHDDSUpgrade.java
##########
@@ -305,4 +400,579 @@ public void testFinalizationFromInitialVersionToLatestVersion()
     // Verify that new pipeline can be created with upgraded datanodes.
     testPostUpgradePipelineCreation();
   }
+
+  /*
+   * All the subsequent tests here are failure cases. Some of the tests below
+   * could simultaneously fail one or more nodes at specific execution points
+   * and in different thread contexts.
+   * Upgrade path key execution points are defined in
+   * UpgradeFinalizer:UpgradeTestInjectionPoints.
+   */
+
+  /*
+   * Helper function to inject SCM failure and a SCM restart at a given
+   * execution point during SCM-Upgrade.
+   *
+   * Injects Failure in  : SCM
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectSCMFailureDuringSCMUpgrade()
+      throws InterruptedException, TimeoutException, AuthenticationException,
+      IOException {
+    // For some tests this could get called in a different thread context.
+    // We need to guard concurrent updates to the cluster.
+    synchronized(cluster) {
+      cluster.restartStorageContainerManager(true);
+      loadSCMState();
+    }
+    // The ongoing current SCM Upgrade is getting aborted at this point. We
+    // need to schedule a new SCM Upgrade on a different thread context.
+    Thread t = new Thread(new Runnable() {
+      @Override
+      public void run() {
+        try {
+          loadSCMState();
+          scm.finalizeUpgrade("xyz");
+        } catch (IOException e) {
+          e.printStackTrace();
+          Assert.fail(e.getMessage());
+        }
+      }
+    });
+    t.start();
+    return true;
+  }
+
+  /*
+   * Helper function to inject DataNode failures and DataNode restarts at a
+   * given execution point during SCM-Upgrade. Please note that it fails all
+   * the DataNodes in the cluster and is part of test cases that simulate
+   * multi-node failure at specific code-execution points during SCM Upgrade.
+   * Please note that this helper function should be called in the thread
+   * context of an SCM-Upgrade only. The return value has a significance that
+   * it does not abort the currently ongoing SCM upgrade. because this
+   * failure injection does not fail the SCM node and only impacts datanodes,
+   *  we do not need to schedule another scm-finalize-upgrade here.
+   *
+   * Injects Failure in  : All the DataNodes
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectDataNodeFailureDuringSCMUpgrade() {
+    try {
+      // Work on a Copy of current set of DataNodes to avoid
+      // running into tricky situations.
+      List<HddsDatanodeService> currentDataNodes =
+          new ArrayList<>(cluster.getHddsDatanodes());
+      for (HddsDatanodeService ds: currentDataNodes) {
+        DatanodeDetails dn = ds.getDatanodeDetails();
+        cluster.restartHddsDatanode(dn, false);
+      }
+      cluster.waitForClusterToBeReady();
+    } catch (Exception e) {
+      LOG.info("DataNode Restarts Failed!");
+      Assert.fail(e.getMessage());
+    }
+    loadSCMState();
+    // returning false from injection function, continues currently ongoing
+    // SCM-Upgrade-Finalization.
+    return false;
+  }
+
+  /*
+   * Helper function to inject a DataNode failure and restart for a specific
+   * DataNode. This injection function can target a specific DataNode and
+   * thus facilitates getting called in the upgrade-finalization thread context
+   * of that specific DataNode.
+   *
+   * Injects Failure in  : Given DataNodes
+   * Executing-Thread-Context : the same DataNode that we are failing here.
+   */
+  private Thread injectDataNodeFailureDuringDataNodeUpgrade(
+      DatanodeDetails dn) {
+    Thread t = null;
+    try {
+      // Schedule the DataNode restart on a separate thread context
+      // otherwise DataNode restart will hang. Also any cluster modification
+      // needs to be guarded since it could get modified in multiple independent
+      // threads.
+      t = new Thread(new Runnable() {
+        @Override
+        public void run() {
+          try {
+            synchronized (cluster) {
+              cluster.restartHddsDatanode(dn, true);
+            }
+          } catch (Exception e) {
+            e.printStackTrace();
+            Assert.fail(e.getMessage());
+          }
+        }
+      });
+    } catch (Exception e) {
+      LOG.info("DataNode Restart Failed!");
+      Assert.fail(e.getMessage());
+    }
+    return t;
+  }
+
+  /*
+   * Helper function to inject coordinated failures and restarts across
+   * all the DataNode as well as SCM. This can help create targeted test cases
+   * to inject such comprehensive failures in SCM-Upgrade-Context as well as
+   * DataNode-Upgrade-Context.
+   *
+   * Injects Failure in  : SCM as well as ALL the DataNodes.
+   * Executing-Thread-Context : Either the SCM-Upgrade-Finalizer or the
+   *                            DataNode-Upgrade-Finalizer.
+   */
+  private Thread injectSCMAndDataNodeFailureTogetherAtTheSameTime()
+      throws InterruptedException, TimeoutException, AuthenticationException,
+      IOException {
+    // This needs to happen in a separate thread context otherwise
+    // DataNode restart will hang.
+    return new Thread(new Runnable() {
+      @Override
+      public void run() {
+        try {
+          // Since we are modifying cluster in an independent thread context,
+          // we synchronize access to it to avoid concurrent modification
+          // exception.
+          synchronized (cluster) {
+            // Work on a Copy of current set of DataNodes to avoid
+            // running into tricky situations.
+            List<HddsDatanodeService> currentDataNodes =
+                new ArrayList<>(cluster.getHddsDatanodes());
+            for (HddsDatanodeService ds: currentDataNodes) {
+              DatanodeDetails dn = ds.getDatanodeDetails();
+              cluster.restartHddsDatanode(dn, false);
+            }
+            cluster.restartStorageContainerManager(false);
+            cluster.waitForClusterToBeReady();
+          }
+        } catch (Exception e) {
+          e.printStackTrace();
+          Assert.fail(e.getMessage());
+        }
+      }
+    });
+  }
+
+  /*
+   * We have various test cases to target single-node or multi-node failures
+   * below.
+   **/
+
+  /*
+   * One node(SCM) failure case:
+   * Thread-Context : SCM-Upgrade
+   *
+   * Test SCM failure During SCM Upgrade before execution point
+   * "PreFinalizeUpgrade". All meaningful Upgrade execution points
+   * are defined in UpgradeFinalizer:UpgradeTestInjectionPoints.
+   */
+  @Test
+  public void testScmFailuresBeforeScmPreFinalizeUpgrade()
+      throws Exception {
+    scm.getUpgradeFinalizer().configureTestInjectionFunction(
+        BeforePreFinalizeUpgrade,
+        () -> {
+          return injectSCMFailureDuringSCMUpgrade();
+        });

Review comment:
       ```suggestion
           this::injectSCMFailureDuringSCMUpgrade);
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org