You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2021/03/07 21:14:53 UTC

[GitHub] [ozone] prashantpogde opened a new pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

prashantpogde opened a new pull request #1998:
URL: https://github.com/apache/ozone/pull/1998


   ## What changes were proposed in this pull request?
   
   The goals of this PR is to write comprehensive framework that will
   
   - drives SCM - finalization
   - Inject failures in both DataNodes as well as SCM at every state change in both SCM and DataNodes.
   - Validate that SCM and DataNodes eventually finalize and upgrade is successful.
   
   HDDS upgrade model can be thought of as a State Machine model {states, transitions}, where
   states are specific stages in upgrade finalization either on the SCM node or on the individual DataNodes
   transitions are events that trigger state change
   
   Different HDDS-Upgrade stages, for Both DataNodes as well SCM are defined as
   
   - BeforePreFinalizeUpgrade
   - AfterPreFinalizeUpgrade
   - BeforeCompleteFinalization
   - AfterCompleteFinalization
   - AfterPostFinalizeUpgrade
   
   This validation framework will trigger all possible combination of failures while the nodes are in different possible states. The different combinations will include :
   
   -  One Node failures - Fail SCM  in the middle of SCM upgrade while the SCM is at a specific state.
         -Try this for all possible SCM-upgrade states 
   - One Node failures - Fail DataNode in the middle of SCM upgrade while the SCM is at a specific state. 
         -  Try this for all possible SCM-upgrade states 
   -  One Node failures - Fail SCM in the middle of DataNode upgrade while the DataNode is at a specific state.
         - Try this for all possible DataNode-upgrade states 
   - One Node failures - Fail DataNode in the middle of DataNode upgrade while the same DataNode is at a specific state. 
        - Try this for all possible DataNode-upgrade states
   - Two Node Failures - Fail SCM as well as a DataNode in the middle of SCM upgrade while the SCM is at a specific state.
       - Try this for all possible SCM-upgrade states
   - Two Node Failures - Fail SCM as well as a DataNode in the middle of the DataNode upgrade while the same DataNode is at a specific state.
       - Try this for all possible DataNode-upgrade states
   - Two Node Failures - Fail SCM at a specific upgrade state in SCM thread context. Fail DataNode at a specific upgrade state in DataNode upgrade thread context.
       - Try this for all permutations of SCM-upgrade-states and Data-Node-Upgrade-states
   - Multi-node failure - Fail All the DataNodes at specific SCM-upgrade state
      - Try this for all possible SCM-upgrade states
   - Multi-node failure - Fail All the DataNodes at specific DataNode-upgrade state
     - Try this for all possible DataNode-upgrade states
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-4914
   
   ## How was this patch tested?
   
   Running newly introduced Integration Tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] avijayanhwx commented on pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
avijayanhwx commented on pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#issuecomment-817087770


   @prashantpogde Can we resolve the merge conflicts? That will trigger CI which will actually run the added tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r607424721



##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/UpgradeFinalizer.java
##########
@@ -187,4 +186,15 @@ StatusAndMessages reportStatus(String upgradeClientId, boolean takeover)
    */
   void runPrefinalizeStateActions(Storage storage, T service)
       throws IOException;
+
+  /**
+   * Sets the Finalization Executor driver.
+   * @param executor FinalizationExecutor.
+   */
+  void setFinalizationExecutor(UpgradeFinalizationExecutor executor);

Review comment:
       Currently default constructor for DN SCM gets the default constructor only. The set method is only for the TestHDDSUpgrade.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r604582442



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/upgrade/DataNodeUpgradeFinalizer.java
##########
@@ -31,12 +35,16 @@
 import org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine;
 import org.apache.hadoop.ozone.upgrade.BasicUpgradeFinalizer;
 import org.apache.hadoop.ozone.upgrade.LayoutFeature;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 /**
  * UpgradeFinalizer for the DataNode.
  */
 public class DataNodeUpgradeFinalizer extends
     BasicUpgradeFinalizer<DatanodeStateMachine, HDDSLayoutVersionManager> {
+  static final Logger LOG =

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r604585519



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHDDSUpgrade.java
##########
@@ -74,10 +95,10 @@
 public class TestHDDSUpgrade {
 
   /**
-    * Set a timeout for each test.
-    */
+   * Set a timeout for each test.
+   */
   @Rule
-  public Timeout timeout = new Timeout(300000);
+  public Timeout timeout = new Timeout(11000000);

Review comment:
       yup, need to find a way to get these tests executes. Some of these tests are trying various permutations of failure combination in SCM and datanode and the tests  can take 3-4 hours. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] avijayanhwx commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
avijayanhwx commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r606364359



##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java
##########
@@ -436,4 +456,15 @@ private void logAndThrow(Exception e, String msg, ResultCodes resultCode)
   protected void updateLayoutVersionInDB(V vm, T comp) throws IOException {
     throw new UnsupportedOperationException();
   }
+
+  protected void postFinalizeUpgrade() throws IOException {
+  }
+
+  protected void finalizeVersionManager(Storage storageConfig)
+      throws UpgradeException {
+  }
+
+  protected boolean preFinalizeUpgrade() throws IOException {

Review comment:
       It seems like there is some overlap of logic between **preFinalizeUpgrade()** in various components, and org.apache.hadoop.ozone.upgrade.BasicUpgradeFinalizer#preFinalize. The preFinalize() step can be called either by the finalizer or the finalizationExecutor (based on whether we need that injection point), but in that implementation, we should be able to check the need to finalize as well as do pre-steps. If needed, we can split these 2 steps into needsFinalization() and preFinalize() steps, and call it from one place, the finalizationExecutor.

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/UpgradeFinalizationExecutor.java
##########
@@ -0,0 +1,80 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.ozone.upgrade;
+
+import static org.apache.hadoop.ozone.upgrade.UpgradeFinalizer.Status.FINALIZATION_IN_PROGRESS;
+import static org.apache.hadoop.ozone.upgrade.UpgradeFinalizer.Status.FINALIZATION_REQUIRED;
+
+import org.apache.hadoop.ozone.common.Storage;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * UpgradeFinalizationExecutor for driving the main part of finalization.
+ * Unit/Integration tests can override this to provide error injected version
+ * of this class.
+ */
+
+@SuppressWarnings("checkstyle:VisibilityModifier")
+public class UpgradeFinalizationExecutor {
+  static final Logger LOG =
+      LoggerFactory.getLogger(UpgradeFinalizationExecutor.class);
+
+  public UpgradeFinalizationExecutor() {
+  }
+
+  public Void execute(Storage storageConfig,
+                      BasicUpgradeFinalizer basicUpgradeFinalizer)
+      throws Exception {
+    try {
+      basicUpgradeFinalizer.emitStartingMsg();
+      basicUpgradeFinalizer.getVersionManager()
+          .setUpgradeState(FINALIZATION_IN_PROGRESS);
+
+      /*
+       * Before we can call finalize the feature, we need to make sure that
+       * all existing pipelines are closed and pipeline Manger would freeze
+       * all new pipeline creation.
+       */
+      if(!basicUpgradeFinalizer.preFinalizeUpgrade()) {
+        return null;
+      }
+
+      basicUpgradeFinalizer.finalizeVersionManager(storageConfig);
+
+      basicUpgradeFinalizer.getVersionManager().completeFinalization();

Review comment:
       Is it possible to absorb this into the last step (finalizeVersionManager)? It seems out of place in the pre => finalize => post flow.  

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/InjectedUpgradeFinalizationExecutor.java
##########
@@ -0,0 +1,137 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.ozone.upgrade;
+
+import static org.apache.hadoop.ozone.upgrade.InjectedUpgradeFinalizationExecutor.UpgradeTestInjectionPoints.AFTER_COMPLETE_FINALIZATION;
+import static org.apache.hadoop.ozone.upgrade.InjectedUpgradeFinalizationExecutor.UpgradeTestInjectionPoints.AFTER_POST_FINALIZE_UPGRADE;
+import static org.apache.hadoop.ozone.upgrade.InjectedUpgradeFinalizationExecutor.UpgradeTestInjectionPoints.AFTER_PRE_FINALIZE_UPGRADE;
+import static org.apache.hadoop.ozone.upgrade.InjectedUpgradeFinalizationExecutor.UpgradeTestInjectionPoints.BEFORE_COMPLETE_FINALIZATION;
+import static org.apache.hadoop.ozone.upgrade.InjectedUpgradeFinalizationExecutor.UpgradeTestInjectionPoints.BEFORE_PRE_FINALIZE_UPGRADE;
+import static org.apache.hadoop.ozone.upgrade.UpgradeFinalizer.Status.FINALIZATION_IN_PROGRESS;
+import static org.apache.hadoop.ozone.upgrade.UpgradeFinalizer.Status.FINALIZATION_REQUIRED;
+
+import java.util.concurrent.Callable;
+
+import org.apache.hadoop.ozone.common.Storage;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Failure injected extension of UpgradeFinalizationExecutor that can be used by
+ * Unit/Integration Tests.
+ */
+@SuppressWarnings("checkstyle:VisibilityModifier")
+public class InjectedUpgradeFinalizationExecutor extends

Review comment:
       Can we move this to the corresponding test package?

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java
##########
@@ -436,4 +456,15 @@ private void logAndThrow(Exception e, String msg, ResultCodes resultCode)
   protected void updateLayoutVersionInDB(V vm, T comp) throws IOException {
     throw new UnsupportedOperationException();
   }
+
+  protected void postFinalizeUpgrade() throws IOException {
+  }
+
+  protected void finalizeVersionManager(Storage storageConfig)

Review comment:
       nit. Can the method name be just finalizeUpgrade (to go with pre and post)? 

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java
##########
@@ -57,18 +57,39 @@
   protected V versionManager;
   protected String clientID;
   protected T component;
+  protected UpgradeFinalizationExecutor finalizationExecutor;
 
   private Queue<String> msgs = new ConcurrentLinkedQueue<>();
   protected boolean isDone = false;
 
   public BasicUpgradeFinalizer(V versionManager) {
     this.versionManager = versionManager;
+    this.finalizationExecutor =
+        new UpgradeFinalizationExecutor();
+  }
+
+  @Override
+  public void setFinalizationExecutor(UpgradeFinalizationExecutor executor) {
+    finalizationExecutor = executor;
+  }
+
+  @Override
+  public UpgradeFinalizationExecutor getFinalizationExecutor() {
+    return finalizationExecutor;
   }
 
   public boolean isFinalizationDone() {
     return isDone;
   }
 
+  public void markFinalizationDone() {
+    isDone = true;

Review comment:
       Can we not have **state** here? Why cannot we refer to the version manager to figure out that finalization is done or not?
   
   The way I see it, the version manager maintains state. The upgrade finalizer provides component specific upgrade finalization implementations (without state), and the executor uses both of them to orchestrate the upgrade.

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/UpgradeFinalizationExecutor.java
##########
@@ -0,0 +1,80 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.ozone.upgrade;
+
+import static org.apache.hadoop.ozone.upgrade.UpgradeFinalizer.Status.FINALIZATION_IN_PROGRESS;
+import static org.apache.hadoop.ozone.upgrade.UpgradeFinalizer.Status.FINALIZATION_REQUIRED;
+
+import org.apache.hadoop.ozone.common.Storage;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * UpgradeFinalizationExecutor for driving the main part of finalization.
+ * Unit/Integration tests can override this to provide error injected version
+ * of this class.
+ */
+
+@SuppressWarnings("checkstyle:VisibilityModifier")
+public class UpgradeFinalizationExecutor {

Review comment:
       nit. Maybe we can call it DefaultUpgradeFinalizationExecutor ? 

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java
##########
@@ -436,4 +456,15 @@ private void logAndThrow(Exception e, String msg, ResultCodes resultCode)
   protected void updateLayoutVersionInDB(V vm, T comp) throws IOException {
     throw new UnsupportedOperationException();
   }
+
+  protected void postFinalizeUpgrade() throws IOException {
+  }
+
+  protected void finalizeVersionManager(Storage storageConfig)

Review comment:
       Given that we are refactoring the abstractions here, I think we can also make org.apache.hadoop.ozone.upgrade.BasicUpgradeFinalizer#finalize abstract and mandate subclasses to implement it.

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/UpgradeFinalizer.java
##########
@@ -187,4 +186,15 @@ StatusAndMessages reportStatus(String upgradeClientId, boolean takeover)
    */
   void runPrefinalizeStateActions(Storage storage, T service)
       throws IOException;
+
+  /**
+   * Sets the Finalization Executor driver.
+   * @param executor FinalizationExecutor.
+   */
+  void setFinalizationExecutor(UpgradeFinalizationExecutor executor);

Review comment:
       Can we create overriden constructors to the finalizers (DN, SCM) that take in a custom executor? We can add one with a new executor parameter. If not passed in, the default executor will be used. In that case, we can remove the setter in the interface.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r604582071



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java
##########
@@ -645,4 +648,7 @@ public StatusAndMessages queryUpgradeStatus()
     return upgradeFinalizer.reportStatus(datanodeDetails.getUuidString(),
         false);
   }
+  public UpgradeFinalizer<DatanodeStateMachine> getUpgradeFinalizer() {
+    return  upgradeFinalizer;

Review comment:
       yup




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r604583361



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/upgrade/DataNodeUpgradeFinalizer.java
##########
@@ -95,14 +110,22 @@ public Void call() throws IOException {
               datanodeStateMachine.getLayoutStorage());
           versionManager.finalized(f);
         }
+        injectTestFunctionAtThisPoint(BeforeCompleteFinalization);
         versionManager.completeFinalization();
+        injectTestFunctionAtThisPoint(AfterCompleteFinalization);
         datanodeStateMachine.postFinalizeUpgrade();
+        injectTestFunctionAtThisPoint(AfterPostFinalizeUpgrade);
         emitFinishedMsg();
         return null;
+      } catch (Exception e) {
+        e.printStackTrace();

Review comment:
       done. {this piece is restructured and moved to UpgradeFinalizationExecutor.java in new version}




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#issuecomment-812090359


   @adoroszlai  @avijayanhwx  @swagle  Addressing all CI failures. The long running failure injection test is disabled by default. We need to find a way to run them with less frequency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] avijayanhwx merged pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
avijayanhwx merged pull request #1998:
URL: https://github.com/apache/ozone/pull/1998


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r604584937



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java
##########
@@ -604,7 +605,7 @@ public DatanodeLayoutStorage getLayoutStorage() {
   }
 
   @VisibleForTesting

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r612032643



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHDDSUpgrade.java
##########
@@ -305,4 +405,655 @@ public void testFinalizationFromInitialVersionToLatestVersion()
     // Verify that new pipeline can be created with upgraded datanodes.
     testPostUpgradePipelineCreation();
   }
+
+  /*
+   * All the subsequent tests here are failure cases. Some of the tests below
+   * could simultaneously fail one or more nodes at specific execution points
+   * and in different thread contexts.
+   * Upgrade path key execution points are defined in
+   * UpgradeFinalizer:UpgradeTestInjectionPoints.
+   */
+
+  /*
+   * Helper function to inject SCM failure and a SCM restart at a given
+   * execution point during SCM-Upgrade.
+   *
+   * Injects Failure in  : SCM
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectSCMFailureDuringSCMUpgrade()
+      throws InterruptedException, TimeoutException, AuthenticationException,
+      IOException {
+    // For some tests this could get called in a different thread context.
+    // We need to guard concurrent updates to the cluster.
+    synchronized(cluster) {
+      cluster.restartStorageContainerManager(true);
+      loadSCMState();
+    }
+    // The ongoing current SCM Upgrade is getting aborted at this point. We
+    // need to schedule a new SCM Upgrade on a different thread context.
+    Thread t = new Thread(new Runnable() {
+      @Override
+      public void run() {
+        try {
+          loadSCMState();
+          scm.finalizeUpgrade("xyz");

Review comment:
       I see, removed this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r604587035



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHDDSUpgrade.java
##########
@@ -305,4 +400,579 @@ public void testFinalizationFromInitialVersionToLatestVersion()
     // Verify that new pipeline can be created with upgraded datanodes.
     testPostUpgradePipelineCreation();
   }
+
+  /*
+   * All the subsequent tests here are failure cases. Some of the tests below
+   * could simultaneously fail one or more nodes at specific execution points
+   * and in different thread contexts.
+   * Upgrade path key execution points are defined in
+   * UpgradeFinalizer:UpgradeTestInjectionPoints.
+   */
+
+  /*
+   * Helper function to inject SCM failure and a SCM restart at a given
+   * execution point during SCM-Upgrade.
+   *
+   * Injects Failure in  : SCM
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectSCMFailureDuringSCMUpgrade()
+      throws InterruptedException, TimeoutException, AuthenticationException,
+      IOException {
+    // For some tests this could get called in a different thread context.
+    // We need to guard concurrent updates to the cluster.
+    synchronized(cluster) {
+      cluster.restartStorageContainerManager(true);
+      loadSCMState();
+    }
+    // The ongoing current SCM Upgrade is getting aborted at this point. We
+    // need to schedule a new SCM Upgrade on a different thread context.
+    Thread t = new Thread(new Runnable() {
+      @Override
+      public void run() {
+        try {
+          loadSCMState();
+          scm.finalizeUpgrade("xyz");
+        } catch (IOException e) {
+          e.printStackTrace();
+          Assert.fail(e.getMessage());
+        }
+      }
+    });
+    t.start();
+    return true;
+  }
+
+  /*
+   * Helper function to inject DataNode failures and DataNode restarts at a
+   * given execution point during SCM-Upgrade. Please note that it fails all
+   * the DataNodes in the cluster and is part of test cases that simulate
+   * multi-node failure at specific code-execution points during SCM Upgrade.
+   * Please note that this helper function should be called in the thread
+   * context of an SCM-Upgrade only. The return value has a significance that
+   * it does not abort the currently ongoing SCM upgrade. because this
+   * failure injection does not fail the SCM node and only impacts datanodes,
+   *  we do not need to schedule another scm-finalize-upgrade here.
+   *
+   * Injects Failure in  : All the DataNodes
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectDataNodeFailureDuringSCMUpgrade() {
+    try {
+      // Work on a Copy of current set of DataNodes to avoid
+      // running into tricky situations.
+      List<HddsDatanodeService> currentDataNodes =
+          new ArrayList<>(cluster.getHddsDatanodes());
+      for (HddsDatanodeService ds: currentDataNodes) {
+        DatanodeDetails dn = ds.getDatanodeDetails();
+        cluster.restartHddsDatanode(dn, false);
+      }
+      cluster.waitForClusterToBeReady();
+    } catch (Exception e) {
+      LOG.info("DataNode Restarts Failed!");
+      Assert.fail(e.getMessage());
+    }
+    loadSCMState();
+    // returning false from injection function, continues currently ongoing
+    // SCM-Upgrade-Finalization.
+    return false;
+  }
+
+  /*
+   * Helper function to inject a DataNode failure and restart for a specific
+   * DataNode. This injection function can target a specific DataNode and
+   * thus facilitates getting called in the upgrade-finalization thread context
+   * of that specific DataNode.
+   *
+   * Injects Failure in  : Given DataNodes
+   * Executing-Thread-Context : the same DataNode that we are failing here.
+   */
+  private Thread injectDataNodeFailureDuringDataNodeUpgrade(
+      DatanodeDetails dn) {
+    Thread t = null;
+    try {
+      // Schedule the DataNode restart on a separate thread context
+      // otherwise DataNode restart will hang. Also any cluster modification
+      // needs to be guarded since it could get modified in multiple independent
+      // threads.
+      t = new Thread(new Runnable() {
+        @Override
+        public void run() {
+          try {
+            synchronized (cluster) {
+              cluster.restartHddsDatanode(dn, true);
+            }
+          } catch (Exception e) {
+            e.printStackTrace();
+            Assert.fail(e.getMessage());
+          }
+        }
+      });
+    } catch (Exception e) {
+      LOG.info("DataNode Restart Failed!");
+      Assert.fail(e.getMessage());
+    }
+    return t;
+  }
+
+  /*
+   * Helper function to inject coordinated failures and restarts across
+   * all the DataNode as well as SCM. This can help create targeted test cases
+   * to inject such comprehensive failures in SCM-Upgrade-Context as well as
+   * DataNode-Upgrade-Context.
+   *
+   * Injects Failure in  : SCM as well as ALL the DataNodes.
+   * Executing-Thread-Context : Either the SCM-Upgrade-Finalizer or the
+   *                            DataNode-Upgrade-Finalizer.
+   */
+  private Thread injectSCMAndDataNodeFailureTogetherAtTheSameTime()
+      throws InterruptedException, TimeoutException, AuthenticationException,
+      IOException {
+    // This needs to happen in a separate thread context otherwise
+    // DataNode restart will hang.
+    return new Thread(new Runnable() {
+      @Override
+      public void run() {
+        try {
+          // Since we are modifying cluster in an independent thread context,
+          // we synchronize access to it to avoid concurrent modification
+          // exception.
+          synchronized (cluster) {
+            // Work on a Copy of current set of DataNodes to avoid
+            // running into tricky situations.
+            List<HddsDatanodeService> currentDataNodes =
+                new ArrayList<>(cluster.getHddsDatanodes());
+            for (HddsDatanodeService ds: currentDataNodes) {
+              DatanodeDetails dn = ds.getDatanodeDetails();
+              cluster.restartHddsDatanode(dn, false);
+            }
+            cluster.restartStorageContainerManager(false);
+            cluster.waitForClusterToBeReady();
+          }
+        } catch (Exception e) {
+          e.printStackTrace();
+          Assert.fail(e.getMessage());
+        }
+      }
+    });
+  }
+
+  /*
+   * We have various test cases to target single-node or multi-node failures
+   * below.
+   **/
+
+  /*
+   * One node(SCM) failure case:
+   * Thread-Context : SCM-Upgrade
+   *
+   * Test SCM failure During SCM Upgrade before execution point
+   * "PreFinalizeUpgrade". All meaningful Upgrade execution points
+   * are defined in UpgradeFinalizer:UpgradeTestInjectionPoints.
+   */
+  @Test
+  public void testScmFailuresBeforeScmPreFinalizeUpgrade()
+      throws Exception {
+    scm.getUpgradeFinalizer().configureTestInjectionFunction(
+        BeforePreFinalizeUpgrade,
+        () -> {
+          return injectSCMFailureDuringSCMUpgrade();
+        });

Review comment:
       done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r604580743



##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/UpgradeFinalizer.java
##########
@@ -167,4 +190,22 @@ StatusAndMessages finalize(String upgradeClientID, T service)
   StatusAndMessages reportStatus(String upgradeClientId, boolean takeover)
       throws IOException;
 
+  /**
+   * Interface to inject arbitrary failures for stress testing.
+   * @param InjectTestFunction function that will be called

Review comment:
       Test related code has been moved out of here. This whole bit is restructured now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r604580905



##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/UpgradeFinalizer.java
##########
@@ -167,4 +190,22 @@ StatusAndMessages finalize(String upgradeClientID, T service)
   StatusAndMessages reportStatus(String upgradeClientId, boolean takeover)
       throws IOException;
 
+  /**
+   * Interface to inject arbitrary failures for stress testing.
+   * @param InjectTestFunction function that will be called
+   *        code execution reached injectTestFunctionAtThisPoint() location.
+   * @param pointIndex code execution point for a given thread.
+   */
+  void configureTestInjectionFunction(UpgradeTestInjectionPoints pointIndex,
+                                      Callable<Boolean> injectTestFunction);
+
+  /**
+   * Interface to inject error at a given point in an upgrade thread.
+   * @param pointIndex TestFunction Injection point in an upgrade thread.
+   * @return "true" if the calling thread should not continue with further
+   *          upgrade processing, "false" otherwise.
+   */
+  Boolean injectTestFunctionAtThisPoint(UpgradeTestInjectionPoints pointIndex)
+      throws Exception;

Review comment:
       yup




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r612032871



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHDDSUpgrade.java
##########
@@ -305,4 +400,579 @@ public void testFinalizationFromInitialVersionToLatestVersion()
     // Verify that new pipeline can be created with upgraded datanodes.
     testPostUpgradePipelineCreation();
   }
+
+  /*
+   * All the subsequent tests here are failure cases. Some of the tests below
+   * could simultaneously fail one or more nodes at specific execution points
+   * and in different thread contexts.
+   * Upgrade path key execution points are defined in
+   * UpgradeFinalizer:UpgradeTestInjectionPoints.
+   */
+
+  /*
+   * Helper function to inject SCM failure and a SCM restart at a given
+   * execution point during SCM-Upgrade.
+   *
+   * Injects Failure in  : SCM
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectSCMFailureDuringSCMUpgrade()
+      throws InterruptedException, TimeoutException, AuthenticationException,
+      IOException {
+    // For some tests this could get called in a different thread context.
+    // We need to guard concurrent updates to the cluster.
+    synchronized(cluster) {
+      cluster.restartStorageContainerManager(true);
+      loadSCMState();
+    }
+    // The ongoing current SCM Upgrade is getting aborted at this point. We
+    // need to schedule a new SCM Upgrade on a different thread context.
+    Thread t = new Thread(new Runnable() {
+      @Override
+      public void run() {
+        try {
+          loadSCMState();
+          scm.finalizeUpgrade("xyz");
+        } catch (IOException e) {
+          e.printStackTrace();
+          Assert.fail(e.getMessage());
+        }
+      }
+    });
+    t.start();
+    return true;
+  }
+
+  /*
+   * Helper function to inject DataNode failures and DataNode restarts at a
+   * given execution point during SCM-Upgrade. Please note that it fails all
+   * the DataNodes in the cluster and is part of test cases that simulate
+   * multi-node failure at specific code-execution points during SCM Upgrade.
+   * Please note that this helper function should be called in the thread
+   * context of an SCM-Upgrade only. The return value has a significance that
+   * it does not abort the currently ongoing SCM upgrade. because this
+   * failure injection does not fail the SCM node and only impacts datanodes,
+   *  we do not need to schedule another scm-finalize-upgrade here.
+   *
+   * Injects Failure in  : All the DataNodes
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectDataNodeFailureDuringSCMUpgrade() {
+    try {
+      // Work on a Copy of current set of DataNodes to avoid
+      // running into tricky situations.
+      List<HddsDatanodeService> currentDataNodes =
+          new ArrayList<>(cluster.getHddsDatanodes());
+      for (HddsDatanodeService ds: currentDataNodes) {
+        DatanodeDetails dn = ds.getDatanodeDetails();
+        cluster.restartHddsDatanode(dn, false);
+      }
+      cluster.waitForClusterToBeReady();
+    } catch (Exception e) {
+      LOG.info("DataNode Restarts Failed!");
+      Assert.fail(e.getMessage());
+    }
+    loadSCMState();
+    // returning false from injection function, continues currently ongoing
+    // SCM-Upgrade-Finalization.
+    return false;
+  }
+
+  /*
+   * Helper function to inject a DataNode failure and restart for a specific
+   * DataNode. This injection function can target a specific DataNode and
+   * thus facilitates getting called in the upgrade-finalization thread context
+   * of that specific DataNode.
+   *
+   * Injects Failure in  : Given DataNodes
+   * Executing-Thread-Context : the same DataNode that we are failing here.
+   */
+  private Thread injectDataNodeFailureDuringDataNodeUpgrade(
+      DatanodeDetails dn) {
+    Thread t = null;
+    try {
+      // Schedule the DataNode restart on a separate thread context
+      // otherwise DataNode restart will hang. Also any cluster modification
+      // needs to be guarded since it could get modified in multiple independent
+      // threads.
+      t = new Thread(new Runnable() {
+        @Override
+        public void run() {
+          try {
+            synchronized (cluster) {
+              cluster.restartHddsDatanode(dn, true);
+            }
+          } catch (Exception e) {
+            e.printStackTrace();
+            Assert.fail(e.getMessage());

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r604579459



##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java
##########
@@ -303,4 +306,26 @@ private void logAndThrow(Exception e, String msg, ResultCodes resultCode)
     LOG.error(msg, e);
     throw new UpgradeException(msg, e, resultCode);
   }
+
+  @Override
+  public void configureTestInjectionFunction(
+      UpgradeTestInjectionPoints pointIndex,
+      Callable<Boolean> injectedTestFunction) {
+    injectTestFunction = injectedTestFunction;
+    testInjectionPoint = pointIndex;
+  }
+
+  @Override
+  public Boolean injectTestFunctionAtThisPoint(
+      UpgradeTestInjectionPoints pointIndex) throws Exception {
+    if ((testInjectionPoint != null) &&
+        (pointIndex.getValue() == testInjectionPoint.getValue())) {
+      if (injectTestFunction != null) {

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r607425386



##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java
##########
@@ -436,4 +456,15 @@ private void logAndThrow(Exception e, String msg, ResultCodes resultCode)
   protected void updateLayoutVersionInDB(V vm, T comp) throws IOException {
     throw new UnsupportedOperationException();
   }
+
+  protected void postFinalizeUpgrade() throws IOException {
+  }
+
+  protected void finalizeVersionManager(Storage storageConfig)
+      throws UpgradeException {
+  }
+
+  protected boolean preFinalizeUpgrade() throws IOException {

Review comment:
       I thought of moving it there and left it as it is to let the Specific DNFinalizer or SCMFinalizer decide as to what to do with pre/post/finalization functions.

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java
##########
@@ -436,4 +456,15 @@ private void logAndThrow(Exception e, String msg, ResultCodes resultCode)
   protected void updateLayoutVersionInDB(V vm, T comp) throws IOException {
     throw new UnsupportedOperationException();
   }
+
+  protected void postFinalizeUpgrade() throws IOException {
+  }
+
+  protected void finalizeVersionManager(Storage storageConfig)

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r607424432



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHDDSUpgrade.java
##########
@@ -92,6 +95,7 @@
 /**
  * Test SCM and DataNode Upgrade sequence.
  */
+@Ignore

Review comment:
       done

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/InjectedUpgradeFinalizationExecutor.java
##########
@@ -0,0 +1,137 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.ozone.upgrade;
+
+import static org.apache.hadoop.ozone.upgrade.InjectedUpgradeFinalizationExecutor.UpgradeTestInjectionPoints.AFTER_COMPLETE_FINALIZATION;
+import static org.apache.hadoop.ozone.upgrade.InjectedUpgradeFinalizationExecutor.UpgradeTestInjectionPoints.AFTER_POST_FINALIZE_UPGRADE;
+import static org.apache.hadoop.ozone.upgrade.InjectedUpgradeFinalizationExecutor.UpgradeTestInjectionPoints.AFTER_PRE_FINALIZE_UPGRADE;
+import static org.apache.hadoop.ozone.upgrade.InjectedUpgradeFinalizationExecutor.UpgradeTestInjectionPoints.BEFORE_COMPLETE_FINALIZATION;
+import static org.apache.hadoop.ozone.upgrade.InjectedUpgradeFinalizationExecutor.UpgradeTestInjectionPoints.BEFORE_PRE_FINALIZE_UPGRADE;
+import static org.apache.hadoop.ozone.upgrade.UpgradeFinalizer.Status.FINALIZATION_IN_PROGRESS;
+import static org.apache.hadoop.ozone.upgrade.UpgradeFinalizer.Status.FINALIZATION_REQUIRED;
+
+import java.util.concurrent.Callable;
+
+import org.apache.hadoop.ozone.common.Storage;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Failure injected extension of UpgradeFinalizationExecutor that can be used by
+ * Unit/Integration Tests.
+ */
+@SuppressWarnings("checkstyle:VisibilityModifier")
+public class InjectedUpgradeFinalizationExecutor extends

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r604586800



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHDDSUpgrade.java
##########
@@ -305,4 +400,579 @@ public void testFinalizationFromInitialVersionToLatestVersion()
     // Verify that new pipeline can be created with upgraded datanodes.
     testPostUpgradePipelineCreation();
   }
+
+  /*
+   * All the subsequent tests here are failure cases. Some of the tests below
+   * could simultaneously fail one or more nodes at specific execution points
+   * and in different thread contexts.
+   * Upgrade path key execution points are defined in
+   * UpgradeFinalizer:UpgradeTestInjectionPoints.
+   */
+
+  /*
+   * Helper function to inject SCM failure and a SCM restart at a given
+   * execution point during SCM-Upgrade.
+   *
+   * Injects Failure in  : SCM
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectSCMFailureDuringSCMUpgrade()
+      throws InterruptedException, TimeoutException, AuthenticationException,
+      IOException {
+    // For some tests this could get called in a different thread context.
+    // We need to guard concurrent updates to the cluster.
+    synchronized(cluster) {
+      cluster.restartStorageContainerManager(true);
+      loadSCMState();
+    }
+    // The ongoing current SCM Upgrade is getting aborted at this point. We
+    // need to schedule a new SCM Upgrade on a different thread context.
+    Thread t = new Thread(new Runnable() {
+      @Override
+      public void run() {
+        try {
+          loadSCMState();
+          scm.finalizeUpgrade("xyz");
+        } catch (IOException e) {
+          e.printStackTrace();
+          Assert.fail(e.getMessage());
+        }
+      }
+    });
+    t.start();
+    return true;
+  }
+
+  /*
+   * Helper function to inject DataNode failures and DataNode restarts at a
+   * given execution point during SCM-Upgrade. Please note that it fails all
+   * the DataNodes in the cluster and is part of test cases that simulate
+   * multi-node failure at specific code-execution points during SCM Upgrade.
+   * Please note that this helper function should be called in the thread
+   * context of an SCM-Upgrade only. The return value has a significance that
+   * it does not abort the currently ongoing SCM upgrade. because this
+   * failure injection does not fail the SCM node and only impacts datanodes,
+   *  we do not need to schedule another scm-finalize-upgrade here.
+   *
+   * Injects Failure in  : All the DataNodes
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectDataNodeFailureDuringSCMUpgrade() {
+    try {
+      // Work on a Copy of current set of DataNodes to avoid
+      // running into tricky situations.
+      List<HddsDatanodeService> currentDataNodes =
+          new ArrayList<>(cluster.getHddsDatanodes());
+      for (HddsDatanodeService ds: currentDataNodes) {
+        DatanodeDetails dn = ds.getDatanodeDetails();
+        cluster.restartHddsDatanode(dn, false);
+      }
+      cluster.waitForClusterToBeReady();
+    } catch (Exception e) {
+      LOG.info("DataNode Restarts Failed!");
+      Assert.fail(e.getMessage());
+    }
+    loadSCMState();
+    // returning false from injection function, continues currently ongoing
+    // SCM-Upgrade-Finalization.
+    return false;
+  }
+
+  /*
+   * Helper function to inject a DataNode failure and restart for a specific
+   * DataNode. This injection function can target a specific DataNode and
+   * thus facilitates getting called in the upgrade-finalization thread context
+   * of that specific DataNode.
+   *
+   * Injects Failure in  : Given DataNodes
+   * Executing-Thread-Context : the same DataNode that we are failing here.
+   */
+  private Thread injectDataNodeFailureDuringDataNodeUpgrade(
+      DatanodeDetails dn) {
+    Thread t = null;
+    try {
+      // Schedule the DataNode restart on a separate thread context
+      // otherwise DataNode restart will hang. Also any cluster modification
+      // needs to be guarded since it could get modified in multiple independent
+      // threads.
+      t = new Thread(new Runnable() {
+        @Override
+        public void run() {
+          try {
+            synchronized (cluster) {
+              cluster.restartHddsDatanode(dn, true);
+            }
+          } catch (Exception e) {
+            e.printStackTrace();
+            Assert.fail(e.getMessage());
+          }
+        }
+      });
+    } catch (Exception e) {
+      LOG.info("DataNode Restart Failed!");
+      Assert.fail(e.getMessage());
+    }
+    return t;
+  }
+
+  /*
+   * Helper function to inject coordinated failures and restarts across
+   * all the DataNode as well as SCM. This can help create targeted test cases
+   * to inject such comprehensive failures in SCM-Upgrade-Context as well as
+   * DataNode-Upgrade-Context.
+   *
+   * Injects Failure in  : SCM as well as ALL the DataNodes.
+   * Executing-Thread-Context : Either the SCM-Upgrade-Finalizer or the
+   *                            DataNode-Upgrade-Finalizer.
+   */
+  private Thread injectSCMAndDataNodeFailureTogetherAtTheSameTime()
+      throws InterruptedException, TimeoutException, AuthenticationException,
+      IOException {
+    // This needs to happen in a separate thread context otherwise
+    // DataNode restart will hang.
+    return new Thread(new Runnable() {
+      @Override
+      public void run() {
+        try {
+          // Since we are modifying cluster in an independent thread context,
+          // we synchronize access to it to avoid concurrent modification
+          // exception.
+          synchronized (cluster) {
+            // Work on a Copy of current set of DataNodes to avoid
+            // running into tricky situations.
+            List<HddsDatanodeService> currentDataNodes =
+                new ArrayList<>(cluster.getHddsDatanodes());
+            for (HddsDatanodeService ds: currentDataNodes) {
+              DatanodeDetails dn = ds.getDatanodeDetails();
+              cluster.restartHddsDatanode(dn, false);
+            }
+            cluster.restartStorageContainerManager(false);
+            cluster.waitForClusterToBeReady();
+          }
+        } catch (Exception e) {
+          e.printStackTrace();
+          Assert.fail(e.getMessage());
+        }
+      }
+    });
+  }
+
+  /*
+   * We have various test cases to target single-node or multi-node failures
+   * below.
+   **/
+
+  /*
+   * One node(SCM) failure case:
+   * Thread-Context : SCM-Upgrade
+   *
+   * Test SCM failure During SCM Upgrade before execution point
+   * "PreFinalizeUpgrade". All meaningful Upgrade execution points
+   * are defined in UpgradeFinalizer:UpgradeTestInjectionPoints.
+   */
+  @Test
+  public void testScmFailuresBeforeScmPreFinalizeUpgrade()
+      throws Exception {
+    scm.getUpgradeFinalizer().configureTestInjectionFunction(
+        BeforePreFinalizeUpgrade,
+        () -> {
+          return injectSCMFailureDuringSCMUpgrade();
+        });

Review comment:
       yup




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r607423940



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHDDSUpgrade.java
##########
@@ -305,4 +405,655 @@ public void testFinalizationFromInitialVersionToLatestVersion()
     // Verify that new pipeline can be created with upgraded datanodes.
     testPostUpgradePipelineCreation();
   }
+
+  /*
+   * All the subsequent tests here are failure cases. Some of the tests below
+   * could simultaneously fail one or more nodes at specific execution points
+   * and in different thread contexts.
+   * Upgrade path key execution points are defined in
+   * UpgradeFinalizer:UpgradeTestInjectionPoints.
+   */
+
+  /*
+   * Helper function to inject SCM failure and a SCM restart at a given
+   * execution point during SCM-Upgrade.
+   *
+   * Injects Failure in  : SCM
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectSCMFailureDuringSCMUpgrade()
+      throws InterruptedException, TimeoutException, AuthenticationException,
+      IOException {
+    // For some tests this could get called in a different thread context.
+    // We need to guard concurrent updates to the cluster.
+    synchronized(cluster) {
+      cluster.restartStorageContainerManager(true);
+      loadSCMState();
+    }
+    // The ongoing current SCM Upgrade is getting aborted at this point. We
+    // need to schedule a new SCM Upgrade on a different thread context.
+    Thread t = new Thread(new Runnable() {
+      @Override
+      public void run() {
+        try {
+          loadSCMState();
+          scm.finalizeUpgrade("xyz");

Review comment:
       Thats because, this finalization will fail and we need another thread to simulate another attempt at failed finalization.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r604580342



##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/UpgradeFinalizer.java
##########
@@ -43,6 +44,28 @@
 @InterfaceStability.Evolving
 public interface UpgradeFinalizer<T> {
 
+  enum UpgradeTestInjectionPoints {
+    BeforePreFinalizeUpgrade(1),
+    AfterPreFinalizeUpgrade(2),
+    BeforeCompleteFinalization(3),
+    AfterCompleteFinalization(4),
+    AfterPostFinalizeUpgrade(5);

Review comment:
       done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r604583975



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/SCMUpgradeFinalizer.java
##########
@@ -68,7 +77,7 @@ public StatusAndMessages finalize(String upgradeClientID,
     }
 
     @Override
-    public Void call() throws IOException {
+    public Void call() throws IOException, InterruptedException {

Review comment:
       this is restructured now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r604581914



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java
##########
@@ -616,6 +617,8 @@ public boolean canFinalizeDataNode() {
       case OPEN:
       case CLOSING:
       case UNHEALTHY:
+        LOG.warn("FinalizeUpgrade : Waiting for container to close, current " +
+            "state is: {}", ctr.getContainerState());

Review comment:
       yup.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r612160008



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHDDSUpgrade.java
##########
@@ -305,4 +405,655 @@ public void testFinalizationFromInitialVersionToLatestVersion()
     // Verify that new pipeline can be created with upgraded datanodes.
     testPostUpgradePipelineCreation();
   }
+
+  /*
+   * All the subsequent tests here are failure cases. Some of the tests below
+   * could simultaneously fail one or more nodes at specific execution points
+   * and in different thread contexts.
+   * Upgrade path key execution points are defined in
+   * UpgradeFinalizer:UpgradeTestInjectionPoints.
+   */
+
+  /*
+   * Helper function to inject SCM failure and a SCM restart at a given
+   * execution point during SCM-Upgrade.
+   *
+   * Injects Failure in  : SCM
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectSCMFailureDuringSCMUpgrade()
+      throws InterruptedException, TimeoutException, AuthenticationException,
+      IOException {
+    // For some tests this could get called in a different thread context.
+    // We need to guard concurrent updates to the cluster.
+    synchronized(cluster) {
+      cluster.restartStorageContainerManager(true);
+      loadSCMState();
+    }
+    // The ongoing current SCM Upgrade is getting aborted at this point. We
+    // need to schedule a new SCM Upgrade on a different thread context.
+    Thread t = new Thread(new Runnable() {
+      @Override
+      public void run() {
+        try {
+          loadSCMState();
+          scm.finalizeUpgrade("xyz");

Review comment:
       Spoke too soon. This can not be removed. 
   
   - Test cases  Where it is not a must, it validates that the multiple invocations of finalizations are handled gracefully.
   - It is a must for TestSCMFailureAfterCompleteFinalizaion. This excercises SCM failure just before postFinalization. This case case not be be caught in "*Helper"Function and is race-prone. The only guaranteed way to address it is in this thread, at this execution point.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r607425053



##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java
##########
@@ -57,18 +57,39 @@
   protected V versionManager;
   protected String clientID;
   protected T component;
+  protected UpgradeFinalizationExecutor finalizationExecutor;
 
   private Queue<String> msgs = new ConcurrentLinkedQueue<>();
   protected boolean isDone = false;
 
   public BasicUpgradeFinalizer(V versionManager) {
     this.versionManager = versionManager;
+    this.finalizationExecutor =
+        new UpgradeFinalizationExecutor();
+  }
+
+  @Override
+  public void setFinalizationExecutor(UpgradeFinalizationExecutor executor) {
+    finalizationExecutor = executor;
+  }
+
+  @Override
+  public UpgradeFinalizationExecutor getFinalizationExecutor() {
+    return finalizationExecutor;
   }
 
   public boolean isFinalizationDone() {
     return isDone;
   }
 
+  public void markFinalizationDone() {
+    isDone = true;

Review comment:
       Currently OMUpgrade is the path that makes use of it and for some OM upgrade unit tests. If we refactor OM upgrade, we can get rid of this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
adoroszlai commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r592958568



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java
##########
@@ -604,7 +605,7 @@ public DatanodeLayoutStorage getLayoutStorage() {
   }
 
   @VisibleForTesting

Review comment:
       ```suggestion
   ```
   
   It's no longer visible.

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java
##########
@@ -303,4 +306,26 @@ private void logAndThrow(Exception e, String msg, ResultCodes resultCode)
     LOG.error(msg, e);
     throw new UpgradeException(msg, e, resultCode);
   }
+
+  @Override
+  public void configureTestInjectionFunction(
+      UpgradeTestInjectionPoints pointIndex,
+      Callable<Boolean> injectedTestFunction) {
+    injectTestFunction = injectedTestFunction;
+    testInjectionPoint = pointIndex;
+  }
+
+  @Override
+  public Boolean injectTestFunctionAtThisPoint(
+      UpgradeTestInjectionPoints pointIndex) throws Exception {
+    if ((testInjectionPoint != null) &&
+        (pointIndex.getValue() == testInjectionPoint.getValue())) {
+      if (injectTestFunction != null) {

Review comment:
       Please merge these `if` statements.

##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java
##########
@@ -616,6 +617,8 @@ public boolean canFinalizeDataNode() {
       case OPEN:
       case CLOSING:
       case UNHEALTHY:
+        LOG.warn("FinalizeUpgrade : Waiting for container to close, current " +
+            "state is: {}", ctr.getContainerState());

Review comment:
       Please extract `ctr.getContainerState()` to a variable to ensure we get the same state in `switch` and the log.

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java
##########
@@ -1322,4 +1322,8 @@ public StatusAndMessages queryUpgradeFinalizationProgress(
   ) throws IOException {
     return upgradeFinalizer.reportStatus(upgradeClientID, takeover);
   }
+
+  public UpgradeFinalizer<StorageContainerManager> getUpgradeFinalizer() {
+    return  upgradeFinalizer;

Review comment:
       ```suggestion
       return upgradeFinalizer;
   ```

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/UpgradeFinalizer.java
##########
@@ -43,6 +44,28 @@
 @InterfaceStability.Evolving
 public interface UpgradeFinalizer<T> {
 
+  enum UpgradeTestInjectionPoints {
+    BeforePreFinalizeUpgrade(1),
+    AfterPreFinalizeUpgrade(2),
+    BeforeCompleteFinalization(3),
+    AfterCompleteFinalization(4),
+    AfterPostFinalizeUpgrade(5);

Review comment:
       These names trigger Sonar warning due to not matching pattern for constants.  Should be `BEFORE_PRE_FINALIZE_UPGRADE`, etc.

##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/upgrade/DataNodeUpgradeFinalizer.java
##########
@@ -31,12 +35,16 @@
 import org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine;
 import org.apache.hadoop.ozone.upgrade.BasicUpgradeFinalizer;
 import org.apache.hadoop.ozone.upgrade.LayoutFeature;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 /**
  * UpgradeFinalizer for the DataNode.
  */
 public class DataNodeUpgradeFinalizer extends
     BasicUpgradeFinalizer<DatanodeStateMachine, HDDSLayoutVersionManager> {
+  static final Logger LOG =

Review comment:
       ```suggestion
     private static final Logger LOG =
   ```

##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHDDSUpgrade.java
##########
@@ -305,4 +400,579 @@ public void testFinalizationFromInitialVersionToLatestVersion()
     // Verify that new pipeline can be created with upgraded datanodes.
     testPostUpgradePipelineCreation();
   }
+
+  /*
+   * All the subsequent tests here are failure cases. Some of the tests below
+   * could simultaneously fail one or more nodes at specific execution points
+   * and in different thread contexts.
+   * Upgrade path key execution points are defined in
+   * UpgradeFinalizer:UpgradeTestInjectionPoints.
+   */
+
+  /*
+   * Helper function to inject SCM failure and a SCM restart at a given
+   * execution point during SCM-Upgrade.
+   *
+   * Injects Failure in  : SCM
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectSCMFailureDuringSCMUpgrade()
+      throws InterruptedException, TimeoutException, AuthenticationException,
+      IOException {
+    // For some tests this could get called in a different thread context.
+    // We need to guard concurrent updates to the cluster.
+    synchronized(cluster) {
+      cluster.restartStorageContainerManager(true);
+      loadSCMState();
+    }
+    // The ongoing current SCM Upgrade is getting aborted at this point. We
+    // need to schedule a new SCM Upgrade on a different thread context.
+    Thread t = new Thread(new Runnable() {
+      @Override
+      public void run() {
+        try {
+          loadSCMState();
+          scm.finalizeUpgrade("xyz");
+        } catch (IOException e) {
+          e.printStackTrace();
+          Assert.fail(e.getMessage());
+        }
+      }
+    });
+    t.start();
+    return true;
+  }
+
+  /*
+   * Helper function to inject DataNode failures and DataNode restarts at a
+   * given execution point during SCM-Upgrade. Please note that it fails all
+   * the DataNodes in the cluster and is part of test cases that simulate
+   * multi-node failure at specific code-execution points during SCM Upgrade.
+   * Please note that this helper function should be called in the thread
+   * context of an SCM-Upgrade only. The return value has a significance that
+   * it does not abort the currently ongoing SCM upgrade. because this
+   * failure injection does not fail the SCM node and only impacts datanodes,
+   *  we do not need to schedule another scm-finalize-upgrade here.
+   *
+   * Injects Failure in  : All the DataNodes
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectDataNodeFailureDuringSCMUpgrade() {
+    try {
+      // Work on a Copy of current set of DataNodes to avoid
+      // running into tricky situations.
+      List<HddsDatanodeService> currentDataNodes =
+          new ArrayList<>(cluster.getHddsDatanodes());
+      for (HddsDatanodeService ds: currentDataNodes) {
+        DatanodeDetails dn = ds.getDatanodeDetails();
+        cluster.restartHddsDatanode(dn, false);
+      }
+      cluster.waitForClusterToBeReady();
+    } catch (Exception e) {
+      LOG.info("DataNode Restarts Failed!");
+      Assert.fail(e.getMessage());
+    }
+    loadSCMState();
+    // returning false from injection function, continues currently ongoing
+    // SCM-Upgrade-Finalization.
+    return false;
+  }
+
+  /*
+   * Helper function to inject a DataNode failure and restart for a specific
+   * DataNode. This injection function can target a specific DataNode and
+   * thus facilitates getting called in the upgrade-finalization thread context
+   * of that specific DataNode.
+   *
+   * Injects Failure in  : Given DataNodes
+   * Executing-Thread-Context : the same DataNode that we are failing here.
+   */
+  private Thread injectDataNodeFailureDuringDataNodeUpgrade(
+      DatanodeDetails dn) {
+    Thread t = null;
+    try {
+      // Schedule the DataNode restart on a separate thread context
+      // otherwise DataNode restart will hang. Also any cluster modification
+      // needs to be guarded since it could get modified in multiple independent
+      // threads.
+      t = new Thread(new Runnable() {
+        @Override
+        public void run() {
+          try {
+            synchronized (cluster) {
+              cluster.restartHddsDatanode(dn, true);
+            }
+          } catch (Exception e) {
+            e.printStackTrace();
+            Assert.fail(e.getMessage());

Review comment:
       I don't think calling`fail()` on other thread really triggers test failure.  The following test passes:
   
   ```
     @Test
     public void testThread() throws InterruptedException {
       Thread thread = new Thread(() -> Assert.fail("test"));
       thread.start();
       thread.join();
     }
   ```
   
   To trigger failure, we need some shared state (eg. `AtomicBoolean`) that's set by the other thread to signal failure.  The main test thread should check it and fail if state is set.

##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHDDSUpgrade.java
##########
@@ -74,10 +95,10 @@
 public class TestHDDSUpgrade {
 
   /**
-    * Set a timeout for each test.
-    */
+   * Set a timeout for each test.
+   */
   @Rule
-  public Timeout timeout = new Timeout(300000);
+  public Timeout timeout = new Timeout(11000000);

Review comment:
       There is no point in increasing test method timeout to 183 minutes.  Surefire forks are killed after 20 minutes:
   
   ```
   Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) on project hadoop-ozone-integration-test: There was a timeout or other error in the fork
   ```

##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java
##########
@@ -645,4 +648,7 @@ public StatusAndMessages queryUpgradeStatus()
     return upgradeFinalizer.reportStatus(datanodeDetails.getUuidString(),
         false);
   }
+  public UpgradeFinalizer<DatanodeStateMachine> getUpgradeFinalizer() {
+    return  upgradeFinalizer;

Review comment:
       ```suggestion
       return upgradeFinalizer;
   ```

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java
##########
@@ -303,4 +306,26 @@ private void logAndThrow(Exception e, String msg, ResultCodes resultCode)
     LOG.error(msg, e);
     throw new UpgradeException(msg, e, resultCode);
   }
+
+  @Override
+  public void configureTestInjectionFunction(
+      UpgradeTestInjectionPoints pointIndex,
+      Callable<Boolean> injectedTestFunction) {
+    injectTestFunction = injectedTestFunction;
+    testInjectionPoint = pointIndex;
+  }
+
+  @Override
+  public Boolean injectTestFunctionAtThisPoint(
+      UpgradeTestInjectionPoints pointIndex) throws Exception {
+    if ((testInjectionPoint != null) &&
+        (pointIndex.getValue() == testInjectionPoint.getValue())) {
+      if (injectTestFunction != null) {
+        if (injectTestFunction.call()) {
+          throw new UpgradeTestInjectionAbort();

Review comment:
       I think `UpgradeTestInjectionAbort` should be an unchecked exception, and any exception thrown from `injectTestFunction.call()` should be wrapped in `UpgradeTestInjectionAbort`, too.  This way we could avoid the need to handle test-related exceptions in production code:
   
   https://github.com/apache/ozone/blob/f67b44d2696be2c15f0941ebce6d93833b9ed8d4/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/upgrade/DataNodeUpgradeFinalizer.java#L64-L69

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/UpgradeFinalizer.java
##########
@@ -167,4 +190,22 @@ StatusAndMessages finalize(String upgradeClientID, T service)
   StatusAndMessages reportStatus(String upgradeClientId, boolean takeover)
       throws IOException;
 
+  /**
+   * Interface to inject arbitrary failures for stress testing.
+   * @param InjectTestFunction function that will be called

Review comment:
       ```suggestion
      * @param injectTestFunction function that will be called
   ```

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/SCMUpgradeFinalizer.java
##########
@@ -68,7 +77,7 @@ public StatusAndMessages finalize(String upgradeClientID,
     }
 
     @Override
-    public Void call() throws IOException {
+    public Void call() throws IOException, InterruptedException {

Review comment:
       I don't see any blocking calls.  Removing `InterruptedException` does not indicate unhandled exception in this method.

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/UpgradeFinalizer.java
##########
@@ -167,4 +190,22 @@ StatusAndMessages finalize(String upgradeClientID, T service)
   StatusAndMessages reportStatus(String upgradeClientId, boolean takeover)
       throws IOException;
 
+  /**
+   * Interface to inject arbitrary failures for stress testing.
+   * @param InjectTestFunction function that will be called
+   *        code execution reached injectTestFunctionAtThisPoint() location.
+   * @param pointIndex code execution point for a given thread.
+   */
+  void configureTestInjectionFunction(UpgradeTestInjectionPoints pointIndex,
+                                      Callable<Boolean> injectTestFunction);
+
+  /**
+   * Interface to inject error at a given point in an upgrade thread.
+   * @param pointIndex TestFunction Injection point in an upgrade thread.
+   * @return "true" if the calling thread should not continue with further
+   *          upgrade processing, "false" otherwise.
+   */
+  Boolean injectTestFunctionAtThisPoint(UpgradeTestInjectionPoints pointIndex)
+      throws Exception;

Review comment:
       Can be `void`, as its return value is always ignored.

##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/upgrade/DataNodeUpgradeFinalizer.java
##########
@@ -95,14 +110,22 @@ public Void call() throws IOException {
               datanodeStateMachine.getLayoutStorage());
           versionManager.finalized(f);
         }
+        injectTestFunctionAtThisPoint(BeforeCompleteFinalization);
         versionManager.completeFinalization();
+        injectTestFunctionAtThisPoint(AfterCompleteFinalization);
         datanodeStateMachine.postFinalizeUpgrade();
+        injectTestFunctionAtThisPoint(AfterPostFinalizeUpgrade);
         emitFinishedMsg();
         return null;
+      } catch (Exception e) {
+        e.printStackTrace();

Review comment:
       Log?

##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHDDSUpgrade.java
##########
@@ -305,4 +400,579 @@ public void testFinalizationFromInitialVersionToLatestVersion()
     // Verify that new pipeline can be created with upgraded datanodes.
     testPostUpgradePipelineCreation();
   }
+
+  /*
+   * All the subsequent tests here are failure cases. Some of the tests below
+   * could simultaneously fail one or more nodes at specific execution points
+   * and in different thread contexts.
+   * Upgrade path key execution points are defined in
+   * UpgradeFinalizer:UpgradeTestInjectionPoints.
+   */
+
+  /*
+   * Helper function to inject SCM failure and a SCM restart at a given
+   * execution point during SCM-Upgrade.
+   *
+   * Injects Failure in  : SCM
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectSCMFailureDuringSCMUpgrade()
+      throws InterruptedException, TimeoutException, AuthenticationException,
+      IOException {
+    // For some tests this could get called in a different thread context.
+    // We need to guard concurrent updates to the cluster.
+    synchronized(cluster) {
+      cluster.restartStorageContainerManager(true);
+      loadSCMState();
+    }
+    // The ongoing current SCM Upgrade is getting aborted at this point. We
+    // need to schedule a new SCM Upgrade on a different thread context.
+    Thread t = new Thread(new Runnable() {
+      @Override
+      public void run() {
+        try {
+          loadSCMState();
+          scm.finalizeUpgrade("xyz");
+        } catch (IOException e) {
+          e.printStackTrace();
+          Assert.fail(e.getMessage());
+        }
+      }
+    });
+    t.start();
+    return true;
+  }
+
+  /*
+   * Helper function to inject DataNode failures and DataNode restarts at a
+   * given execution point during SCM-Upgrade. Please note that it fails all
+   * the DataNodes in the cluster and is part of test cases that simulate
+   * multi-node failure at specific code-execution points during SCM Upgrade.
+   * Please note that this helper function should be called in the thread
+   * context of an SCM-Upgrade only. The return value has a significance that
+   * it does not abort the currently ongoing SCM upgrade. because this
+   * failure injection does not fail the SCM node and only impacts datanodes,
+   *  we do not need to schedule another scm-finalize-upgrade here.
+   *
+   * Injects Failure in  : All the DataNodes
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectDataNodeFailureDuringSCMUpgrade() {
+    try {
+      // Work on a Copy of current set of DataNodes to avoid
+      // running into tricky situations.
+      List<HddsDatanodeService> currentDataNodes =
+          new ArrayList<>(cluster.getHddsDatanodes());
+      for (HddsDatanodeService ds: currentDataNodes) {
+        DatanodeDetails dn = ds.getDatanodeDetails();
+        cluster.restartHddsDatanode(dn, false);
+      }
+      cluster.waitForClusterToBeReady();
+    } catch (Exception e) {
+      LOG.info("DataNode Restarts Failed!");
+      Assert.fail(e.getMessage());
+    }
+    loadSCMState();
+    // returning false from injection function, continues currently ongoing
+    // SCM-Upgrade-Finalization.
+    return false;
+  }
+
+  /*
+   * Helper function to inject a DataNode failure and restart for a specific
+   * DataNode. This injection function can target a specific DataNode and
+   * thus facilitates getting called in the upgrade-finalization thread context
+   * of that specific DataNode.
+   *
+   * Injects Failure in  : Given DataNodes
+   * Executing-Thread-Context : the same DataNode that we are failing here.
+   */
+  private Thread injectDataNodeFailureDuringDataNodeUpgrade(
+      DatanodeDetails dn) {
+    Thread t = null;
+    try {
+      // Schedule the DataNode restart on a separate thread context
+      // otherwise DataNode restart will hang. Also any cluster modification
+      // needs to be guarded since it could get modified in multiple independent
+      // threads.
+      t = new Thread(new Runnable() {
+        @Override
+        public void run() {
+          try {
+            synchronized (cluster) {
+              cluster.restartHddsDatanode(dn, true);
+            }
+          } catch (Exception e) {
+            e.printStackTrace();
+            Assert.fail(e.getMessage());
+          }
+        }
+      });
+    } catch (Exception e) {
+      LOG.info("DataNode Restart Failed!");
+      Assert.fail(e.getMessage());
+    }
+    return t;
+  }
+
+  /*
+   * Helper function to inject coordinated failures and restarts across
+   * all the DataNode as well as SCM. This can help create targeted test cases
+   * to inject such comprehensive failures in SCM-Upgrade-Context as well as
+   * DataNode-Upgrade-Context.
+   *
+   * Injects Failure in  : SCM as well as ALL the DataNodes.
+   * Executing-Thread-Context : Either the SCM-Upgrade-Finalizer or the
+   *                            DataNode-Upgrade-Finalizer.
+   */
+  private Thread injectSCMAndDataNodeFailureTogetherAtTheSameTime()
+      throws InterruptedException, TimeoutException, AuthenticationException,
+      IOException {
+    // This needs to happen in a separate thread context otherwise
+    // DataNode restart will hang.
+    return new Thread(new Runnable() {
+      @Override
+      public void run() {
+        try {
+          // Since we are modifying cluster in an independent thread context,
+          // we synchronize access to it to avoid concurrent modification
+          // exception.
+          synchronized (cluster) {
+            // Work on a Copy of current set of DataNodes to avoid
+            // running into tricky situations.
+            List<HddsDatanodeService> currentDataNodes =
+                new ArrayList<>(cluster.getHddsDatanodes());
+            for (HddsDatanodeService ds: currentDataNodes) {
+              DatanodeDetails dn = ds.getDatanodeDetails();
+              cluster.restartHddsDatanode(dn, false);
+            }
+            cluster.restartStorageContainerManager(false);
+            cluster.waitForClusterToBeReady();
+          }
+        } catch (Exception e) {
+          e.printStackTrace();
+          Assert.fail(e.getMessage());
+        }
+      }
+    });
+  }
+
+  /*
+   * We have various test cases to target single-node or multi-node failures
+   * below.
+   **/
+
+  /*
+   * One node(SCM) failure case:
+   * Thread-Context : SCM-Upgrade
+   *
+   * Test SCM failure During SCM Upgrade before execution point
+   * "PreFinalizeUpgrade". All meaningful Upgrade execution points
+   * are defined in UpgradeFinalizer:UpgradeTestInjectionPoints.
+   */
+  @Test
+  public void testScmFailuresBeforeScmPreFinalizeUpgrade()
+      throws Exception {
+    scm.getUpgradeFinalizer().configureTestInjectionFunction(
+        BeforePreFinalizeUpgrade,
+        () -> {
+          return injectSCMFailureDuringSCMUpgrade();
+        });

Review comment:
       ```suggestion
           this::injectSCMFailureDuringSCMUpgrade);
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r607424351



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHDDSUpgrade.java
##########
@@ -305,4 +405,655 @@ public void testFinalizationFromInitialVersionToLatestVersion()
     // Verify that new pipeline can be created with upgraded datanodes.
     testPostUpgradePipelineCreation();
   }
+
+  /*
+   * All the subsequent tests here are failure cases. Some of the tests below
+   * could simultaneously fail one or more nodes at specific execution points
+   * and in different thread contexts.
+   * Upgrade path key execution points are defined in
+   * UpgradeFinalizer:UpgradeTestInjectionPoints.
+   */
+
+  /*
+   * Helper function to inject SCM failure and a SCM restart at a given
+   * execution point during SCM-Upgrade.
+   *
+   * Injects Failure in  : SCM
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectSCMFailureDuringSCMUpgrade()
+      throws InterruptedException, TimeoutException, AuthenticationException,
+      IOException {
+    // For some tests this could get called in a different thread context.
+    // We need to guard concurrent updates to the cluster.
+    synchronized(cluster) {
+      cluster.restartStorageContainerManager(true);
+      loadSCMState();
+    }
+    // The ongoing current SCM Upgrade is getting aborted at this point. We
+    // need to schedule a new SCM Upgrade on a different thread context.
+    Thread t = new Thread(new Runnable() {
+      @Override
+      public void run() {
+        try {
+          loadSCMState();
+          scm.finalizeUpgrade("xyz");
+        } catch (IOException e) {
+          e.printStackTrace();
+          testPassed.set(false);
+        }
+      }
+    });
+    t.start();
+    return true;
+  }
+
+  /*
+   * Helper function to inject DataNode failures and DataNode restarts at a
+   * given execution point during SCM-Upgrade. Please note that it fails all
+   * the DataNodes in the cluster and is part of test cases that simulate
+   * multi-node failure at specific code-execution points during SCM Upgrade.
+   * Please note that this helper function should be called in the thread
+   * context of an SCM-Upgrade only. The return value has a significance that
+   * it does not abort the currently ongoing SCM upgrade. because this
+   * failure injection does not fail the SCM node and only impacts datanodes,
+   *  we do not need to schedule another scm-finalize-upgrade here.
+   *
+   * Injects Failure in  : All the DataNodes
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectDataNodeFailureDuringSCMUpgrade() {
+    try {
+      // Work on a Copy of current set of DataNodes to avoid
+      // running into tricky situations.
+      List<HddsDatanodeService> currentDataNodes =
+          new ArrayList<>(cluster.getHddsDatanodes());
+      for (HddsDatanodeService ds: currentDataNodes) {
+        DatanodeDetails dn = ds.getDatanodeDetails();
+        cluster.restartHddsDatanode(dn, false);

Review comment:
       Not these will be new test cases. We can keep on adding new test cases like these.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r607425122



##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/UpgradeFinalizationExecutor.java
##########
@@ -0,0 +1,80 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.ozone.upgrade;
+
+import static org.apache.hadoop.ozone.upgrade.UpgradeFinalizer.Status.FINALIZATION_IN_PROGRESS;
+import static org.apache.hadoop.ozone.upgrade.UpgradeFinalizer.Status.FINALIZATION_REQUIRED;
+
+import org.apache.hadoop.ozone.common.Storage;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * UpgradeFinalizationExecutor for driving the main part of finalization.
+ * Unit/Integration tests can override this to provide error injected version
+ * of this class.
+ */
+
+@SuppressWarnings("checkstyle:VisibilityModifier")
+public class UpgradeFinalizationExecutor {
+  static final Logger LOG =
+      LoggerFactory.getLogger(UpgradeFinalizationExecutor.class);
+
+  public UpgradeFinalizationExecutor() {
+  }
+
+  public Void execute(Storage storageConfig,
+                      BasicUpgradeFinalizer basicUpgradeFinalizer)
+      throws Exception {
+    try {
+      basicUpgradeFinalizer.emitStartingMsg();
+      basicUpgradeFinalizer.getVersionManager()
+          .setUpgradeState(FINALIZATION_IN_PROGRESS);
+
+      /*
+       * Before we can call finalize the feature, we need to make sure that
+       * all existing pipelines are closed and pipeline Manger would freeze
+       * all new pipeline creation.
+       */
+      if(!basicUpgradeFinalizer.preFinalizeUpgrade()) {
+        return null;
+      }
+
+      basicUpgradeFinalizer.finalizeVersionManager(storageConfig);
+
+      basicUpgradeFinalizer.getVersionManager().completeFinalization();

Review comment:
       done

##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/UpgradeFinalizationExecutor.java
##########
@@ -0,0 +1,80 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.ozone.upgrade;
+
+import static org.apache.hadoop.ozone.upgrade.UpgradeFinalizer.Status.FINALIZATION_IN_PROGRESS;
+import static org.apache.hadoop.ozone.upgrade.UpgradeFinalizer.Status.FINALIZATION_REQUIRED;
+
+import org.apache.hadoop.ozone.common.Storage;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * UpgradeFinalizationExecutor for driving the main part of finalization.
+ * Unit/Integration tests can override this to provide error injected version
+ * of this class.
+ */
+
+@SuppressWarnings("checkstyle:VisibilityModifier")
+public class UpgradeFinalizationExecutor {

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#issuecomment-812341848


   The one failure in CI is unrelated with the changes. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#issuecomment-816994839


   > > These fault injection tests are adding significant time to complete CI runs
   > 
   > We can skip it in regular integration tests by adding an `<exclude>` for `TestHDDSUpgrade` in `pom.xml`:
   > 
   > https://github.com/apache/ozone/blob/2ce05944a502c17b38760807c3bddb5398bf7882/pom.xml#L2175-L2191
   > 
   > If the pre-existing test case(s) or any other future test cases need to be run as regular integration tests, then the new injection test cases should be separated into a separate class.
   > 
   > > is there any way to 1) run the conditionally on layout version upgrade
   > 
   > We can introduce a new workflow for failure injection tests. It can be scheduled with lower frequency.
   > 
   > Is "layout version upgrade" indicated by changes to specific source files, which we could use as trigger? Also, are you sure that upgrade functionality is not affected by other cocde changes?
   > 
   > > 1. speed up the tests by running as separate processes?
   > 
   > We can override surefire fork parameters for this separate workflow.
   
   For now we have disabled long running tests. Therefore we do not need to make change in pom.xml for now. But we do need a way to run these test with less frequency e.g. every 50th commit or something like that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r604579709



##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java
##########
@@ -303,4 +306,26 @@ private void logAndThrow(Exception e, String msg, ResultCodes resultCode)
     LOG.error(msg, e);
     throw new UpgradeException(msg, e, resultCode);
   }
+
+  @Override
+  public void configureTestInjectionFunction(
+      UpgradeTestInjectionPoints pointIndex,
+      Callable<Boolean> injectedTestFunction) {
+    injectTestFunction = injectedTestFunction;
+    testInjectionPoint = pointIndex;
+  }
+
+  @Override
+  public Boolean injectTestFunctionAtThisPoint(
+      UpgradeTestInjectionPoints pointIndex) throws Exception {
+    if ((testInjectionPoint != null) &&
+        (pointIndex.getValue() == testInjectionPoint.getValue())) {
+      if (injectTestFunction != null) {
+        if (injectTestFunction.call()) {
+          throw new UpgradeTestInjectionAbort();

Review comment:
       This is restructured to take the test code out from production. code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#issuecomment-818324365


   > @prashantpogde Can we resolve the merge conflicts? That will trigger CI which will actually run the added tests.
   
   Done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] avijayanhwx commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
avijayanhwx commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r606386264



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHDDSUpgrade.java
##########
@@ -92,6 +95,7 @@
 /**
  * Test SCM and DataNode Upgrade sequence.
  */
+@Ignore

Review comment:
       Do we need to Ignore the entire test class? Why not just ignore long running tests? 

##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHDDSUpgrade.java
##########
@@ -305,4 +405,655 @@ public void testFinalizationFromInitialVersionToLatestVersion()
     // Verify that new pipeline can be created with upgraded datanodes.
     testPostUpgradePipelineCreation();
   }
+
+  /*
+   * All the subsequent tests here are failure cases. Some of the tests below
+   * could simultaneously fail one or more nodes at specific execution points
+   * and in different thread contexts.
+   * Upgrade path key execution points are defined in
+   * UpgradeFinalizer:UpgradeTestInjectionPoints.
+   */
+
+  /*
+   * Helper function to inject SCM failure and a SCM restart at a given
+   * execution point during SCM-Upgrade.
+   *
+   * Injects Failure in  : SCM
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectSCMFailureDuringSCMUpgrade()
+      throws InterruptedException, TimeoutException, AuthenticationException,
+      IOException {
+    // For some tests this could get called in a different thread context.
+    // We need to guard concurrent updates to the cluster.
+    synchronized(cluster) {
+      cluster.restartStorageContainerManager(true);
+      loadSCMState();
+    }
+    // The ongoing current SCM Upgrade is getting aborted at this point. We
+    // need to schedule a new SCM Upgrade on a different thread context.
+    Thread t = new Thread(new Runnable() {
+      @Override
+      public void run() {
+        try {
+          loadSCMState();
+          scm.finalizeUpgrade("xyz");
+        } catch (IOException e) {
+          e.printStackTrace();
+          testPassed.set(false);
+        }
+      }
+    });
+    t.start();
+    return true;
+  }
+
+  /*
+   * Helper function to inject DataNode failures and DataNode restarts at a
+   * given execution point during SCM-Upgrade. Please note that it fails all
+   * the DataNodes in the cluster and is part of test cases that simulate
+   * multi-node failure at specific code-execution points during SCM Upgrade.
+   * Please note that this helper function should be called in the thread
+   * context of an SCM-Upgrade only. The return value has a significance that
+   * it does not abort the currently ongoing SCM upgrade. because this
+   * failure injection does not fail the SCM node and only impacts datanodes,
+   *  we do not need to schedule another scm-finalize-upgrade here.
+   *
+   * Injects Failure in  : All the DataNodes
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectDataNodeFailureDuringSCMUpgrade() {
+    try {
+      // Work on a Copy of current set of DataNodes to avoid
+      // running into tricky situations.
+      List<HddsDatanodeService> currentDataNodes =
+          new ArrayList<>(cluster.getHddsDatanodes());
+      for (HddsDatanodeService ds: currentDataNodes) {
+        DatanodeDetails dn = ds.getDatanodeDetails();
+        cluster.restartHddsDatanode(dn, false);

Review comment:
       Apart from immediate restart of the DN, can we also look at a case where the DN was brought down (or already down) during finalization, and can join late after finalization is completed?
   
   Do we have a test case where a new node (DN MLV = DN SLV > SCM MLV) is registering during finalization or when SCM is in pre-Finalize? 

##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHDDSUpgrade.java
##########
@@ -305,4 +405,655 @@ public void testFinalizationFromInitialVersionToLatestVersion()
     // Verify that new pipeline can be created with upgraded datanodes.
     testPostUpgradePipelineCreation();
   }
+
+  /*
+   * All the subsequent tests here are failure cases. Some of the tests below
+   * could simultaneously fail one or more nodes at specific execution points
+   * and in different thread contexts.
+   * Upgrade path key execution points are defined in
+   * UpgradeFinalizer:UpgradeTestInjectionPoints.
+   */
+
+  /*
+   * Helper function to inject SCM failure and a SCM restart at a given
+   * execution point during SCM-Upgrade.
+   *
+   * Injects Failure in  : SCM
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectSCMFailureDuringSCMUpgrade()
+      throws InterruptedException, TimeoutException, AuthenticationException,
+      IOException {
+    // For some tests this could get called in a different thread context.
+    // We need to guard concurrent updates to the cluster.
+    synchronized(cluster) {
+      cluster.restartStorageContainerManager(true);
+      loadSCMState();
+    }
+    // The ongoing current SCM Upgrade is getting aborted at this point. We
+    // need to schedule a new SCM Upgrade on a different thread context.
+    Thread t = new Thread(new Runnable() {
+      @Override
+      public void run() {
+        try {
+          loadSCMState();
+          scm.finalizeUpgrade("xyz");

Review comment:
       Why are we attempting a finalize from within a failure function? There is also a finalization attempt in the actual test through TestHDDSUpgrade#testFinalizationWithFailuerInjectionHelper.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] prashantpogde commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
prashantpogde commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r604583812



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java
##########
@@ -1322,4 +1322,8 @@ public StatusAndMessages queryUpgradeFinalizationProgress(
   ) throws IOException {
     return upgradeFinalizer.reportStatus(upgradeClientID, takeover);
   }
+
+  public UpgradeFinalizer<StorageContainerManager> getUpgradeFinalizer() {
+    return  upgradeFinalizer;

Review comment:
       yup




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] avijayanhwx commented on pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
avijayanhwx commented on pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#issuecomment-819576725


   Thanks for working on this @prashantpogde. I am merging this, with a follow up item of HDDS-5108. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] swagle commented on pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
swagle commented on pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#issuecomment-797126901


   @adoroszlai These fault injection tests are adding significant time to complete CI runs, is there any way to 1) run the conditionally on layout version upgrade or 2) speed up the tests by running as separate processes?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] avijayanhwx commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
avijayanhwx commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r611004955



##########
File path: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHDDSUpgrade.java
##########
@@ -305,4 +405,655 @@ public void testFinalizationFromInitialVersionToLatestVersion()
     // Verify that new pipeline can be created with upgraded datanodes.
     testPostUpgradePipelineCreation();
   }
+
+  /*
+   * All the subsequent tests here are failure cases. Some of the tests below
+   * could simultaneously fail one or more nodes at specific execution points
+   * and in different thread contexts.
+   * Upgrade path key execution points are defined in
+   * UpgradeFinalizer:UpgradeTestInjectionPoints.
+   */
+
+  /*
+   * Helper function to inject SCM failure and a SCM restart at a given
+   * execution point during SCM-Upgrade.
+   *
+   * Injects Failure in  : SCM
+   * Executing-Thread-Context : SCM-Upgrade
+   */
+  private Boolean injectSCMFailureDuringSCMUpgrade()
+      throws InterruptedException, TimeoutException, AuthenticationException,
+      IOException {
+    // For some tests this could get called in a different thread context.
+    // We need to guard concurrent updates to the cluster.
+    synchronized(cluster) {
+      cluster.restartStorageContainerManager(true);
+      loadSCMState();
+    }
+    // The ongoing current SCM Upgrade is getting aborted at this point. We
+    // need to schedule a new SCM Upgrade on a different thread context.
+    Thread t = new Thread(new Runnable() {
+      @Override
+      public void run() {
+        try {
+          loadSCMState();
+          scm.finalizeUpgrade("xyz");

Review comment:
       @prashantpogde Maybe I am missing something here, but the method 'testFinalizationWithFailuerInjectionHelper' is also invoked, and within that we do a _scm.finalizeUpgrade("xyz")_. Why do we need to finalize twice?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] avijayanhwx commented on a change in pull request #1998: HDDS-4914. Failure injection and validating HDDS upgrade.

Posted by GitBox <gi...@apache.org>.
avijayanhwx commented on a change in pull request #1998:
URL: https://github.com/apache/ozone/pull/1998#discussion_r613314774



##########
File path: hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java
##########
@@ -57,18 +57,39 @@
   protected V versionManager;
   protected String clientID;
   protected T component;
+  protected UpgradeFinalizationExecutor finalizationExecutor;
 
   private Queue<String> msgs = new ConcurrentLinkedQueue<>();
   protected boolean isDone = false;
 
   public BasicUpgradeFinalizer(V versionManager) {
     this.versionManager = versionManager;
+    this.finalizationExecutor =
+        new UpgradeFinalizationExecutor();
+  }
+
+  @Override
+  public void setFinalizationExecutor(UpgradeFinalizationExecutor executor) {
+    finalizationExecutor = executor;
+  }
+
+  @Override
+  public UpgradeFinalizationExecutor getFinalizationExecutor() {
+    return finalizationExecutor;
   }
 
   public boolean isFinalizationDone() {
     return isDone;
   }
 
+  public void markFinalizationDone() {
+    isDone = true;

Review comment:
       Created HDDS-5108 to track this work item.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org