You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2022/06/21 03:45:32 UTC

[GitHub] [ozone] errose28 opened a new pull request, #3534: HDDS-6761. [SCM HA finalization] Handle restarts, crashes, and leader changes.

errose28 opened a new pull request, #3534:
URL: https://github.com/apache/ozone/pull/3534

   ## What changes were proposed in this pull request?
   
   Resume HDDS finalization after a restart or crash on any or all SCMs. Hand off finalization to the new SCM leader on leader change.
   
   ## What is the link to the Apache JIRA
   
   HDDS-6761
   
   ## How was this patch tested?
   
   Integration tests added. The new tests finish in about 10 minutes. Since they are somewhat slow and testing corner cases, I am open to annotating them with `@Slow` to skip them on normal CI runs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime commented on pull request #3534: HDDS-6761. [SCM HA finalization] Handle restarts, crashes, and leader changes.

Posted by GitBox <gi...@apache.org>.
kerneltime commented on PR #3534:
URL: https://github.com/apache/ozone/pull/3534#issuecomment-1182557852

   Thank you @errose28 !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] errose28 commented on a diff in pull request #3534: HDDS-6761. [SCM HA finalization] Handle restarts, crashes, and leader changes.

Posted by GitBox <gi...@apache.org>.
errose28 commented on code in PR #3534:
URL: https://github.com/apache/ozone/pull/3534#discussion_r918412092


##########
hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java:
##########
@@ -85,14 +87,20 @@ public StatusAndMessages finalize(String upgradeClientID, T service)
     StatusAndMessages response = initFinalize(upgradeClientID, service);
     if (finalizationLock.tryLock()) {
       try {
-        // Even if the status indicates finalization completed, the component
-        // may not have finished all its specific steps if finalization was
-        // interrupted, so we should re-invoke them here.
+        // If we were able to enter the lock and finalization status is "in
+        // progress", we should resume finalization because the last attempt
+        // was interrupted. If an attempt was currently ongoing, the lock
+        // would have been held.
         if (response.status() == FINALIZATION_REQUIRED ||
-            !componentFinishedFinalizationSteps(service)) {
+            response.status() == FINALIZATION_IN_PROGRESS) {
           finalizationExecutor.execute(service, this);
           response = STARTING_MSG;
         }
+        // Else, the initial response we got from initFinalize was still
+        // correct.
+      } catch (NotLeaderException e) {
+        LOG.info("Leader change encountered during finalization. This " +
+            "component will continue to finalize as a follower.", e);

Review Comment:
   This component was the leader and driving finalization. If an NLE is thrown, then this component is no longer the leader. It will stop sending Ratis requests to drive finalization and instead finalize based on Ratis transactions it receives from the leader. I will amend the log message to try to clarify this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] errose28 commented on pull request #3534: HDDS-6761. [SCM HA finalization] Handle restarts, crashes, and leader changes.

Posted by GitBox <gi...@apache.org>.
errose28 commented on PR #3534:
URL: https://github.com/apache/ozone/pull/3534#issuecomment-1181337740

   Thanks for the first round review @kerneltime. All of your comments should be incorporated into the latest version.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] errose28 commented on pull request #3534: HDDS-6761. [SCM HA finalization] Handle restarts, crashes, and leader changes.

Posted by GitBox <gi...@apache.org>.
errose28 commented on PR #3534:
URL: https://github.com/apache/ozone/pull/3534#issuecomment-1165872080

   Looks like there is an intermittent failure in `TestScmHaFinalization#testSnapshotFinalization`. However, it is the leader, not the SCM that finalized from a snapshot, that hit the error:
   
   ```
   2022-06-24 00:37:59,413 [Listener at 127.0.0.1/43961] INFO  upgrade.TestScmHAFinalization (TestScmHAFinalization.java:testSnapshotFinalization(251)) - Inactive SCM node ID: scmNode-3
   ...
   2022-06-24 00:29:26,723 [Listener at 0.0.0.0/35361] INFO  upgrade.TestHddsUpgradeUtils (TestHddsUpgradeUtils.java:lambda$waitForFinalization$0(76)) - Waiting for upgrade finalization to complete from client. Current status is FINALIZATION_DONE.
   2022-06-24 00:29:26,725 [Listener at 0.0.0.0/35361] INFO  upgrade.TestScmHAFinalization (TestScmHAFinalization.java:lambda$waitForScmToFinalize$2(298)) - Waiting for SCM scmNode-2 (leader? false) to finalize. Current finalization checkpoint is FINALIZATION_COMPLETE
   2022-06-24 00:29:26,725 [Listener at 0.0.0.0/35361] INFO  upgrade.TestScmHAFinalization (TestScmHAFinalization.java:lambda$waitForScmToFinalize$2(298)) - Waiting for SCM scmNode-3 (leader? false) to finalize. Current finalization checkpoint is FINALIZATION_COMPLETE
   2022-06-24 00:29:26,725 [Listener at 0.0.0.0/35361] INFO  upgrade.TestScmHAFinalization (TestScmHAFinalization.java:lambda$waitForScmToFinalize$2(298)) - Waiting for SCM scmNode-1 (leader? true) to finalize. Current finalization checkpoint is FINALIZATION_COMPLETE
   2022-06-24 00:29:26,725 [Listener at 0.0.0.0/35361] INFO  upgrade.TestHddsUpgradeUtils (TestHddsUpgradeUtils.java:testPostUpgradeConditionsSCM(103)) - Testing post upgrade conditions on SCM with node ID: scmNode-2
   2022-06-24 00:29:26,728 [Listener at 0.0.0.0/35361] INFO  upgrade.TestHddsUpgradeUtils (TestHddsUpgradeUtils.java:testPostUpgradeConditionsSCM(103)) - Testing post upgrade conditions on SCM with node ID: scmNode-3
   2022-06-24 00:29:26,729 [Listener at 0.0.0.0/35361] INFO  upgrade.TestHddsUpgradeUtils (TestHddsUpgradeUtils.java:testPostUpgradeConditionsSCM(103)) - Testing post upgrade conditions on SCM with node ID: scmNode-1
   2022-06-24 00:29:26,739 [Listener at 0.0.0.0/35361] INFO  ozone.MiniOzoneClusterImpl (MiniOzoneClusterImpl.java:shutdown(448)) - Shutting down the Mini Ozone Cluster
   ```
   
   The error is
   ```
   org.apache.hadoop.hdds.upgrade.TestScmHAFinalization.testSnapshotFinalization  Time elapsed: 73.328 s  <<< FAILURE!
   java.lang.AssertionError
   	at org.junit.Assert.fail(Assert.java:87)
   	at org.junit.Assert.assertTrue(Assert.java:42)
   	at org.junit.Assert.assertTrue(Assert.java:53)
   	at org.apache.hadoop.hdds.upgrade.TestHddsUpgradeUtils.testPostUpgradeConditionsSCM(TestHddsUpgradeUtils.java:112)
   	at org.apache.hadoop.hdds.upgrade.TestHddsUpgradeUtils.testPostUpgradeConditionsSCM(TestHddsUpgradeUtils.java:105)
   	at org.apache.hadoop.hdds.upgrade.TestScmHAFinalization.testSnapshotFinalization(TestScmHAFinalization.java:266)
   ```
   
   Which indicates the leader scmNode-1 did not cross the `FINALIZATION_COMPLETE` checkpoint in this check, even though the logs show that it already had earlier. I will need investigate this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime merged pull request #3534: HDDS-6761. [SCM HA finalization] Handle restarts, crashes, and leader changes.

Posted by GitBox <gi...@apache.org>.
kerneltime merged PR #3534:
URL: https://github.com/apache/ozone/pull/3534


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] errose28 commented on pull request #3534: HDDS-6761. [SCM HA finalization] Handle restarts, crashes, and leader changes.

Posted by GitBox <gi...@apache.org>.
errose28 commented on PR #3534:
URL: https://github.com/apache/ozone/pull/3534#issuecomment-1171310964

   I was looking at some incorrect logs in the previous message, sorry for confusion. The updated integration tests passed 49/50 [times on my fork](https://github.com/errose28/hadoop-ozone/runs/7126176923?check_suite_focus=true), and the one failure happened within the mini ozone cluster waiting for a leader election while finalization was paused by the test, not waiting for finalization itself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime commented on a diff in pull request #3534: HDDS-6761. [SCM HA finalization] Handle restarts, crashes, and leader changes.

Posted by GitBox <gi...@apache.org>.
kerneltime commented on code in PR #3534:
URL: https://github.com/apache/ozone/pull/3534#discussion_r918577167


##########
hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java:
##########
@@ -82,29 +84,36 @@ public StatusAndMessages finalize(String upgradeClientID, T service)
     // sets the finalization status to FINALIZATION_IN_PROGRESS.
     // Therefore, a lock is used to make sure only one finalization thread is
     // running at a time.
-    StatusAndMessages response = initFinalize(upgradeClientID, service);
-    if (finalizationLock.tryLock()) {
+    if (isFinalized(versionManager.getUpgradeState())) {
+      return FINALIZED_MSG;
+    } else if (finalizationLock.tryLock()) {

Review Comment:
   ```suggestion
       } 
       if (finalizationLock.tryLock()) {
   ```
   ```suggestion
       } else if (finalizationLock.tryLock()) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime commented on a diff in pull request #3534: HDDS-6761. [SCM HA finalization] Handle restarts, crashes, and leader changes.

Posted by GitBox <gi...@apache.org>.
kerneltime commented on code in PR #3534:
URL: https://github.com/apache/ozone/pull/3534#discussion_r906418391


##########
hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java:
##########
@@ -46,6 +47,7 @@
 import org.apache.hadoop.ozone.upgrade.UpgradeException.ResultCodes;
 
 import com.google.common.annotations.VisibleForTesting;
+import org.apache.ratis.protocol.exceptions.NotLeaderException;
 
 /**
  * UpgradeFinalizer implementation for the Storage Container Manager service.

Review Comment:
   Comment needs updating.



##########
hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java:
##########
@@ -85,14 +87,20 @@ public StatusAndMessages finalize(String upgradeClientID, T service)
     StatusAndMessages response = initFinalize(upgradeClientID, service);
     if (finalizationLock.tryLock()) {
       try {
-        // Even if the status indicates finalization completed, the component
-        // may not have finished all its specific steps if finalization was
-        // interrupted, so we should re-invoke them here.
+        // If we were able to enter the lock and finalization status is "in
+        // progress", we should resume finalization because the last attempt
+        // was interrupted. If an attempt was currently ongoing, the lock
+        // would have been held.
         if (response.status() == FINALIZATION_REQUIRED ||
-            !componentFinishedFinalizationSteps(service)) {
+            response.status() == FINALIZATION_IN_PROGRESS) {
           finalizationExecutor.execute(service, this);
           response = STARTING_MSG;
         }
+        // Else, the initial response we got from initFinalize was still
+        // correct.
+      } catch (NotLeaderException e) {
+        LOG.info("Leader change encountered during finalization. This " +
+            "component will continue to finalize as a follower.", e);

Review Comment:
   What does this mean? Will the follower continue to finalize as a follower when the leader executes it, or will the follower continue to finalize on its own?



##########
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/upgrade/DataNodeUpgradeFinalizer.java:
##########
@@ -54,6 +55,8 @@ public void preFinalizeUpgrade(DatanodeStateMachine dsm)
       String msg = "Pre Finalization checks failed on the DataNode.";
       logAndEmit(msg);
       throw new UpgradeException(msg, PREFINALIZE_VALIDATION_FAILED);
+    } else {

Review Comment:
   Nit: No need for `else`



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMHAInvocationHandler.java:
##########
@@ -60,12 +60,19 @@ public SCMHAInvocationHandler(final RequestType requestType,
   @Override
   public Object invoke(final Object proxy, final Method method,
                        final Object[] args) throws Throwable {
+    // Javadoc for InvocationHandler#invoke specifies that args will be null
+    // if the method takes no arguments. Convert this to an empty array for
+    // easier handling.
+    Object[] convertedArgs = args;
+    if (args == null) {
+      convertedArgs = new Object[]{};
+    }

Review Comment:
   Minor nit:
   
   ```suggestion
       Object[] convertedArgs = (args == null) ? new Object[]{} : args;
   ```



##########
hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java:
##########
@@ -46,6 +47,7 @@
 import org.apache.hadoop.ozone.upgrade.UpgradeException.ResultCodes;
 
 import com.google.common.annotations.VisibleForTesting;
+import org.apache.ratis.protocol.exceptions.NotLeaderException;
 
 /**
  * UpgradeFinalizer implementation for the Storage Container Manager service.

Review Comment:
   Nit: Comment needs updating.



##########
hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java:
##########
@@ -85,14 +87,20 @@ public StatusAndMessages finalize(String upgradeClientID, T service)
     StatusAndMessages response = initFinalize(upgradeClientID, service);

Review Comment:
   Should this be inside the lock? 2 threads from 2 different clients trying to finalize as well as the status for the finalization should be evaluated after the lock was won.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime commented on a diff in pull request #3534: HDDS-6761. [SCM HA finalization] Handle restarts, crashes, and leader changes.

Posted by GitBox <gi...@apache.org>.
kerneltime commented on code in PR #3534:
URL: https://github.com/apache/ozone/pull/3534#discussion_r918577167


##########
hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java:
##########
@@ -82,29 +84,36 @@ public StatusAndMessages finalize(String upgradeClientID, T service)
     // sets the finalization status to FINALIZATION_IN_PROGRESS.
     // Therefore, a lock is used to make sure only one finalization thread is
     // running at a time.
-    StatusAndMessages response = initFinalize(upgradeClientID, service);
-    if (finalizationLock.tryLock()) {
+    if (isFinalized(versionManager.getUpgradeState())) {
+      return FINALIZED_MSG;
+    } else if (finalizationLock.tryLock()) {

Review Comment:
   ```suggestion
       } 
       if (finalizationLock.tryLock()) {
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] errose28 commented on a diff in pull request #3534: HDDS-6761. [SCM HA finalization] Handle restarts, crashes, and leader changes.

Posted by GitBox <gi...@apache.org>.
errose28 commented on code in PR #3534:
URL: https://github.com/apache/ozone/pull/3534#discussion_r918418536


##########
hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/upgrade/BasicUpgradeFinalizer.java:
##########
@@ -85,14 +87,20 @@ public StatusAndMessages finalize(String upgradeClientID, T service)
     StatusAndMessages response = initFinalize(upgradeClientID, service);

Review Comment:
   Yes init should be in the lock for the case you described. I will refactor this block.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] errose28 commented on pull request #3534: HDDS-6761. [SCM HA finalization] Handle restarts, crashes, and leader changes.

Posted by GitBox <gi...@apache.org>.
errose28 commented on PR #3534:
URL: https://github.com/apache/ozone/pull/3534#issuecomment-1182558213

   Thanks for reviewing this large patch @kerneltime 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org