You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2020/06/25 21:53:29 UTC

[GitHub] [hadoop-ozone] hanishakoneru opened a new pull request #1129: HDDS-3741. Reload old OM state if Install Snapshot from Leader fails

hanishakoneru opened a new pull request #1129:
URL: https://github.com/apache/hadoop-ozone/pull/1129


   ## What changes were proposed in this pull request?
   
   Follower OM issues a pause on its services before installing new checkpoint from Leader OM (Install Snapshot). If this installation fails for some reason, the OM stays in paused state. It should be unpaused and the old state should be reloaded.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-3741
   
   ## How was this patch tested?
   
   Tested manually. Unit tests will be added later on.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] hanishakoneru commented on pull request #1129: HDDS-3741. Reload old OM state if Install Snapshot from Leader fails

Posted by GitBox <gi...@apache.org>.
hanishakoneru commented on pull request #1129:
URL: https://github.com/apache/hadoop-ozone/pull/1129#issuecomment-651991144


   Thanks @arp7 for the review. Addressed your review comments.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] arp7 commented on a change in pull request #1129: HDDS-3741. Reload old OM state if Install Snapshot from Leader fails

Posted by GitBox <gi...@apache.org>.
arp7 commented on a change in pull request #1129:
URL: https://github.com/apache/hadoop-ozone/pull/1129#discussion_r447284712



##########
File path: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
##########
@@ -3040,45 +3024,61 @@ public TermIndex installSnapshot(String leaderId) {
     } catch (Exception e) {
       LOG.error("Failed to stop/ pause the services. Cannot proceed with " +
           "installing the new checkpoint.", e);
+
+      // During stopServices, if KeyManager was stopped successfully and
+      // OMMetadataManager stop failed, we should restart the KeyManager.
+      keyManager.start(configuration);
+
       return null;
     }
 
-    //TODO: un-pause SM if any failures and retry?
+    File dbBackup;
+    TermIndex termIndex = omRatisServer.getLastAppliedTermIndex();
+    long currentTerm = termIndex.getTerm();
+    long lastAppliedIndex = termIndex.getIndex();
+    boolean loadSuccess = false;
 
-    long lastAppliedIndex = omRatisServer.getLastAppliedTermIndex().getIndex();
+    try {
+      // Check if current applied log index is smaller than the downloaded
+      // checkpoint transaction index. If yes, proceed by stopping the ratis
+      // server so that the OM state can be re-initialized. If no then do not
+      // proceed with installSnapshot.
+      boolean canProceed = OzoneManagerRatisUtils.verifyTransactionInfo(
+          omTransactionInfo, lastAppliedIndex, leaderId, newDBLocation);
+      if (!canProceed) {
+        return null;
+      }
 
-    boolean canProceed =
-        OzoneManagerRatisUtils.verifyTransactionInfo(omTransactionInfo,
-        lastAppliedIndex, leaderId, newDBlocation);
+      try {
+        dbBackup = replaceOMDBWithCheckpoint(lastAppliedIndex, oldDBLocation,

Review comment:
       Also the marker file should be created before starting the move operations, and deleted on success.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] arp7 commented on a change in pull request #1129: HDDS-3741. Reload old OM state if Install Snapshot from Leader fails

Posted by GitBox <gi...@apache.org>.
arp7 commented on a change in pull request #1129:
URL: https://github.com/apache/hadoop-ozone/pull/1129#discussion_r447272528



##########
File path: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
##########
@@ -2999,32 +3000,15 @@ public TermIndex installSnapshot(String leaderId) {
     }
 
     DBCheckpoint omDBcheckpoint = getDBCheckpointFromLeader(leaderId);
-    Path newDBlocation = omDBcheckpoint.getCheckpointLocation();
+    Path newDBLocation = omDBcheckpoint.getCheckpointLocation();
 
     LOG.info("Downloaded checkpoint from Leader {}, in to the location {}",
-        leaderId, newDBlocation);
+        leaderId, newDBLocation);
 
-    // Check if current ratis log index is smaller than the downloaded
-    // checkpoint transaction index. If yes, proceed by stopping the ratis
-    // server so that the OM state can be re-initialized. If no, then do not
-    // proceed with installSnapshot.
+    OMTransactionInfo omTransactionInfo = getTransactionInfoFromDB(
+        newDBLocation);
 
-    OMTransactionInfo omTransactionInfo = null;
-
-    Path dbDir = newDBlocation.getParent();
-    if (dbDir == null) {
-      LOG.error("Incorrect DB location path {} received from checkpoint.",
-          newDBlocation);
-      return null;
-    }
-
-    try {
-      omTransactionInfo =
-          OzoneManagerRatisUtils.getTransactionInfoFromDownloadedSnapshot(
-              configuration, dbDir);
-    } catch (Exception ex) {
-      LOG.error("Failed during opening downloaded snapshot from " +
-          "{} to obtain transaction index", newDBlocation, ex);
+    if (omTransactionInfo == null) {

Review comment:
       Let's log an error here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] arp7 commented on a change in pull request #1129: HDDS-3741. Reload old OM state if Install Snapshot from Leader fails

Posted by GitBox <gi...@apache.org>.
arp7 commented on a change in pull request #1129:
URL: https://github.com/apache/hadoop-ozone/pull/1129#discussion_r447280689



##########
File path: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
##########
@@ -3040,45 +3024,61 @@ public TermIndex installSnapshot(String leaderId) {
     } catch (Exception e) {
       LOG.error("Failed to stop/ pause the services. Cannot proceed with " +
           "installing the new checkpoint.", e);
+
+      // During stopServices, if KeyManager was stopped successfully and
+      // OMMetadataManager stop failed, we should restart the KeyManager.
+      keyManager.start(configuration);
+
       return null;
     }
 
-    //TODO: un-pause SM if any failures and retry?
+    File dbBackup;
+    TermIndex termIndex = omRatisServer.getLastAppliedTermIndex();
+    long currentTerm = termIndex.getTerm();
+    long lastAppliedIndex = termIndex.getIndex();
+    boolean loadSuccess = false;
 
-    long lastAppliedIndex = omRatisServer.getLastAppliedTermIndex().getIndex();
+    try {
+      // Check if current applied log index is smaller than the downloaded
+      // checkpoint transaction index. If yes, proceed by stopping the ratis
+      // server so that the OM state can be re-initialized. If no then do not
+      // proceed with installSnapshot.
+      boolean canProceed = OzoneManagerRatisUtils.verifyTransactionInfo(
+          omTransactionInfo, lastAppliedIndex, leaderId, newDBLocation);
+      if (!canProceed) {
+        return null;
+      }
 
-    boolean canProceed =
-        OzoneManagerRatisUtils.verifyTransactionInfo(omTransactionInfo,
-        lastAppliedIndex, leaderId, newDBlocation);
+      try {
+        dbBackup = replaceOMDBWithCheckpoint(lastAppliedIndex, oldDBLocation,
+            newDBLocation);
+      } catch (Exception e) {
+        LOG.error("OM DB checkpoint replacement with new downloaded " +
+            "checkpoint failed.", e);
+        return null;
+      }
 
-    // If downloaded DB has transaction info less than current one, return.
-    if (!canProceed) {
-      return null;
+      loadSuccess = true;
+    } finally {
+      if (!loadSuccess) {

Review comment:
       Remove this unpause, we already do a reload and unpause below.

##########
File path: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
##########
@@ -3040,45 +3024,61 @@ public TermIndex installSnapshot(String leaderId) {
     } catch (Exception e) {
       LOG.error("Failed to stop/ pause the services. Cannot proceed with " +
           "installing the new checkpoint.", e);
+
+      // During stopServices, if KeyManager was stopped successfully and
+      // OMMetadataManager stop failed, we should restart the KeyManager.
+      keyManager.start(configuration);
+
       return null;
     }
 
-    //TODO: un-pause SM if any failures and retry?
+    File dbBackup;
+    TermIndex termIndex = omRatisServer.getLastAppliedTermIndex();
+    long currentTerm = termIndex.getTerm();
+    long lastAppliedIndex = termIndex.getIndex();
+    boolean loadSuccess = false;
 
-    long lastAppliedIndex = omRatisServer.getLastAppliedTermIndex().getIndex();
+    try {
+      // Check if current applied log index is smaller than the downloaded
+      // checkpoint transaction index. If yes, proceed by stopping the ratis
+      // server so that the OM state can be re-initialized. If no then do not
+      // proceed with installSnapshot.
+      boolean canProceed = OzoneManagerRatisUtils.verifyTransactionInfo(
+          omTransactionInfo, lastAppliedIndex, leaderId, newDBLocation);
+      if (!canProceed) {
+        return null;
+      }
 
-    boolean canProceed =
-        OzoneManagerRatisUtils.verifyTransactionInfo(omTransactionInfo,
-        lastAppliedIndex, leaderId, newDBlocation);
+      try {
+        dbBackup = replaceOMDBWithCheckpoint(lastAppliedIndex, oldDBLocation,
+            newDBLocation);
+      } catch (Exception e) {
+        LOG.error("OM DB checkpoint replacement with new downloaded " +
+            "checkpoint failed.", e);
+        return null;
+      }
 
-    // If downloaded DB has transaction info less than current one, return.
-    if (!canProceed) {
-      return null;
+      loadSuccess = true;

Review comment:
       Don't need `loadSuccess` any more.

##########
File path: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
##########
@@ -3040,45 +3024,61 @@ public TermIndex installSnapshot(String leaderId) {
     } catch (Exception e) {
       LOG.error("Failed to stop/ pause the services. Cannot proceed with " +
           "installing the new checkpoint.", e);
+
+      // During stopServices, if KeyManager was stopped successfully and
+      // OMMetadataManager stop failed, we should restart the KeyManager.
+      keyManager.start(configuration);
+
       return null;
     }
 
-    //TODO: un-pause SM if any failures and retry?
+    File dbBackup;
+    TermIndex termIndex = omRatisServer.getLastAppliedTermIndex();
+    long currentTerm = termIndex.getTerm();
+    long lastAppliedIndex = termIndex.getIndex();
+    boolean loadSuccess = false;
 
-    long lastAppliedIndex = omRatisServer.getLastAppliedTermIndex().getIndex();
+    try {
+      // Check if current applied log index is smaller than the downloaded
+      // checkpoint transaction index. If yes, proceed by stopping the ratis
+      // server so that the OM state can be re-initialized. If no then do not
+      // proceed with installSnapshot.
+      boolean canProceed = OzoneManagerRatisUtils.verifyTransactionInfo(
+          omTransactionInfo, lastAppliedIndex, leaderId, newDBLocation);
+      if (!canProceed) {
+        return null;
+      }
 
-    boolean canProceed =
-        OzoneManagerRatisUtils.verifyTransactionInfo(omTransactionInfo,
-        lastAppliedIndex, leaderId, newDBlocation);
+      try {
+        dbBackup = replaceOMDBWithCheckpoint(lastAppliedIndex, oldDBLocation,
+            newDBLocation);
+      } catch (Exception e) {
+        LOG.error("OM DB checkpoint replacement with new downloaded " +
+            "checkpoint failed.", e);
+        return null;
+      }
 
-    // If downloaded DB has transaction info less than current one, return.
-    if (!canProceed) {
-      return null;
+      loadSuccess = true;
+    } finally {
+      if (!loadSuccess) {
+        omRatisServer.getOmStateMachine().unpause(lastAppliedIndex,
+            currentTerm);
+      }
     }
 
     long leaderIndex = omTransactionInfo.getTransactionIndex();

Review comment:
       Move these inside the try block, just after you invoke `replaceOMDBWithCheckpoint`. Else these variables should be left to values that reflect the old checkpoint.

##########
File path: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
##########
@@ -3040,45 +3024,61 @@ public TermIndex installSnapshot(String leaderId) {
     } catch (Exception e) {
       LOG.error("Failed to stop/ pause the services. Cannot proceed with " +
           "installing the new checkpoint.", e);
+
+      // During stopServices, if KeyManager was stopped successfully and
+      // OMMetadataManager stop failed, we should restart the KeyManager.
+      keyManager.start(configuration);
+
       return null;
     }
 
-    //TODO: un-pause SM if any failures and retry?
+    File dbBackup;
+    TermIndex termIndex = omRatisServer.getLastAppliedTermIndex();
+    long currentTerm = termIndex.getTerm();
+    long lastAppliedIndex = termIndex.getIndex();
+    boolean loadSuccess = false;
 
-    long lastAppliedIndex = omRatisServer.getLastAppliedTermIndex().getIndex();
+    try {
+      // Check if current applied log index is smaller than the downloaded
+      // checkpoint transaction index. If yes, proceed by stopping the ratis
+      // server so that the OM state can be re-initialized. If no then do not
+      // proceed with installSnapshot.
+      boolean canProceed = OzoneManagerRatisUtils.verifyTransactionInfo(
+          omTransactionInfo, lastAppliedIndex, leaderId, newDBLocation);
+      if (!canProceed) {
+        return null;
+      }
 
-    boolean canProceed =
-        OzoneManagerRatisUtils.verifyTransactionInfo(omTransactionInfo,
-        lastAppliedIndex, leaderId, newDBlocation);
+      try {
+        dbBackup = replaceOMDBWithCheckpoint(lastAppliedIndex, oldDBLocation,

Review comment:
       `replaceOMDBWithCheckpoint` contract should be that it either leaves the current state or the old state. If it leaves inconsistent state then it should also leave some persistent marker on disk so OM doesn't try to restart with a bad checkpoint.

##########
File path: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
##########
@@ -3040,45 +3024,61 @@ public TermIndex installSnapshot(String leaderId) {
     } catch (Exception e) {
       LOG.error("Failed to stop/ pause the services. Cannot proceed with " +
           "installing the new checkpoint.", e);
+
+      // During stopServices, if KeyManager was stopped successfully and
+      // OMMetadataManager stop failed, we should restart the KeyManager.
+      keyManager.start(configuration);
+
       return null;
     }
 
-    //TODO: un-pause SM if any failures and retry?
+    File dbBackup;
+    TermIndex termIndex = omRatisServer.getLastAppliedTermIndex();
+    long currentTerm = termIndex.getTerm();
+    long lastAppliedIndex = termIndex.getIndex();
+    boolean loadSuccess = false;
 
-    long lastAppliedIndex = omRatisServer.getLastAppliedTermIndex().getIndex();
+    try {
+      // Check if current applied log index is smaller than the downloaded
+      // checkpoint transaction index. If yes, proceed by stopping the ratis
+      // server so that the OM state can be re-initialized. If no then do not
+      // proceed with installSnapshot.
+      boolean canProceed = OzoneManagerRatisUtils.verifyTransactionInfo(
+          omTransactionInfo, lastAppliedIndex, leaderId, newDBLocation);
+      if (!canProceed) {
+        return null;
+      }
 
-    boolean canProceed =
-        OzoneManagerRatisUtils.verifyTransactionInfo(omTransactionInfo,
-        lastAppliedIndex, leaderId, newDBlocation);
+      try {
+        dbBackup = replaceOMDBWithCheckpoint(lastAppliedIndex, oldDBLocation,
+            newDBLocation);
+      } catch (Exception e) {
+        LOG.error("OM DB checkpoint replacement with new downloaded " +
+            "checkpoint failed.", e);
+        return null;
+      }
 
-    // If downloaded DB has transaction info less than current one, return.
-    if (!canProceed) {
-      return null;
+      loadSuccess = true;
+    } finally {
+      if (!loadSuccess) {
+        omRatisServer.getOmStateMachine().unpause(lastAppliedIndex,
+            currentTerm);
+      }
     }
 
     long leaderIndex = omTransactionInfo.getTransactionIndex();
     long leaderTerm = omTransactionInfo.getCurrentTerm();
 
-
-    File dbBackup;
-    try {
-      dbBackup = replaceOMDBWithCheckpoint(lastAppliedIndex, oldDBLocation,
-          newDBlocation);
-    } catch (Exception e) {
-      LOG.error("OM DB checkpoint replacement with new downloaded checkpoint " +
-          "failed.", e);
-      return null;
-    }
-
     // Reload the OM DB store with the new checkpoint.
     // Restart (unpause) the state machine and update its last applied index
     // to the installed checkpoint's snapshot index.
     try {
       reloadOMState(leaderIndex, leaderTerm);

Review comment:
       Before calling reloadState check for existence of temp_marker file. Also on OM restart.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] arp7 commented on a change in pull request #1129: HDDS-3741. Reload old OM state if Install Snapshot from Leader fails

Posted by GitBox <gi...@apache.org>.
arp7 commented on a change in pull request #1129:
URL: https://github.com/apache/hadoop-ozone/pull/1129#discussion_r453138655



##########
File path: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmMetadataManagerImpl.java
##########
@@ -249,6 +252,20 @@ public void start(OzoneConfiguration configuration) throws IOException {
     if (store == null) {
       File metaDir = OMStorage.getOmDbDir(configuration);
 
+      // Check if there is a DB Inconsistent Marker in the metaDir. This
+      // marker indicates that the DB is in an inconsistent state and hence
+      // the OM process should be terminated.
+      File markerFile = new File(metaDir, DB_TRANSIENT_MARKER);
+      if (Files.exists(markerFile.toPath())) {
+        LOG.error("File {} marks that OM DB is in an inconsistent state.");
+        // Note - The marker file should be deleted only after fixing the DB.
+        // In an HA setup, this can be done by replacing this DB with a
+        // checkpoint from another OM.
+        String errorMsg = "Cannot load OM DB as it is in an inconsistent " +
+            "state.";
+        ExitUtils.terminate(1, errorMsg, LOG);

Review comment:
       Should we use `ExitManager` here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] hanishakoneru merged pull request #1129: HDDS-3741. Reload old OM state if Install Snapshot from Leader fails

Posted by GitBox <gi...@apache.org>.
hanishakoneru merged pull request #1129:
URL: https://github.com/apache/hadoop-ozone/pull/1129


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] arp7 commented on a change in pull request #1129: HDDS-3741. Reload old OM state if Install Snapshot from Leader fails

Posted by GitBox <gi...@apache.org>.
arp7 commented on a change in pull request #1129:
URL: https://github.com/apache/hadoop-ozone/pull/1129#discussion_r455950095



##########
File path: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
##########
@@ -3041,58 +3047,74 @@ public TermIndex installSnapshot(String leaderId) {
       omRatisServer.getOmStateMachine().pause();
     } catch (Exception e) {
       LOG.error("Failed to stop/ pause the services. Cannot proceed with " +
-          "installing the new checkpoint.", e);
-      return null;
-    }
-
-    //TODO: un-pause SM if any failures and retry?
-
-    long lastAppliedIndex = omRatisServer.getLastAppliedTermIndex().getIndex();
-
-    boolean canProceed =
-        OzoneManagerRatisUtils.verifyTransactionInfo(omTransactionInfo,
-        lastAppliedIndex, leaderId, newDBlocation);
-
-    // If downloaded DB has transaction info less than current one, return.
-    if (!canProceed) {
-      return null;
+          "installing the new checkpoint.");
+      // During stopServices, if KeyManager was stopped successfully and
+      // OMMetadataManager stop failed, we should restart the KeyManager.
+      keyManager.start(configuration);
+      throw e;
     }
 
-    long leaderIndex = omTransactionInfo.getTransactionIndex();
-    long leaderTerm = omTransactionInfo.getCurrentTerm();
+    File dbBackup = null;
+    TermIndex termIndex = omRatisServer.getLastAppliedTermIndex();
+    long term = termIndex.getTerm();
+    long lastAppliedIndex = termIndex.getIndex();
 
+    // Check if current applied log index is smaller than the downloaded
+    // checkpoint transaction index. If yes, proceed by stopping the ratis
+    // server so that the OM state can be re-initialized. If no then do not
+    // proceed with installSnapshot.
+    boolean canProceed = OzoneManagerRatisUtils.verifyTransactionInfo(
+        checkpointTrxnInfo, lastAppliedIndex, leaderId, checkpointLocation);
 
-    File dbBackup;
-    try {
-      dbBackup = replaceOMDBWithCheckpoint(lastAppliedIndex, oldDBLocation,
-          newDBlocation);
-    } catch (Exception e) {
-      LOG.error("OM DB checkpoint replacement with new downloaded checkpoint " +
-          "failed.", e);
-      return null;
+    if (canProceed) {
+      try {
+        dbBackup = replaceOMDBWithCheckpoint(lastAppliedIndex, oldDBLocation,
+            checkpointLocation);
+        term = checkpointTrxnInfo.getTerm();
+        lastAppliedIndex = checkpointTrxnInfo.getTransactionIndex();
+        LOG.info("Replaced DB with checkpoint from OM: {}, term: {}, index: {}",
+            leaderId, term, lastAppliedIndex);
+      } catch (Exception e) {
+        LOG.error("Failed to install Snapshot from {} as OM failed to replace" +
+            " DB with downloaded checkpoint. Reloading old OM state.", e);
+      }
+    } else {
+      LOG.warn("Cannot proceed with InstallSnapshot as OM is at TermIndex {} " +
+          "and checkpoint has lower TermIndex {}. Reloading old state of OM.",
+          termIndex, checkpointTrxnInfo.getTermIndex());
     }
 
     // Reload the OM DB store with the new checkpoint.
     // Restart (unpause) the state machine and update its last applied index
     // to the installed checkpoint's snapshot index.
     try {
-      reloadOMState(leaderIndex, leaderTerm);
-      omRatisServer.getOmStateMachine().unpause(leaderIndex, leaderTerm);
-    } catch (IOException e) {
-      LOG.error("Failed to reload OM state with new DB checkpoint.", e);
-      return null;
+      reloadOMState(lastAppliedIndex, term);
+      omRatisServer.getOmStateMachine().unpause(lastAppliedIndex, term);
+      LOG.info("Reloaded OM state with Term: {} and Index: {}", term,
+          lastAppliedIndex);
+    } catch (IOException ex) {
+      String errorMsg = "Failed to reload OM state and instantiate services.";
+      exitManager.exitSystem(1, errorMsg, ex, LOG);
     }
 
     // Delete the backup DB
     try {
-      FileUtils.deleteFully(dbBackup);
+      if (dbBackup != null) {
+        FileUtils.deleteFully(dbBackup);
+      }

Review comment:
       Let's do this in a followup jira. Ok to commit this one.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] arp7 commented on a change in pull request #1129: HDDS-3741. Reload old OM state if Install Snapshot from Leader fails

Posted by GitBox <gi...@apache.org>.
arp7 commented on a change in pull request #1129:
URL: https://github.com/apache/hadoop-ozone/pull/1129#discussion_r455442847



##########
File path: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
##########
@@ -3041,58 +3047,74 @@ public TermIndex installSnapshot(String leaderId) {
       omRatisServer.getOmStateMachine().pause();
     } catch (Exception e) {
       LOG.error("Failed to stop/ pause the services. Cannot proceed with " +
-          "installing the new checkpoint.", e);
-      return null;
-    }
-
-    //TODO: un-pause SM if any failures and retry?
-
-    long lastAppliedIndex = omRatisServer.getLastAppliedTermIndex().getIndex();
-
-    boolean canProceed =
-        OzoneManagerRatisUtils.verifyTransactionInfo(omTransactionInfo,
-        lastAppliedIndex, leaderId, newDBlocation);
-
-    // If downloaded DB has transaction info less than current one, return.
-    if (!canProceed) {
-      return null;
+          "installing the new checkpoint.");
+      // During stopServices, if KeyManager was stopped successfully and
+      // OMMetadataManager stop failed, we should restart the KeyManager.
+      keyManager.start(configuration);
+      throw e;
     }
 
-    long leaderIndex = omTransactionInfo.getTransactionIndex();
-    long leaderTerm = omTransactionInfo.getCurrentTerm();
+    File dbBackup = null;
+    TermIndex termIndex = omRatisServer.getLastAppliedTermIndex();
+    long term = termIndex.getTerm();
+    long lastAppliedIndex = termIndex.getIndex();
 
+    // Check if current applied log index is smaller than the downloaded
+    // checkpoint transaction index. If yes, proceed by stopping the ratis
+    // server so that the OM state can be re-initialized. If no then do not
+    // proceed with installSnapshot.
+    boolean canProceed = OzoneManagerRatisUtils.verifyTransactionInfo(
+        checkpointTrxnInfo, lastAppliedIndex, leaderId, checkpointLocation);
 
-    File dbBackup;
-    try {
-      dbBackup = replaceOMDBWithCheckpoint(lastAppliedIndex, oldDBLocation,
-          newDBlocation);
-    } catch (Exception e) {
-      LOG.error("OM DB checkpoint replacement with new downloaded checkpoint " +
-          "failed.", e);
-      return null;
+    if (canProceed) {
+      try {
+        dbBackup = replaceOMDBWithCheckpoint(lastAppliedIndex, oldDBLocation,
+            checkpointLocation);
+        term = checkpointTrxnInfo.getTerm();
+        lastAppliedIndex = checkpointTrxnInfo.getTransactionIndex();
+        LOG.info("Replaced DB with checkpoint from OM: {}, term: {}, index: {}",
+            leaderId, term, lastAppliedIndex);
+      } catch (Exception e) {
+        LOG.error("Failed to install Snapshot from {} as OM failed to replace" +
+            " DB with downloaded checkpoint. Reloading old OM state.", e);
+      }
+    } else {
+      LOG.warn("Cannot proceed with InstallSnapshot as OM is at TermIndex {} " +
+          "and checkpoint has lower TermIndex {}. Reloading old state of OM.",
+          termIndex, checkpointTrxnInfo.getTermIndex());
     }
 
     // Reload the OM DB store with the new checkpoint.
     // Restart (unpause) the state machine and update its last applied index
     // to the installed checkpoint's snapshot index.
     try {
-      reloadOMState(leaderIndex, leaderTerm);
-      omRatisServer.getOmStateMachine().unpause(leaderIndex, leaderTerm);
-    } catch (IOException e) {
-      LOG.error("Failed to reload OM state with new DB checkpoint.", e);
-      return null;
+      reloadOMState(lastAppliedIndex, term);
+      omRatisServer.getOmStateMachine().unpause(lastAppliedIndex, term);
+      LOG.info("Reloaded OM state with Term: {} and Index: {}", term,
+          lastAppliedIndex);
+    } catch (IOException ex) {
+      String errorMsg = "Failed to reload OM state and instantiate services.";
+      exitManager.exitSystem(1, errorMsg, ex, LOG);
     }
 
     // Delete the backup DB
     try {
-      FileUtils.deleteFully(dbBackup);
+      if (dbBackup != null) {
+        FileUtils.deleteFully(dbBackup);
+      }

Review comment:
       On successful restart we should clear any old backups.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] arp7 commented on a change in pull request #1129: HDDS-3741. Reload old OM state if Install Snapshot from Leader fails

Posted by GitBox <gi...@apache.org>.
arp7 commented on a change in pull request #1129:
URL: https://github.com/apache/hadoop-ozone/pull/1129#discussion_r455433582



##########
File path: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmMetadataManagerImpl.java
##########
@@ -249,6 +252,20 @@ public void start(OzoneConfiguration configuration) throws IOException {
     if (store == null) {
       File metaDir = OMStorage.getOmDbDir(configuration);
 
+      // Check if there is a DB Inconsistent Marker in the metaDir. This
+      // marker indicates that the DB is in an inconsistent state and hence
+      // the OM process should be terminated.
+      File markerFile = new File(metaDir, DB_TRANSIENT_MARKER);
+      if (Files.exists(markerFile.toPath())) {

Review comment:
       Nitpick: why not just `markerFile.exists()`?

##########
File path: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/OzoneManagerSnapshotProvider.java
##########
@@ -113,8 +113,10 @@ public OzoneManagerSnapshotProvider(MutableConfigurationSource conf,
   public DBCheckpoint getOzoneManagerDBSnapshot(String leaderOMNodeID)
       throws IOException {
     String snapshotTime = Long.toString(System.currentTimeMillis());
-    String snapshotFileName = Paths.get(omSnapshotDir.getAbsolutePath(),
-        snapshotTime, OM_DB_NAME).toFile().getAbsolutePath();
+    String snapshotFileName = OM_DB_NAME + "-" + leaderOMNodeID

Review comment:
       Did we change the snapshot file name convention in this jira?

##########
File path: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/utils/OzoneManagerRatisUtils.java
##########
@@ -228,15 +228,34 @@ public static Status exceptionToResponseStatus(IOException exception) {
   }
 
   /**
-   * Obtain Transaction info from downloaded snapshot DB.
+   * Obtain OMTransactionInfo from Checkpoint.
+   */
+  public static OMTransactionInfo getTrxnInfoFromCheckpoint(
+      OzoneConfiguration conf, Path dbPath) throws Exception {
+    Path dbDir = dbPath.getParent();
+    String dbName = dbPath.getFileName().toString();
+    if (dbDir == null) {
+      throw new IOException("Checkpoint {} does not have proper DB location");

Review comment:
       Print the `dbPath` in the exception message to make debugging easier in case this happens. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org


[GitHub] [hadoop-ozone] arp7 commented on a change in pull request #1129: HDDS-3741. Reload old OM state if Install Snapshot from Leader fails

Posted by GitBox <gi...@apache.org>.
arp7 commented on a change in pull request #1129:
URL: https://github.com/apache/hadoop-ozone/pull/1129#discussion_r453138266



##########
File path: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
##########
@@ -3314,4 +3352,8 @@ private void startJVMPauseMonitor() {
     jvmPauseMonitor.init(configuration);
     jvmPauseMonitor.start();
   }
+
+  void setExitManagerForTesting(ExitManager exitManagerForTesting) {

Review comment:
       Can you add a `@VisibleForTesting` annotation here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org