You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2021/06/01 10:15:29 UTC

[GitHub] [ozone] bharatviswa504 opened a new pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

bharatviswa504 opened a new pull request #2294:
URL: https://github.com/apache/ozone/pull/2294


   ## What changes were proposed in this pull request?
   
   After unclean SCM shutdown, SCM may not come out of safemode.
   
   Proposal:
   1. Use leader ready to start Background services.
   2. In apply transaction after apply is complete if SCM is in safemode, refresh and validate safemode rules with current state.
   3. For leader ready have a back ground daemon thread and check is leader ready using Ratis API. And also call notifyStatusChanged in SCMServices and also update isLeader in SCMContext.
   
   ## What is the link to the Apache JIRA
   https://issues.apache.org/jira/browse/HDDS-5263
   
   ## How was this patch tested?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 commented on pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 commented on pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#issuecomment-852885323


   With this approach we see an issue
   
   
   Before restart, 2 pipelines closed, and let's say it removed and create a new pipeline. But in the SCM pipeline table it has old 2 pipelines, as remove/new pipeline are not persisted to DB as SCM is force killed.
   
   As we call refresh and validate we exit safe mode after 2nd pipeline removes, and we validate pipeline rules, and we do not wait for all the pending transactions.
   
   This causes problems like reading/write will fail, even after SCM is out of safe mode.
   ```
   2021-06-02 05:51:04,208 INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 1, healthy pipeline threshold count is 1
   2021-06-02 05:51:04,208 INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total pipeline count is 1, pipeline's with at least one datanode reported threshold count is 1
   2021-06-02 05:51:04,209 INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 0, healthy pipeline threshold count is 0
   2021-06-02 05:51:04,209 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: HealthyPipelineSafeModeRule rule is successfully validated
   2021-06-02 05:51:04,209 INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total pipeline count is 0, pipeline's with at least one datanode reported threshold count is 0
   ```
   
   After an offline discussion with @bshashikant 
   1. We thought we shall refresh SCM safe mode rule once after leader Ready on all SCMs.
   2. And start DN RPC port only after leader ready, so that SCM does not come out of safe mode early by considering not upto date DB.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 edited a comment on pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 edited a comment on pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#issuecomment-854422133


   Tested this on cluster 
   with the scenario, close pipeline, scrubber removed and created new pipeline. Restarted SCM. (As SCM will have in its DB the old pipeline which is closed/removed before fix SCM would never come out of safe mode, as it reads old pipeline info in DB during rule setup)
   
   And once after leader ready, rules refreshed and data nodes registered, rules are successfully validated. (Unlike before after remove pipeline, pipeline rules are not successfully validated)
   
   ```
   2021-06-04 07:05:28,646 INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 1, healthy pipeline threshold count is 1
   2021-06-04 07:05:28,646 INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Refreshed Total pipeline count is 1, pipeline's with at least one datanode reported threshold count is 1
   ```
   ```
   021-06-04 07:05:29,184 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Registered Data node : 9642160f-49ab-4400-a66c-e7f4210f4ca0{ip: 172.27.131.64, host: bv-unsec-2.bv-unsec.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
   2021-06-04 07:05:30,038 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Registered Data node : b44ed911-3a93-4b83-9e81-88fe47369a84{ip: 172.27.102.9, host: bv-unsec-3.bv-unsec.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
   2021-06-04 07:05:30,098 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Registered Data node : 37d28fa1-41db-4f44-963e-45675b822884{ip: 172.27.97.68, host: bv-unsec-1.bv-unsec.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
   
   ```
   
   ```
   2021-06-04 07:05:30,098 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:31,146 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:32,009 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:32,059 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:33,841 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:33,888 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:33,921 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:39,331 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 1, required healthy pipeline reported count is 1
   2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: HealthyPipelineSafeModeRule rule is successfully validated
   2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: ScmSafeModeManager, all rules are successfully validated
   2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM exiting safe mode.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 commented on a change in pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 commented on a change in pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#discussion_r648010195



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/HealthyPipelineSafeModeRule.java
##########
@@ -130,30 +124,61 @@ protected void process(Pipeline pipeline) {
       SCMSafeModeManager.getLogger().info(
           "SCM in safe mode. Healthy pipelines reported count is {}, " +
               "required healthy pipeline reported count is {}",
-          currentHealthyPipelineCount, healthyPipelineThresholdCount);
+          currentHealthyPipelineCount, getHealthyPipelineThresholdCount());
+
     }
   }
 
+
+  public synchronized void refresh() {
+    if (!validate()) {
+      initializeRule(true);
+    }
+  }
+
+  private synchronized void initializeRule(boolean refresh) {
+    int pipelineCount = pipelineManager.getPipelines(
+        new RatisReplicationConfig(HddsProtos.ReplicationFactor.THREE),
+        Pipeline.PipelineState.OPEN).size();
+
+    healthyPipelineThresholdCount = Math.max(minHealthyPipelines,
+        (int) Math.ceil(healthyPipelinesPercent * pipelineCount));
+
+    if (refresh) {
+      LOG.info("Refreshed total pipeline count is {}, healthy pipeline " +
+          "threshold count is {}", pipelineCount,
+          healthyPipelineThresholdCount);
+    } else {
+      LOG.info("Total pipeline count is {}, healthy pipeline " +
+          "threshold count is {}", pipelineCount,
+          healthyPipelineThresholdCount);
+    }
+
+    getSafeModeMetrics().setNumHealthyPipelinesThreshold(
+        healthyPipelineThresholdCount);
+  }
+
+
   @Override
-  protected void cleanup() {
+  protected synchronized void cleanup() {
     processedPipelineIDs.clear();
   }
 
   @VisibleForTesting
-  public int getCurrentHealthyPipelineCount() {
+  public synchronized int getCurrentHealthyPipelineCount() {

Review comment:
       To avoid contention between refresh and process reports.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 edited a comment on pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 edited a comment on pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#issuecomment-854422133


   Tested this on cluster 
   with scenario, close pipeline, scrubber removed and created new pipeline.
   
   And once after leader ready, rules refreshed and data nodes registered, rules are successfully validated. (Unlike before after remove pipeline, pipeline rules are not successfully validated)
   
   ```
   2021-06-04 07:05:28,646 INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 1, healthy pipeline threshold count is 1
   2021-06-04 07:05:28,646 INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Refreshed Total pipeline count is 1, pipeline's with at least one datanode reported threshold count is 1
   ```
   ```
   021-06-04 07:05:29,184 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Registered Data node : 9642160f-49ab-4400-a66c-e7f4210f4ca0{ip: 172.27.131.64, host: bv-unsec-2.bv-unsec.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
   2021-06-04 07:05:30,038 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Registered Data node : b44ed911-3a93-4b83-9e81-88fe47369a84{ip: 172.27.102.9, host: bv-unsec-3.bv-unsec.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
   2021-06-04 07:05:30,098 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Registered Data node : 37d28fa1-41db-4f44-963e-45675b822884{ip: 172.27.97.68, host: bv-unsec-1.bv-unsec.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
   
   ```
   
   ```
   2021-06-04 07:05:30,098 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:31,146 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:32,009 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:32,059 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:33,841 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:33,888 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:33,921 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:39,331 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 1, required healthy pipeline reported count is 1
   2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: HealthyPipelineSafeModeRule rule is successfully validated
   2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: ScmSafeModeManager, all rules are successfully validated
   2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM exiting safe mode.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 edited a comment on pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 edited a comment on pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#issuecomment-854422133


   Tested this on cluster 
   with the scenario, close pipeline, scrubber removed and created new pipeline. Restarted SCM. (As SCM will have in its DB the old pipeline which is closed/removed before fix SCM would never come out of safe mode, as it reads old pipeline info in DB during rule setup)
   
   And once after leader ready, rules refreshed and data nodes registered, rules are successfully validated. (Unlike before after remove pipeline, pipeline rules are not successfully validated)
   
   ```
   2021-06-04 07:05:28,646 INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 1, healthy pipeline threshold count is 1
   2021-06-04 07:05:28,646 INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Refreshed Total pipeline count is 1, pipeline's with at least one datanode reported threshold count is 1
   ```
   ```
   021-06-04 07:05:29,184 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Registered Data node : 9642160f-49ab-4400-a66c-e7f4210f4ca0{ip: xx, host: bv-unsec-2.bv-unsec.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
   2021-06-04 07:05:30,038 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Registered Data node : b44ed911-3a93-4b83-9e81-88fe47369a84{ip: xx, host: bv-unsec-3.bv-unsec.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
   2021-06-04 07:05:30,098 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Registered Data node : 37d28fa1-41db-4f44-963e-45675b822884{ip: xx, host: bv-unsec-1.bv-unsec.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
   
   ```
   
   ```
   2021-06-04 07:05:30,098 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:31,146 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:32,009 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:32,059 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:33,841 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:33,888 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:33,921 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:39,331 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 1, required healthy pipeline reported count is 1
   2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: HealthyPipelineSafeModeRule rule is successfully validated
   2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: ScmSafeModeManager, all rules are successfully validated
   2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM exiting safe mode.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 commented on pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 commented on pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#issuecomment-854422133


   Tested this on cluster 
   with scenario, close pipeline, scrubber removed and created new pipeline.
   
   And once after leader ready, rules refreshed and data nodes registered, rules are successfully validated. (Unlike before after remove pipeline, pipeline rules are not successfully validated)
   
   2021-06-04 07:05:28,646 INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 1, healthy pipeline threshold count is 1
   2021-06-04 07:05:28,646 INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Refreshed Total pipeline count is 1, pipeline's with at least one datanode reported threshold count is 1
   
   2021-06-04 07:05:30,098 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:31,146 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:32,009 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:32,059 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:33,841 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:33,888 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:33,921 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:39,331 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 1, required healthy pipeline reported count is 1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 commented on a change in pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 commented on a change in pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#discussion_r648010195



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/HealthyPipelineSafeModeRule.java
##########
@@ -130,30 +124,61 @@ protected void process(Pipeline pipeline) {
       SCMSafeModeManager.getLogger().info(
           "SCM in safe mode. Healthy pipelines reported count is {}, " +
               "required healthy pipeline reported count is {}",
-          currentHealthyPipelineCount, healthyPipelineThresholdCount);
+          currentHealthyPipelineCount, getHealthyPipelineThresholdCount());
+
     }
   }
 
+
+  public synchronized void refresh() {
+    if (!validate()) {
+      initializeRule(true);
+    }
+  }
+
+  private synchronized void initializeRule(boolean refresh) {
+    int pipelineCount = pipelineManager.getPipelines(
+        new RatisReplicationConfig(HddsProtos.ReplicationFactor.THREE),
+        Pipeline.PipelineState.OPEN).size();
+
+    healthyPipelineThresholdCount = Math.max(minHealthyPipelines,
+        (int) Math.ceil(healthyPipelinesPercent * pipelineCount));
+
+    if (refresh) {
+      LOG.info("Refreshed total pipeline count is {}, healthy pipeline " +
+          "threshold count is {}", pipelineCount,
+          healthyPipelineThresholdCount);
+    } else {
+      LOG.info("Total pipeline count is {}, healthy pipeline " +
+          "threshold count is {}", pipelineCount,
+          healthyPipelineThresholdCount);
+    }
+
+    getSafeModeMetrics().setNumHealthyPipelinesThreshold(
+        healthyPipelineThresholdCount);
+  }
+
+
   @Override
-  protected void cleanup() {
+  protected synchronized void cleanup() {
     processedPipelineIDs.clear();
   }
 
   @VisibleForTesting
-  public int getCurrentHealthyPipelineCount() {
+  public synchronized int getCurrentHealthyPipelineCount() {

Review comment:
       this is due to Findbugs warnings.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bshashikant commented on a change in pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bshashikant commented on a change in pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#discussion_r648000416



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -297,6 +308,23 @@ public void notifyTermIndexUpdated(long term, long index) {
     // with some information like its peers and termIndex). So, calling
     // updateLastApplied updates lastAppliedTermIndex.
     updateLastAppliedTermIndex(term, index);
+
+    if (currentLeaderTerm.get() == term &&

Review comment:
       Can we add comments here on why the logic is necessary?

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -173,6 +187,7 @@ public void notifyNotLeader(Collection<TransactionContext> pendingEntries) {
 
     scm.getScmContext().updateLeaderAndTerm(false, 0);
     scm.getSCMServiceManager().notifyStatusChanged();
+

Review comment:
       Unintended change

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMContext.java
##########
@@ -96,6 +96,38 @@ public void updateLeaderAndTerm(boolean leader, long newTerm) {
     }
   }
 
+  /**
+   * Update isLeader flag.
+   * @param leader
+   */
+  public void updateLeader(boolean leader) {
+    lock.writeLock().lock();
+    try {

Review comment:
       This looks a bit confusing. It should be ideally leaderReady. Let's maintain notion of leader and leaderReady separately.

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -132,6 +135,17 @@ public void initialize(RaftServer server, RaftGroupId id,
       final SCMRatisRequest request = SCMRatisRequest.decode(
           Message.valueOf(trx.getStateMachineLogEntry().getLogData()));
       applyTransactionFuture.complete(process(request));
+      // After restart ratis replay logs from last snapshot index.
+      // So if some transactions which need to be updated to DB will not be
+      // applied to DB. After a restart of SCM container/pipeline managers
+      // have setup the safemode rules with not to update DB. Due to this
+      // some time safemode rules are not validated and SCM does not exit
+      // safe mode. So, once after restart as transactions are applied, we
+      // check whether safe mode rules are validated to solve the issue of
+      // SCM not coming out of safemode.
+      if (scm.isInSafeMode()) {
+        scm.getScmSafeModeManager().refreshAndValidate();

Review comment:
       This check should happen irrespective of, whether its leader or not right?

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/HealthyPipelineSafeModeRule.java
##########
@@ -130,30 +124,61 @@ protected void process(Pipeline pipeline) {
       SCMSafeModeManager.getLogger().info(
           "SCM in safe mode. Healthy pipelines reported count is {}, " +
               "required healthy pipeline reported count is {}",
-          currentHealthyPipelineCount, healthyPipelineThresholdCount);
+          currentHealthyPipelineCount, getHealthyPipelineThresholdCount());
+
     }
   }
 
+
+  public synchronized void refresh() {
+    if (!validate()) {
+      initializeRule(true);
+    }
+  }
+
+  private synchronized void initializeRule(boolean refresh) {
+    int pipelineCount = pipelineManager.getPipelines(
+        new RatisReplicationConfig(HddsProtos.ReplicationFactor.THREE),
+        Pipeline.PipelineState.OPEN).size();
+
+    healthyPipelineThresholdCount = Math.max(minHealthyPipelines,
+        (int) Math.ceil(healthyPipelinesPercent * pipelineCount));
+
+    if (refresh) {
+      LOG.info("Refreshed total pipeline count is {}, healthy pipeline " +
+          "threshold count is {}", pipelineCount,
+          healthyPipelineThresholdCount);
+    } else {
+      LOG.info("Total pipeline count is {}, healthy pipeline " +
+          "threshold count is {}", pipelineCount,
+          healthyPipelineThresholdCount);
+    }
+
+    getSafeModeMetrics().setNumHealthyPipelinesThreshold(
+        healthyPipelineThresholdCount);
+  }
+
+
   @Override
-  protected void cleanup() {
+  protected synchronized void cleanup() {
     processedPipelineIDs.clear();
   }
 
   @VisibleForTesting
-  public int getCurrentHealthyPipelineCount() {
+  public synchronized int getCurrentHealthyPipelineCount() {

Review comment:
       Why these calls need to be synchronized()?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bshashikant merged pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bshashikant merged pull request #2294:
URL: https://github.com/apache/ozone/pull/2294


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 edited a comment on pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 edited a comment on pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#issuecomment-854422133






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 commented on a change in pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 commented on a change in pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#discussion_r648009982



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMContext.java
##########
@@ -96,6 +96,38 @@ public void updateLeaderAndTerm(boolean leader, long newTerm) {
     }
   }
 
+  /**
+   * Update isLeader flag.
+   * @param leader
+   */
+  public void updateLeader(boolean leader) {
+    lock.writeLock().lock();
+    try {

Review comment:
       Why do we need 2 leader and leader ready. Actually we need leader ready every where.
   Do you see any use-case for both?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bshashikant commented on a change in pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bshashikant commented on a change in pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#discussion_r649087180



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -297,6 +305,33 @@ public void notifyTermIndexUpdated(long term, long index) {
     // with some information like its peers and termIndex). So, calling
     // updateLastApplied updates lastAppliedTermIndex.
     updateLastAppliedTermIndex(term, index);
+
+    if (currentLeaderTerm.get() == term) {
+      // On leader SCM once after it is ready, notify SCM services and also set
+      // leader ready  in SCMContext.
+      if (scm.getScmHAManager().getRatisServer().getDivision().getInfo()
+          .isLeaderReady()) {
+        scm.getScmContext().setLeaderReady();
+        scm.getSCMServiceManager().notifyStatusChanged();
+      }
+
+      // Means all transactions before this term have been applied.
+      // This means after a restart, all pending transactions have been applied.
+      // Perform
+      // 1. Refresh Safemode rules state.
+      // 2. Start DN Rpc server.
+      if (!refreshedAfterLeaderReady.get()) {
+        scm.getScmSafeModeManager().refresh();
+        scm.getDatanodeProtocolServer().start();
+
+        refreshedAfterLeaderReady.set(true);
+      }

Review comment:
       Remove the empty lines.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 commented on a change in pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 commented on a change in pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#discussion_r644715708



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -132,6 +135,17 @@ public void initialize(RaftServer server, RaftGroupId id,
       final SCMRatisRequest request = SCMRatisRequest.decode(
           Message.valueOf(trx.getStateMachineLogEntry().getLogData()));
       applyTransactionFuture.complete(process(request));
+      // After restart ratis replay logs from last snapshot index.
+      // So if some transactions which need to be updated to DB will not be
+      // applied to DB. After a restart of SCM container/pipeline managers
+      // have setup the safemode rules with not to update DB. Due to this
+      // some time safemode rules are not validated and SCM does not exit
+      // safe mode. So, once after restart as transactions are applied, we
+      // check whether safe mode rules are validated to solve the issue of
+      // SCM not coming out of safemode.
+      if (scm.isInSafeMode()) {
+        scm.getScmSafeModeManager().refreshAndValidate();

Review comment:
       Looks like we still need it. Have a look in to latest update

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -80,6 +81,8 @@
   // and reinitialize().
   private DBCheckpoint installingDBCheckpoint = null;
 
+  private Daemon leaderReady;

Review comment:
       Removed this




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bshashikant commented on a change in pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bshashikant commented on a change in pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#discussion_r649086692



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -297,6 +305,33 @@ public void notifyTermIndexUpdated(long term, long index) {
     // with some information like its peers and termIndex). So, calling
     // updateLastApplied updates lastAppliedTermIndex.
     updateLastAppliedTermIndex(term, index);

Review comment:
       Probably need to do an isInitlaized() check here, so that none of below code gets executed during scm init()?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 commented on a change in pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 commented on a change in pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#discussion_r644715475



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java
##########
@@ -437,7 +437,7 @@ private RpcType setRpcType(RaftProperties properties) {
 
   private void setPendingRequestsLimits(RaftProperties properties) {
 
-    final int pendingRequestsByteLimit = (int)conf.getStorageSize(
+    final long pendingRequestsByteLimit = (int)conf.getStorageSize(

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bshashikant commented on a change in pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bshashikant commented on a change in pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#discussion_r644515940



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -132,6 +135,17 @@ public void initialize(RaftServer server, RaftGroupId id,
       final SCMRatisRequest request = SCMRatisRequest.decode(
           Message.valueOf(trx.getStateMachineLogEntry().getLogData()));
       applyTransactionFuture.complete(process(request));
+      // After restart ratis replay logs from last snapshot index.
+      // So if some transactions which need to be updated to DB will not be
+      // applied to DB. After a restart of SCM container/pipeline managers
+      // have setup the safemode rules with not to update DB. Due to this
+      // some time safemode rules are not validated and SCM does not exit
+      // safe mode. So, once after restart as transactions are applied, we
+      // check whether safe mode rules are validated to solve the issue of
+      // SCM not coming out of safemode.
+      if (scm.isInSafeMode()) {
+        scm.getScmSafeModeManager().refreshAndValidate();

Review comment:
       probably, this code will go off in the next patch.

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -132,6 +135,17 @@ public void initialize(RaftServer server, RaftGroupId id,
       final SCMRatisRequest request = SCMRatisRequest.decode(
           Message.valueOf(trx.getStateMachineLogEntry().getLogData()));
       applyTransactionFuture.complete(process(request));
+      // After restart ratis replay logs from last snapshot index.
+      // So if some transactions which need to be updated to DB will not be
+      // applied to DB. After a restart of SCM container/pipeline managers
+      // have setup the safemode rules with not to update DB. Due to this

Review comment:
       The comment is not clear here.

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -80,6 +81,8 @@
   // and reinitialize().
   private DBCheckpoint installingDBCheckpoint = null;
 
+  private Daemon leaderReady;

Review comment:
       leaderReady -> leaderReadyDetector

##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java
##########
@@ -437,7 +437,7 @@ private RpcType setRpcType(RaftProperties properties) {
 
   private void setPendingRequestsLimits(RaftProperties properties) {
 
-    final int pendingRequestsByteLimit = (int)conf.getStorageSize(
+    final long pendingRequestsByteLimit = (int)conf.getStorageSize(

Review comment:
       The change is not related. Let's revert this in this jira.

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -80,6 +81,8 @@
   // and reinitialize().
   private DBCheckpoint installingDBCheckpoint = null;
 
+  private Daemon leaderReady;

Review comment:
       Please add some doc on what this thread is supposed to do.

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -132,6 +135,17 @@ public void initialize(RaftServer server, RaftGroupId id,
       final SCMRatisRequest request = SCMRatisRequest.decode(
           Message.valueOf(trx.getStateMachineLogEntry().getLogData()));
       applyTransactionFuture.complete(process(request));
+      // After restart ratis replay logs from last snapshot index.
+      // So if some transactions which need to be updated to DB will not be
+      // applied to DB. After a restart of SCM container/pipeline managers
+      // have setup the safemode rules with not to update DB. Due to this

Review comment:
       Its better to mention safeMode rules might work with stale data.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 commented on a change in pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 commented on a change in pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#discussion_r649282318



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -297,6 +305,33 @@ public void notifyTermIndexUpdated(long term, long index) {
     // with some information like its peers and termIndex). So, calling
     // updateLastApplied updates lastAppliedTermIndex.
     updateLastAppliedTermIndex(term, index);

Review comment:
       Done

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -297,6 +305,33 @@ public void notifyTermIndexUpdated(long term, long index) {
     // with some information like its peers and termIndex). So, calling
     // updateLastApplied updates lastAppliedTermIndex.
     updateLastAppliedTermIndex(term, index);
+
+    if (currentLeaderTerm.get() == term) {
+      // On leader SCM once after it is ready, notify SCM services and also set
+      // leader ready  in SCMContext.
+      if (scm.getScmHAManager().getRatisServer().getDivision().getInfo()
+          .isLeaderReady()) {
+        scm.getScmContext().setLeaderReady();
+        scm.getSCMServiceManager().notifyStatusChanged();
+      }
+
+      // Means all transactions before this term have been applied.
+      // This means after a restart, all pending transactions have been applied.
+      // Perform
+      // 1. Refresh Safemode rules state.
+      // 2. Start DN Rpc server.
+      if (!refreshedAfterLeaderReady.get()) {
+        scm.getScmSafeModeManager().refresh();
+        scm.getDatanodeProtocolServer().start();
+
+        refreshedAfterLeaderReady.set(true);
+      }

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 commented on a change in pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 commented on a change in pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#discussion_r648882009



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -297,6 +308,23 @@ public void notifyTermIndexUpdated(long term, long index) {
     // with some information like its peers and termIndex). So, calling
     // updateLastApplied updates lastAppliedTermIndex.
     updateLastAppliedTermIndex(term, index);
+
+    if (currentLeaderTerm.get() == term &&

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bshashikant commented on a change in pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bshashikant commented on a change in pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#discussion_r644515940



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -132,6 +135,17 @@ public void initialize(RaftServer server, RaftGroupId id,
       final SCMRatisRequest request = SCMRatisRequest.decode(
           Message.valueOf(trx.getStateMachineLogEntry().getLogData()));
       applyTransactionFuture.complete(process(request));
+      // After restart ratis replay logs from last snapshot index.
+      // So if some transactions which need to be updated to DB will not be
+      // applied to DB. After a restart of SCM container/pipeline managers
+      // have setup the safemode rules with not to update DB. Due to this
+      // some time safemode rules are not validated and SCM does not exit
+      // safe mode. So, once after restart as transactions are applied, we
+      // check whether safe mode rules are validated to solve the issue of
+      // SCM not coming out of safemode.
+      if (scm.isInSafeMode()) {
+        scm.getScmSafeModeManager().refreshAndValidate();

Review comment:
       probably, this code will go off in the next patch.

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -132,6 +135,17 @@ public void initialize(RaftServer server, RaftGroupId id,
       final SCMRatisRequest request = SCMRatisRequest.decode(
           Message.valueOf(trx.getStateMachineLogEntry().getLogData()));
       applyTransactionFuture.complete(process(request));
+      // After restart ratis replay logs from last snapshot index.
+      // So if some transactions which need to be updated to DB will not be
+      // applied to DB. After a restart of SCM container/pipeline managers
+      // have setup the safemode rules with not to update DB. Due to this

Review comment:
       The comment is not clear here.

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -80,6 +81,8 @@
   // and reinitialize().
   private DBCheckpoint installingDBCheckpoint = null;
 
+  private Daemon leaderReady;

Review comment:
       leaderReady -> leaderReadyDetector

##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java
##########
@@ -437,7 +437,7 @@ private RpcType setRpcType(RaftProperties properties) {
 
   private void setPendingRequestsLimits(RaftProperties properties) {
 
-    final int pendingRequestsByteLimit = (int)conf.getStorageSize(
+    final long pendingRequestsByteLimit = (int)conf.getStorageSize(

Review comment:
       The change is not related. Let's revert this in this jira.

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -80,6 +81,8 @@
   // and reinitialize().
   private DBCheckpoint installingDBCheckpoint = null;
 
+  private Daemon leaderReady;

Review comment:
       Please add some doc on what this thread is supposed to do.

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -132,6 +135,17 @@ public void initialize(RaftServer server, RaftGroupId id,
       final SCMRatisRequest request = SCMRatisRequest.decode(
           Message.valueOf(trx.getStateMachineLogEntry().getLogData()));
       applyTransactionFuture.complete(process(request));
+      // After restart ratis replay logs from last snapshot index.
+      // So if some transactions which need to be updated to DB will not be
+      // applied to DB. After a restart of SCM container/pipeline managers
+      // have setup the safemode rules with not to update DB. Due to this

Review comment:
       Its better to mention safeMode rules might work with stale data.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 edited a comment on pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 edited a comment on pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#issuecomment-852885323


   With this approach we see an issue
   
   
   Before restart, 2 pipelines closed, and let's say it removed and create a new pipeline. But in the SCM pipeline table it has old 2 pipelines, as remove/new pipeline are not persisted to DB as SCM is force killed.
   
   As we call refresh and validate we exit safe mode after 2nd pipeline remove transaction, and we validate pipeline rules for each applyTransaction so safemode pipeline rules will be validated, and we do not wait for all the pending transactions. In this case we come out of safemode early and reads/write might fail. 
   
   This causes problems like reading/write will fail, even after SCM is out of safe mode.
   ```
   2021-06-02 05:51:04,208 INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 1, healthy pipeline threshold count is 1
   2021-06-02 05:51:04,208 INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total pipeline count is 1, pipeline's with at least one datanode reported threshold count is 1
   2021-06-02 05:51:04,209 INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 0, healthy pipeline threshold count is 0
   2021-06-02 05:51:04,209 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: HealthyPipelineSafeModeRule rule is successfully validated
   2021-06-02 05:51:04,209 INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total pipeline count is 0, pipeline's with at least one datanode reported threshold count is 0
   ```
   
   After an offline discussion with @bshashikant 
   1. We thought we shall refresh SCM safe mode rule once after leader Ready on all SCMs.
   2. And start DN RPC port only after leader ready, so that SCM does not come out of safe mode early by considering not upto date DB.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 commented on a change in pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 commented on a change in pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#discussion_r648882179



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMContext.java
##########
@@ -96,6 +96,38 @@ public void updateLeaderAndTerm(boolean leader, long newTerm) {
     }
   }
 
+  /**
+   * Update isLeader flag.
+   * @param leader
+   */
+  public void updateLeader(boolean leader) {
+    lock.writeLock().lock();
+    try {

Review comment:
       Updated to fix this confusion and also added 2 APIs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] vivekratnavel commented on a change in pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

vivekratnavel commented on a change in pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#discussion_r695259652



##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -286,17 +294,49 @@ public long takeSnapshot() throws IOException {
 
   @Override
   public void notifyTermIndexUpdated(long term, long index) {
-    if (transactionBuffer != null) {
-      transactionBuffer.updateLatestTrxInfo(
-          TransactionInfo.builder().setCurrentTerm(term)
-              .setTransactionIndex(index).build());
-    }
+
     // We need to call updateLastApplied here because now in ratis when a
     // node becomes leader, it is checking stateMachineIndex >=
     // placeHolderIndex (when a node becomes leader, it writes a conf entry
     // with some information like its peers and termIndex). So, calling
     // updateLastApplied updates lastAppliedTermIndex.
     updateLastAppliedTermIndex(term, index);
+
+    // Skip below part if state machine is not initialized.
+
+    if (!isInitialized) {
+      return;
+    }
+
+    if (transactionBuffer != null) {
+      transactionBuffer.updateLatestTrxInfo(
+          TransactionInfo.builder().setCurrentTerm(term)
+              .setTransactionIndex(index).build());
+    }
+
+    if (currentLeaderTerm.get() == term) {
+      // On leader SCM once after it is ready, notify SCM services and also set
+      // leader ready  in SCMContext.
+      if (scm.getScmHAManager().getRatisServer().getDivision().getInfo()
+          .isLeaderReady()) {
+        scm.getScmContext().setLeaderReady();
+        scm.getSCMServiceManager().notifyStatusChanged();
+      }
+
+      // Means all transactions before this term have been applied.
+      // This means after a restart, all pending transactions have been applied.
+      // Perform
+      // 1. Refresh Safemode rules state.
+      // 2. Start DN Rpc server.
+      if (!refreshedAfterLeaderReady.get()) {
+        scm.getScmSafeModeManager().refresh();
+        LOG.info("bharat starting from sm");

Review comment:
       @bharatviswa504 I just noticed that this line got committed by mistake. Can we rectify this in a separate jira or combine it in your next patch?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 commented on pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 commented on pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#issuecomment-854422133


   Tested this on cluster 
   with scenario, close pipeline, scrubber removed and created new pipeline.
   
   And once after leader ready, rules refreshed and data nodes registered, rules are successfully validated. (Unlike before after remove pipeline, pipeline rules are not successfully validated)
   
   2021-06-04 07:05:28,646 INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 1, healthy pipeline threshold count is 1
   2021-06-04 07:05:28,646 INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Refreshed Total pipeline count is 1, pipeline's with at least one datanode reported threshold count is 1
   
   2021-06-04 07:05:30,098 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:31,146 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:32,009 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:32,059 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:33,841 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:33,888 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:33,921 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:39,331 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 1, required healthy pipeline reported count is 1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 commented on a change in pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 commented on a change in pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#discussion_r644715475



##########
File path: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java
##########
@@ -437,7 +437,7 @@ private RpcType setRpcType(RaftProperties properties) {
 
   private void setPendingRequestsLimits(RaftProperties properties) {
 
-    final int pendingRequestsByteLimit = (int)conf.getStorageSize(
+    final long pendingRequestsByteLimit = (int)conf.getStorageSize(

Review comment:
       Done

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -132,6 +135,17 @@ public void initialize(RaftServer server, RaftGroupId id,
       final SCMRatisRequest request = SCMRatisRequest.decode(
           Message.valueOf(trx.getStateMachineLogEntry().getLogData()));
       applyTransactionFuture.complete(process(request));
+      // After restart ratis replay logs from last snapshot index.
+      // So if some transactions which need to be updated to DB will not be
+      // applied to DB. After a restart of SCM container/pipeline managers
+      // have setup the safemode rules with not to update DB. Due to this
+      // some time safemode rules are not validated and SCM does not exit
+      // safe mode. So, once after restart as transactions are applied, we
+      // check whether safe mode rules are validated to solve the issue of
+      // SCM not coming out of safemode.
+      if (scm.isInSafeMode()) {
+        scm.getScmSafeModeManager().refreshAndValidate();

Review comment:
       Looks like we still need it. Have a look in to latest update

##########
File path: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
##########
@@ -80,6 +81,8 @@
   // and reinitialize().
   private DBCheckpoint installingDBCheckpoint = null;
 
+  private Daemon leaderReady;

Review comment:
       Removed this




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bharatviswa504 edited a comment on pull request #2294: HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM.

Posted by GitBox <gi...@apache.org>.

bharatviswa504 edited a comment on pull request #2294:
URL: https://github.com/apache/ozone/pull/2294#issuecomment-854422133


   Tested this on cluster 
   with scenario, close pipeline, scrubber removed and created new pipeline.
   
   And once after leader ready, rules refreshed and data nodes registered, rules are successfully validated. (Unlike before after remove pipeline, pipeline rules are not successfully validated)
   
   ```
   2021-06-04 07:05:28,646 INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 1, healthy pipeline threshold count is 1
   2021-06-04 07:05:28,646 INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Refreshed Total pipeline count is 1, pipeline's with at least one datanode reported threshold count is 1
   ```
   ```
   021-06-04 07:05:29,184 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Registered Data node : 9642160f-49ab-4400-a66c-e7f4210f4ca0{ip: 172.27.131.64, host: bv-unsec-2.bv-unsec.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
   2021-06-04 07:05:30,038 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Registered Data node : b44ed911-3a93-4b83-9e81-88fe47369a84{ip: 172.27.102.9, host: bv-unsec-3.bv-unsec.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
   2021-06-04 07:05:30,098 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Registered Data node : 37d28fa1-41db-4f44-963e-45675b822884{ip: 172.27.97.68, host: bv-unsec-1.bv-unsec.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
   
   ```
   2021-06-04 07:05:30,098 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:31,146 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:32,009 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:32,059 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:33,841 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:33,888 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:33,921 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:39,331 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
   2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 1, required healthy pipeline reported count is 1
   2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: HealthyPipelineSafeModeRule rule is successfully validated
   2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: ScmSafeModeManager, all rules are successfully validated
   2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM exiting safe mode.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org