You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "duongkame (via GitHub)" <gi...@apache.org> on 2023/07/14 18:31:23 UTC

[GitHub] [ozone] duongkame opened a new pull request, #5068: HDDS-9020. Datanodes fails to start up when secret key has not yet been initialized in SCM.

duongkame opened a new pull request, #5068:
URL: https://github.com/apache/ozone/pull/5068

   ## What changes were proposed in this pull request?
   
   When a datanode/OM starts up while SCM has not finished initializing secret keys yet, the startup fails because Datanode/OM needs to prefetch the current active secret keys.
   
   ```
   2023-07-13 16:06:22,369 [main] ERROR org.apache.hadoop.ozone.HddsDatanodeService: Exception in HddsDatanodeService.
   java.lang.RuntimeException: Can't start the HDDS datanode plugin
   	at org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:361)
   	at org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:235)
   	at org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:203)
   	at org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:93)
   	at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
   	at picocli.CommandLine.access$1300(CommandLine.java:145)
   	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
   	at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
   	at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
   	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
   	at picocli.CommandLine.execute(CommandLine.java:2078)
   	at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
   	at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
   	at org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:185)
   Caused by: org.apache.hadoop.hdds.security.exception.SCMSecretKeyException: Secret key initialization is not finished yet.
   	at org.apache.hadoop.hdds.protocolPB.SecretKeyProtocolClientSideTranslatorPB.handleError(SecretKeyProtocolClientSideTranslatorPB.java:101)
   	at org.apache.hadoop.hdds.protocolPB.SecretKeyProtocolClientSideTranslatorPB.submitRequest(SecretKeyProtocolClientSideTranslatorPB.java:89)
   	at org.apache.hadoop.hdds.protocolPB.SecretKeyProtocolClientSideTranslatorPB.getCurrentSecretKey(SecretKeyProtocolClientSideTranslatorPB.java:127)
   	at org.apache.hadoop.hdds.security.symmetric.DefaultSecretKeySignerClient.start(DefaultSecretKeySignerClient.java:72)
   	at org.apache.hadoop.hdds.security.symmetric.DefaultSecretKeyClient.start(DefaultSecretKeyClient.java:50)
   	at org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:312)
   	... 13 more
   2023-07-13 16:06:22,382 [shutdown-hook-0] INFO org.apache.hadoop.ozone.HddsDatanodeService: SHUTDOWN_MSG: 
   /************************************************************
   SHUTDOWN_MSG: Shutting down HddsDatanodeService at ....
   ************************************************************/  
   ```
   
   The current active secret key is mandatory for Datanode/OM startup because they need it in their background processes, e.g. EC reconstruction. The solution is to apply retries to the active secret key prefetch if SCM has not initialized secret keys yet.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-9020
   
   Please replace this section with the link to the Apache JIRA)
   
   ## How was this patch tested?
   
   I added a custom code to SCM to delay the secret key initialization, started a docker cluster and verified datanodes/OMs retry on SCM secret key exceptions until succeed.
   
   ```
   2023-07-14 17:45:18,748 [main] INFO symmetric.DefaultSecretKeyVerifierClient: Initializing secret key cache with size 16, TTL PT168H
   2023-07-14 17:45:18,875 [main] INFO utils.RetriableTask: Execution of task getCurrentSecretKey failed, will be retried in 4000 ms
   2023-07-14 17:45:22,882 [main] INFO utils.RetriableTask: Execution of task getCurrentSecretKey failed, will be retried in 6000 ms
   2023-07-14 17:45:28,895 [main] INFO utils.RetriableTask: Execution of task getCurrentSecretKey failed, will be retried in 19000 ms
   2023-07-14 17:45:47,938 [main] INFO utils.RetriableTask: Execution of task getCurrentSecretKey failed, will be retried in 32000 ms
   2023-07-14 17:46:19,972 [main] INFO utils.RetriableTask: Execution of task getCurrentSecretKey failed, will be retried in 91000 ms
   2023-07-14 17:47:51,007 [main] INFO symmetric.DefaultSecretKeySignerClient: Initial secret key fetched from SCM: SecretKey(id = f937b502-faca-4505-a894-72c19dde972b, creation at: 2023-07-14T17:47:16.371Z, expire at: 2023-07-21T17:47:16.371Z).
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime merged pull request #5068: HDDS-9020. Datanodes fails to start up when secret key has not yet been initialized in SCM.

Posted by "kerneltime (via GitHub)" <gi...@apache.org>.
kerneltime merged PR #5068:
URL: https://github.com/apache/ozone/pull/5068


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] duongkame commented on a diff in pull request #5068: HDDS-9020. Datanodes fails to start up when secret key has not yet been initialized in SCM.

Posted by "duongkame (via GitHub)" <gi...@apache.org>.
duongkame commented on code in PR #5068:
URL: https://github.com/apache/ozone/pull/5068#discussion_r1264704956


##########
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/security/symmetric/DefaultSecretKeySignerClient.java:
##########
@@ -68,13 +74,46 @@ public void refetchSecretKey() {
 
   @Override
   public void start(ConfigurationSource conf) throws IOException {
-    final ManagedSecretKey initialKey =
-        secretKeyProtocol.getCurrentSecretKey();
+    final ManagedSecretKey initialKey = loadInitialSecretKey();

Review Comment:
   This path is reachable only when block tokens are enabled.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime commented on a diff in pull request #5068: HDDS-9020. Datanodes fails to start up when secret key has not yet been initialized in SCM.

Posted by "kerneltime (via GitHub)" <gi...@apache.org>.
kerneltime commented on code in PR #5068:
URL: https://github.com/apache/ozone/pull/5068#discussion_r1264261226


##########
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/security/symmetric/DefaultSecretKeySignerClient.java:
##########
@@ -68,13 +74,46 @@ public void refetchSecretKey() {
 
   @Override
   public void start(ConfigurationSource conf) throws IOException {
-    final ManagedSecretKey initialKey =
-        secretKeyProtocol.getCurrentSecretKey();
+    final ManagedSecretKey initialKey = loadInitialSecretKey();
+
     LOG.info("Initial secret key fetched from SCM: {}.", initialKey);
     cache.set(initialKey);
     scheduleSecretKeyPoller(conf, initialKey.getCreationTime());
   }
 
+  private ManagedSecretKey loadInitialSecretKey() throws IOException {
+    // Load initial active secret key from SCM, retries with exponential
+    // backoff when SCM has not initialized secret keys yet.
+
+    // Exponential backoff policy, 10 retries/1s will give maximum wait time
+    // around 10 min (2^9 = 512s).
+    int maxRetries = 10;

Review Comment:
   Process state != initialized. Datanode can be up indefinitely in a secure cluster for SCM to be initialized to return the secret keys, but the process need not stop.



##########
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/security/symmetric/DefaultSecretKeySignerClient.java:
##########
@@ -68,13 +74,46 @@ public void refetchSecretKey() {
 
   @Override
   public void start(ConfigurationSource conf) throws IOException {
-    final ManagedSecretKey initialKey =
-        secretKeyProtocol.getCurrentSecretKey();
+    final ManagedSecretKey initialKey = loadInitialSecretKey();

Review Comment:
   What should happen if block tokens are not enabled? Maybe Datanode should not wait for SCM to start up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime commented on a diff in pull request #5068: HDDS-9020. Datanodes fails to start up when secret key has not yet been initialized in SCM.

Posted by "kerneltime (via GitHub)" <gi...@apache.org>.
kerneltime commented on code in PR #5068:
URL: https://github.com/apache/ozone/pull/5068#discussion_r1264261226


##########
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/security/symmetric/DefaultSecretKeySignerClient.java:
##########
@@ -68,13 +74,46 @@ public void refetchSecretKey() {
 
   @Override
   public void start(ConfigurationSource conf) throws IOException {
-    final ManagedSecretKey initialKey =
-        secretKeyProtocol.getCurrentSecretKey();
+    final ManagedSecretKey initialKey = loadInitialSecretKey();
+
     LOG.info("Initial secret key fetched from SCM: {}.", initialKey);
     cache.set(initialKey);
     scheduleSecretKeyPoller(conf, initialKey.getCreationTime());
   }
 
+  private ManagedSecretKey loadInitialSecretKey() throws IOException {
+    // Load initial active secret key from SCM, retries with exponential
+    // backoff when SCM has not initialized secret keys yet.
+
+    // Exponential backoff policy, 10 retries/1s will give maximum wait time
+    // around 10 min (2^9 = 512s).
+    int maxRetries = 10;

Review Comment:
   Process state != isInitialized(). Datanode can be up indefinitely in a secure cluster for SCM to be initialized to return the secret keys, but the process need not stop.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] duongkame commented on a diff in pull request #5068: HDDS-9020. Datanodes fails to start up when secret key has not yet been initialized in SCM.

Posted by "duongkame (via GitHub)" <gi...@apache.org>.
duongkame commented on code in PR #5068:
URL: https://github.com/apache/ozone/pull/5068#discussion_r1264705169


##########
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/security/symmetric/DefaultSecretKeySignerClient.java:
##########
@@ -68,13 +74,46 @@ public void refetchSecretKey() {
 
   @Override
   public void start(ConfigurationSource conf) throws IOException {
-    final ManagedSecretKey initialKey =
-        secretKeyProtocol.getCurrentSecretKey();
+    final ManagedSecretKey initialKey = loadInitialSecretKey();
+
     LOG.info("Initial secret key fetched from SCM: {}.", initialKey);
     cache.set(initialKey);
     scheduleSecretKeyPoller(conf, initialKey.getCreationTime());
   }
 
+  private ManagedSecretKey loadInitialSecretKey() throws IOException {
+    // Load initial active secret key from SCM, retries with exponential
+    // backoff when SCM has not initialized secret keys yet.
+
+    // Exponential backoff policy, 10 retries/1s will give maximum wait time
+    // around 10 min (2^9 = 512s).
+    int maxRetries = 10;

Review Comment:
   Increased the max retries to 100, with exp backoff wait time repeats after 10, the total maximum wait time is around 3h.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org