You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "sumitagrawl (via GitHub)" <gi...@apache.org> on 2023/09/12 08:49:03 UTC

[GitHub] [ozone] sumitagrawl opened a new pull request, #5277: HDDS-9122. Container stuck in QUASI_CLOSED state causing re-replication failure

sumitagrawl opened a new pull request, #5277:
URL: https://github.com/apache/ozone/pull/5277

   ## What changes were proposed in this pull request?
   
   - close container command is executed async way
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-9122
   
   ## How was this patch tested?
   
   Existing integration testcase cover the close container behavior
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel merged pull request #5277: HDDS-9122. Make the Datanode CloseContainerCommandHandler async by queuing commands in an executor

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel merged PR #5277:
URL: https://github.com/apache/ozone/pull/5277


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a diff in pull request #5277: HDDS-9122. Container stuck in QUASI_CLOSED state causing re-replication failure

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on code in PR #5277:
URL: https://github.com/apache/ozone/pull/5277#discussion_r1324780087


##########
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java:
##########
@@ -130,6 +133,19 @@ public class DatanodeConfiguration {
   )
   private int containerDeleteThreads = CONTAINER_DELETE_THREADS_DEFAULT;
 
+  /**
+   * The maximum number of threads used to close containers on a datanode
+   * simultaneously.
+   */
+  @Config(key = "container.close.threads.max",
+      type = ConfigType.INT,
+      defaultValue = "3",
+      tags = {DATANODE},
+      description = "The maximum number of threads used to delete containers " +

Review Comment:
   Typo - Delete -> Close



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a diff in pull request #5277: HDDS-9122. Container stuck in QUASI_CLOSED state causing re-replication failure

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on code in PR #5277:
URL: https://github.com/apache/ozone/pull/5277#discussion_r1324793118


##########
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/CloseContainerCommandHandler.java:
##########
@@ -69,73 +85,76 @@ public CloseContainerCommandHandler() {
   @Override
   public void handle(SCMCommand command, OzoneContainer ozoneContainer,
       StateContext context, SCMConnectionManager connectionManager) {
-    invocationCount.incrementAndGet();
-    final long startTime = Time.monotonicNow();
-    final DatanodeDetails datanodeDetails = context.getParent()
-        .getDatanodeDetails();
-    final CloseContainerCommandProto closeCommand =
-        ((CloseContainerCommand)command).getProto();
-    final ContainerController controller = ozoneContainer.getController();
-    final long containerId = closeCommand.getContainerID();
-    LOG.debug("Processing Close Container command container #{}",
-        containerId);
-    try {
-      final Container container = controller.getContainer(containerId);
-
-      if (container == null) {
-        LOG.error("Container #{} does not exist in datanode. "
-            + "Container close failed.", containerId);
-        return;
-      }
-
-      // move the container to CLOSING if in OPEN state
-      controller.markContainerForClose(containerId);
-
-      switch (container.getContainerState()) {
-      case OPEN:
-      case CLOSING:
-        // If the container is part of open pipeline, close it via write channel
-        if (ozoneContainer.getWriteChannel()
-            .isExist(closeCommand.getPipelineID())) {
-          ContainerCommandRequestProto request =
-              getContainerCommandRequestProto(datanodeDetails,
-                  closeCommand.getContainerID(),
-                  command.getEncodedToken());
-          ozoneContainer.getWriteChannel()
-              .submitRequest(request, closeCommand.getPipelineID());
-        } else if (closeCommand.getForce()) {
-          // Non-RATIS containers should have the force close flag set, so they
-          // are moved to CLOSED immediately rather than going to quasi-closed.
-          controller.closeContainer(containerId);
-        } else {
-          controller.quasiCloseContainer(containerId,
-              "Ratis pipeline does not exist");
-          LOG.info("Marking Container {} quasi closed", containerId);
+    queuedCount.incrementAndGet();
+    CompletableFuture.runAsync(() -> {
+      invocationCount.incrementAndGet();
+      final long startTime = Time.monotonicNow();
+      final DatanodeDetails datanodeDetails = context.getParent()
+          .getDatanodeDetails();
+      final CloseContainerCommandProto closeCommand =
+          ((CloseContainerCommand) command).getProto();
+      final ContainerController controller = ozoneContainer.getController();
+      final long containerId = closeCommand.getContainerID();
+      LOG.debug("Processing Close Container command container #{}",
+          containerId);
+      try {
+        final Container container = controller.getContainer(containerId);
+
+        if (container == null) {
+          LOG.error("Container #{} does not exist in datanode. "
+              + "Container close failed.", containerId);
+          return;
         }
-        break;
-      case QUASI_CLOSED:
-        if (closeCommand.getForce()) {
-          controller.closeContainer(containerId);
+
+        // move the container to CLOSING if in OPEN state
+        controller.markContainerForClose(containerId);
+
+        switch (container.getContainerState()) {
+        case OPEN:
+        case CLOSING:
+          // If container is part of open pipeline, close it via write channel

Review Comment:
   Nit: Removing "the" from the sentence, makes it incorrect. The sentence should read:
   
   > If the container is part of **an** open pipeline, close it via write channel
   
   The typo is the missing "an" if you want to fix it but removing "the" is not correct



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sumitagrawl commented on pull request #5277: HDDS-9122. Container stuck in QUASI_CLOSED state causing re-replication failure

Posted by "sumitagrawl (via GitHub)" <gi...@apache.org>.
sumitagrawl commented on PR #5277:
URL: https://github.com/apache/ozone/pull/5277#issuecomment-1716975156

   > Please add an outline of the problem, the symptoms, and how the solution addresses it to the PR description.
   > 
   > I think I had looked briefly at this problem before - it was not clear to me that the close container commands were so slow it was affecting things, especially since the problem was detected on a cluster with a very small number of containers. Were you able to prove the slow container closes were the root cause of the problem via debug logging, jstacks etc?
   > 
   > It does make sense to make this process async and not be able to block the command handler thread, but I am interested to know if this is the root cause of the problem or is it possible that some other command is blocking the queue preventing the container close commands from getting executed?
   
   Updated...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a diff in pull request #5277: HDDS-9122. Container stuck in QUASI_CLOSED state causing re-replication failure

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on code in PR #5277:
URL: https://github.com/apache/ozone/pull/5277#discussion_r1324793718


##########
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/CloseContainerCommandHandler.java:
##########
@@ -69,73 +85,76 @@ public CloseContainerCommandHandler() {
   @Override
   public void handle(SCMCommand command, OzoneContainer ozoneContainer,
       StateContext context, SCMConnectionManager connectionManager) {
-    invocationCount.incrementAndGet();
-    final long startTime = Time.monotonicNow();
-    final DatanodeDetails datanodeDetails = context.getParent()
-        .getDatanodeDetails();
-    final CloseContainerCommandProto closeCommand =
-        ((CloseContainerCommand)command).getProto();
-    final ContainerController controller = ozoneContainer.getController();
-    final long containerId = closeCommand.getContainerID();
-    LOG.debug("Processing Close Container command container #{}",
-        containerId);
-    try {
-      final Container container = controller.getContainer(containerId);
-
-      if (container == null) {
-        LOG.error("Container #{} does not exist in datanode. "
-            + "Container close failed.", containerId);
-        return;
-      }
-
-      // move the container to CLOSING if in OPEN state
-      controller.markContainerForClose(containerId);
-
-      switch (container.getContainerState()) {
-      case OPEN:
-      case CLOSING:
-        // If the container is part of open pipeline, close it via write channel
-        if (ozoneContainer.getWriteChannel()
-            .isExist(closeCommand.getPipelineID())) {
-          ContainerCommandRequestProto request =
-              getContainerCommandRequestProto(datanodeDetails,
-                  closeCommand.getContainerID(),
-                  command.getEncodedToken());
-          ozoneContainer.getWriteChannel()
-              .submitRequest(request, closeCommand.getPipelineID());
-        } else if (closeCommand.getForce()) {
-          // Non-RATIS containers should have the force close flag set, so they
-          // are moved to CLOSED immediately rather than going to quasi-closed.
-          controller.closeContainer(containerId);
-        } else {
-          controller.quasiCloseContainer(containerId,
-              "Ratis pipeline does not exist");
-          LOG.info("Marking Container {} quasi closed", containerId);
+    queuedCount.incrementAndGet();
+    CompletableFuture.runAsync(() -> {
+      invocationCount.incrementAndGet();
+      final long startTime = Time.monotonicNow();
+      final DatanodeDetails datanodeDetails = context.getParent()
+          .getDatanodeDetails();
+      final CloseContainerCommandProto closeCommand =
+          ((CloseContainerCommand) command).getProto();
+      final ContainerController controller = ozoneContainer.getController();
+      final long containerId = closeCommand.getContainerID();
+      LOG.debug("Processing Close Container command container #{}",
+          containerId);
+      try {
+        final Container container = controller.getContainer(containerId);
+
+        if (container == null) {
+          LOG.error("Container #{} does not exist in datanode. "
+              + "Container close failed.", containerId);
+          return;
         }
-        break;
-      case QUASI_CLOSED:
-        if (closeCommand.getForce()) {
-          controller.closeContainer(containerId);
+
+        // move the container to CLOSING if in OPEN state
+        controller.markContainerForClose(containerId);
+
+        switch (container.getContainerState()) {
+        case OPEN:
+        case CLOSING:
+          // If container is part of open pipeline, close it via write channel
+          if (ozoneContainer.getWriteChannel()
+              .isExist(closeCommand.getPipelineID())) {
+            ContainerCommandRequestProto request =
+                getContainerCommandRequestProto(datanodeDetails,
+                    closeCommand.getContainerID(),
+                    command.getEncodedToken());
+            ozoneContainer.getWriteChannel()
+                .submitRequest(request, closeCommand.getPipelineID());
+          } else if (closeCommand.getForce()) {
+            // Non-RATIS containers should have force close flag set, so they

Review Comment:
   Nit: Same as above - we should not remove "the" and "than"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on pull request #5277: HDDS-9122. Container stuck in QUASI_CLOSED state causing re-replication failure

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on PR #5277:
URL: https://github.com/apache/ozone/pull/5277#issuecomment-1715466619

   Please add an outline of the problem, the symptoms, and how the solution addresses it to the PR description.
   
   I think I had looked briefly at this problem before - it was not clear to me that the close container commands were so slow it was affecting things, especially since the problem was detected on a cluster with a very small number of containers. Were you able to prove the slow container closes were the root cause of the problem via debug logging, jstacks etc?
   
   It does make sense to make this process async and not be able to block the command handler thread, but I am interested to know if this is the root cause of the problem or is it possible that some other command is blocking the queue preventing the container close commands from getting executed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org