You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "siddhantsangwan (via GitHub)" <gi...@apache.org> on 2023/09/22 12:16:58 UTC

[GitHub] [ozone] siddhantsangwan opened a new pull request, #5348: HDDS-9327. LegacyReplicationManager: Handle all UNHEALTHY replicas of a CLOSING container

siddhantsangwan opened a new pull request, #5348:
URL: https://github.com/apache/ozone/pull/5348

   ## What changes were proposed in this pull request?
   
   Legacy Replication Manager currently doesn't handle a CLOSING container when all its replicas are UNHEALTHY. RM needs to make decisions for such a container so it can move to one of the terminal states and be eligible for replication. This PR moves such a container to the QUASI_CLOSED state. This jira will fix the LegacyReplicationManager only.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-9327
   
   ## How was this patch tested?
   
   Added unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on a diff in pull request #5348: HDDS-9327. LegacyReplicationManager: Handle all UNHEALTHY replicas of a CLOSING container

Posted by "siddhantsangwan (via GitHub)" <gi...@apache.org>.
siddhantsangwan commented on code in PR #5348:
URL: https://github.com/apache/ozone/pull/5348#discussion_r1337033128


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -398,7 +400,21 @@ protected void processContainer(ContainerInfo container,
            */
           if (replicas.isEmpty() && (container.getNumberOfKeys() == 0)) {
             closeEmptyContainer(container);
+            return;
+          }
+
+          if (!foundHealthy) {

Review Comment:
   That does make sense, yes. I'll add a check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel merged pull request #5348: HDDS-9327. LegacyReplicationManager: Handle all UNHEALTHY replicas of a CLOSING container

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel merged PR #5348:
URL: https://github.com/apache/ozone/pull/5348


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on a diff in pull request #5348: HDDS-9327. LegacyReplicationManager: Handle all UNHEALTHY replicas of a CLOSING container

Posted by "siddhantsangwan (via GitHub)" <gi...@apache.org>.
siddhantsangwan commented on code in PR #5348:
URL: https://github.com/apache/ozone/pull/5348#discussion_r1337075543


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -398,7 +400,21 @@ protected void processContainer(ContainerInfo container,
            */
           if (replicas.isEmpty() && (container.getNumberOfKeys() == 0)) {
             closeEmptyContainer(container);
+            return;
+          }
+
+          if (!foundHealthy) {

Review Comment:
   Updated in the latest commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] errose28 commented on a diff in pull request #5348: HDDS-9327. LegacyReplicationManager: Handle all UNHEALTHY replicas of a CLOSING container

Posted by "errose28 (via GitHub)" <gi...@apache.org>.
errose28 commented on code in PR #5348:
URL: https://github.com/apache/ozone/pull/5348#discussion_r1337594484


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -398,7 +400,22 @@ protected void processContainer(ContainerInfo container,
            */
           if (replicas.isEmpty() && (container.getNumberOfKeys() == 0)) {
             closeEmptyContainer(container);
+            return;
+          }
+
+          if (!foundHealthy && container.getReplicationType().equals(
+              HddsProtos.ReplicationType.RATIS)) {

Review Comment:
   I think we want a branch for EC that moves it from CLOSING to CLOSED in this case too, and a test as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] errose28 commented on a diff in pull request #5348: HDDS-9327. LegacyReplicationManager: Handle all UNHEALTHY replicas of a CLOSING container

Posted by "errose28 (via GitHub)" <gi...@apache.org>.
errose28 commented on code in PR #5348:
URL: https://github.com/apache/ozone/pull/5348#discussion_r1337752743


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -398,7 +400,22 @@ protected void processContainer(ContainerInfo container,
            */
           if (replicas.isEmpty() && (container.getNumberOfKeys() == 0)) {
             closeEmptyContainer(container);
+            return;
+          }
+
+          if (!foundHealthy && container.getReplicationType().equals(
+              HddsProtos.ReplicationType.RATIS)) {

Review Comment:
   Or actually, EC is not handled by the Legacy RM and the Ratis check here is just defensive, right? In that case just adding a comment explaining this might be good since it threw me off initially.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on pull request #5348: HDDS-9327. LegacyReplicationManager: Handle all UNHEALTHY replicas of a CLOSING container

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on PR #5348:
URL: https://github.com/apache/ozone/pull/5348#issuecomment-1735697898

   LGTM - Will commit tomorrow unless @errose28 has any further comments before then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on a diff in pull request #5348: HDDS-9327. LegacyReplicationManager: Handle all UNHEALTHY replicas of a CLOSING container

Posted by "siddhantsangwan (via GitHub)" <gi...@apache.org>.
siddhantsangwan commented on code in PR #5348:
URL: https://github.com/apache/ozone/pull/5348#discussion_r1338378150


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -398,7 +400,22 @@ protected void processContainer(ContainerInfo container,
            */
           if (replicas.isEmpty() && (container.getNumberOfKeys() == 0)) {
             closeEmptyContainer(container);
+            return;
+          }
+
+          if (!foundHealthy && container.getReplicationType().equals(
+              HddsProtos.ReplicationType.RATIS)) {

Review Comment:
   > EC is not handled by the Legacy RM
   You're right. A comment about EC containers elsewhere in that method threw me off and I didn't realise we don't even let EC containers be processed in legacy. I've removed that check entirely to avoid confusion.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -398,7 +400,22 @@ protected void processContainer(ContainerInfo container,
            */
           if (replicas.isEmpty() && (container.getNumberOfKeys() == 0)) {
             closeEmptyContainer(container);
+            return;
+          }
+
+          if (!foundHealthy && container.getReplicationType().equals(
+              HddsProtos.ReplicationType.RATIS)) {

Review Comment:
   > EC is not handled by the Legacy RM
   
   You're right. A comment about EC containers elsewhere in that method threw me off and I didn't realise we don't even let EC containers be processed in legacy. I've removed that check entirely to avoid confusion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] errose28 commented on a diff in pull request #5348: HDDS-9327. LegacyReplicationManager: Handle all UNHEALTHY replicas of a CLOSING container

Posted by "errose28 (via GitHub)" <gi...@apache.org>.
errose28 commented on code in PR #5348:
URL: https://github.com/apache/ozone/pull/5348#discussion_r1334914989


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -398,7 +400,20 @@ protected void processContainer(ContainerInfo container,
            */
           if (replicas.isEmpty() && (container.getNumberOfKeys() == 0)) {
             closeEmptyContainer(container);
+            return;
+          }
+
+          if (!foundHealthy) {
+            /* If we get here, then this container has replicas and all are
+            UNHEALTHY. Move it from CLOSING to QUASI_CLOSED so RM can then try
+            to maintain replication factor number of replicas.
+            */
+            containerManager.updateContainerState(container.containerID(),
+                HddsProtos.LifeCycleEvent.QUASI_CLOSE);
+            LOG.debug("Moved container {} to QUASI_CLOSED because it has only" +
+                " UNHEALTHY replicas: {}.", container, replicas);

Review Comment:
   nit. to clarify log message.
   ```suggestion
               LOG.debug("Moved container {} from CLOSING to QUASI_CLOSED because it has only" +
                   " UNHEALTHY replicas: {}.", container, replicas);
   ```



##########
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestLegacyReplicationManager.java:
##########
@@ -381,6 +382,69 @@ public void testClosingContainer() throws IOException, TimeoutException {
       Assertions.assertEquals(1, report.getStat(LifeCycleState.CLOSING));
     }
 
+    /**

Review Comment:
   I think we should move these tests to the `UnstableReplicas` inner class. That was my attempt to group the old tests together to make this more manageable until the new RM tests came along.



##########
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestLegacyReplicationManager.java:
##########
@@ -381,6 +382,69 @@ public void testClosingContainer() throws IOException, TimeoutException {
       Assertions.assertEquals(1, report.getStat(LifeCycleState.CLOSING));
     }
 
+    /**
+     * A CLOSING container which has only UNHEALTHY replicas should be moved
+     * to QUASI_CLOSED state so that RM can then maintain replication factor
+     * number of replicas.
+     */
+    @Test
+    public void testClosingContainerWithOnlyUnhealthyReplicas()
+        throws IOException, InvalidStateTransitionException {
+      final ContainerInfo container = getContainer(LifeCycleState.CLOSING);
+      final ContainerID id = container.containerID();
+      containerStateManager.addContainer(container.getProtobuf());
+
+      // all replicas are UNHEALTHY
+      final Set<ContainerReplica> replicas = getReplicas(id, UNHEALTHY,
+          randomDatanodeDetails(), randomDatanodeDetails(),
+          randomDatanodeDetails());
+      for (ContainerReplica replica : replicas) {
+        containerStateManager.updateContainerReplica(id, replica);
+      }
+
+      replicationManager.processAll();
+      Mockito.verify(containerManager, times(1))
+          .updateContainerState(container.containerID(),
+              LifeCycleEvent.QUASI_CLOSE);
+
+      containerStateManager.updateContainerState(
+          container.containerID().getProtobuf(), LifeCycleEvent.QUASI_CLOSE);
+
+      replicationManager.processAll();
+      Assertions.assertEquals(1,
+          replicationManager.getContainerReport().getStat(
+              ReplicationManagerReport.HealthState.QUASI_CLOSED_STUCK));
+    }
+
+    @Test
+    public void testClosingContainerWithSomeUnhealthyReplicas()
+        throws IOException, InvalidStateTransitionException {
+      final ContainerInfo container = getContainer(LifeCycleState.CLOSING);
+      final ContainerID id = container.containerID();
+      containerStateManager.addContainer(container.getProtobuf());
+
+      // 2 UNHEALTHY, 1 OPEN
+      final Set<ContainerReplica> replicas = getReplicas(id, UNHEALTHY,
+          randomDatanodeDetails(), randomDatanodeDetails());
+      final DatanodeDetails datanode = randomDatanodeDetails();
+      replicas.addAll(getReplicas(id, State.OPEN, datanode));
+      for (ContainerReplica replica : replicas) {
+        containerStateManager.updateContainerReplica(id, replica);
+      }
+
+      final int currentCloseCommandCount = datanodeCommandHandler
+          .getInvocationCount(SCMCommandProto.Type.closeContainerCommand);
+      replicationManager.processAll();
+      eventQueue.processAll(1000);
+
+      Mockito.verify(containerManager, times(0))
+          .updateContainerState(container.containerID(),
+              LifeCycleEvent.QUASI_CLOSE);
+      Assertions.assertEquals(currentCloseCommandCount + 1,
+          datanodeCommandHandler.getInvocationCount(
+              SCMCommandProto.Type.closeContainerCommand));

Review Comment:
   Can we verify that the close command is not forced (container should go to quasi-close if pipeline is not available)? A similar check was done to test the force delete command in #5286



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a diff in pull request #5348: HDDS-9327. LegacyReplicationManager: Handle all UNHEALTHY replicas of a CLOSING container

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on code in PR #5348:
URL: https://github.com/apache/ozone/pull/5348#discussion_r1337001369


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -398,7 +400,21 @@ protected void processContainer(ContainerInfo container,
            */
           if (replicas.isEmpty() && (container.getNumberOfKeys() == 0)) {
             closeEmptyContainer(container);
+            return;
+          }
+
+          if (!foundHealthy) {

Review Comment:
   I wonder if we should check the container is a RATIS container here too, or at least not EC? EC containers have no handling for Quasi-Closed, so letting an EC container go to Quasi_Closed may cause more problems.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on pull request #5348: HDDS-9327. LegacyReplicationManager: Handle all UNHEALTHY replicas of a CLOSING container

Posted by "siddhantsangwan (via GitHub)" <gi...@apache.org>.
siddhantsangwan commented on PR #5348:
URL: https://github.com/apache/ozone/pull/5348#issuecomment-1732940813

   @errose28 Thanks for the review. I have addressed your comments in the latest commit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org