You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "siddhantsangwan (via GitHub)" <gi...@apache.org> on 2023/11/27 11:16:25 UTC

[PR] HDDS-9763. Over Replication Check of all UNHEALTHY replicas is broken [ozone]

siddhantsangwan opened a new pull request, #5678:
URL: https://github.com/apache/ozone/pull/5678

   ## What changes were proposed in this pull request?
   When all replicas of a `CLOSED` container are `UNHEALTHY` and more than replication factor, the container is reported as over replicated but not queued to the over replication queue. This jira will fix this and also define clear and separate responsibilities for `RatisReplicationCheckHandler` and `RatisUnhealthyReplicationCheckHandler` to avoid confusion.
   
   `RatisReplicationCheckHandler`: Checks replication for typical over replicated cases, such as 4 `CLOSED` replicas. And also for an excess of unhealthy replicas when we have enough matching replicas, such as 3 `CLOSED` and 1 `UNHEALTHY` or 3 `CLOSED` and 1 `QUASI_CLOSED` with incorrect sequence ID.
   
   `RatisUnhealthyReplicationCheckHandler`: Checks replication when only `UNHEALTHY` replicas are present or when the container is closed and only `QUASI_CLOSED` replicas with incorrect sequence IDs are present.
   
   This problem only affects the new `ReplicationManager`. The majority of changes and additions are in tests.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-9763
   
   ## How was this patch tested?
   Added unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9763. Over Replication Check of all UNHEALTHY replicas is broken [ozone]

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on code in PR #5678:
URL: https://github.com/apache/ozone/pull/5678#discussion_r1406106400


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/RatisContainerReplicaCount.java:
##########
@@ -566,8 +566,7 @@ public boolean isSafelyOverReplicated() {
       return false;
     }
 
-    return getMatchingReplicaCount() >= repFactor &&
-        getMatchingReplicaCount() - repFactor >= inFlightDel;
+    return getMatchingReplicaCount() >= repFactor;

Review Comment:
   I think the javadoc needs updated since you have change the criteria here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9763. Over Replication Check of all UNHEALTHY replicas is broken [ozone]

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai merged PR #5678:
URL: https://github.com/apache/ozone/pull/5678


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9763. Over Replication Check of all UNHEALTHY replicas is broken [ozone]

Posted by "siddhantsangwan (via GitHub)" <gi...@apache.org>.
siddhantsangwan commented on PR #5678:
URL: https://github.com/apache/ozone/pull/5678#issuecomment-1831215847

   The justification for deleting a `QUASI_CLOSED` replica can be that we need only preserve one replica per unique origin. On the other hand, what if delete a `QUASI_CLOSED` replica and the `UNHEALTHY` replica is severely corrupted? With 2 `QUASI_CLOSED` and 1 `UNHEALTHY`, the replication factor is then effectively 2. 
   
   It seems like we need to keep 3 copies of the healthy replica and also keep the unhealthy one. Keeping 4 replicas, however, has its own issues in small clusters...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9763. Over Replication Check of all UNHEALTHY replicas is broken [ozone]

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on code in PR #5678:
URL: https://github.com/apache/ozone/pull/5678#discussion_r1409472496


##########
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/health/TestRatisUnhealthyReplicationCheckHandler.java:
##########
@@ -366,29 +352,45 @@ public void overReplicationCheckWithQuasiClosedReplica() {
     final ContainerInfo container = ReplicationTestUtil.createContainerInfo(
         repConfig, 1, HddsProtos.LifeCycleState.CLOSED, sequenceID);
 
-    final Set<ContainerReplica> replicas = new HashSet<>(2);
-    replicas.add(createContainerReplica(container.containerID(), 0,
-        IN_SERVICE, State.CLOSED, sequenceID));
-    replicas.add(createContainerReplica(container.containerID(), 0,
-        IN_SERVICE, State.CLOSED, sequenceID));
-    replicas.add(createContainerReplica(container.containerID(), 0,
-        IN_SERVICE, State.CLOSED, sequenceID));
-
-    final ContainerReplica quasiClosedReplica =
-        createContainerReplica(container.containerID(), 0,
-            IN_SERVICE, State.QUASI_CLOSED, sequenceID - 1);
-    replicas.add(quasiClosedReplica);
+    Set<ContainerReplica> replicas = new HashSet<>(4);
+    for (int i = 0; i < 4; i++) {
+      replicas.add(createContainerReplica(container.containerID(), 0,
+          IN_SERVICE, State.QUASI_CLOSED, sequenceID - 1));

Review Comment:
   Here we have all replicas with a sequenceID less than the container, which is different to the original test - is this intentional?
   
   In this case, I guess we have 4 replicas all with "sequenceID - 1", so as they have the same sequence and it is less than that of the container, that is what makes the container unhealthy, right? Otherwise it would not be handled by this handler and the RatisReplicationCheckHandler would take care of it instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9763. Over Replication Check of all UNHEALTHY replicas is broken [ozone]

Posted by "siddhantsangwan (via GitHub)" <gi...@apache.org>.
siddhantsangwan commented on code in PR #5678:
URL: https://github.com/apache/ozone/pull/5678#discussion_r1410642765


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/health/RatisUnhealthyReplicationCheckHandler.java:
##########
@@ -31,18 +31,16 @@
 import static org.apache.hadoop.hdds.protocol.proto.HddsProtos.ReplicationType.RATIS;
 
 /**
- * This class handles UNHEALTHY RATIS replicas. Its responsibilities include:
- * <ul>
- *   <li>If only unhealthy replicas exist, check if there are replication
- *   factor number of replicas and queue them for under/over replication if
- *   needed.
- *   </li>
- *   <li>If there is a mixture of unhealthy and healthy containers this
- *   handler will only be called if there is no under or over replication after
- *   excluding the empty containers. If there is under or over replication the
- *   ECReplicationCheckHandler will take care of it first.
- *   </li>
- * </ul>
+ * This class handles RATIS containers which only have replicas in UNHEALTHY
+ * state, or CLOSED containers with replicas in QUASI_CLOSED state having
+ * less sequence ID. There are no other replicas. This class ensures that
+ * such containers have replication factor number of UNHEALTHY/QUASI_CLOSED
+ * replicas.
+ * <p>

Review Comment:
   Added closing tag.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/RatisContainerReplicaCount.java:
##########
@@ -566,8 +566,7 @@ public boolean isSafelyOverReplicated() {
       return false;
     }
 
-    return getMatchingReplicaCount() >= repFactor &&
-        getMatchingReplicaCount() - repFactor >= inFlightDel;
+    return getMatchingReplicaCount() >= repFactor;

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9763. Over Replication Check of all UNHEALTHY replicas is broken [ozone]

Posted by "siddhantsangwan (via GitHub)" <gi...@apache.org>.
siddhantsangwan commented on code in PR #5678:
URL: https://github.com/apache/ozone/pull/5678#discussion_r1410205380


##########
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/health/TestRatisReplicationCheckHandler.java:
##########
@@ -500,6 +501,43 @@ public void testOverReplicatedContainerWithMismatchedReplicas() {
         ReplicationManagerReport.HealthState.OVER_REPLICATED));
   }
 
+  @Test
+  public void shouldQueueForOverReplicationOnlyWhenSafe() {
+    ContainerInfo container =
+        createContainerInfo(repConfig, 1L, HddsProtos.LifeCycleState.CLOSED);
+    Set<ContainerReplica> replicas = createReplicas(container.containerID(),
+        ContainerReplicaProto.State.CLOSED, 0, 0);
+    ContainerReplica unhealthyReplica =
+        createContainerReplica(container.containerID(), 0, IN_SERVICE,
+            ContainerReplicaProto.State.UNHEALTHY);
+    ContainerReplica mismatchedReplica =
+        createContainerReplica(container.containerID(), 0, IN_SERVICE,
+            ContainerReplicaProto.State.QUASI_CLOSED);
+    replicas.add(mismatchedReplica);
+    replicas.add(unhealthyReplica);
+
+    requestBuilder.setContainerReplicas(replicas)
+        .setContainerInfo(container);
+
+    ContainerHealthResult.OverReplicatedHealthResult
+        result = (ContainerHealthResult.OverReplicatedHealthResult)
+        healthCheck.checkHealth(requestBuilder.build());
+
+    assertEquals(ContainerHealthResult.HealthState.OVER_REPLICATED,
+        result.getHealthState());
+    assertEquals(1, result.getExcessRedundancy());
+    assertFalse(result.isReplicatedOkAfterPending());
+
+    // not safe for over replication because we don't have 3 matching replicas
+    assertFalse(result.isSafelyOverReplicated());
+
+    assertTrue(healthCheck.handle(requestBuilder.build()));
+    assertEquals(0, repQueue.underReplicatedQueueSize());
+    assertEquals(0, repQueue.overReplicatedQueueSize());
+    assertEquals(1, report.getStat(

Review Comment:
   It would lead to questions, but I think those would be valid questions from an observability perspective? We expect the quasi closed replica to get closed soon, so there can be 3 closed replicas and the unhealthy one can be deleted. If that is not happening for a long time, then there's certainly a problem with the cluster - maybe the datanode hosting the quasi closed replica is going to be stale, the close command isn't reaching the datanode, there's some error in the datanode while processing the command etc. If not over replicated then what? How do we show there's an issue?
   
   Also I wonder if we should instead go ahead and delete the unhealthy replica without waiting for the quasi closed replica. I chose against doing this because the future improvements that we're planning to work on should make it so that the unhealthy replica does not block decommission. It also seems safer to not send any delete commands until we have 3 closed replicas. On the other hand, that unhealthy replica continues to take space on a DN if closing the quasi closed replica is blocked.
   
   What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9763. Over Replication Check of all UNHEALTHY replicas is broken [ozone]

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai commented on PR #5678:
URL: https://github.com/apache/ozone/pull/5678#issuecomment-1830188813

   @siddhantsangwan `TestReplicationManager` is passing with your code (https://github.com/siddhantsangwan/ozone/commit/757b0797df502c2000520090e35ce278662e9e2a).  However, PRs test code as it would be after being merged into `master`.  In that state `TestReplicationManager` is failing.  To reproduce, just merge `master` into your branch locally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9763. Over Replication Check of all UNHEALTHY replicas is broken [ozone]

Posted by "siddhantsangwan (via GitHub)" <gi...@apache.org>.
siddhantsangwan commented on PR #5678:
URL: https://github.com/apache/ozone/pull/5678#issuecomment-1831210001

   > @siddhantsangwan `TestReplicationManager` is passing with your code ([siddhantsangwan@757b079](https://github.com/siddhantsangwan/ozone/commit/757b0797df502c2000520090e35ce278662e9e2a)). However, PRs test code as it would be after being merged into `master`. In that state `TestReplicationManager` is failing. To reproduce, just merge `master` into your branch locally.
   
   @adoroszlai Yes, you're right. I was able to reproduce the failing test yesterday locally but @sodonnel and I have been discussing whether the test has the correct expectations at all. `testQuasiClosedContainerWithUnhealthyReplicaOnUniqueOrigin` sets up a `QUASI_CLOSED` container with 3 `QUASI_CLOSED` replicas with the same origin datanode and 1 `UNHEALTHY` replica on a unique origin datanode. It expects a replica other than the unique origin to get deleted. But I'm not sure if we should be deleting anything at all in this case. In fact, if one `QUASI_CLOSED` replica is deleted, RM will see the container as under replicated in the next iteration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9763. Over Replication Check of all UNHEALTHY replicas is broken [ozone]

Posted by "siddhantsangwan (via GitHub)" <gi...@apache.org>.
siddhantsangwan commented on code in PR #5678:
URL: https://github.com/apache/ozone/pull/5678#discussion_r1410186695


##########
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/health/TestRatisUnhealthyReplicationCheckHandler.java:
##########
@@ -366,29 +352,45 @@ public void overReplicationCheckWithQuasiClosedReplica() {
     final ContainerInfo container = ReplicationTestUtil.createContainerInfo(
         repConfig, 1, HddsProtos.LifeCycleState.CLOSED, sequenceID);
 
-    final Set<ContainerReplica> replicas = new HashSet<>(2);
-    replicas.add(createContainerReplica(container.containerID(), 0,
-        IN_SERVICE, State.CLOSED, sequenceID));
-    replicas.add(createContainerReplica(container.containerID(), 0,
-        IN_SERVICE, State.CLOSED, sequenceID));
-    replicas.add(createContainerReplica(container.containerID(), 0,
-        IN_SERVICE, State.CLOSED, sequenceID));
-
-    final ContainerReplica quasiClosedReplica =
-        createContainerReplica(container.containerID(), 0,
-            IN_SERVICE, State.QUASI_CLOSED, sequenceID - 1);
-    replicas.add(quasiClosedReplica);
+    Set<ContainerReplica> replicas = new HashSet<>(4);
+    for (int i = 0; i < 4; i++) {
+      replicas.add(createContainerReplica(container.containerID(), 0,
+          IN_SERVICE, State.QUASI_CLOSED, sequenceID - 1));

Review Comment:
   > Here we have all replicas with a sequenceID less than the container, which is different to the original test - is this intentional?
   
   Yes, I want to intentionally show that `RatisUnhealthyReplicationCheckHandler` only handles containers which have _all_ replicas unhealthy. Unhealthy meaning the replica is in `UNHEALTHY` state or the container is `CLOSED` with the replica in `QUASI_CLOSED` and having sequence id less than the container.
   
   > In this case, I guess we have 4 replicas all with "sequenceID - 1", so as they have the same sequence and it is less than that of the container, that is what makes the container unhealthy, right? Otherwise it would not be handled by this handler and the RatisReplicationCheckHandler would take care of it instead?
   
   Right. If the replicas had sequence id equal to the container, I'd expect the `MisMatchedReplicasHandler` to try and close the replicas and return false so the container can be checked for replication. Then `RatisReplicationCheckHandler` would queue it for over replication and return true.
   
   Let me know if I can add some more doc in the code to make this clearer!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9763. Over Replication Check of all UNHEALTHY replicas is broken [ozone]

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on code in PR #5678:
URL: https://github.com/apache/ozone/pull/5678#discussion_r1409456506


##########
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/health/TestRatisReplicationCheckHandler.java:
##########
@@ -500,6 +501,43 @@ public void testOverReplicatedContainerWithMismatchedReplicas() {
         ReplicationManagerReport.HealthState.OVER_REPLICATED));
   }
 
+  @Test
+  public void shouldQueueForOverReplicationOnlyWhenSafe() {
+    ContainerInfo container =
+        createContainerInfo(repConfig, 1L, HddsProtos.LifeCycleState.CLOSED);
+    Set<ContainerReplica> replicas = createReplicas(container.containerID(),
+        ContainerReplicaProto.State.CLOSED, 0, 0);
+    ContainerReplica unhealthyReplica =
+        createContainerReplica(container.containerID(), 0, IN_SERVICE,
+            ContainerReplicaProto.State.UNHEALTHY);
+    ContainerReplica mismatchedReplica =
+        createContainerReplica(container.containerID(), 0, IN_SERVICE,
+            ContainerReplicaProto.State.QUASI_CLOSED);
+    replicas.add(mismatchedReplica);
+    replicas.add(unhealthyReplica);
+
+    requestBuilder.setContainerReplicas(replicas)
+        .setContainerInfo(container);
+
+    ContainerHealthResult.OverReplicatedHealthResult
+        result = (ContainerHealthResult.OverReplicatedHealthResult)
+        healthCheck.checkHealth(requestBuilder.build());
+
+    assertEquals(ContainerHealthResult.HealthState.OVER_REPLICATED,
+        result.getHealthState());
+    assertEquals(1, result.getExcessRedundancy());
+    assertFalse(result.isReplicatedOkAfterPending());
+
+    // not safe for over replication because we don't have 3 matching replicas
+    assertFalse(result.isSafelyOverReplicated());
+
+    assertTrue(healthCheck.handle(requestBuilder.build()));
+    assertEquals(0, repQueue.underReplicatedQueueSize());
+    assertEquals(0, repQueue.overReplicatedQueueSize());
+    assertEquals(1, report.getStat(

Review Comment:
   I wonder if we should be reporting the container as over-replicated, if we are not prepared to delete any of the replicas?
   
   This will be confusing and result in questions about why the over replication count is stuck and won't reduce.
   
   If the container is in a state where we cannot remove any of the replicas, then even if it has 4 replicas, I am not sure we should count it as over replicated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9763. Over Replication Check of all UNHEALTHY replicas is broken [ozone]

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on code in PR #5678:
URL: https://github.com/apache/ozone/pull/5678#discussion_r1410624509


##########
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/health/TestRatisUnhealthyReplicationCheckHandler.java:
##########
@@ -366,29 +352,45 @@ public void overReplicationCheckWithQuasiClosedReplica() {
     final ContainerInfo container = ReplicationTestUtil.createContainerInfo(
         repConfig, 1, HddsProtos.LifeCycleState.CLOSED, sequenceID);
 
-    final Set<ContainerReplica> replicas = new HashSet<>(2);
-    replicas.add(createContainerReplica(container.containerID(), 0,
-        IN_SERVICE, State.CLOSED, sequenceID));
-    replicas.add(createContainerReplica(container.containerID(), 0,
-        IN_SERVICE, State.CLOSED, sequenceID));
-    replicas.add(createContainerReplica(container.containerID(), 0,
-        IN_SERVICE, State.CLOSED, sequenceID));
-
-    final ContainerReplica quasiClosedReplica =
-        createContainerReplica(container.containerID(), 0,
-            IN_SERVICE, State.QUASI_CLOSED, sequenceID - 1);
-    replicas.add(quasiClosedReplica);
+    Set<ContainerReplica> replicas = new HashSet<>(4);
+    for (int i = 0; i < 4; i++) {
+      replicas.add(createContainerReplica(container.containerID(), 0,
+          IN_SERVICE, State.QUASI_CLOSED, sequenceID - 1));

Review Comment:
   I think this is OK. I was just confirming my understanding really.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9763. Over Replication Check of all UNHEALTHY replicas is broken [ozone]

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai commented on PR #5678:
URL: https://github.com/apache/ozone/pull/5678#issuecomment-1833909048

   Thanks @siddhantsangwan for the patch, @sodonnel for the review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9763. Over Replication Check of all UNHEALTHY replicas is broken [ozone]

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on code in PR #5678:
URL: https://github.com/apache/ozone/pull/5678#discussion_r1409440866


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/health/RatisUnhealthyReplicationCheckHandler.java:
##########
@@ -31,18 +31,16 @@
 import static org.apache.hadoop.hdds.protocol.proto.HddsProtos.ReplicationType.RATIS;
 
 /**
- * This class handles UNHEALTHY RATIS replicas. Its responsibilities include:
- * <ul>
- *   <li>If only unhealthy replicas exist, check if there are replication
- *   factor number of replicas and queue them for under/over replication if
- *   needed.
- *   </li>
- *   <li>If there is a mixture of unhealthy and healthy containers this
- *   handler will only be called if there is no under or over replication after
- *   excluding the empty containers. If there is under or over replication the
- *   ECReplicationCheckHandler will take care of it first.
- *   </li>
- * </ul>
+ * This class handles RATIS containers which only have replicas in UNHEALTHY
+ * state, or CLOSED containers with replicas in QUASI_CLOSED state having
+ * less sequence ID. There are no other replicas. This class ensures that
+ * such containers have replication factor number of UNHEALTHY/QUASI_CLOSED
+ * replicas.
+ * <p>

Review Comment:
   Nit: there is a stray <p> tag in the description.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9763. Over Replication Check of all UNHEALTHY replicas is broken [ozone]

Posted by "siddhantsangwan (via GitHub)" <gi...@apache.org>.
siddhantsangwan commented on PR #5678:
URL: https://github.com/apache/ozone/pull/5678#issuecomment-1831874309

   Fixed the two failing tests.`testQuasiClosedContainerWithUnhealthyReplicaOnUniqueOrigin` now expects no replicas to be deleted.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9763. Over Replication Check of all UNHEALTHY replicas is broken [ozone]

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on code in PR #5678:
URL: https://github.com/apache/ozone/pull/5678#discussion_r1410625671


##########
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/health/TestRatisReplicationCheckHandler.java:
##########
@@ -500,6 +501,43 @@ public void testOverReplicatedContainerWithMismatchedReplicas() {
         ReplicationManagerReport.HealthState.OVER_REPLICATED));
   }
 
+  @Test
+  public void shouldQueueForOverReplicationOnlyWhenSafe() {
+    ContainerInfo container =
+        createContainerInfo(repConfig, 1L, HddsProtos.LifeCycleState.CLOSED);
+    Set<ContainerReplica> replicas = createReplicas(container.containerID(),
+        ContainerReplicaProto.State.CLOSED, 0, 0);
+    ContainerReplica unhealthyReplica =
+        createContainerReplica(container.containerID(), 0, IN_SERVICE,
+            ContainerReplicaProto.State.UNHEALTHY);
+    ContainerReplica mismatchedReplica =
+        createContainerReplica(container.containerID(), 0, IN_SERVICE,
+            ContainerReplicaProto.State.QUASI_CLOSED);
+    replicas.add(mismatchedReplica);
+    replicas.add(unhealthyReplica);
+
+    requestBuilder.setContainerReplicas(replicas)
+        .setContainerInfo(container);
+
+    ContainerHealthResult.OverReplicatedHealthResult
+        result = (ContainerHealthResult.OverReplicatedHealthResult)
+        healthCheck.checkHealth(requestBuilder.build());
+
+    assertEquals(ContainerHealthResult.HealthState.OVER_REPLICATED,
+        result.getHealthState());
+    assertEquals(1, result.getExcessRedundancy());
+    assertFalse(result.isReplicatedOkAfterPending());
+
+    // not safe for over replication because we don't have 3 matching replicas
+    assertFalse(result.isSafelyOverReplicated());
+
+    assertTrue(healthCheck.handle(requestBuilder.build()));
+    assertEquals(0, repQueue.underReplicatedQueueSize());
+    assertEquals(0, repQueue.overReplicatedQueueSize());
+    assertEquals(1, report.getStat(

Review Comment:
   After some discussion, we decided it is OK to leave this code as it is, as the over-replication should be corrected when the other replicas get closed, unless there is a problem.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org