You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2022/06/24 11:00:58 UTC

[GitHub] [ozone] sodonnel opened a new pull request, #3545: HDDS-6699. EC: ReplicationManager - collect under and over replicated containers

sodonnel opened a new pull request, #3545:
URL: https://github.com/apache/ozone/pull/3545

   ## What changes were proposed in this pull request?
   
   Scan all containers in Replication Manager. For the EC containers, pass them to the EcContainerHealthCheck class to allow their health to be check. For any under or over replicated containers add them a list which later will be sorted by priority and turned into a queue for subsequent stages of the RM processing.
   
   ## What is the link to the Apache JIRA
   
   [ec-HDDS-6699](https://issues.apache.org/jira/browse/HDDS-6699)
   
   ## How was this patch tested?
   
   New unit tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] umamaheswararao merged pull request #3545: HDDS-6699. EC: ReplicationManager - collect under and over replicated containers

Posted by GitBox <gi...@apache.org>.
umamaheswararao merged PR #3545:
URL: https://github.com/apache/ozone/pull/3545


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai commented on a diff in pull request #3545: HDDS-6699. EC: ReplicationManager - collect under and over replicated containers

Posted by GitBox <gi...@apache.org>.
adoroszlai commented on code in PR #3545:
URL: https://github.com/apache/ozone/pull/3545#discussion_r907303588


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ContainerReplicaOp.java:
##########
@@ -36,6 +36,12 @@ public enum PendingOpType {
   private int replicaIndex;
   private long scheduledEpochMillis;
 
+  public static ContainerReplicaOp create(PendingOpType opType,
+      DatanodeDetails target, int replicaIndex) {
+    return new ContainerReplicaOp(opType, target, replicaIndex,
+        System.currentTimeMillis());
+  }

Review Comment:
   We can keep it this way then.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] umamaheswararao commented on a diff in pull request #3545: HDDS-6699. EC: ReplicationManager - collect under and over replicated containers

Posted by GitBox <gi...@apache.org>.
umamaheswararao commented on code in PR #3545:
URL: https://github.com/apache/ozone/pull/3545#discussion_r906293237


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java:
##########
@@ -233,24 +229,77 @@ public synchronized void processAll() {
     final List<ContainerInfo> containers =
         containerManager.getContainers();
     ReplicationManagerReport report = new ReplicationManagerReport();
+    List<ContainerHealthResult.UnderReplicatedHealthResult> underReplicated =
+        new ArrayList<>();
+    List<ContainerHealthResult.OverReplicatedHealthResult> overReplicated =
+        new ArrayList<>();
+
     for (ContainerInfo c : containers) {
       if (!shouldRun()) {
         break;
       }
-      switch (c.getReplicationType()) {
-      case EC:
-        break;
-      default:
+      report.increment(c.getState());
+      if (c.getReplicationType() != EC) {
         legacyReplicationManager.processContainer(c, report);
+        continue;
+      }
+      try {
+        processContainer(c, underReplicated, overReplicated, report);

Review Comment:
   What do you mean by below TODO? I thought we will just populate the under and over replicated lists above and later we process them with queues at the line 254: TODO.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java:
##########
@@ -233,24 +229,77 @@ public synchronized void processAll() {
     final List<ContainerInfo> containers =
         containerManager.getContainers();
     ReplicationManagerReport report = new ReplicationManagerReport();
+    List<ContainerHealthResult.UnderReplicatedHealthResult> underReplicated =
+        new ArrayList<>();
+    List<ContainerHealthResult.OverReplicatedHealthResult> overReplicated =
+        new ArrayList<>();
+
     for (ContainerInfo c : containers) {
       if (!shouldRun()) {
         break;
       }
-      switch (c.getReplicationType()) {
-      case EC:
-        break;
-      default:
+      report.increment(c.getState());
+      if (c.getReplicationType() != EC) {
         legacyReplicationManager.processContainer(c, report);
+        continue;
+      }
+      try {
+        processContainer(c, underReplicated, overReplicated, report);
+        // TODO - send any commands contained in the health result
+      } catch (ContainerNotFoundException e) {
+        LOG.error("Container {} not found", c.getContainerID(), e);
       }
     }
     report.setComplete();
+    // TODO - Sort the pending lists by priority and assign to the main queue,
+    //        which is yet to be defined.
     this.containerReport = report;
     LOG.info("Replication Monitor Thread took {} milliseconds for" +
             " processing {} containers.", clock.millis() - start,
         containers.size());
   }
 
+  protected ContainerHealthResult processContainer(ContainerInfo containerInfo,
+      List<ContainerHealthResult.UnderReplicatedHealthResult> underRep,
+      List<ContainerHealthResult.OverReplicatedHealthResult> overRep,
+      ReplicationManagerReport report) throws ContainerNotFoundException {

Review Comment:
   variable and assignment can be in same statement.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java:
##########
@@ -233,24 +229,77 @@ public synchronized void processAll() {
     final List<ContainerInfo> containers =
         containerManager.getContainers();
     ReplicationManagerReport report = new ReplicationManagerReport();
+    List<ContainerHealthResult.UnderReplicatedHealthResult> underReplicated =
+        new ArrayList<>();
+    List<ContainerHealthResult.OverReplicatedHealthResult> overReplicated =
+        new ArrayList<>();
+
     for (ContainerInfo c : containers) {
       if (!shouldRun()) {
         break;
       }
-      switch (c.getReplicationType()) {
-      case EC:
-        break;
-      default:
+      report.increment(c.getState());
+      if (c.getReplicationType() != EC) {
         legacyReplicationManager.processContainer(c, report);
+        continue;
+      }
+      try {
+        processContainer(c, underReplicated, overReplicated, report);

Review Comment:
   What do you mean by below TODO? I thought we will just populate the under and over replicated lists above and later we process them with queues at the line 254: TODO.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java:
##########
@@ -233,24 +229,77 @@ public synchronized void processAll() {
     final List<ContainerInfo> containers =
         containerManager.getContainers();
     ReplicationManagerReport report = new ReplicationManagerReport();
+    List<ContainerHealthResult.UnderReplicatedHealthResult> underReplicated =
+        new ArrayList<>();
+    List<ContainerHealthResult.OverReplicatedHealthResult> overReplicated =
+        new ArrayList<>();
+
     for (ContainerInfo c : containers) {
       if (!shouldRun()) {
         break;
       }
-      switch (c.getReplicationType()) {
-      case EC:
-        break;
-      default:
+      report.increment(c.getState());
+      if (c.getReplicationType() != EC) {
         legacyReplicationManager.processContainer(c, report);
+        continue;
+      }
+      try {
+        processContainer(c, underReplicated, overReplicated, report);
+        // TODO - send any commands contained in the health result
+      } catch (ContainerNotFoundException e) {
+        LOG.error("Container {} not found", c.getContainerID(), e);
       }
     }
     report.setComplete();
+    // TODO - Sort the pending lists by priority and assign to the main queue,
+    //        which is yet to be defined.
     this.containerReport = report;
     LOG.info("Replication Monitor Thread took {} milliseconds for" +
             " processing {} containers.", clock.millis() - start,
         containers.size());
   }
 
+  protected ContainerHealthResult processContainer(ContainerInfo containerInfo,
+      List<ContainerHealthResult.UnderReplicatedHealthResult> underRep,
+      List<ContainerHealthResult.OverReplicatedHealthResult> overRep,
+      ReplicationManagerReport report) throws ContainerNotFoundException {
+    Set<ContainerReplica> replicas;
+    replicas = containerManager.getContainerReplicas(
+        containerInfo.containerID());
+    List<ContainerReplicaOp> pendingOps =
+        containerReplicaPendingOps.getPendingOps(containerInfo.containerID());
+    ContainerHealthResult health = ecContainerHealthCheck
+        .checkHealth(containerInfo, replicas, pendingOps, 0);
+      // TODO - should the report have a HEALTHY state, rather than just bad
+      //        states? It would need to be added to legacy RM too.
+    if (health.getHealthState()
+        == ContainerHealthResult.HealthState.UNDER_REPLICATED) {
+      report.incrementAndSample(
+          HealthState.UNDER_REPLICATED, containerInfo.containerID());
+      ContainerHealthResult.UnderReplicatedHealthResult underHealth
+          = ((ContainerHealthResult.UnderReplicatedHealthResult) health);
+      if (underHealth.isUnrecoverable()) {
+        // TODO - do we need a new health state for unrecoverable EC?
+        report.incrementAndSample(
+            HealthState.MISSING, containerInfo.containerID());
+      }
+      if (!underHealth.isSufficientlyReplicatedAfterPending() &&
+          !underHealth.isUnrecoverable()) {
+        underRep.add(underHealth);
+      }
+    } else if (health.getHealthState()

Review Comment:
   I am wondering in EC case, a container can have both situations. Some indexes over-replicated while other replicas missing. In that situation what is the HealthCheck result? We may give preference to missing? meaning Underreplication.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a diff in pull request #3545: HDDS-6699. EC: ReplicationManager - collect under and over replicated containers

Posted by GitBox <gi...@apache.org>.
sodonnel commented on code in PR #3545:
URL: https://github.com/apache/ozone/pull/3545#discussion_r907203555


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java:
##########
@@ -233,24 +229,77 @@ public synchronized void processAll() {
     final List<ContainerInfo> containers =
         containerManager.getContainers();
     ReplicationManagerReport report = new ReplicationManagerReport();
+    List<ContainerHealthResult.UnderReplicatedHealthResult> underReplicated =
+        new ArrayList<>();
+    List<ContainerHealthResult.OverReplicatedHealthResult> overReplicated =
+        new ArrayList<>();
+
     for (ContainerInfo c : containers) {
       if (!shouldRun()) {
         break;
       }
-      switch (c.getReplicationType()) {
-      case EC:
-        break;
-      default:
+      report.increment(c.getState());
+      if (c.getReplicationType() != EC) {
         legacyReplicationManager.processContainer(c, report);
+        continue;
+      }
+      try {
+        processContainer(c, underReplicated, overReplicated, report);
+        // TODO - send any commands contained in the health result
+      } catch (ContainerNotFoundException e) {
+        LOG.error("Container {} not found", c.getContainerID(), e);
       }
     }
     report.setComplete();
+    // TODO - Sort the pending lists by priority and assign to the main queue,
+    //        which is yet to be defined.
     this.containerReport = report;
     LOG.info("Replication Monitor Thread took {} milliseconds for" +
             " processing {} containers.", clock.millis() - start,
         containers.size());
   }
 
+  protected ContainerHealthResult processContainer(ContainerInfo containerInfo,
+      List<ContainerHealthResult.UnderReplicatedHealthResult> underRep,
+      List<ContainerHealthResult.OverReplicatedHealthResult> overRep,
+      ReplicationManagerReport report) throws ContainerNotFoundException {
+    Set<ContainerReplica> replicas;
+    replicas = containerManager.getContainerReplicas(
+        containerInfo.containerID());
+    List<ContainerReplicaOp> pendingOps =
+        containerReplicaPendingOps.getPendingOps(containerInfo.containerID());
+    ContainerHealthResult health = ecContainerHealthCheck
+        .checkHealth(containerInfo, replicas, pendingOps, 0);
+      // TODO - should the report have a HEALTHY state, rather than just bad
+      //        states? It would need to be added to legacy RM too.
+    if (health.getHealthState()
+        == ContainerHealthResult.HealthState.UNDER_REPLICATED) {
+      report.incrementAndSample(
+          HealthState.UNDER_REPLICATED, containerInfo.containerID());
+      ContainerHealthResult.UnderReplicatedHealthResult underHealth
+          = ((ContainerHealthResult.UnderReplicatedHealthResult) health);
+      if (underHealth.isUnrecoverable()) {
+        // TODO - do we need a new health state for unrecoverable EC?
+        report.incrementAndSample(
+            HealthState.MISSING, containerInfo.containerID());
+      }
+      if (!underHealth.isSufficientlyReplicatedAfterPending() &&
+          !underHealth.isUnrecoverable()) {
+        underRep.add(underHealth);
+      }
+    } else if (health.getHealthState()

Review Comment:
   For now I thought it would be easier to have only a single state. The worse case is under-replication, so ideally we fix that first. When that is fixed, the container will get processed again and fix the over-replication. So yes, the if the container is both over and under replicated, under-replicated will be the result and over will be ignored until it is fixed.
   
   I guess there is an edge case, where the container is both missing (unrecoverable) and over-replicated. The missing will never get fixed and it will be stuck like that. I am not sure what the answer is here - probably the container needs to be removed from the system as it cannot be read anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai commented on a diff in pull request #3545: HDDS-6699. EC: ReplicationManager - collect under and over replicated containers

Posted by GitBox <gi...@apache.org>.
adoroszlai commented on code in PR #3545:
URL: https://github.com/apache/ozone/pull/3545#discussion_r907271078


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ContainerReplicaOp.java:
##########
@@ -36,6 +36,12 @@ public enum PendingOpType {
   private int replicaIndex;
   private long scheduledEpochMillis;
 
+  public static ContainerReplicaOp create(PendingOpType opType,
+      DatanodeDetails target, int replicaIndex) {
+    return new ContainerReplicaOp(opType, target, replicaIndex,
+        System.currentTimeMillis());
+  }

Review Comment:
   Nit: as far as I see this is only for test (`TestECContainerHealthCheck`), wouldn't it be better to define as helper in that class (or `ReplicationTestUtil`)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a diff in pull request #3545: HDDS-6699. EC: ReplicationManager - collect under and over replicated containers

Posted by GitBox <gi...@apache.org>.
sodonnel commented on code in PR #3545:
URL: https://github.com/apache/ozone/pull/3545#discussion_r907200968


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java:
##########
@@ -233,24 +229,77 @@ public synchronized void processAll() {
     final List<ContainerInfo> containers =
         containerManager.getContainers();
     ReplicationManagerReport report = new ReplicationManagerReport();
+    List<ContainerHealthResult.UnderReplicatedHealthResult> underReplicated =
+        new ArrayList<>();
+    List<ContainerHealthResult.OverReplicatedHealthResult> overReplicated =
+        new ArrayList<>();
+
     for (ContainerInfo c : containers) {
       if (!shouldRun()) {
         break;
       }
-      switch (c.getReplicationType()) {
-      case EC:
-        break;
-      default:
+      report.increment(c.getState());
+      if (c.getReplicationType() != EC) {
         legacyReplicationManager.processContainer(c, report);
+        continue;
+      }
+      try {
+        processContainer(c, underReplicated, overReplicated, report);
+        // TODO - send any commands contained in the health result
+      } catch (ContainerNotFoundException e) {
+        LOG.error("Container {} not found", c.getContainerID(), e);
       }
     }
     report.setComplete();
+    // TODO - Sort the pending lists by priority and assign to the main queue,
+    //        which is yet to be defined.
     this.containerReport = report;
     LOG.info("Replication Monitor Thread took {} milliseconds for" +
             " processing {} containers.", clock.millis() - start,
         containers.size());
   }
 
+  protected ContainerHealthResult processContainer(ContainerInfo containerInfo,
+      List<ContainerHealthResult.UnderReplicatedHealthResult> underRep,
+      List<ContainerHealthResult.OverReplicatedHealthResult> overRep,
+      ReplicationManagerReport report) throws ContainerNotFoundException {

Review Comment:
   Ah yes, I think I had a try...catch block here before to handle containerNotFound, but changed it. I will fix this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] umamaheswararao commented on pull request #3545: HDDS-6699. EC: ReplicationManager - collect under and over replicated containers

Posted by GitBox <gi...@apache.org>.
umamaheswararao commented on PR #3545:
URL: https://github.com/apache/ozone/pull/3545#issuecomment-1165834898

   Overall patch looks good to me, I have just dropped few comments/questions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a diff in pull request #3545: HDDS-6699. EC: ReplicationManager - collect under and over replicated containers

Posted by GitBox <gi...@apache.org>.
sodonnel commented on code in PR #3545:
URL: https://github.com/apache/ozone/pull/3545#discussion_r907200048


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java:
##########
@@ -233,24 +229,77 @@ public synchronized void processAll() {
     final List<ContainerInfo> containers =
         containerManager.getContainers();
     ReplicationManagerReport report = new ReplicationManagerReport();
+    List<ContainerHealthResult.UnderReplicatedHealthResult> underReplicated =
+        new ArrayList<>();
+    List<ContainerHealthResult.OverReplicatedHealthResult> overReplicated =
+        new ArrayList<>();
+
     for (ContainerInfo c : containers) {
       if (!shouldRun()) {
         break;
       }
-      switch (c.getReplicationType()) {
-      case EC:
-        break;
-      default:
+      report.increment(c.getState());
+      if (c.getReplicationType() != EC) {
         legacyReplicationManager.processContainer(c, report);
+        continue;
+      }
+      try {
+        processContainer(c, underReplicated, overReplicated, report);

Review Comment:
   At the moment the health check commands do not return any commands, but they might. Eg if we put in some logic about unhealthy containers, eg not all replicas in the same state - there is a check like this in the Legacy RM for ratis containers. Then the command may return some commands such as "closeContainer". We should fire those commands onto the event queue if they are returned.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a diff in pull request #3545: HDDS-6699. EC: ReplicationManager - collect under and over replicated containers

Posted by GitBox <gi...@apache.org>.
sodonnel commented on code in PR #3545:
URL: https://github.com/apache/ozone/pull/3545#discussion_r907283654


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ContainerReplicaOp.java:
##########
@@ -36,6 +36,12 @@ public enum PendingOpType {
   private int replicaIndex;
   private long scheduledEpochMillis;
 
+  public static ContainerReplicaOp create(PendingOpType opType,
+      DatanodeDetails target, int replicaIndex) {
+    return new ContainerReplicaOp(opType, target, replicaIndex,
+        System.currentTimeMillis());
+  }

Review Comment:
   So far, we don't create any ContainerRepliaOps except in tests, but the main code will soon need to create them, and this static method should make it a little bit nicer to create them in the main code flow when we do. What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org