You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2022/08/09 20:17:21 UTC

[GitHub] [ozone] aswinshakil opened a new pull request, #3668: HDDS-7022. EC: Open EC container are not closed when SCM container was already closed.

aswinshakil opened a new pull request, #3668:
URL: https://github.com/apache/ozone/pull/3668

   ## What changes were proposed in this pull request?
   
   When a DN is Shutdown, SCM closes the pipeline and the containers associated with the DN. The `NodeManager` sends `closePipeline` and `closeContainer` commands to the DNs when it comes back as a response to DN heartbeat. In the event the commands on the NodeManager are gone (eg: SCM restart). The EC Container is left OPEN forever. This patch aims to close these OPEN containers. 
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-7022
   
   ## How was this patch tested?
   
   This patch was manually tested on a cluster and also using integration test. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao merged pull request #3668: HDDS-7022. EC: Open EC container are not closed when SCM container was already closed.

Posted by GitBox <gi...@apache.org>.

umamaheswararao merged PR #3668:
URL: https://github.com/apache/ozone/pull/3668


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #3668: HDDS-7022. EC: Open EC container are not closed when SCM container was already closed.

Posted by GitBox <gi...@apache.org>.

sodonnel commented on PR #3668:
URL: https://github.com/apache/ozone/pull/3668#issuecomment-1223117285

   We probably should have added a unit test for this change in TestReplicationManager, similar to testUnhealthyOpenContainerClosed. All the logic in the new RM code has quite clean unit tests so far.
   
   Also, I believe this new logic should either have gone into the ECContainerHealthyCheck class, or we should have created a new "CommonHealthCheck" class, that gets called for both Ratis and EC containers. We are trying to avoid the main RM class taking on too many roles by splitting the business logic out into the separate check classes. The logic in the `if (containerInfo.getState() == HddsProtos.LifeCycleState.OPEN)` block should probably be in the same common class too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a diff in pull request #3668: HDDS-7022. EC: Open EC container are not closed when SCM container was already closed.

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on code in PR #3668:
URL: https://github.com/apache/ozone/pull/3668#discussion_r942457298


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java:
##########
@@ -401,6 +407,16 @@ protected ContainerHealthResult processContainer(ContainerInfo containerInfo,
       return new ContainerHealthResult.HealthyResult(containerInfo);
     }
 
+    if (containerInfo.getState() == HddsProtos.LifeCycleState.CLOSED) {
+      List<ContainerReplica> unhealthyReplicas = replicas.stream()
+          .filter(r -> !compareState(containerInfo.getState(), r.getState()))
+          .collect(Collectors.toList());
+
+      if (unhealthyReplicas.size() > 0) {
+        handleUnhealthyReplicas(containerInfo, unhealthyReplicas);
+      }

Review Comment:
   In Legacy Replication Manager, we are returning after sending close command. Should we return too? There is no point of proceeding with the current replica states I think. UnderReplication processing may work though, but overreplication may fail still until that replica closed. Did you think about this?
   ```
   **/*
            * If the container is in CLOSING state, the replicas can either
            * be in OPEN or in CLOSING state. In both of this cases
            * we have to resend close container command to the datanodes.
            */
           if (state == LifeCycleState.CLOSING) {
             for (ContainerReplica replica: replicas) {
               if (replica.getState() != State.UNHEALTHY) {
                 sendCloseCommand(
                     container, replica.getDatanodeDetails(), false);
               }
             }
             return;
           }**
   ```



##########
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/container/TestECContainerRecovery.java:
##########
@@ -213,35 +216,56 @@ public void testContainerRecoveryOverReplicationProcessing()
     // increased.
     cluster.restartHddsDatanode(pipeline.getFirstNode(), true);
     // Check container is over replicated.
-    waitForContainerCount(6, containerID, scm);
+    waitForContainerCount(6, container.containerID(), scm);
+    // Wait for all the replicas to be closed.
+    container = scm.getContainerInfo(container.getContainerID());
+    waitForDNContainerState(container, scm);
 
     // Resume RM and wait the over replicated replica deleted.
     scm.getReplicationManager().start();
-    waitForContainerCount(5, containerID, scm);
+    waitForContainerCount(5, container.containerID(), scm);
+  }
+
+  private void waitForDNContainerState(ContainerInfo container,

Review Comment:
   This method should be renamed. We are not waiting for any specific DN state.
   And this test does not make much difference as you are anyway waiting until unhealthy replicas 0?
   That means, DN already reported it closed state.
   
   What we need to do it, we just need check scm state is closed or not, but not the replicas state.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a diff in pull request #3668: HDDS-7022. EC: Open EC container are not closed when SCM container was already closed.

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on code in PR #3668:
URL: https://github.com/apache/ozone/pull/3668#discussion_r942621070


##########
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/container/TestECContainerRecovery.java:
##########
@@ -213,35 +216,56 @@ public void testContainerRecoveryOverReplicationProcessing()
     // increased.
     cluster.restartHddsDatanode(pipeline.getFirstNode(), true);
     // Check container is over replicated.
-    waitForContainerCount(6, containerID, scm);
+    waitForContainerCount(6, container.containerID(), scm);
+    // Wait for all the replicas to be closed.
+    container = scm.getContainerInfo(container.getContainerID());
+    waitForDNContainerState(container, scm);
 
     // Resume RM and wait the over replicated replica deleted.
     scm.getReplicationManager().start();
-    waitForContainerCount(5, containerID, scm);
+    waitForContainerCount(5, container.containerID(), scm);
+  }
+
+  private void waitForDNContainerState(ContainerInfo container,

Review Comment:
   I got your point, I was assumed that this method used to ensure container closed after DN kill. Now I realized this is different method and used later to ensure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a diff in pull request #3668: HDDS-7022. EC: Open EC container are not closed when SCM container was already closed.

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on code in PR #3668:
URL: https://github.com/apache/ozone/pull/3668#discussion_r942868843


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java:
##########
@@ -401,6 +407,16 @@ protected ContainerHealthResult processContainer(ContainerInfo containerInfo,
       return new ContainerHealthResult.HealthyResult(containerInfo);
     }
 
+    if (containerInfo.getState() == HddsProtos.LifeCycleState.CLOSED) {
+      List<ContainerReplica> unhealthyReplicas = replicas.stream()
+          .filter(r -> !compareState(containerInfo.getState(), r.getState()))
+          .collect(Collectors.toList());
+
+      if (unhealthyReplicas.size() > 0) {
+        handleUnhealthyReplicas(containerInfo, unhealthyReplicas);
+      }

Review Comment:
   ok, even if we don't return from here, we may have chance that possible under/over replications can processed meanwhile. In case if its  can't find enough source replicas, it would not be able to form commands. That is ok as it would be able to form in next iteration once all DNs closed their replicas.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org