You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "djordje-mijatovic (via GitHub)" <gi...@apache.org> on 2023/01/25 08:55:46 UTC

[GitHub] [ozone] djordje-mijatovic opened a new pull request, #4207: HDDS-7210. Missing open containers show up as "Closing" on the container report.

djordje-mijatovic opened a new pull request, #4207:
URL: https://github.com/apache/ozone/pull/4207

   ## What changes were proposed in this pull request?
   
   When the container goes into CLOSING state, if the data node is down than the container is stuck in CLOSING state forever, and the user does not know that this container is missing. In this PR we recalculate the HealthState of the CLOSING container to inform the user that container is also MISSING. The goal was to set the same HealthState as for CLOSED container.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-7210
   
   ## How was this patch tested?
   
   Unit test and manual dev tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a diff in pull request #4207: HDDS-7210. Missing open containers show up as "Closing" on the container report.

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on code in PR #4207:
URL: https://github.com/apache/ozone/pull/4207#discussion_r1086829942


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -1612,6 +1613,20 @@ private boolean isOpenContainerHealthy(
         .allMatch(r -> compareState(state, r.getState()));
   }
 
+  private void setHealthStateForClosing(Set<ContainerReplica> replicas,
+                                        ContainerInfo container,
+                                        ReplicationManagerReport report) {
+    if (!replicas.stream().
+            anyMatch(r -> compareState(LifeCycleState.OPEN, r.getState()) ||
+                    compareState(LifeCycleState.CLOSED, r.getState()))) {
+      report.incrementAndSample(HealthState.MISSING, container.containerID());

Review Comment:
   The definition of MISSING is that the container is not able to be read because no replicas exist on any online nodes. If you have 3 replicas and then all 3 DNs go offline, then there will be zero replicas available. The replicas in SCM will be reduced to zero and hence `replicas.size() == 0` will be true. When a DN goes offline and is marked DEAD (after 10 minutes) its replicas are removed from SCM automatically - it does not remember the last location. If the node comes back online, then the node will sent a replica report which will add the replica location back to SCM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] neils-dev merged pull request #4207: HDDS-7210. Missing open containers show up as "Closing" on the container report.

Posted by "neils-dev (via GitHub)" <gi...@apache.org>.
neils-dev merged PR #4207:
URL: https://github.com/apache/ozone/pull/4207


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on a diff in pull request #4207: HDDS-7210. Missing open containers show up as "Closing" on the container report.

Posted by "siddhantsangwan (via GitHub)" <gi...@apache.org>.
siddhantsangwan commented on code in PR #4207:
URL: https://github.com/apache/ozone/pull/4207#discussion_r1113903613


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -1613,6 +1614,18 @@ private boolean isOpenContainerHealthy(
         .allMatch(r -> compareState(state, r.getState()));
   }
 
+  private void setHealthStateForClosing(Set<ContainerReplica> replicas,
+                                        ContainerInfo container,
+                                        ReplicationManagerReport report) {
+    if (replicas.size() == 0) {
+      report.incrementAndSample(HealthState.MISSING, container.containerID());
+      report.incrementAndSample(HealthState.UNDER_REPLICATED,
+              container.containerID());
+      report.incrementAndSample(HealthState.MIS_REPLICATED,
+              container.containerID());

Review Comment:
   Yeah, I see that we're also setting the container as mis replicated, under replicated and missing further down in that method for closed containers.
   
   Looks like over time we've diverged from the original intention of the replication manager report:
   
   ```
    * This class is used by ReplicationManager. Each time ReplicationManager runs,
    * it creates a new instance of this class and increments the various counters
    * to allow for creating a report on the various container states within the
    * system. There is a counter for each LifeCycleState (open, closing, closed
    * etc) and the sum of each of the lifecycle state counters should equal the
    * total number of containers in SCM. Ie, each container can only be in one of
    * the Lifecycle states at any time.
   ```
   It specifies that each container should only be in one state at a time. 
   
   I think we need to decide what will best help with debugging. For example, if a container is missing, it's naturally also mis replicated and under replicated. We can choose to count it only once as missing or we can count it in all three categories, but that needs to be done consistently everywhere.
   
   The new RM does not count a missing container as mis replicated, but it does count it as under replicated in `RatisReplicationCheckHandler`. This is because it considers mis replication only when there is no under/over replication.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai commented on a diff in pull request #4207: HDDS-7210. Missing open containers show up as "Closing" on the container report.

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai commented on code in PR #4207:
URL: https://github.com/apache/ozone/pull/4207#discussion_r1111923029


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -1613,6 +1614,18 @@ private boolean isOpenContainerHealthy(
         .allMatch(r -> compareState(state, r.getState()));
   }
 
+  private void setHealthStateForClosing(Set<ContainerReplica> replicas,
+                                        ContainerInfo container,
+                                        ReplicationManagerReport report) {
+    if (replicas.size() == 0) {
+      report.incrementAndSample(HealthState.MISSING, container.containerID());
+      report.incrementAndSample(HealthState.UNDER_REPLICATED,
+              container.containerID());
+      report.incrementAndSample(HealthState.MIS_REPLICATED,
+              container.containerID());

Review Comment:
   Why is the container categorized in all of these states?  Isn't `MISSING` enough?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] djordje-mijatovic commented on a diff in pull request #4207: HDDS-7210. Missing open containers show up as "Closing" on the container report.

Posted by "djordje-mijatovic (via GitHub)" <gi...@apache.org>.
djordje-mijatovic commented on code in PR #4207:
URL: https://github.com/apache/ozone/pull/4207#discussion_r1113345171


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -1613,6 +1614,18 @@ private boolean isOpenContainerHealthy(
         .allMatch(r -> compareState(state, r.getState()));
   }
 
+  private void setHealthStateForClosing(Set<ContainerReplica> replicas,
+                                        ContainerInfo container,
+                                        ReplicationManagerReport report) {
+    if (replicas.size() == 0) {
+      report.incrementAndSample(HealthState.MISSING, container.containerID());
+      report.incrementAndSample(HealthState.UNDER_REPLICATED,
+              container.containerID());
+      report.incrementAndSample(HealthState.MIS_REPLICATED,
+              container.containerID());

Review Comment:
   In theory, MISSING could be enough but we wanted to have the same behavior as we have in Recon - this is way we have set all the states.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] neils-dev commented on pull request #4207: HDDS-7210. Missing open containers show up as "Closing" on the container report.

Posted by "neils-dev (via GitHub)" <gi...@apache.org>.
neils-dev commented on PR #4207:
URL: https://github.com/apache/ozone/pull/4207#issuecomment-1438926676

   @siddhantsangwan , would you mind taking a look at this PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] djordje-mijatovic commented on pull request #4207: HDDS-7210. Missing open containers show up as "Closing" on the container report.

Posted by "djordje-mijatovic (via GitHub)" <gi...@apache.org>.
djordje-mijatovic commented on PR #4207:
URL: https://github.com/apache/ozone/pull/4207#issuecomment-1419421122

   @sodonnel Could you review this PR again? Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] djordje-mijatovic commented on a diff in pull request #4207: HDDS-7210. Missing open containers show up as "Closing" on the container report.

Posted by "djordje-mijatovic (via GitHub)" <gi...@apache.org>.
djordje-mijatovic commented on code in PR #4207:
URL: https://github.com/apache/ozone/pull/4207#discussion_r1086610934


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -1612,6 +1613,20 @@ private boolean isOpenContainerHealthy(
         .allMatch(r -> compareState(state, r.getState()));
   }
 
+  private void setHealthStateForClosing(Set<ContainerReplica> replicas,
+                                        ContainerInfo container,
+                                        ReplicationManagerReport report) {
+    if (!replicas.stream().
+            anyMatch(r -> compareState(LifeCycleState.OPEN, r.getState()) ||
+                    compareState(LifeCycleState.CLOSED, r.getState()))) {
+      report.incrementAndSample(HealthState.MISSING, container.containerID());

Review Comment:
   It is not enough. For example, lets say we have 6 data nodes and replication factor 3. Container #1 can be stored on data nodes 1, 2 and 4, for example. If all 3 data nodes (data nodes 1, 2 and 4) go offline at the same moment, than the container #1 will be in CLOSING state and will have 2 replicas also in CLOSING state but it will still be MISSING. 
   I am new on this project, so I might be wrong.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] neils-dev commented on a diff in pull request #4207: HDDS-7210. Missing open containers show up as "Closing" on the container report.

Posted by "neils-dev (via GitHub)" <gi...@apache.org>.
neils-dev commented on code in PR #4207:
URL: https://github.com/apache/ozone/pull/4207#discussion_r1113376266


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -1613,6 +1614,18 @@ private boolean isOpenContainerHealthy(
         .allMatch(r -> compareState(state, r.getState()));
   }
 
+  private void setHealthStateForClosing(Set<ContainerReplica> replicas,
+                                        ContainerInfo container,
+                                        ReplicationManagerReport report) {
+    if (replicas.size() == 0) {
+      report.incrementAndSample(HealthState.MISSING, container.containerID());
+      report.incrementAndSample(HealthState.UNDER_REPLICATED,
+              container.containerID());
+      report.incrementAndSample(HealthState.MIS_REPLICATED,
+              container.containerID());

Review Comment:
   > Isn't MISSING enough?
   
   This is also consistent with the replication manager detecting and reporting "`MISSING`" when the container is in the `CLOSED` state.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a diff in pull request #4207: HDDS-7210. Missing open containers show up as "Closing" on the container report.

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on code in PR #4207:
URL: https://github.com/apache/ozone/pull/4207#discussion_r1086422448


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -1612,6 +1613,20 @@ private boolean isOpenContainerHealthy(
         .allMatch(r -> compareState(state, r.getState()));
   }
 
+  private void setHealthStateForClosing(Set<ContainerReplica> replicas,
+                                        ContainerInfo container,
+                                        ReplicationManagerReport report) {
+    if (!replicas.stream().
+            anyMatch(r -> compareState(LifeCycleState.OPEN, r.getState()) ||
+                    compareState(LifeCycleState.CLOSED, r.getState()))) {
+      report.incrementAndSample(HealthState.MISSING, container.containerID());

Review Comment:
   The only way for a container to be "MISSING" is if there are no replicas at all, as otherwise there is a replica available to read. So can we not just say `if (replicas.size() == 0) ...` ?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] neils-dev commented on pull request #4207: HDDS-7210. Missing open containers show up as "Closing" on the container report.

Posted by "neils-dev (via GitHub)" <gi...@apache.org>.
neils-dev commented on PR #4207:
URL: https://github.com/apache/ozone/pull/4207#issuecomment-1441069865

   Thanks @siddhantsangwan , filed a jira to handle closing containers and check for MISSING for the `ClosingContainerHandler`, HDDS-8017.  https://issues.apache.org/jira/browse/HDDS-8017
   
    
   > We will also need this fix in the new RM in ClosingContainerHandler. That can be done in a new PR or in this one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] djordje-mijatovic commented on a diff in pull request #4207: HDDS-7210. Missing open containers show up as "Closing" on the container report.

Posted by "djordje-mijatovic (via GitHub)" <gi...@apache.org>.
djordje-mijatovic commented on code in PR #4207:
URL: https://github.com/apache/ozone/pull/4207#discussion_r1086610934


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -1612,6 +1613,20 @@ private boolean isOpenContainerHealthy(
         .allMatch(r -> compareState(state, r.getState()));
   }
 
+  private void setHealthStateForClosing(Set<ContainerReplica> replicas,
+                                        ContainerInfo container,
+                                        ReplicationManagerReport report) {
+    if (!replicas.stream().
+            anyMatch(r -> compareState(LifeCycleState.OPEN, r.getState()) ||
+                    compareState(LifeCycleState.CLOSED, r.getState()))) {
+      report.incrementAndSample(HealthState.MISSING, container.containerID());

Review Comment:
   It is not enough. For example, lets say we have 6 data nodes and replication factor 3. Container 1 can be stored on data nodes 1, 2 and 4, for example. If all 3 data nodes (data nodes 1, 2 and 4) go offline at the same moment, than the container 1 will be in CLOSING state and will have 2 replicas also in CLOSING state but it will still be MISSING. 
   I am new on this project, so I might be wrong.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] arp7 commented on pull request #4207: HDDS-7210. Missing open containers show up as "Closing" on the container report.

Posted by "arp7 (via GitHub)" <gi...@apache.org>.
arp7 commented on PR #4207:
URL: https://github.com/apache/ozone/pull/4207#issuecomment-1568888546

   @djordje-mijatovic can you request an Apache jira account?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org