You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "devmadhuu (via GitHub)" <gi...@apache.org> on 2023/12/20 12:33:49 UTC

[PR] HDDS-9819.Recon - Potential memory overflow in Container Health Task. [ozone]

devmadhuu opened a new pull request, #5841:
URL: https://github.com/apache/ozone/pull/5841

   ## What changes were proposed in this pull request?
   This PR addresses the issue of potential. memory overflow in case there are millions of containers because ContainerHealthTask tries to load all containers available in memory using `containerManager.getContainers()`, this API loads maximum available containers. So to avoid any memory overflow, this PR changes the API to use paginated based API.
   
   ## What is the link to the Apache JIRA
   https://issues.apache.org/jira/browse/HDDS-9819
   
   ## How was this patch tested?
   This patch was tested using existing junit test cases in order not to break anything.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "devmadhuu (via GitHub)" <gi...@apache.org>.
devmadhuu commented on code in PR #5841:
URL: https://github.com/apache/ozone/pull/5841#discussion_r1446949791


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -142,9 +158,14 @@ public void triggerContainerHealthCheck() {
               " processing {} containers.", Time.monotonicNow() - start,
           containers.size());
       logUnhealthyContainerStats(unhealthyContainerStateStatsMap);
-      processedContainers.clear();
-    } finally {
-      lock.writeLock().unlock();
+      if (containers.size() > Integer.parseInt(DEFAULT_FETCH_COUNT)) {
+        startID = ContainerID.valueOf(
+            containers.get(containers.size() - 1).getContainerID());
+        containers.clear();
+        containers = containerManager.getContainers(startID,
+            Integer.parseInt(DEFAULT_FETCH_COUNT));
+      }
+      containers.clear();

Review Comment:
   @ArafatKhan2198 I couldn't understand this comment, as I don't see `containers.clear()`  outside the loop. Can you pls clarify ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai commented on code in PR #5841:
URL: https://github.com/apache/ozone/pull/5841#discussion_r1451745457


##########
hadoop-ozone/recon/src/test/java/org/apache/hadoop/ozone/recon/fsck/TestContainerHealthTask.java:
##########
@@ -299,7 +304,7 @@ public void testDeletedContainer() throws Exception {
         .thenReturn(new ContainerWithPipeline(mockContainers.get(0), null));
 
     List<UnhealthyContainers> all = unHealthyContainersTableHandle.findAll();
-    assertThat(all).isEmpty();
+    assertTrue(all.isEmpty());

Review Comment:
   Please keep `assertThat`, added by HDDS-10034.
   
   ```suggestion
       assertThat(all).isEmpty();
   ```



##########
hadoop-ozone/recon/src/test/java/org/apache/hadoop/ozone/recon/fsck/TestContainerHealthTask.java:
##########
@@ -192,7 +195,8 @@ public void testRun() throws Exception {
 
     ReconTaskStatus taskStatus =
         reconTaskStatusDao.findById(containerHealthTask.getTaskName());
-    assertThat(taskStatus.getLastUpdatedTimestamp()).isGreaterThan(currentTime);
+    assertTrue(taskStatus.getLastUpdatedTimestamp() >
+        currentTime);

Review Comment:
   Please keep `assertThat`, added by HDDS-10034.
   
   ```suggestion
       assertThat(taskStatus.getLastUpdatedTimestamp()).isGreaterThan(currentTime);
   ```



##########
hadoop-ozone/recon/src/test/java/org/apache/hadoop/ozone/recon/fsck/TestContainerHealthTask.java:
##########
@@ -137,7 +140,7 @@ public void testRun() throws Exception {
         .thenReturn(Collections.emptySet());
 
     List<UnhealthyContainers> all = unHealthyContainersTableHandle.findAll();
-    assertThat(all).isEmpty();
+    assertTrue(all.isEmpty());

Review Comment:
   Please keep `assertThat`, added by HDDS-10034.
   
   ```suggestion
       assertThat(all).isEmpty();
   ```



##########
hadoop-ozone/recon/src/test/java/org/apache/hadoop/ozone/recon/fsck/TestContainerHealthTask.java:
##########
@@ -327,7 +332,8 @@ public void testDeletedContainer() throws Exception {
 
     ReconTaskStatus taskStatus =
         reconTaskStatusDao.findById(containerHealthTask.getTaskName());
-    assertThat(taskStatus.getLastUpdatedTimestamp()).isGreaterThan(currentTime);
+    assertTrue(taskStatus.getLastUpdatedTimestamp() >
+        currentTime);

Review Comment:
   Please keep `assertThat`, added by HDDS-10034.
   
   ```suggestion
       assertThat(taskStatus.getLastUpdatedTimestamp()).isGreaterThan(currentTime);
   ```



##########
hadoop-ozone/recon/src/test/java/org/apache/hadoop/ozone/recon/fsck/TestContainerHealthTask.java:
##########
@@ -19,9 +19,11 @@
 package org.apache.hadoop.ozone.recon.fsck;
 
 import static org.hadoop.ozone.recon.schema.ContainerSchemaDefinition.UnHealthyContainerStates.ALL_REPLICAS_UNHEALTHY;
-import static org.assertj.core.api.Assertions.assertThat;
 import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertNotNull;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.mockito.ArgumentMatchers.any;
+import static org.mockito.ArgumentMatchers.anyInt;

Review Comment:
   Please keep `assertThat` instead of restoring `assertTrue`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819.Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "devmadhuu (via GitHub)" <gi...@apache.org>.
devmadhuu commented on PR #5841:
URL: https://github.com/apache/ozone/pull/5841#issuecomment-1864399900

   @sumitagrawl @adoroszlai pls review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai commented on PR #5841:
URL: https://github.com/apache/ozone/pull/5841#issuecomment-1878346611

   Thanks @devmadhuu for updating the patch.
   
   @ArafatKhan2198 please review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819.Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai commented on code in PR #5841:
URL: https://github.com/apache/ozone/pull/5841#discussion_r1432910300


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -132,7 +133,21 @@ public void triggerContainerHealthCheck() {
               " process {} existing database records.",
           Time.monotonicNow() - start, existingCount);
       start = Time.monotonicNow();
-      final List<ContainerInfo> containers = containerManager.getContainers();
+      checkAndProcessContainers(unhealthyContainerStateStatsMap, start,
+          currentTime);
+      processedContainers.clear();
+    } finally {
+      lock.writeLock().unlock();
+    }
+  }
+
+  private void checkAndProcessContainers(
+      Map<UnHealthyContainerStates, Map<String, Long>>
+          unhealthyContainerStateStatsMap, long start, long currentTime) {

Review Comment:
   `start` should not be a parameter, it should be set at the start of each iteration.



##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -437,7 +454,7 @@ public static List<UnhealthyContainers> generateUnhealthyRecords(
 
       if (container.isUnderReplicated()
           && !recordForStateExists.contains(
-              UnHealthyContainerStates.UNDER_REPLICATED.toString())) {
+          UnHealthyContainerStates.UNDER_REPLICATED.toString())) {

Review Comment:
   nit: Please avoid unnecessary space changes.



##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -280,7 +297,7 @@ private long processExistingDBRecords(long currentTime,
   private void processContainer(ContainerInfo container, long currentTime,
                                 Map<UnHealthyContainerStates,
                                     Map<String, Long>>
-                                      unhealthyContainerStateStatsMap) {
+                                    unhealthyContainerStateStatsMap) {

Review Comment:
   nit: Please avoid unnecessary space changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "ArafatKhan2198 (via GitHub)" <gi...@apache.org>.
ArafatKhan2198 commented on PR #5841:
URL: https://github.com/apache/ozone/pull/5841#issuecomment-1884672721

   Thanks for the changes @devmadhuu LGTM!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "devmadhuu (via GitHub)" <gi...@apache.org>.
devmadhuu commented on code in PR #5841:
URL: https://github.com/apache/ozone/pull/5841#discussion_r1448879052


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -131,8 +133,23 @@ public void triggerContainerHealthCheck() {
       LOG.info("Container Health task thread took {} milliseconds to" +
               " process {} existing database records.",
           Time.monotonicNow() - start, existingCount);
+
+      checkAndProcessContainers(unhealthyContainerStateStatsMap, currentTime);
+      processedContainers.clear();
+    } finally {
+      lock.writeLock().unlock();
+    }
+  }
+
+  private void checkAndProcessContainers(
+      Map<UnHealthyContainerStates, Map<String, Long>>
+          unhealthyContainerStateStatsMap, long currentTime) {
+    ContainerID startID = ContainerID.valueOf(1);
+    List<ContainerInfo> containers = containerManager.getContainers(startID,
+        FETCH_COUNT);
+    long start;

Review Comment:
   done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819.Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "devmadhuu (via GitHub)" <gi...@apache.org>.
devmadhuu commented on code in PR #5841:
URL: https://github.com/apache/ozone/pull/5841#discussion_r1433412635


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -132,7 +133,21 @@ public void triggerContainerHealthCheck() {
               " process {} existing database records.",
           Time.monotonicNow() - start, existingCount);
       start = Time.monotonicNow();
-      final List<ContainerInfo> containers = containerManager.getContainers();
+      checkAndProcessContainers(unhealthyContainerStateStatsMap, start,
+          currentTime);
+      processedContainers.clear();
+    } finally {
+      lock.writeLock().unlock();
+    }
+  }
+
+  private void checkAndProcessContainers(
+      Map<UnHealthyContainerStates, Map<String, Long>>
+          unhealthyContainerStateStatsMap, long start, long currentTime) {

Review Comment:
   Ok, I made the changes. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "devmadhuu (via GitHub)" <gi...@apache.org>.
devmadhuu commented on code in PR #5841:
URL: https://github.com/apache/ozone/pull/5841#discussion_r1447255726


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -142,9 +158,14 @@ public void triggerContainerHealthCheck() {
               " processing {} containers.", Time.monotonicNow() - start,
           containers.size());
       logUnhealthyContainerStats(unhealthyContainerStateStatsMap);
-      processedContainers.clear();
-    } finally {
-      lock.writeLock().unlock();
+      if (containers.size() > Integer.parseInt(DEFAULT_FETCH_COUNT)) {
+        startID = ContainerID.valueOf(
+            containers.get(containers.size() - 1).getContainerID());
+        containers.clear();
+        containers = containerManager.getContainers(startID,
+            Integer.parseInt(DEFAULT_FETCH_COUNT));
+      }
+      containers.clear();

Review Comment:
   As discussed, we have a case where last iteration in batch will not execute the if condition , so need to clear -off any in memory.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "ArafatKhan2198 (via GitHub)" <gi...@apache.org>.
ArafatKhan2198 commented on code in PR #5841:
URL: https://github.com/apache/ozone/pull/5841#discussion_r1447212054


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -142,9 +158,14 @@ public void triggerContainerHealthCheck() {
               " processing {} containers.", Time.monotonicNow() - start,
           containers.size());
       logUnhealthyContainerStats(unhealthyContainerStateStatsMap);
-      processedContainers.clear();
-    } finally {
-      lock.writeLock().unlock();
+      if (containers.size() > Integer.parseInt(DEFAULT_FETCH_COUNT)) {
+        startID = ContainerID.valueOf(
+            containers.get(containers.size() - 1).getContainerID());
+        containers.clear();
+        containers = containerManager.getContainers(startID,
+            Integer.parseInt(DEFAULT_FETCH_COUNT));
+      }
+      containers.clear();

Review Comment:
   Apologies for the earlier mistake in the comment. I have now corrected it. Please take a look.
   
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "devmadhuu (via GitHub)" <gi...@apache.org>.
devmadhuu commented on PR #5841:
URL: https://github.com/apache/ozone/pull/5841#issuecomment-1878174466

   > @devmadhuu please check if test failure in `TestReconTasks` is related:
   > 
   > https://github.com/devmadhuu/ozone/actions/runs/7283607925/job/19848007957#step:6:167
   
   Thanks @adoroszlai for review. I have resolved the test case failure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai merged PR #5841:
URL: https://github.com/apache/ozone/pull/5841


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819.Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "devmadhuu (via GitHub)" <gi...@apache.org>.
devmadhuu commented on code in PR #5841:
URL: https://github.com/apache/ozone/pull/5841#discussion_r1433410979


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -280,7 +297,7 @@ private long processExistingDBRecords(long currentTime,
   private void processContainer(ContainerInfo container, long currentTime,
                                 Map<UnHealthyContainerStates,
                                     Map<String, Long>>
-                                      unhealthyContainerStateStatsMap) {
+                                    unhealthyContainerStateStatsMap) {

Review Comment:
   done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "devmadhuu (via GitHub)" <gi...@apache.org>.
devmadhuu commented on code in PR #5841:
URL: https://github.com/apache/ozone/pull/5841#discussion_r1447377922


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -142,9 +159,13 @@ public void triggerContainerHealthCheck() {
               " processing {} containers.", Time.monotonicNow() - start,
           containers.size());
       logUnhealthyContainerStats(unhealthyContainerStateStatsMap);
-      processedContainers.clear();
-    } finally {
-      lock.writeLock().unlock();
+      if (containers.size() > FETCH_COUNT) {
+        startID = ContainerID.valueOf(

Review Comment:
   Yes, will correct that, startId is inclusive.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "devmadhuu (via GitHub)" <gi...@apache.org>.
devmadhuu commented on code in PR #5841:
URL: https://github.com/apache/ozone/pull/5841#discussion_r1447362977


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -142,9 +159,13 @@ public void triggerContainerHealthCheck() {
               " processing {} containers.", Time.monotonicNow() - start,
           containers.size());
       logUnhealthyContainerStats(unhealthyContainerStateStatsMap);
-      processedContainers.clear();
-    } finally {
-      lock.writeLock().unlock();
+      if (containers.size() > FETCH_COUNT) {
+        startID = ContainerID.valueOf(
+            containers.get(containers.size() - 1).getContainerID());
+        containers.clear();

Review Comment:
   Ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "devmadhuu (via GitHub)" <gi...@apache.org>.
devmadhuu commented on code in PR #5841:
URL: https://github.com/apache/ozone/pull/5841#discussion_r1446927856


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -131,8 +132,23 @@ public void triggerContainerHealthCheck() {
       LOG.info("Container Health task thread took {} milliseconds to" +
               " process {} existing database records.",
           Time.monotonicNow() - start, existingCount);
+
+      checkAndProcessContainers(unhealthyContainerStateStatsMap, currentTime);
+      processedContainers.clear();
+    } finally {
+      lock.writeLock().unlock();
+    }
+  }
+
+  private void checkAndProcessContainers(
+      Map<UnHealthyContainerStates, Map<String, Long>>
+          unhealthyContainerStateStatsMap, long currentTime) {
+    ContainerID startID = ContainerID.valueOf(1);
+    List<ContainerInfo> containers = containerManager.getContainers(startID,
+        Integer.parseInt(DEFAULT_FETCH_COUNT));

Review Comment:
   Ok.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "ArafatKhan2198 (via GitHub)" <gi...@apache.org>.
ArafatKhan2198 commented on code in PR #5841:
URL: https://github.com/apache/ozone/pull/5841#discussion_r1444269244


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -142,9 +158,14 @@ public void triggerContainerHealthCheck() {
               " processing {} containers.", Time.monotonicNow() - start,
           containers.size());
       logUnhealthyContainerStats(unhealthyContainerStateStatsMap);
-      processedContainers.clear();
-    } finally {
-      lock.writeLock().unlock();
+      if (containers.size() > Integer.parseInt(DEFAULT_FETCH_COUNT)) {
+        startID = ContainerID.valueOf(
+            containers.get(containers.size() - 1).getContainerID());
+        containers.clear();
+        containers = containerManager.getContainers(startID,
+            Integer.parseInt(DEFAULT_FETCH_COUNT));
+      }
+      containers.clear();

Review Comment:
   The `containers.clear()`; statement inside the loop is meant to clear the list before fetching the next batch. However, the last `containers.clear();` statement outside the loop seems redundant since the loop condition `(while (!containers.isEmpty()))` ensures that it's only executed when containers is empty.?
   
   



##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -131,8 +132,23 @@ public void triggerContainerHealthCheck() {
       LOG.info("Container Health task thread took {} milliseconds to" +
               " process {} existing database records.",
           Time.monotonicNow() - start, existingCount);
+
+      checkAndProcessContainers(unhealthyContainerStateStatsMap, currentTime);
+      processedContainers.clear();
+    } finally {
+      lock.writeLock().unlock();
+    }
+  }
+
+  private void checkAndProcessContainers(
+      Map<UnHealthyContainerStates, Map<String, Long>>
+          unhealthyContainerStateStatsMap, long currentTime) {
+    ContainerID startID = ContainerID.valueOf(1);
+    List<ContainerInfo> containers = containerManager.getContainers(startID,
+        Integer.parseInt(DEFAULT_FETCH_COUNT));

Review Comment:
   Could we replace the usage of `Integer.parseInt(DEFAULT_FETCH_COUNT)` with a named constant or a method as `DEFAULT_FETCH_COUNT` is a constant. 
   
   ```
   private static final int FETCH_COUNT = Integer.parseInt(DEFAULT_FETCH_COUNT);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "ArafatKhan2198 (via GitHub)" <gi...@apache.org>.
ArafatKhan2198 commented on code in PR #5841:
URL: https://github.com/apache/ozone/pull/5841#discussion_r1444269244


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -142,9 +158,14 @@ public void triggerContainerHealthCheck() {
               " processing {} containers.", Time.monotonicNow() - start,
           containers.size());
       logUnhealthyContainerStats(unhealthyContainerStateStatsMap);
-      processedContainers.clear();
-    } finally {
-      lock.writeLock().unlock();
+      if (containers.size() > Integer.parseInt(DEFAULT_FETCH_COUNT)) {
+        startID = ContainerID.valueOf(
+            containers.get(containers.size() - 1).getContainerID());
+        containers.clear();
+        containers = containerManager.getContainers(startID,
+            Integer.parseInt(DEFAULT_FETCH_COUNT));
+      }
+      containers.clear();

Review Comment:
   The `containers.clear()`; statement inside the if block is meant to clear the list before fetching the next batch. However, the last `containers.clear();` statement just outside the if block seems redundant since the loop condition `(while (!containers.isEmpty()))` ensures that it's only executed when containers is empty.?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "sumitagrawl (via GitHub)" <gi...@apache.org>.
sumitagrawl commented on code in PR #5841:
URL: https://github.com/apache/ozone/pull/5841#discussion_r1447290507


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -142,9 +159,13 @@ public void triggerContainerHealthCheck() {
               " processing {} containers.", Time.monotonicNow() - start,
           containers.size());
       logUnhealthyContainerStats(unhealthyContainerStateStatsMap);
-      processedContainers.clear();
-    } finally {
-      lock.writeLock().unlock();
+      if (containers.size() > FETCH_COUNT) {

Review Comment:
   When number of containers greater than fetch_count, it will return FETCH_COUNT, what is possibility of this to be greater than FETCH_COUNT? check if going to second iteration



##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -142,9 +159,13 @@ public void triggerContainerHealthCheck() {
               " processing {} containers.", Time.monotonicNow() - start,
           containers.size());
       logUnhealthyContainerStats(unhealthyContainerStateStatsMap);
-      processedContainers.clear();
-    } finally {
-      lock.writeLock().unlock();
+      if (containers.size() > FETCH_COUNT) {
+        startID = ContainerID.valueOf(
+            containers.get(containers.size() - 1).getContainerID());
+        containers.clear();

Review Comment:
   containers.clear() may not be requried as its local variable and overwritten on next assignment also



##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -131,8 +133,23 @@ public void triggerContainerHealthCheck() {
       LOG.info("Container Health task thread took {} milliseconds to" +
               " process {} existing database records.",
           Time.monotonicNow() - start, existingCount);
+
+      checkAndProcessContainers(unhealthyContainerStateStatsMap, currentTime);
+      processedContainers.clear();
+    } finally {
+      lock.writeLock().unlock();
+    }
+  }
+
+  private void checkAndProcessContainers(
+      Map<UnHealthyContainerStates, Map<String, Long>>
+          unhealthyContainerStateStatsMap, long currentTime) {
+    ContainerID startID = ContainerID.valueOf(1);
+    List<ContainerInfo> containers = containerManager.getContainers(startID,
+        FETCH_COUNT);
+    long start;

Review Comment:
   we can log iteration count also here for time taken, to identify if this log is for same task



##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -142,9 +159,13 @@ public void triggerContainerHealthCheck() {
               " processing {} containers.", Time.monotonicNow() - start,
           containers.size());
       logUnhealthyContainerStats(unhealthyContainerStateStatsMap);
-      processedContainers.clear();
-    } finally {
-      lock.writeLock().unlock();
+      if (containers.size() > FETCH_COUNT) {
+        startID = ContainerID.valueOf(

Review Comment:
   startID should be "lastID + 1", right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "devmadhuu (via GitHub)" <gi...@apache.org>.
devmadhuu commented on code in PR #5841:
URL: https://github.com/apache/ozone/pull/5841#discussion_r1447376030


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -142,9 +159,13 @@ public void triggerContainerHealthCheck() {
               " processing {} containers.", Time.monotonicNow() - start,
           containers.size());
       logUnhealthyContainerStats(unhealthyContainerStateStatsMap);
-      processedContainers.clear();
-    } finally {
-      lock.writeLock().unlock();
+      if (containers.size() > FETCH_COUNT) {

Review Comment:
   Thanks @sumitagrawl for review. Yes you are right. I'll change the logic.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819.Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai commented on PR #5841:
URL: https://github.com/apache/ozone/pull/5841#issuecomment-1866605641

   @devmadhuu please check if test failure in `TestReconTasks` is related:
   
   https://github.com/devmadhuu/ozone/actions/runs/7283607925/job/19848007957#step:6:167
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819.Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "devmadhuu (via GitHub)" <gi...@apache.org>.
devmadhuu commented on code in PR #5841:
URL: https://github.com/apache/ozone/pull/5841#discussion_r1433412290


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -437,7 +454,7 @@ public static List<UnhealthyContainers> generateUnhealthyRecords(
 
       if (container.isUnderReplicated()
           && !recordForStateExists.contains(
-              UnHealthyContainerStates.UNDER_REPLICATED.toString())) {
+          UnHealthyContainerStates.UNDER_REPLICATED.toString())) {

Review Comment:
   done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "ArafatKhan2198 (via GitHub)" <gi...@apache.org>.
ArafatKhan2198 commented on code in PR #5841:
URL: https://github.com/apache/ozone/pull/5841#discussion_r1444267264


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -131,8 +132,23 @@ public void triggerContainerHealthCheck() {
       LOG.info("Container Health task thread took {} milliseconds to" +
               " process {} existing database records.",
           Time.monotonicNow() - start, existingCount);
+
+      checkAndProcessContainers(unhealthyContainerStateStatsMap, currentTime);
+      processedContainers.clear();
+    } finally {
+      lock.writeLock().unlock();
+    }
+  }
+
+  private void checkAndProcessContainers(
+      Map<UnHealthyContainerStates, Map<String, Long>>
+          unhealthyContainerStateStatsMap, long currentTime) {
+    ContainerID startID = ContainerID.valueOf(1);
+    List<ContainerInfo> containers = containerManager.getContainers(startID,
+        Integer.parseInt(DEFAULT_FETCH_COUNT));

Review Comment:
   Could we replace the usage of `Integer.parseInt(DEFAULT_FETCH_COUNT)` with a named constant as `DEFAULT_FETCH_COUNT` is a constant. 
   
   ```
   private static final int FETCH_COUNT = Integer.parseInt(DEFAULT_FETCH_COUNT);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai commented on PR #5841:
URL: https://github.com/apache/ozone/pull/5841#issuecomment-1890994753

   Thanks @devmadhuu for the patch, @ArafatKhan2198, @sumitagrawl for the review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "devmadhuu (via GitHub)" <gi...@apache.org>.
devmadhuu commented on code in PR #5841:
URL: https://github.com/apache/ozone/pull/5841#discussion_r1447362085


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -142,9 +159,13 @@ public void triggerContainerHealthCheck() {
               " processing {} containers.", Time.monotonicNow() - start,
           containers.size());
       logUnhealthyContainerStats(unhealthyContainerStateStatsMap);
-      processedContainers.clear();
-    } finally {
-      lock.writeLock().unlock();
+      if (containers.size() > FETCH_COUNT) {

Review Comment:
   Thanks @sumitagrawl for review. FETCH_COUNT is 1000 and in cluster we can have more than 1000 containers. so 1st iteration will fetch 1000 containers , 2nd iteration will fetch any containers from 1001th container Id to 2000th container and so on....Did I mis-understood your point ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9819. Recon - Potential memory overflow in Container Health Task. [ozone]

Posted by "devmadhuu (via GitHub)" <gi...@apache.org>.
devmadhuu commented on code in PR #5841:
URL: https://github.com/apache/ozone/pull/5841#discussion_r1447362085


##########
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java:
##########
@@ -142,9 +159,13 @@ public void triggerContainerHealthCheck() {
               " processing {} containers.", Time.monotonicNow() - start,
           containers.size());
       logUnhealthyContainerStats(unhealthyContainerStateStatsMap);
-      processedContainers.clear();
-    } finally {
-      lock.writeLock().unlock();
+      if (containers.size() > FETCH_COUNT) {

Review Comment:
   Thanks @sumitagrawl for review. FETCH_COUNT is 1000 and in cluster we can have more than 1000 containers. so 1st iteration will fetch 1000 containers , 2nd iteration will fetch any containers from 1001th container Id to 2000th container and so on....Did I mis-understood your point ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org