You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2022/09/15 13:56:50 UTC

[GitHub] [ozone] sumitagrawl opened a new pull request, #3751: HDDS-6492. Add metric for failed container moves

sumitagrawl opened a new pull request, #3751:
URL: https://github.com/apache/ozone/pull/3751

   ## What changes were proposed in this pull request?
   
   A metric is added for move container failed due to exception occurred in execution - numContainerMovesFailed. This does not include metrics for replication know failure.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-6492
   
   ## How was this patch tested?
   
   Patch is tested with unit test, here multiple type of exception is thrown using mock of replication manager - checkIterationResultException
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on a diff in pull request #3751: HDDS-6492 Add metric for failed container moves.

Posted by GitBox <gi...@apache.org>.
siddhantsangwan commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r970994760


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -544,8 +545,10 @@ private void checkIterationMoveResults() {
           }).count();
       LOG.warn("{} Container moves are canceled.", timeoutCounts);
       metrics.incrementNumContainerMovesTimeoutInLatestIteration(timeoutCounts);
+      metrics.incrementNumContainerMovesFailed(timeoutCounts);
     } catch (ExecutionException e) {
       LOG.error("Got exception while checkIterationMoveResults", e);
+      metrics.incrementNumContainerMovesFailed(moveSelectionToFutureMap.size());

Review Comment:
   Does getting an ExecutionException mean that all container moves failed? We probably still need to check individual move results like we're doing above for `TimeoutException`.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,6 +529,7 @@ private void checkIterationMoveResults() {
       allFuturesResult.get(config.getMoveTimeout().toMillis(),
           TimeUnit.MILLISECONDS);
     } catch (InterruptedException e) {
+      metrics.incrementNumContainerMovesFailed(1);

Review Comment:
   Not sure I understand this change - why are we increasing move failures by one if the current thread is interrupted?



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -544,8 +545,10 @@ private void checkIterationMoveResults() {
           }).count();
       LOG.warn("{} Container moves are canceled.", timeoutCounts);
       metrics.incrementNumContainerMovesTimeoutInLatestIteration(timeoutCounts);
+      metrics.incrementNumContainerMovesFailed(timeoutCounts);

Review Comment:
   If we're also considering timeouts as move failures, we need to mention this in the metrics description. We'd also need to update failure metrics in other places where timeouts are being calculated, such as `ContainerBalancerMetrics#incrementCurrentIterationContainerMoveMetric`.
   
   Maybe it's better to keep timeout and failures separate?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] symious commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
symious commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r979748292


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,19 +529,10 @@ private void checkIterationMoveResults() {
       allFuturesResult.get(config.getMoveTimeout().toMillis(),
           TimeUnit.MILLISECONDS);
     } catch (InterruptedException e) {
+      LOG.warn("Container balancer is interrupted");
       Thread.currentThread().interrupt();
     } catch (TimeoutException e) {
-      long timeoutCounts = moveSelectionToFutureMap.entrySet().stream()
-          .filter(entry -> !entry.getValue().isDone())
-          .peek(entry -> {
-            LOG.warn("Container move canceled for container {} from source {}" +
-                    " to target {} due to timeout.",
-                entry.getKey().getContainerID(),
-                containerToSourceMap.get(entry.getKey().getContainerID())
-                    .getUuidString(),
-                entry.getKey().getTargetNode().getUuidString());
-            entry.getValue().cancel(true);

Review Comment:
   I remember I added this cancelation because of some issue in our cluster: even after the timeout, the request is also handled by datanodes, then many containers have 4 replicas, causing ReplicationManager to clean the redundant one.
   
   The cancelation here can help eliminate these cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sumitagrawl commented on pull request #3751: HDDS-6492 Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
sumitagrawl commented on PR #3751:
URL: https://github.com/apache/ozone/pull/3751#issuecomment-1246291358

   @siddhantsangwan @lokeshj1703 @symious Please review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sumitagrawl commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
sumitagrawl commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r977516700


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -568,6 +561,39 @@ private void checkIterationMoveResults() {
         metrics.getNumContainerMovesCompletedInLatestIteration());
   }
 
+  private long performTimeMoveCancelAndCount() {
+    return moveSelectionToFutureMap.entrySet().stream()
+        .filter(entry -> !entry.getValue().isDone())
+        .peek(entry -> {
+          LOG.warn("Container move canceled for container {} from source {}" +
+                  " to target {} due to timeout.",
+              entry.getKey().getContainerID(),
+              containerToSourceMap.get(entry.getKey().getContainerID())
+                  .getUuidString(),
+              entry.getKey().getTargetNode().getUuidString());
+          entry.getValue().cancel(true);
+        }).count();
+  }
+
+  private long performMoveCancelAndFailCount() {
+    long cnt = moveSelectionToFutureMap.entrySet().stream()
+        .filter(entry -> entry.getValue().isDone() && entry.getValue()
+            .isCompletedExceptionally()).count();
+    moveSelectionToFutureMap.entrySet().stream()
+        .filter(entry -> !entry.getValue().isDone())
+        .peek(entry -> {
+          LOG.warn("Container move canceled for container {} from source {}" +
+                  " to target {} due to timeout.",

Review Comment:
   removed the specific method as completeExceptionally also not applicable for interrup exception and will be counted in whenComplete(), reused cancelAndCountPendingMoves()



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sumitagrawl commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
sumitagrawl commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r972038381


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,6 +529,7 @@ private void checkIterationMoveResults() {
       allFuturesResult.get(config.getMoveTimeout().toMillis(),
           TimeUnit.MILLISECONDS);
     } catch (InterruptedException e) {
+      metrics.incrementNumContainerMovesFailed(1);

Review Comment:
   Will cancel moves which is in progress, and mark that as failed as move is not completed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan merged pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
siddhantsangwan merged PR #3751:
URL: https://github.com/apache/ozone/pull/3751


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] symious commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
symious commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r979734540


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,19 +529,10 @@ private void checkIterationMoveResults() {
       allFuturesResult.get(config.getMoveTimeout().toMillis(),
           TimeUnit.MILLISECONDS);
     } catch (InterruptedException e) {
+      LOG.warn("Container balancer is interrupted");
       Thread.currentThread().interrupt();
     } catch (TimeoutException e) {
-      long timeoutCounts = moveSelectionToFutureMap.entrySet().stream()
-          .filter(entry -> !entry.getValue().isDone())
-          .peek(entry -> {
-            LOG.warn("Container move canceled for container {} from source {}" +
-                    " to target {} due to timeout.",
-                entry.getKey().getContainerID(),
-                containerToSourceMap.get(entry.getKey().getContainerID())
-                    .getUuidString(),
-                entry.getKey().getTargetNode().getUuidString());
-            entry.getValue().cancel(true);

Review Comment:
   The cancel part should be needed to handle timeout tasks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sumitagrawl commented on pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
sumitagrawl commented on PR #3751:
URL: https://github.com/apache/ozone/pull/3751#issuecomment-1248137129

   > Overall, I think it's better to count failed moves in the `whenComplete((result, ex)` stage of completable future in the `moveContainer` method. That way, we can count all the failures at one place. `checkIterationMoveResults` can be used for updating aggregate metrics (across all iterations). @sumitagrawl What do you think?
   
   I checked internet, replication.move() should return completable failure, and must not throw exception to get in whenComplete() handling, currently replication code is not written in this manner, so exception will never come.
   eg:
   CompletableFuture<String> cf0 =
       CompletableFuture.failedFuture(new RuntimeException("Oops"));


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sumitagrawl commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
sumitagrawl commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r972038381


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,6 +529,7 @@ private void checkIterationMoveResults() {
       allFuturesResult.get(config.getMoveTimeout().toMillis(),
           TimeUnit.MILLISECONDS);
     } catch (InterruptedException e) {
+      metrics.incrementNumContainerMovesFailed(1);

Review Comment:
   Will cancel moves which is in progress, and mark that as failed as move is not completed. Also update with count for failed one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sumitagrawl commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
sumitagrawl commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r979793972


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,19 +529,10 @@ private void checkIterationMoveResults() {
       allFuturesResult.get(config.getMoveTimeout().toMillis(),
           TimeUnit.MILLISECONDS);
     } catch (InterruptedException e) {
+      LOG.warn("Container balancer is interrupted");
       Thread.currentThread().interrupt();
     } catch (TimeoutException e) {
-      long timeoutCounts = moveSelectionToFutureMap.entrySet().stream()
-          .filter(entry -> !entry.getValue().isDone())
-          .peek(entry -> {
-            LOG.warn("Container move canceled for container {} from source {}" +
-                    " to target {} due to timeout.",
-                entry.getKey().getContainerID(),
-                containerToSourceMap.get(entry.getKey().getContainerID())
-                    .getUuidString(),
-                entry.getKey().getTargetNode().getUuidString());
-            entry.getValue().cancel(true);

Review Comment:
   reverted changes and raised PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sumitagrawl commented on pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
sumitagrawl commented on PR #3751:
URL: https://github.com/apache/ozone/pull/3751#issuecomment-1249051315

   @siddhantsangwan updated with all comments, please recheck


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sumitagrawl commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
sumitagrawl commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r977564607


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,22 +529,17 @@ private void checkIterationMoveResults() {
       allFuturesResult.get(config.getMoveTimeout().toMillis(),
           TimeUnit.MILLISECONDS);
     } catch (InterruptedException e) {
+      long cancelCount = cancelAndCountPendingMoves();
+      metrics.incrementNumContainerMovesFailed(cancelCount);
+      LOG.warn("Container balancer is interrupted and moves are cancelled {}",
+          cancelCount);

Review Comment:
   Removed the case of counting as failed. Just cancelling the moves. But as cancel, possibly of exception caught in whenComplete(), that will be counted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sumitagrawl commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
sumitagrawl commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r979733008


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,19 +529,10 @@ private void checkIterationMoveResults() {
       allFuturesResult.get(config.getMoveTimeout().toMillis(),
           TimeUnit.MILLISECONDS);
     } catch (InterruptedException e) {
+      LOG.warn("Container balancer is interrupted");
       Thread.currentThread().interrupt();
     } catch (TimeoutException e) {
-      long timeoutCounts = moveSelectionToFutureMap.entrySet().stream()
-          .filter(entry -> !entry.getValue().isDone())
-          .peek(entry -> {
-            LOG.warn("Container move canceled for container {} from source {}" +
-                    " to target {} due to timeout.",
-                entry.getKey().getContainerID(),
-                containerToSourceMap.get(entry.getKey().getContainerID())
-                    .getUuidString(),
-                entry.getKey().getTargetNode().getUuidString());
-            entry.getValue().cancel(true);

Review Comment:
   Its planned to reuse same in interrupt and Execution, so refactored to move to new function performing this task. But later using that is removed as discussion an removal of cancel partl



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
siddhantsangwan commented on PR #3751:
URL: https://github.com/apache/ozone/pull/3751#issuecomment-1249362646

   > I checked internet, replication.move() should return completable failure, and must not throw exception to get in whenComplete() handling, currently replication code is not written in this manner, so exception will never come.
   
   Any checked exceptions thrown by `replicationManager.move` can be counted in their respective catch blocks. Other execution exceptions can be handled in `whenComplete`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sumitagrawl commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
sumitagrawl commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r977515659


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -568,6 +561,39 @@ private void checkIterationMoveResults() {
         metrics.getNumContainerMovesCompletedInLatestIteration());
   }
 
+  private long performTimeMoveCancelAndCount() {

Review Comment:
   renamed as cancelAndCountPendingMoves as applicable for cancelling on exception case, not only timeout.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
siddhantsangwan commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r978314884


##########
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/balancer/TestContainerBalancer.java:
##########
@@ -779,7 +780,52 @@ public void checkIterationResultTimeoutFromReplicationManager()
         .getNumContainerMovesTimeoutInLatestIteration() > 0);
     stopBalancer();
   }
-  
+
+  @Test
+  public void checkIterationResultException()
+      throws NodeNotFoundException, IOException,
+      IllegalContainerBalancerStateException,
+      InvalidContainerBalancerConfigurationException,
+      TimeoutException {
+
+    CompletableFuture<MoveResult> f = new CompletableFuture();
+    f.completeExceptionally(new RuntimeException("Runtime Exception"));
+    Mockito.when(replicationManager.move(Mockito.any(ContainerID.class),
+            Mockito.any(DatanodeDetails.class),
+            Mockito.any(DatanodeDetails.class)))
+        .thenThrow(new ContainerNotFoundException("Test Container not found"),
+            new NodeNotFoundException("Test Node not found"))
+        .thenReturn(f).thenReturn(CompletableFuture.supplyAsync(() -> {
+          try {
+            Thread.sleep(1000);

Review Comment:
   NIT: Is this deliberately set to a high value? Maybe we can reduce it to something like 5ms.



##########
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/balancer/TestContainerBalancer.java:
##########
@@ -779,7 +780,52 @@ public void checkIterationResultTimeoutFromReplicationManager()
         .getNumContainerMovesTimeoutInLatestIteration() > 0);
     stopBalancer();
   }
-  
+
+  @Test
+  public void checkIterationResultException()
+      throws NodeNotFoundException, IOException,
+      IllegalContainerBalancerStateException,
+      InvalidContainerBalancerConfigurationException,
+      TimeoutException {
+
+    CompletableFuture<MoveResult> f = new CompletableFuture();
+    f.completeExceptionally(new RuntimeException("Runtime Exception"));
+    Mockito.when(replicationManager.move(Mockito.any(ContainerID.class),
+            Mockito.any(DatanodeDetails.class),
+            Mockito.any(DatanodeDetails.class)))
+        .thenThrow(new ContainerNotFoundException("Test Container not found"),
+            new NodeNotFoundException("Test Node not found"))
+        .thenReturn(f).thenReturn(CompletableFuture.supplyAsync(() -> {
+          try {
+            Thread.sleep(1000);
+          } catch (Exception ex) {
+          }
+          throw new RuntimeException("Throw");
+        }));
+
+    balancerConfiguration.setThreshold(10);
+    balancerConfiguration.setIterations(1);
+    balancerConfiguration.setMaxSizeEnteringTarget(10 * STORAGE_UNIT);
+    balancerConfiguration.setMaxSizeToMovePerIteration(100 * STORAGE_UNIT);
+    balancerConfiguration.setMaxDatanodesPercentageToInvolvePerIteration(100);
+    balancerConfiguration.setMoveTimeout(Duration.ofMillis(2000));
+
+    startBalancer(balancerConfiguration);
+    sleepWhileBalancing(2000);
+
+    /*
+    Immediate failure count is 2 and timeout count is 4 together
+     */

Review Comment:
   NIT: This comment can be removed now since we aren't counting timeouts.



##########
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/balancer/TestContainerBalancer.java:
##########
@@ -779,7 +780,52 @@ public void checkIterationResultTimeoutFromReplicationManager()
         .getNumContainerMovesTimeoutInLatestIteration() > 0);
     stopBalancer();
   }
-  
+
+  @Test
+  public void checkIterationResultException()
+      throws NodeNotFoundException, IOException,
+      IllegalContainerBalancerStateException,
+      InvalidContainerBalancerConfigurationException,
+      TimeoutException {
+
+    CompletableFuture<MoveResult> f = new CompletableFuture();
+    f.completeExceptionally(new RuntimeException("Runtime Exception"));
+    Mockito.when(replicationManager.move(Mockito.any(ContainerID.class),
+            Mockito.any(DatanodeDetails.class),
+            Mockito.any(DatanodeDetails.class)))
+        .thenThrow(new ContainerNotFoundException("Test Container not found"),
+            new NodeNotFoundException("Test Node not found"))
+        .thenReturn(f).thenReturn(CompletableFuture.supplyAsync(() -> {
+          try {
+            Thread.sleep(1000);
+          } catch (Exception ex) {
+          }
+          throw new RuntimeException("Throw");
+        }));
+
+    balancerConfiguration.setThreshold(10);
+    balancerConfiguration.setIterations(1);
+    balancerConfiguration.setMaxSizeEnteringTarget(10 * STORAGE_UNIT);
+    balancerConfiguration.setMaxSizeToMovePerIteration(100 * STORAGE_UNIT);
+    balancerConfiguration.setMaxDatanodesPercentageToInvolvePerIteration(100);
+    balancerConfiguration.setMoveTimeout(Duration.ofMillis(2000));
+
+    startBalancer(balancerConfiguration);
+    sleepWhileBalancing(2000);

Review Comment:
   Same as above. Can be reduced to 500ms.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sumitagrawl commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
sumitagrawl commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r972041061


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -544,8 +545,10 @@ private void checkIterationMoveResults() {
           }).count();
       LOG.warn("{} Container moves are canceled.", timeoutCounts);
       metrics.incrementNumContainerMovesTimeoutInLatestIteration(timeoutCounts);
+      metrics.incrementNumContainerMovesFailed(timeoutCounts);

Review Comment:
   Will remove this metrics from timeout



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] lokeshj1703 commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
lokeshj1703 commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r979742381


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,19 +529,10 @@ private void checkIterationMoveResults() {
       allFuturesResult.get(config.getMoveTimeout().toMillis(),
           TimeUnit.MILLISECONDS);
     } catch (InterruptedException e) {
+      LOG.warn("Container balancer is interrupted");
       Thread.currentThread().interrupt();
     } catch (TimeoutException e) {
-      long timeoutCounts = moveSelectionToFutureMap.entrySet().stream()
-          .filter(entry -> !entry.getValue().isDone())
-          .peek(entry -> {
-            LOG.warn("Container move canceled for container {} from source {}" +
-                    " to target {} due to timeout.",
-                entry.getKey().getContainerID(),
-                containerToSourceMap.get(entry.getKey().getContainerID())
-                    .getUuidString(),
-                entry.getKey().getTargetNode().getUuidString());
-            entry.getValue().cancel(true);

Review Comment:
   @symious This is updated based on review comment: https://github.com/apache/ozone/pull/3751#discussion_r978370000



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sumitagrawl commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
sumitagrawl commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r978422270


##########
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/balancer/TestContainerBalancer.java:
##########
@@ -779,7 +780,52 @@ public void checkIterationResultTimeoutFromReplicationManager()
         .getNumContainerMovesTimeoutInLatestIteration() > 0);
     stopBalancer();
   }
-  
+
+  @Test
+  public void checkIterationResultException()
+      throws NodeNotFoundException, IOException,
+      IllegalContainerBalancerStateException,
+      InvalidContainerBalancerConfigurationException,
+      TimeoutException {
+
+    CompletableFuture<MoveResult> f = new CompletableFuture();
+    f.completeExceptionally(new RuntimeException("Runtime Exception"));
+    Mockito.when(replicationManager.move(Mockito.any(ContainerID.class),
+            Mockito.any(DatanodeDetails.class),
+            Mockito.any(DatanodeDetails.class)))
+        .thenThrow(new ContainerNotFoundException("Test Container not found"),
+            new NodeNotFoundException("Test Node not found"))
+        .thenReturn(f).thenReturn(CompletableFuture.supplyAsync(() -> {
+          try {
+            Thread.sleep(1000);
+          } catch (Exception ex) {
+          }
+          throw new RuntimeException("Throw");
+        }));
+
+    balancerConfiguration.setThreshold(10);
+    balancerConfiguration.setIterations(1);
+    balancerConfiguration.setMaxSizeEnteringTarget(10 * STORAGE_UNIT);
+    balancerConfiguration.setMaxSizeToMovePerIteration(100 * STORAGE_UNIT);
+    balancerConfiguration.setMaxDatanodesPercentageToInvolvePerIteration(100);
+    balancerConfiguration.setMoveTimeout(Duration.ofMillis(2000));
+
+    startBalancer(balancerConfiguration);
+    sleepWhileBalancing(2000);

Review Comment:
   This also set to high value of 2 second as other cases deliberately, need to have timeout



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sumitagrawl commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
sumitagrawl commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r977563198


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,22 +529,17 @@ private void checkIterationMoveResults() {
       allFuturesResult.get(config.getMoveTimeout().toMillis(),
           TimeUnit.MILLISECONDS);
     } catch (InterruptedException e) {
+      long cancelCount = cancelAndCountPendingMoves();
+      metrics.incrementNumContainerMovesFailed(cancelCount);
+      LOG.warn("Container balancer is interrupted and moves are cancelled {}",
+          cancelCount);
       Thread.currentThread().interrupt();
     } catch (TimeoutException e) {
-      long timeoutCounts = moveSelectionToFutureMap.entrySet().stream()
-          .filter(entry -> !entry.getValue().isDone())
-          .peek(entry -> {
-            LOG.warn("Container move canceled for container {} from source {}" +
-                    " to target {} due to timeout.",
-                entry.getKey().getContainerID(),
-                containerToSourceMap.get(entry.getKey().getContainerID())
-                    .getUuidString(),
-                entry.getKey().getTargetNode().getUuidString());
-            entry.getValue().cancel(true);
-          }).count();
+      long timeoutCounts = cancelAndCountPendingMoves();
       LOG.warn("{} Container moves are canceled.", timeoutCounts);
       metrics.incrementNumContainerMovesTimeoutInLatestIteration(timeoutCounts);
     } catch (ExecutionException e) {
+      cancelAndCountPendingMoves();

Review Comment:
   Not doing any count of failed metrics in this exception case



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] lokeshj1703 commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
lokeshj1703 commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r978362904


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -568,6 +563,20 @@ private void checkIterationMoveResults() {
         metrics.getNumContainerMovesCompletedInLatestIteration());
   }
 
+  private long cancelAndCountPendingMoves() {
+    return moveSelectionToFutureMap.entrySet().stream()
+        .filter(entry -> !entry.getValue().isDone())
+        .peek(entry -> {
+          LOG.warn("Container move canceled for container {} from source {}" +
+                  " to target {}.",
+              entry.getKey().getContainerID(),
+              containerToSourceMap.get(entry.getKey().getContainerID())
+                  .getUuidString(),
+              entry.getKey().getTargetNode().getUuidString());
+          entry.getValue().cancel(true);

Review Comment:
   Based on cancel javadoc, the cancel function completes the future exceptionally. I think if we call this fn on InterruptedException, then it would mark all the pending moves as failed in the whenComplete call. I do not think we should count this as move failure. It could trigger an unnecessary alarm for the administrator.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,19 +529,12 @@ private void checkIterationMoveResults() {
       allFuturesResult.get(config.getMoveTimeout().toMillis(),
           TimeUnit.MILLISECONDS);
     } catch (InterruptedException e) {
+      long cancelCount = cancelAndCountPendingMoves();
+      LOG.warn("Container balancer is interrupted and moves are cancelled {}",
+          cancelCount);
       Thread.currentThread().interrupt();
     } catch (TimeoutException e) {
-      long timeoutCounts = moveSelectionToFutureMap.entrySet().stream()
-          .filter(entry -> !entry.getValue().isDone())
-          .peek(entry -> {
-            LOG.warn("Container move canceled for container {} from source {}" +
-                    " to target {} due to timeout.",
-                entry.getKey().getContainerID(),
-                containerToSourceMap.get(entry.getKey().getContainerID())
-                    .getUuidString(),
-                entry.getKey().getTargetNode().getUuidString());
-            entry.getValue().cancel(true);

Review Comment:
   Unrelated to PR, I think for timeout we have been cancelling the future. This would mean that timeout is considered as both move failure and move timeout. I think it would be better to consider the different metrics as disjoint sets (completed, timeout, failed)?
   @siddhantsangwan 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sumitagrawl commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
sumitagrawl commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r979837592


##########
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/balancer/TestContainerBalancer.java:
##########
@@ -779,7 +780,52 @@ public void checkIterationResultTimeoutFromReplicationManager()
         .getNumContainerMovesTimeoutInLatestIteration() > 0);
     stopBalancer();
   }
-  
+
+  @Test
+  public void checkIterationResultException()
+      throws NodeNotFoundException, IOException,
+      IllegalContainerBalancerStateException,
+      InvalidContainerBalancerConfigurationException,
+      TimeoutException {
+
+    CompletableFuture<MoveResult> f = new CompletableFuture();
+    f.completeExceptionally(new RuntimeException("Runtime Exception"));
+    Mockito.when(replicationManager.move(Mockito.any(ContainerID.class),
+            Mockito.any(DatanodeDetails.class),
+            Mockito.any(DatanodeDetails.class)))
+        .thenThrow(new ContainerNotFoundException("Test Container not found"),
+            new NodeNotFoundException("Test Node not found"))
+        .thenReturn(f).thenReturn(CompletableFuture.supplyAsync(() -> {
+          try {
+            Thread.sleep(1000);
+          } catch (Exception ex) {
+          }
+          throw new RuntimeException("Throw");
+        }));
+
+    balancerConfiguration.setThreshold(10);
+    balancerConfiguration.setIterations(1);
+    balancerConfiguration.setMaxSizeEnteringTarget(10 * STORAGE_UNIT);
+    balancerConfiguration.setMaxSizeToMovePerIteration(100 * STORAGE_UNIT);
+    balancerConfiguration.setMaxDatanodesPercentageToInvolvePerIteration(100);
+    balancerConfiguration.setMoveTimeout(Duration.ofMillis(2000));
+
+    startBalancer(balancerConfiguration);
+    sleepWhileBalancing(2000);

Review Comment:
   ok timeout changed to 500 milli sec, and task to 200 milli sec. No impact to overall execution.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on pull request #3751: HDDS-6492 Add metric for failed container moves.

Posted by GitBox <gi...@apache.org>.
siddhantsangwan commented on PR #3751:
URL: https://github.com/apache/ozone/pull/3751#issuecomment-1247730996

   Overall, I think it's better to count failed moves in the `whenComplete((result, ex)` stage of completable future in the `moveContainer` method. That way, we can count all the failures at one place. `checkIterationMoveResults` can be used for updating aggregate metrics (across all iterations). @sumitagrawl What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
siddhantsangwan commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r977465464


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -568,6 +561,39 @@ private void checkIterationMoveResults() {
         metrics.getNumContainerMovesCompletedInLatestIteration());
   }
 
+  private long performTimeMoveCancelAndCount() {
+    return moveSelectionToFutureMap.entrySet().stream()
+        .filter(entry -> !entry.getValue().isDone())
+        .peek(entry -> {
+          LOG.warn("Container move canceled for container {} from source {}" +
+                  " to target {} due to timeout.",
+              entry.getKey().getContainerID(),
+              containerToSourceMap.get(entry.getKey().getContainerID())
+                  .getUuidString(),
+              entry.getKey().getTargetNode().getUuidString());
+          entry.getValue().cancel(true);
+        }).count();
+  }
+
+  private long performMoveCancelAndFailCount() {
+    long cnt = moveSelectionToFutureMap.entrySet().stream()
+        .filter(entry -> entry.getValue().isDone() && entry.getValue()
+            .isCompletedExceptionally()).count();
+    moveSelectionToFutureMap.entrySet().stream()
+        .filter(entry -> !entry.getValue().isDone())
+        .peek(entry -> {
+          LOG.warn("Container move canceled for container {} from source {}" +
+                  " to target {} due to timeout.",

Review Comment:
   Should be "failure" instead of "timeout".



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -568,6 +561,39 @@ private void checkIterationMoveResults() {
         metrics.getNumContainerMovesCompletedInLatestIteration());
   }
 
+  private long performTimeMoveCancelAndCount() {
+    return moveSelectionToFutureMap.entrySet().stream()
+        .filter(entry -> !entry.getValue().isDone())
+        .peek(entry -> {
+          LOG.warn("Container move canceled for container {} from source {}" +
+                  " to target {} due to timeout.",
+              entry.getKey().getContainerID(),
+              containerToSourceMap.get(entry.getKey().getContainerID())
+                  .getUuidString(),
+              entry.getKey().getTargetNode().getUuidString());
+          entry.getValue().cancel(true);
+        }).count();
+  }
+
+  private long performMoveCancelAndFailCount() {

Review Comment:
   How does `cancelAndCountFailedMoves` sound?



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -568,6 +561,39 @@ private void checkIterationMoveResults() {
         metrics.getNumContainerMovesCompletedInLatestIteration());
   }
 
+  private long performTimeMoveCancelAndCount() {

Review Comment:
   Let's rename this method to something like `cancelAndCountTimedOutMoves`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] lokeshj1703 commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
lokeshj1703 commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r977530477


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,22 +529,17 @@ private void checkIterationMoveResults() {
       allFuturesResult.get(config.getMoveTimeout().toMillis(),
           TimeUnit.MILLISECONDS);
     } catch (InterruptedException e) {
+      long cancelCount = cancelAndCountPendingMoves();
+      metrics.incrementNumContainerMovesFailed(cancelCount);
+      LOG.warn("Container balancer is interrupted and moves are cancelled {}",
+          cancelCount);

Review Comment:
   We do not need to handle this since InterruptedException would occur in case balancer is stopped. We shouldn't mark the moves as failed then. It could be misleading.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,22 +529,17 @@ private void checkIterationMoveResults() {
       allFuturesResult.get(config.getMoveTimeout().toMillis(),
           TimeUnit.MILLISECONDS);
     } catch (InterruptedException e) {
+      long cancelCount = cancelAndCountPendingMoves();
+      metrics.incrementNumContainerMovesFailed(cancelCount);
+      LOG.warn("Container balancer is interrupted and moves are cancelled {}",
+          cancelCount);
       Thread.currentThread().interrupt();
     } catch (TimeoutException e) {
-      long timeoutCounts = moveSelectionToFutureMap.entrySet().stream()
-          .filter(entry -> !entry.getValue().isDone())
-          .peek(entry -> {
-            LOG.warn("Container move canceled for container {} from source {}" +
-                    " to target {} due to timeout.",
-                entry.getKey().getContainerID(),
-                containerToSourceMap.get(entry.getKey().getContainerID())
-                    .getUuidString(),
-                entry.getKey().getTargetNode().getUuidString());
-            entry.getValue().cancel(true);
-          }).count();
+      long timeoutCounts = cancelAndCountPendingMoves();
       LOG.warn("{} Container moves are canceled.", timeoutCounts);
       metrics.incrementNumContainerMovesTimeoutInLatestIteration(timeoutCounts);
     } catch (ExecutionException e) {
+      cancelAndCountPendingMoves();

Review Comment:
   Based on javadoc
   ExecutionException – if this future completed exceptionally
   This case would be handled in the respective whenComplete operation.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,22 +529,15 @@ private void checkIterationMoveResults() {
       allFuturesResult.get(config.getMoveTimeout().toMillis(),
           TimeUnit.MILLISECONDS);
     } catch (InterruptedException e) {
+      long failedCount = performMoveCancelAndFailCount();
+      metrics.incrementNumContainerMovesFailed(failedCount);

Review Comment:
   I feel this is not a failure scenario since the balancer is being stopped here.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -568,6 +561,39 @@ private void checkIterationMoveResults() {
         metrics.getNumContainerMovesCompletedInLatestIteration());
   }
 
+  private long performTimeMoveCancelAndCount() {
+    return moveSelectionToFutureMap.entrySet().stream()
+        .filter(entry -> !entry.getValue().isDone())
+        .peek(entry -> {
+          LOG.warn("Container move canceled for container {} from source {}" +
+                  " to target {} due to timeout.",
+              entry.getKey().getContainerID(),
+              containerToSourceMap.get(entry.getKey().getContainerID())
+                  .getUuidString(),
+              entry.getKey().getTargetNode().getUuidString());
+          entry.getValue().cancel(true);

Review Comment:
   Since we are cancelling the future, it can throw exception which would be handled by the whenComplete leading to metric being updated twice.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerMetrics.java:
##########
@@ -74,6 +74,10 @@ public final class ContainerBalancerMetrics {
       "Container Balancer.")
   private MutableCounterLong dataSizeMovedGB;
 
+  @Metric(about = "Total number container for which moves failed " +
+      "exceptionally across all iterations of Container Balancer.")
+  private MutableCounterLong numContainerMovesFailed;
+

Review Comment:
   We have numContainerMovesCompletedInLatestIteration. I think we can have similar metric numContainerMovesFailedInLatestIteration 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sumitagrawl closed pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
sumitagrawl closed pull request #3751: HDDS-6492. Add metric for failed container moves
URL: https://github.com/apache/ozone/pull/3751


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sumitagrawl commented on pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
sumitagrawl commented on PR #3751:
URL: https://github.com/apache/ozone/pull/3751#issuecomment-1251173555

   > moveContainer
   
   @siddhantsangwan The case is handled for ExecutionException which is having additional count for failed metrics. Its now removed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
siddhantsangwan commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r979596627


##########
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/balancer/TestContainerBalancer.java:
##########
@@ -779,7 +780,52 @@ public void checkIterationResultTimeoutFromReplicationManager()
         .getNumContainerMovesTimeoutInLatestIteration() > 0);
     stopBalancer();
   }
-  
+
+  @Test
+  public void checkIterationResultException()
+      throws NodeNotFoundException, IOException,
+      IllegalContainerBalancerStateException,
+      InvalidContainerBalancerConfigurationException,
+      TimeoutException {
+
+    CompletableFuture<MoveResult> f = new CompletableFuture();
+    f.completeExceptionally(new RuntimeException("Runtime Exception"));
+    Mockito.when(replicationManager.move(Mockito.any(ContainerID.class),
+            Mockito.any(DatanodeDetails.class),
+            Mockito.any(DatanodeDetails.class)))
+        .thenThrow(new ContainerNotFoundException("Test Container not found"),
+            new NodeNotFoundException("Test Node not found"))
+        .thenReturn(f).thenReturn(CompletableFuture.supplyAsync(() -> {
+          try {
+            Thread.sleep(1000);
+          } catch (Exception ex) {
+          }
+          throw new RuntimeException("Throw");
+        }));
+
+    balancerConfiguration.setThreshold(10);
+    balancerConfiguration.setIterations(1);
+    balancerConfiguration.setMaxSizeEnteringTarget(10 * STORAGE_UNIT);
+    balancerConfiguration.setMaxSizeToMovePerIteration(100 * STORAGE_UNIT);
+    balancerConfiguration.setMaxDatanodesPercentageToInvolvePerIteration(100);
+    balancerConfiguration.setMoveTimeout(Duration.ofMillis(2000));
+
+    startBalancer(balancerConfiguration);
+    sleepWhileBalancing(2000);

Review Comment:
   Reducing them to the following works as expected:
   ```
       Mockito.when(replicationManager.move(Mockito.any(ContainerID.class),
               Mockito.any(DatanodeDetails.class),
               Mockito.any(DatanodeDetails.class)))
           .thenThrow(new ContainerNotFoundException("Test Container not found"),
               new NodeNotFoundException("Test Node not found"))
           .thenReturn(f).thenReturn(CompletableFuture.supplyAsync(() -> {
             try {
               Thread.sleep(50);
             } catch (Exception ex) {
             }
             throw new RuntimeException("Throw");
           }));
   
       balancerConfiguration.setThreshold(10);
       balancerConfiguration.setIterations(1);
       balancerConfiguration.setMaxSizeEnteringTarget(10 * STORAGE_UNIT);
       balancerConfiguration.setMaxSizeToMovePerIteration(100 * STORAGE_UNIT);
       balancerConfiguration.setMaxDatanodesPercentageToInvolvePerIteration(100);
       balancerConfiguration.setMoveTimeout(Duration.ofMillis(500));
   
       startBalancer(balancerConfiguration);
       sleepWhileBalancing(1000);
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
siddhantsangwan commented on PR #3751:
URL: https://github.com/apache/ozone/pull/3751#issuecomment-1257533355

   @sumitagrawl Can you check if the CI failure is related?
   ```
   TestContainerBalancer.checkIterationResultTimeout:744 expected: <1> but was: <7>
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] symious commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
symious commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r979615732


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,19 +529,10 @@ private void checkIterationMoveResults() {
       allFuturesResult.get(config.getMoveTimeout().toMillis(),
           TimeUnit.MILLISECONDS);
     } catch (InterruptedException e) {
+      LOG.warn("Container balancer is interrupted");
       Thread.currentThread().interrupt();
     } catch (TimeoutException e) {
-      long timeoutCounts = moveSelectionToFutureMap.entrySet().stream()
-          .filter(entry -> !entry.getValue().isDone())
-          .peek(entry -> {
-            LOG.warn("Container move canceled for container {} from source {}" +
-                    " to target {} due to timeout.",
-                entry.getKey().getContainerID(),
-                containerToSourceMap.get(entry.getKey().getContainerID())
-                    .getUuidString(),
-                entry.getKey().getTargetNode().getUuidString());
-            entry.getValue().cancel(true);

Review Comment:
   Can I ask why this line is removed in the new function?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sumitagrawl commented on a diff in pull request #3751: HDDS-6492. Add metric for failed container moves

Posted by GitBox <gi...@apache.org>.
sumitagrawl commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r977562762


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -568,6 +561,39 @@ private void checkIterationMoveResults() {
         metrics.getNumContainerMovesCompletedInLatestIteration());
   }
 
+  private long performTimeMoveCancelAndCount() {
+    return moveSelectionToFutureMap.entrySet().stream()
+        .filter(entry -> !entry.getValue().isDone())
+        .peek(entry -> {
+          LOG.warn("Container move canceled for container {} from source {}" +
+                  " to target {} due to timeout.",
+              entry.getKey().getContainerID(),
+              containerToSourceMap.get(entry.getKey().getContainerID())
+                  .getUuidString(),
+              entry.getKey().getTargetNode().getUuidString());
+          entry.getValue().cancel(true);

Review Comment:
   Its just a cancel, do not have any count and metrics update at this place. InterruptException case, its removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org