You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2022/09/15 05:45:28 UTC

[GitHub] [ozone] siddhantsangwan commented on a diff in pull request #3751: HDDS-6492 Add metric for failed container moves.

siddhantsangwan commented on code in PR #3751:
URL: https://github.com/apache/ozone/pull/3751#discussion_r970994760


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -544,8 +545,10 @@ private void checkIterationMoveResults() {
           }).count();
       LOG.warn("{} Container moves are canceled.", timeoutCounts);
       metrics.incrementNumContainerMovesTimeoutInLatestIteration(timeoutCounts);
+      metrics.incrementNumContainerMovesFailed(timeoutCounts);
     } catch (ExecutionException e) {
       LOG.error("Got exception while checkIterationMoveResults", e);
+      metrics.incrementNumContainerMovesFailed(moveSelectionToFutureMap.size());

Review Comment:
   Does getting an ExecutionException mean that all container moves failed? We probably still need to check individual move results like we're doing above for `TimeoutException`.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -529,6 +529,7 @@ private void checkIterationMoveResults() {
       allFuturesResult.get(config.getMoveTimeout().toMillis(),
           TimeUnit.MILLISECONDS);
     } catch (InterruptedException e) {
+      metrics.incrementNumContainerMovesFailed(1);

Review Comment:
   Not sure I understand this change - why are we increasing move failures by one if the current thread is interrupted?



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -544,8 +545,10 @@ private void checkIterationMoveResults() {
           }).count();
       LOG.warn("{} Container moves are canceled.", timeoutCounts);
       metrics.incrementNumContainerMovesTimeoutInLatestIteration(timeoutCounts);
+      metrics.incrementNumContainerMovesFailed(timeoutCounts);

Review Comment:
   If we're also considering timeouts as move failures, we need to mention this in the metrics description. We'd also need to update failure metrics in other places where timeouts are being calculated, such as `ContainerBalancerMetrics#incrementCurrentIterationContainerMoveMetric`.
   
   Maybe it's better to keep timeout and failures separate?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org