You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2022/09/27 03:40:33 UTC

[GitHub] [ozone] neils-dev opened a new pull request, #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

neils-dev opened a new pull request, #3781:
URL: https://github.com/apache/ozone/pull/3781

   ## What changes were proposed in this pull request?
   To expose metrics from nodes entering the decommissioning and maintenance workflow to JMX and prom endpoint.  These metrics expose the number of datanodes in the workflow, the container replication state of tracked nodes and the number of pipelines waiting to close of tracked nodes.  With the following exposed metrics from the `NodeDecommissionManager` through the `DataAdminMonitorImpl` the progress of the decommission and maintenance workflow can be monitored.
   
   The progress of datanodes going though the workflow are monitored through aggregated counts of the number of tracked nodes, their number of pipelines waiting to close and the number of containers in each of sufficiently, under-replicated and unhealthy state.  The metrics collected are as discussed in the associated Jira comments,
   
   **As exposed to prom endpoint:**
   
   _aggregated total number of datanodes in workflow:_
   `node_decommission_metrics_total_tracked_decommissioning_maintenance_nodes   
   `
   
   _Of tracked datanodes in workflow, the container replication state; total number of containers in each of sufficiently replicated, under-replicated and unhealthy state_
   ```
   node_decommission_metrics_total_tracked_containers_sufficiently_replicated
   node_decommission_metrics_total_tracked_containers_under_replicated
   node_decommission_metrics_total_tracked_containers_unhealthy
   
   ```
   _Of tracked datanodes in workflow, the aggregated number of pipelines waiting to close_
   `node_decommission_metrics_total_tracked_pipelines_waiting_to_close`
   
   
   _And, the number of datanodes in the workflow that are taken out and recommissioned._
   `node_decommission_metrics_total_tracked_recommission_nodes`
   
   **Similarly exposed via JMX:**
   ```
    {
       "name" : "Hadoop:service=StorageContainerManager,name=NodeDecommissionMetrics",
       "modelerType" : "NodeDecommissionMetrics",
       "tag.Hostname" : "e68cfe1f098e",
       "TotalTrackedDecommissioningMaintenanceNodes" : 0,
       "TotalTrackedRecommissionNodes" : 0,
       "TotalTrackedPipelinesWaitingToClose" : 0,
       "TotalTrackedContainersUnderReplicated" : 0,
       "TotalTrackedContainersUnhealthy" : 0,
       "TotalTrackedContainersSufficientlyReplicated" : 0
     }
   ```
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-2642
   
   ## How was this  tested?
   Unit tests, CI workflow and manually tested with dev docker-cluster entering nodes in decommissioning workflow monitoring metrics collected in prom endpoint.
   
   **Unit tests:**
   
   `hadoop-hdds/server-scm$ mvn -Dtest=TestNodeDecommissionMetrics test`
   
   INFO] -------------------------------------------------------
   [INFO]  T E S T S
   [INFO] -------------------------------------------------------
   [INFO] Running org.apache.hadoop.hdds.scm.node.TestNodeDecommissionMetrics
   [INFO] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.072 s - in org.apache.hadoop.hdds.scm.node.TestNodeDecommissionMetrics
   [INFO] 
   [INFO] Results:
   [INFO] 
   [INFO] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0
   [INFO] 
   
   
   **Manual testing via dev docker-cluster:**
   modify the docker-config for scm serviceid and serviceid-address:
   `hadoop-ozone/dist/target/ozone-1.3.0-SNAPSHOT/compose/ozone$`
   OZONE-SITE.XML_ozone.scm.nodes.scmservice=scm
   OZONE-SITE.XML_ozone.scm.address.scmservice.scm=scm
   
   set docker-compose for monitoring with prometheus:
   export COMPOSE_FILE=docker-compose.yaml:monitoring.yaml
   `hadoop-ozone/dist/target/ozone-1.3.0-SNAPSHOT/compose/ozone$ docker-compose up -d --scale datanode=3`
   
   view metrics through prom endpoint : http://localhost:9090
   Decomission datanode from scm bash prompt:
   `$ ozone admin datanode decommission -id=scmservice --scm=172.26.0.3:9894 3224625960ec`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r999242000


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionMetrics.java:
##########
@@ -0,0 +1,245 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with this
+ * work for additional information regarding copyright ownership.  The ASF
+ * licenses this file to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ * <p>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package org.apache.hadoop.hdds.scm.node;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.metrics2.MetricsCollector;
+import org.apache.hadoop.metrics2.MetricsRecordBuilder;
+import org.apache.hadoop.metrics2.MetricsSource;
+import org.apache.hadoop.metrics2.annotation.Metric;
+import org.apache.hadoop.metrics2.annotation.Metrics;
+import org.apache.hadoop.metrics2.lib.DefaultMetricsSystem;
+import org.apache.hadoop.metrics2.lib.Interns;
+import org.apache.hadoop.metrics2.lib.MetricsRegistry;
+import org.apache.hadoop.metrics2.lib.MutableGaugeLong;
+import org.apache.hadoop.ozone.OzoneConsts;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Class contains metrics related to the NodeDecommissionManager.
+ */
+@Metrics(about = "Node Decommission Metrics", context = OzoneConsts.OZONE)
+public final class NodeDecommissionMetrics implements MetricsSource {
+  public static final String METRICS_SOURCE_NAME =
+      org.apache.hadoop.hdds.scm.node.NodeDecommissionMetrics
+          .class.getSimpleName();
+
+  @Metric("Number of nodes tracked for decommissioning and maintenance.")
+  private MutableGaugeLong trackedDecommissioningMaintenanceNodesTotal;
+
+  @Metric("Number of nodes tracked for recommissioning.")
+  private MutableGaugeLong trackedRecommissionNodesTotal;
+
+  @Metric("Number of nodes tracked with pipelines waiting to close.")
+  private MutableGaugeLong trackedPipelinesWaitingToCloseTotal;
+
+  @Metric("Number of containers under replicated in tracked nodes.")
+  private MutableGaugeLong trackedContainersUnderReplicatedTotal;
+
+  @Metric("Number of containers unhealthy in tracked nodes.")
+  private MutableGaugeLong trackedContainersUnhealthyTotal;
+
+  @Metric("Number of containers sufficiently replicated in tracked nodes.")
+  private MutableGaugeLong trackedContainersSufficientlyReplicatedTotal;
+
+  private MetricsRegistry registry;
+
+  private Map<String, MutableGaugeLong> trackedSufficientlyReplicatedByHost;
+
+  private Map<String, MutableGaugeLong> trackedUnderReplicatedByHost;
+
+  private Map<String, MutableGaugeLong> trackedUnhealthyContainersByHost;
+
+  private Map<String, MutableGaugeLong> trackedPipelinesWaitingToCloseByHost;
+
+  /** Private constructor. */
+  private NodeDecommissionMetrics() {
+    this.registry = new MetricsRegistry(METRICS_SOURCE_NAME);
+    trackedSufficientlyReplicatedByHost = new HashMap<>();
+    trackedUnderReplicatedByHost = new HashMap<>();
+    trackedUnhealthyContainersByHost = new HashMap();
+    trackedPipelinesWaitingToCloseByHost = new HashMap();
+  }
+
+  /**
+   * Create and returns NodeDecommissionMetrics instance.
+   *
+   * @return NodeDecommissionMetrics
+   */
+  public static NodeDecommissionMetrics create() {
+    return DefaultMetricsSystem.instance().register(METRICS_SOURCE_NAME,
+        "Metrics tracking the progress of nodes in the "
+            + "Decommissioning and Maintenance workflows.  "
+            + "Tracks num nodes in mode and container "
+            + "replications state and pipelines waiting to close",
+        new NodeDecommissionMetrics());
+  }
+
+  /**
+   * Get aggregated gauge metrics.
+   */
+  @Override
+  public void getMetrics(MetricsCollector collector, boolean all) {
+    MetricsRecordBuilder builder = collector
+        .addRecord(METRICS_SOURCE_NAME);
+    trackedDecommissioningMaintenanceNodesTotal.snapshot(builder, all);
+    trackedRecommissionNodesTotal.snapshot(builder, all);
+    trackedPipelinesWaitingToCloseTotal.snapshot(builder, all);
+    trackedContainersUnderReplicatedTotal.snapshot(builder, all);
+    trackedContainersUnhealthyTotal.snapshot(builder, all);
+    trackedContainersSufficientlyReplicatedTotal.snapshot(builder, all);
+
+    registry.snapshot(builder, all);
+  }
+
+  /**
+   * Unregister the metrics instance.
+   */
+  public void unRegister() {
+    DefaultMetricsSystem.instance().unregisterSource(METRICS_SOURCE_NAME);
+  }
+
+  public void setTrackedDecommissioningMaintenanceNodesTotal(
+            long numNodesTracked) {
+    trackedDecommissioningMaintenanceNodesTotal
+        .set(numNodesTracked);
+  }
+
+  public void setTrackedRecommissionNodesTotal(
+          long numNodesTrackedRecommissioned) {
+    trackedRecommissionNodesTotal.set(numNodesTrackedRecommissioned);
+  }
+
+  public void setTrackedPipelinesWaitingToCloseTotal(
+          long numTrackedPipelinesWaitToClose) {
+    trackedPipelinesWaitingToCloseTotal
+        .set(numTrackedPipelinesWaitToClose);
+  }
+
+  public void setTrackedContainersUnderReplicatedTotal(
+          long numTrackedUnderReplicated) {
+    trackedContainersUnderReplicatedTotal
+        .set(numTrackedUnderReplicated);
+  }
+
+  public void setTrackedContainersUnhealthyTotal(
+          long numTrackedUnhealthy) {
+    trackedContainersUnhealthyTotal
+        .set(numTrackedUnhealthy);
+  }
+
+  public void setTrackedContainersSufficientlyReplicatedTotal(
+          long numTrackedSufficientlyReplicated) {
+    trackedContainersSufficientlyReplicatedTotal
+        .set(numTrackedSufficientlyReplicated);
+  }
+
+  public long getTrackedDecommissioningMaintenanceNodesTotal() {
+    return trackedDecommissioningMaintenanceNodesTotal.value();
+  }
+
+  public long getTrackedRecommissionNodesTotal() {
+    return trackedRecommissionNodesTotal.value();
+  }
+
+  public long getTrackedPipelinesWaitingToCloseTotal() {
+    return trackedPipelinesWaitingToCloseTotal.value();
+  }
+
+  public long getTrackedContainersUnderReplicatedTotal() {
+    return trackedContainersUnderReplicatedTotal.value();
+  }
+
+  public long getTrackedContainersUnhealthyTotal() {
+    return trackedContainersUnhealthyTotal.value();
+  }
+
+  public long getTrackedContainersSufficientlyReplicatedTotal() {
+    return trackedContainersSufficientlyReplicatedTotal.value();
+  }
+
+  public void metricRecordPipelineWaitingToCloseByHost(String host,
+                                                       long num) {
+    trackedPipelinesWaitingToCloseByHost.computeIfAbsent(host,
+        hostID -> registry.newGauge(
+            Interns.info("trackedPipelinesWaitingToClose-" + hostID,

Review Comment:
   This block is repeated 4 times with just the name and value changing:
   
   ```
   Interns.info("trackedSufficientlyReplicated-" + hostID,
                   "Number of sufficiently replicated containers "
                       + "for host in decommissioning and "
                       + "maintenance mode"), 0L)).set(sufficientlyReplicated);
   ```
   
   Could you move it into a private method to reduce the duplicated code?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1004946930


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +232,55 @@ public int getTrackedNodeCount() {
     return trackedNodes.size();
   }
 
+  synchronized void setMetricsToGauge() {
+    metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+    metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+    metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+            trackedDecomMaintenance);
+    metrics.setTrackedContainersUnderReplicatedTotal(
+            underReplicatedContainers);
+    metrics.setTrackedContainersSufficientlyReplicatedTotal(
+            sufficientlyReplicatedContainers);
+    metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+    for (Map.Entry<String, ContainerStateInWorkflow> e :
+            containerStateByHost.entrySet()) {
+      metrics.metricRecordOfContainerStateByHost(e.getKey(),
+          e.getValue().sufficientlyReplicated,
+          e.getValue().underReplicatedContainers,
+          e.getValue().unhealthyContainers,
+          e.getValue().pipelinesWaitingToClose);
+    }
+  }
+
+  void resetContainerMetrics() {
+    pipelinesWaitingToClose = 0;
+    sufficientlyReplicatedContainers = 0;
+    unhealthyContainers = 0;
+    underReplicatedContainers = 0;
+
+    for (Map.Entry<String, ContainerStateInWorkflow> e :

Review Comment:
   Yes, thanks,  I was looking to couple the `ContainerStateInWorkflow` for use both in the `DatanodeAdminMonitorImpl`and in the `NodeDecommissionMetrics` however there are a few issues with that and thus it is implemented this way.  Those issues are,
   
   i.) there needs to be separate stores for numbers collected for the metrics from the monitor and numbers stored in the `NodeDecommissionMetrics.`  This is so that we do not report incomplete intermediate numbers to the` NodeDecommissionMetrics` when the metrics are periodically pulled through `getMetrics().`  We flush the numbers on each run of the monitor thread to the `NodeDecommissionMetrics `once all the numbers have been collected (the calls to `metricRecordOfContainterStateByHost)`.  For this reason the `Map<string, ContainerStateInWorkflow>` cannot be used directly in the `NodeDecommisonMetrics`.
   
   ii.) With the two separate stores, we need to know which hosts stored are currently in the workflow and which are out of the workflow and stale.  Thus the check in the monitor code to collect those hosts that are stale and reporting that to the `NodeDecommissionMetrics.metricRemoveRecordOfContainerStateByHost()`.  For this reason as well, it looks like clearing the map on each run of the monitor instead of iterating to reset to 0, as suggested, is not possible.  We need to know which nodes (hosts) have become stale since the last run of the monitor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1300724900

   I'm out of the office this week. I will look at this again on Monday. I don't think this PR is essential for the 1.3 release as we have lived without these metrics up until now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1005334878


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +232,55 @@ public int getTrackedNodeCount() {
     return trackedNodes.size();
   }
 
+  synchronized void setMetricsToGauge() {
+    metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+    metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+    metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+            trackedDecomMaintenance);
+    metrics.setTrackedContainersUnderReplicatedTotal(
+            underReplicatedContainers);
+    metrics.setTrackedContainersSufficientlyReplicatedTotal(
+            sufficientlyReplicatedContainers);
+    metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+    for (Map.Entry<String, ContainerStateInWorkflow> e :
+            containerStateByHost.entrySet()) {
+      metrics.metricRecordOfContainerStateByHost(e.getKey(),
+          e.getValue().sufficientlyReplicated,
+          e.getValue().underReplicatedContainers,
+          e.getValue().unhealthyContainers,
+          e.getValue().pipelinesWaitingToClose);
+    }
+  }
+
+  void resetContainerMetrics() {
+    pipelinesWaitingToClose = 0;
+    sufficientlyReplicatedContainers = 0;
+    unhealthyContainers = 0;
+    underReplicatedContainers = 0;
+
+    for (Map.Entry<String, ContainerStateInWorkflow> e :

Review Comment:
   > On each scheduled run of the monitor, the implementation captures the current workflow state completely prior to flushing the metric update to the NodeDecommissionMetric object (DatanodeAdminMonitorImpl.setMetricsToGauge). In doing so, it tries to keep each metric refresh pull from getMetrics display a full snapshot of the last captured run of the monitor.
   
   The code in `DatanodeAdminMonitorImpl.setMetricsToGauge` is:
   
   ```
   synchronized void setMetricsToGauge() {
       metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
       metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
       metrics.setTrackedDecommissioningMaintenanceNodesTotal(
               trackedDecomMaintenance);
       metrics.setTrackedContainersUnderReplicatedTotal(
               underReplicatedContainers);
       metrics.setTrackedContainersSufficientlyReplicatedTotal(
               sufficientlyReplicatedContainers);
       metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
       for (Map.Entry<String, ContainerStateInWorkflow> e :
   ...
   ```
   
   It makes multiple calls to `metrics` and there is nothing stopping getMetrics() being called on the metrics object by another thread half way through the execution of `setMetricsToGauge`. This means the metrics can still be snapshot inconsistently.
   
   To make it consistent, you need to synchronize in the metrics object and set ALL the metrics in a single synchronized call.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1294032163

   Why use threadLocal variables in the latest patch? Does this even work? If another thread reads the values set by the monitor thread, does it even get the values, as they should be local to the thread who set them?
   
   Lets take a step back. The Decommission Monitor thread is synchronised already, and at the end of a run, the accumulated values for the metrics can be sent to the gauge, while still under the lock.
   
   In that way, a consistent set of metrics can be sent to the gauge. However they are being sent to the gauge over a number of calls, so the gauge could be snapshot halfway through, giving inconsistent results.
   
   Instead, you can make the snapshot method in the metrics class synchronized. Then in the sendMetricsToGauge method, you can wrap all the calls to the metrics class in a synchronized block too:
   
   ```
     private synchronized void setMetricsToGauge() {
       synchronized(metrics) {
         metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers.get());
         metrics.setTrackedRecommissionNodesTotal(trackedRecommission.get());
         metrics.setTrackedDecommissioningMaintenanceNodesTotal(
                 trackedDecomMaintenance.get());
         metrics.setTrackedContainersUnderReplicatedTotal(
                 underReplicatedContainers.get());
         metrics.setTrackedContainersSufficientlyReplicatedTotal(
                 sufficientlyReplicatedContainers.get());
         metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose
             .get());
         metrics.metricRecordOfContainerStateByHost(containerStateByHost.get());
       }
     }
   ```
   That way, we get a consistent set of metrics into the metric class, and block the `getMetrics(...)` method from executing when the metrics are in the process of being updated, preventing any race conditions.
   
   I think you should revert that latest commit, and the synchronised part as I suggested here, and then we can see if we can clean up the reset side of things (which I think you already did here).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] michelsumbul commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

michelsumbul commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1301025104

   @captainzmc, having visibility on the decommissioning / maintenance process its something quite important from an operational point of view if you want to run a production cluster. our operational team will be very reluctant to use it in prod without that as they will not know what's happening or not. Its probably a fundamental feature from an operational point of view.  We want to use the 1.3 version as it will be widely use and not a master branch later. This is why we would like to have it included in the 1.3.
   Moreover I believe we are not far to be ready, @neils-dev told me that all the comments have been addressed in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1002288760


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionMetrics.java:
##########
@@ -0,0 +1,245 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with this
+ * work for additional information regarding copyright ownership.  The ASF
+ * licenses this file to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ * <p>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package org.apache.hadoop.hdds.scm.node;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.metrics2.MetricsCollector;
+import org.apache.hadoop.metrics2.MetricsRecordBuilder;
+import org.apache.hadoop.metrics2.MetricsSource;
+import org.apache.hadoop.metrics2.annotation.Metric;
+import org.apache.hadoop.metrics2.annotation.Metrics;
+import org.apache.hadoop.metrics2.lib.DefaultMetricsSystem;
+import org.apache.hadoop.metrics2.lib.Interns;
+import org.apache.hadoop.metrics2.lib.MetricsRegistry;
+import org.apache.hadoop.metrics2.lib.MutableGaugeLong;
+import org.apache.hadoop.ozone.OzoneConsts;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Class contains metrics related to the NodeDecommissionManager.
+ */
+@Metrics(about = "Node Decommission Metrics", context = OzoneConsts.OZONE)
+public final class NodeDecommissionMetrics implements MetricsSource {
+  public static final String METRICS_SOURCE_NAME =
+      org.apache.hadoop.hdds.scm.node.NodeDecommissionMetrics
+          .class.getSimpleName();
+
+  @Metric("Number of nodes tracked for decommissioning and maintenance.")
+  private MutableGaugeLong trackedDecommissioningMaintenanceNodesTotal;
+
+  @Metric("Number of nodes tracked for recommissioning.")
+  private MutableGaugeLong trackedRecommissionNodesTotal;
+
+  @Metric("Number of nodes tracked with pipelines waiting to close.")
+  private MutableGaugeLong trackedPipelinesWaitingToCloseTotal;
+
+  @Metric("Number of containers under replicated in tracked nodes.")
+  private MutableGaugeLong trackedContainersUnderReplicatedTotal;
+
+  @Metric("Number of containers unhealthy in tracked nodes.")
+  private MutableGaugeLong trackedContainersUnhealthyTotal;
+
+  @Metric("Number of containers sufficiently replicated in tracked nodes.")
+  private MutableGaugeLong trackedContainersSufficientlyReplicatedTotal;
+
+  private MetricsRegistry registry;
+
+  private Map<String, MutableGaugeLong> trackedSufficientlyReplicatedByHost;
+
+  private Map<String, MutableGaugeLong> trackedUnderReplicatedByHost;
+
+  private Map<String, MutableGaugeLong> trackedUnhealthyContainersByHost;
+
+  private Map<String, MutableGaugeLong> trackedPipelinesWaitingToCloseByHost;
+
+  /** Private constructor. */
+  private NodeDecommissionMetrics() {
+    this.registry = new MetricsRegistry(METRICS_SOURCE_NAME);
+    trackedSufficientlyReplicatedByHost = new HashMap<>();
+    trackedUnderReplicatedByHost = new HashMap<>();
+    trackedUnhealthyContainersByHost = new HashMap();
+    trackedPipelinesWaitingToCloseByHost = new HashMap();
+  }
+
+  /**
+   * Create and returns NodeDecommissionMetrics instance.
+   *
+   * @return NodeDecommissionMetrics
+   */
+  public static NodeDecommissionMetrics create() {
+    return DefaultMetricsSystem.instance().register(METRICS_SOURCE_NAME,
+        "Metrics tracking the progress of nodes in the "
+            + "Decommissioning and Maintenance workflows.  "
+            + "Tracks num nodes in mode and container "
+            + "replications state and pipelines waiting to close",
+        new NodeDecommissionMetrics());
+  }
+
+  /**
+   * Get aggregated gauge metrics.
+   */
+  @Override
+  public void getMetrics(MetricsCollector collector, boolean all) {
+    MetricsRecordBuilder builder = collector
+        .addRecord(METRICS_SOURCE_NAME);
+    trackedDecommissioningMaintenanceNodesTotal.snapshot(builder, all);
+    trackedRecommissionNodesTotal.snapshot(builder, all);
+    trackedPipelinesWaitingToCloseTotal.snapshot(builder, all);
+    trackedContainersUnderReplicatedTotal.snapshot(builder, all);
+    trackedContainersUnhealthyTotal.snapshot(builder, all);
+    trackedContainersSufficientlyReplicatedTotal.snapshot(builder, all);
+
+    registry.snapshot(builder, all);
+  }
+
+  /**
+   * Unregister the metrics instance.
+   */
+  public void unRegister() {
+    DefaultMetricsSystem.instance().unregisterSource(METRICS_SOURCE_NAME);
+  }
+
+  public void setTrackedDecommissioningMaintenanceNodesTotal(
+            long numNodesTracked) {
+    trackedDecommissioningMaintenanceNodesTotal
+        .set(numNodesTracked);
+  }
+
+  public void setTrackedRecommissionNodesTotal(
+          long numNodesTrackedRecommissioned) {
+    trackedRecommissionNodesTotal.set(numNodesTrackedRecommissioned);
+  }
+
+  public void setTrackedPipelinesWaitingToCloseTotal(
+          long numTrackedPipelinesWaitToClose) {
+    trackedPipelinesWaitingToCloseTotal
+        .set(numTrackedPipelinesWaitToClose);
+  }
+
+  public void setTrackedContainersUnderReplicatedTotal(
+          long numTrackedUnderReplicated) {
+    trackedContainersUnderReplicatedTotal
+        .set(numTrackedUnderReplicated);
+  }
+
+  public void setTrackedContainersUnhealthyTotal(
+          long numTrackedUnhealthy) {
+    trackedContainersUnhealthyTotal
+        .set(numTrackedUnhealthy);
+  }
+
+  public void setTrackedContainersSufficientlyReplicatedTotal(
+          long numTrackedSufficientlyReplicated) {
+    trackedContainersSufficientlyReplicatedTotal
+        .set(numTrackedSufficientlyReplicated);
+  }
+
+  public long getTrackedDecommissioningMaintenanceNodesTotal() {
+    return trackedDecommissioningMaintenanceNodesTotal.value();
+  }
+
+  public long getTrackedRecommissionNodesTotal() {
+    return trackedRecommissionNodesTotal.value();
+  }
+
+  public long getTrackedPipelinesWaitingToCloseTotal() {
+    return trackedPipelinesWaitingToCloseTotal.value();
+  }
+
+  public long getTrackedContainersUnderReplicatedTotal() {
+    return trackedContainersUnderReplicatedTotal.value();
+  }
+
+  public long getTrackedContainersUnhealthyTotal() {
+    return trackedContainersUnhealthyTotal.value();
+  }
+
+  public long getTrackedContainersSufficientlyReplicatedTotal() {
+    return trackedContainersSufficientlyReplicatedTotal.value();
+  }
+
+  public void metricRecordPipelineWaitingToCloseByHost(String host,
+                                                       long num) {
+    trackedPipelinesWaitingToCloseByHost.computeIfAbsent(host,
+        hostID -> registry.newGauge(
+            Interns.info("trackedPipelinesWaitingToClose-" + hostID,
+                "Number of pipelines waiting to close for "
+                    + "host in decommissioning and "
+                    + "maintenance mode"), 0L)).set(num);
+  }
+
+  public void metricRecordOfReplicationByHost(String host,
+                                              long sufficientlyReplicated,
+                                              long underReplicated,
+                                              long unhealthy) {
+    trackedSufficientlyReplicatedByHost.computeIfAbsent(host,
+        hostID -> registry.newGauge(
+            Interns.info("trackedSufficientlyReplicated-" + hostID,
+                "Number of sufficiently replicated containers "
+                    + "for host in decommissioning and "
+                    + "maintenance mode"), 0L)).set(sufficientlyReplicated);
+
+    trackedUnderReplicatedByHost.computeIfAbsent(host,
+        hostID -> registry.newGauge(
+            Interns.info("trackedUnderReplicated-" + hostID,
+                "Number of under-replicated containers "
+                    + "for host in decommissioning and "
+                    + "maintenance mode"), 0L)).set(underReplicated);
+
+    trackedUnhealthyContainersByHost.computeIfAbsent(host,
+        hostID -> registry.newGauge(
+            Interns.info("trackedUnhealthyContainers-" + hostID,
+                "Number of unhealthy containers "
+                    + "for host in decommissioning and "
+                    + "maintenance mode"), 0L)).set(unhealthy);
+  }
+
+  @VisibleForTesting
+  public Long getTrackedPipelinesWaitingToCloseByHost(String host) {
+    if (!trackedPipelinesWaitingToCloseByHost.containsKey(host)) {

Review Comment:
   getters from metrics by host cleaned up in latest commit.  `getTrackedPipelinesWaitingToCloseByHost` , please check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1002290670


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +225,43 @@ public int getTrackedNodeCount() {
     return trackedNodes.size();
   }
 
+  synchronized void setMetricsToGauge() {
+    metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+    metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+    metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+            trackedDecomMaintenance);
+    metrics.setTrackedContainersUnderReplicatedTotal(
+            underReplicatedContainers);
+    metrics.setTrackedContainersSufficientlyReplicatedTotal(
+            sufficientlyReplicatedContainers);
+    metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+    for (Map.Entry<String, Long> e :
+            pipelinesWaitingToCloseByHost.entrySet()) {
+      metrics.metricRecordPipelineWaitingToCloseByHost(e.getKey(),
+              e.getValue());
+    }
+    for (Map.Entry<String, ContainerStateInWorkflow> e :

Review Comment:
   I've modified the code to dynamically (without using the helper `MetricsRegistry` class to add gauges) add to the collector as is done similarly in namenode topmetrics collections.  See https://github.com/apache/hadoop/blob/eefa664fea1119a9c6e3ae2d2ad3069019fbd4ef/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/top/metrics/TopMetrics.java#L167.
   Here the metrics are collected dynamically by host when the host node is in the workflow.  When the node exits the workflow, the metrics for that host are no longer collected.  In the `JMX,` the node metrics are no longer in the output.  Note this is true for `JMX`, the prom endpoint _seems to retain_ the last value pushed.  See the following metrics pushed out to JMX for the` NodeDecommissionMetrics` when `datanode-2` is decommissioned:
   
   ```
   before
   
   "name" : "Hadoop:service=StorageContainerManager,name=NodeDecommissionMetrics",
       "modelerType" : "NodeDecommissionMetrics",
       "tag.Hostname" : "0d207b6cbbf1",
       "TrackedDecommissioningMaintenanceNodesTotal" : 0,
       "TrackedRecommissionNodesTotal" : 0,
       "TrackedPipelinesWaitingToCloseTotal" : 0,
       "TrackedContainersUnderReplicatedTotal" : 0,
       "TrackedContainersUnhealthyTotal" : 0,
       "TrackedContainersSufficientlyReplicatedTotal" : 0
     }, {
   
   during
       "name" : "Hadoop:service=StorageContainerManager,name=NodeDecommissionMetrics",
       "modelerType" : "NodeDecommissionMetrics",
       "tag.Hostname" : "0d207b6cbbf1",
       "TrackedDecommissioningMaintenanceNodesTotal" : 1,
       "TrackedRecommissionNodesTotal" : 0,
       "TrackedPipelinesWaitingToCloseTotal" : 2,
       "TrackedContainersUnderReplicatedTotal" : 0,
       "TrackedContainersUnhealthyTotal" : 0,
       "TrackedContainersSufficientlyReplicatedTotal" : 0,
       "TrackedUnhealthyContainers-ozone-datanode-2.ozone_default" : 0,
       "TrackedSufficientlyReplicated-ozone-datanode-2.ozone_default" : 0,
       "TrackedPipelinesWaitingToClose-ozone-datanode-2.ozone_default" : 2,
       "TrackedUnderReplicated-ozone-datanode-2.ozone_default" : 0
     }, {
   
     }, {
       "name" : "Hadoop:service=StorageContainerManager,name=NodeDecommissionMetrics",
       "modelerType" : "NodeDecommissionMetrics",
       "tag.Hostname" : "0d207b6cbbf1",
       "TrackedDecommissioningMaintenanceNodesTotal" : 1,
       "TrackedRecommissionNodesTotal" : 0,
       "TrackedPipelinesWaitingToCloseTotal" : 0,
       "TrackedContainersUnderReplicatedTotal" : 1,
       "TrackedContainersUnhealthyTotal" : 0,
       "TrackedContainersSufficientlyReplicatedTotal" : 0,
       "TrackedUnhealthyContainers-ozone-datanode-2.ozone_default" : 0,
       "TrackedSufficientlyReplicated-ozone-datanode-2.ozone_default" : 0,
       "TrackedPipelinesWaitingToClose-ozone-datanode-2.ozone_default" : 0,
       "TrackedUnderReplicated-ozone-datanode-2.ozone_default" : 1
     }, {
   
   after
    }, {
       "name" : "Hadoop:service=StorageContainerManager,name=NodeDecommissionMetrics",
       "modelerType" : "NodeDecommissionMetrics",
       "tag.Hostname" : "0d207b6cbbf1",
       "TrackedDecommissioningMaintenanceNodesTotal" : 0,
       "TrackedRecommissionNodesTotal" : 0,
       "TrackedPipelinesWaitingToCloseTotal" : 0,
       "TrackedContainersUnderReplicatedTotal" : 0,
       "TrackedContainersUnhealthyTotal" : 0,
       "TrackedContainersSufficientlyReplicatedTotal" : 0
     }, {
   
   ```
   The host `datanode-2 `metrics no longer visible as the node exits the workflow.
   
   This seems to follow how hadoop handles metrics collected dynamically, however the `prom endpoint` seems to _**retain**_ the last pushed value for some reason.  Is this what we should expect when collecting metrics for hosts as they go in and out of the workflow?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r998796755


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -82,6 +85,44 @@ public class DatanodeAdminMonitorImpl implements DatanodeAdminMonitor {
   private long unhealthyContainers = 0;
   private long underReplicatedContainers = 0;
 
+  @SuppressFBWarnings(value = "SIC_INNER_SHOULD_BE_STATIC")
+  private final class ContainerStateInWorkflow {

Review Comment:
   Thanks.  Removed the suppress annotation and properly converted _instead_ to` static final nested class` _from_ the` final inner class.`  As the inner class does not refer to the outer class instance, it should indeed be a static nested class.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1281536891

   JMX and prom metrics collected for monitoring decommissioning and maintenance mode workflow updated to included collecting metrics by host.  Examples of metrics captured follow.  
   JMX sample with hosts _**ozone-datanode-3.ozone_default**_ and _**ozone-datanode-1.ozone_default**_ :
   ```
       "modelerType" : "NodeDecommissionMetrics",
       "tag.Context" : "ozone",
       "tag.Hostname" : "e8772c7d7c4c",
       "TrackedDecommissioningMaintenanceNodesTotal" : 1,
       "TrackedRecommissionNodesTotal" : 0,
       "TrackedPipelinesWaitingToCloseTotal" : 2,
       "TrackedContainersUnderReplicatedTotal" : 0,
       "TrackedContainersUnhealthyTotal" : 0,
       "TrackedContainersSufficientlyReplicatedTotal" : 0,
       "TrackedContainersSufficientlyReplicatedTotal" : 0,
       "TrackedContainersUnderReplicatedTotal" : 0,
       "TrackedContainersUnhealthyTotal" : 0,
       "TrackedDecommissioningMaintenanceNodesTotal" : 1,
       "TrackedPipelinesWaitingToCloseTotal" : 2,
       "TrackedRecommissionNodesTotal" : 0,
       "trackedPipelinesWaitingToClose-ozone-datanode-3.ozone_default" : 0,
       "trackedSufficientlyReplicated-ozone-datanode-3.ozone_default" : 0,
       "trackedUnderReplicated-ozone-datanode-3.ozone_default" : 0,
       "trackedUnhealthyContainers-ozone-datanode-3.ozone_default" : 0,
       "trackedPipelinesWaitingToClose-ozone-datanode-1.ozone_default" : 2
     }, { 
   ```
   
   And, prom endpoint capturing decommissioning datanodes 3 and 1 in captured image:
   ![prom_decommission_monitor_scale](https://user-images.githubusercontent.com/81126310/196290199-c9ab6aa1-d4d8-4309-913d-f0b442135cc2.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1283787129

   Thanks for updating the formatting. I have a few more comments around code reuse and also a possible bug to check on. Also a conflict has appeared against one class which needs resolved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] kerneltime commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

kerneltime commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r984819741


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionMetrics.java:
##########
@@ -0,0 +1,158 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with this
+ * work for additional information regarding copyright ownership.  The ASF
+ * licenses this file to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ * <p>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package org.apache.hadoop.hdds.scm.node;
+
+import org.apache.hadoop.metrics2.MetricsCollector;
+import org.apache.hadoop.metrics2.MetricsRecordBuilder;
+import org.apache.hadoop.metrics2.MetricsSource;
+import org.apache.hadoop.metrics2.annotation.Metric;
+import org.apache.hadoop.metrics2.annotation.Metrics;
+import org.apache.hadoop.metrics2.lib.DefaultMetricsSystem;
+import org.apache.hadoop.metrics2.lib.MetricsRegistry;
+import org.apache.hadoop.metrics2.lib.MutableGaugeLong;
+import org.apache.hadoop.ozone.OzoneConsts;
+
+/**
+ * Class contains metrics related to the NodeDecommissionManager.
+ */
+@Metrics(about = "Node Decommission Metrics", context = OzoneConsts.OZONE)
+public final class NodeDecommissionMetrics implements MetricsSource {
+  public static final String METRICS_SOURCE_NAME =
+        org.apache.hadoop.hdds.scm.node.NodeDecommissionMetrics
+                .class.getSimpleName();
+
+  @Metric("Number of nodes tracked for decommissioning and maintenance.")
+  private MutableGaugeLong totalTrackedDecommissioningMaintenanceNodes;

Review Comment:
   The metric names should be renamed to follow standard conventions (we don't it everywhere in our code, but we should for new metrics). This way, when searching for metrics, we can first filter for what entity and then for which metric. Applicable to all metrics here.
   ```suggestion
     private MutableGaugeLong trackedDecommissioningMaintenanceNodesTotal;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1295689285

   Pushed new changes that finish the clean up for the metrics reset and collection in the monitor.  In addition the metric in the NodeDecommisionMetrics changed for container state metrics per node in the workflow. 
   
    Now, a single metric name is used for the same metric collected for each datanode, within the metric is an associated tag that identifies the node for the metric reading, ie. `node_decommission_metrics_tracked_sufficiently_replicated_dn{datanode="ozone-datanode-2.ozone_default",hostname="39160451dea0"}.
   
   The decommissioning / maintenance workflow is tracked by JMX displaying each aggregated metric and displaying the node container state metrics only when the node is in the workflow.  Prometheus now also displays each aggregated metric but now under a unique metric name for each container state metric, displays each host associated with the reading as a tag.  This can be seen below in a workflow decommissioning `datanode-3:`
   
   during:
   ```
       "name" : "Hadoop:service=StorageContainerManager,name=NodeDecommissionMetric
   s",
       "modelerType" : "NodeDecommissionMetrics",
       "tag.Hostname" : "39160451dea0",
       "TrackedDecommissioningMaintenanceNodesTotal" : 1,
       "TrackedRecommissionNodesTotal" : 0,
       "TrackedPipelinesWaitingToCloseTotal" : 0,
       "TrackedContainersUnderReplicatedTotal" : 1,
       "TrackedContainersUnhealthyTotal" : 0,
       "TrackedContainersSufficientlyReplicatedTotal" : 0,
       "tag.datanode.1" : "ozone-datanode-3.ozone_default",
       "tag.Hostname.1" : "39160451dea0",
       "TrackedUnderReplicatedDN.1" : 1,
       "tag.datanode.2" : "ozone-datanode-3.ozone_default",
       "tag.Hostname.2" : "39160451dea0",
       "TrackedSufficientlyReplicatedDN.2" : 0,
       "tag.datanode.3" : "ozone-datanode-3.ozone_default",
       "tag.Hostname.3" : "39160451dea0",
       "TrackedPipelinesWaitingToCloseDN.3" : 0,
       "tag.datanode.4" : "ozone-datanode-3.ozone_default",
       "tag.Hostname.4" : "39160451dea0",
       "TrackedUnhealthyContainersDN.4" : 0
     }, {
   ```
   
   after,
   ```
    }, {
       "name" : "Hadoop:service=StorageContainerManager,name=NodeDecommissionMetrics",
       "modelerType" : "NodeDecommissionMetrics",
       "tag.Hostname" : "39160451dea0",
       "TrackedDecommissioningMaintenanceNodesTotal" : 0,
       "TrackedRecommissionNodesTotal" : 0,
       "TrackedPipelinesWaitingToCloseTotal" : 0,
       "TrackedContainersUnderReplicatedTotal" : 0,
       "TrackedContainersUnhealthyTotal" : 0,
       "TrackedContainersSufficientlyReplicatedTotal" : 0
     }, {
   ```
   and Prometheus, decommission datanode-2 and datanode-3:
   
   ![decommission_prom_DNs](https://user-images.githubusercontent.com/81126310/198768113-f907b06a-6fc7-49c6-809d-8a9618c2c764.png)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1005149014


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +232,55 @@ public int getTrackedNodeCount() {
     return trackedNodes.size();
   }
 
+  synchronized void setMetricsToGauge() {
+    metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+    metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+    metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+            trackedDecomMaintenance);
+    metrics.setTrackedContainersUnderReplicatedTotal(
+            underReplicatedContainers);
+    metrics.setTrackedContainersSufficientlyReplicatedTotal(
+            sufficientlyReplicatedContainers);
+    metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+    for (Map.Entry<String, ContainerStateInWorkflow> e :
+            containerStateByHost.entrySet()) {
+      metrics.metricRecordOfContainerStateByHost(e.getKey(),
+          e.getValue().sufficientlyReplicated,
+          e.getValue().underReplicatedContainers,
+          e.getValue().unhealthyContainers,
+          e.getValue().pipelinesWaitingToClose);
+    }
+  }
+
+  void resetContainerMetrics() {
+    pipelinesWaitingToClose = 0;
+    sufficientlyReplicatedContainers = 0;
+    unhealthyContainers = 0;
+    underReplicatedContainers = 0;
+
+    for (Map.Entry<String, ContainerStateInWorkflow> e :

Review Comment:
   On each scheduled run of the monitor, the implementation captures the current workflow state completely prior to flushing the metric update to the `NodeDecommissionMetric` object (`DatanodeAdminMonitorImpl.setMetricsToGauge`).  In doing so, it tries to keep each metric refresh pull from `getMetrics `display a full snapshot of the last captured run of the monitor.
   
   > We can build up a Map<String, ContainerStateInWorkflow> for each iteration and at the end of the iteration pass it to the metrics object like new Map<>(mapJustBuiltUp)
   
   Yes, I currently have been coding something just like that based on your earlier comment.  With this, the `Map<String, ContainerStateInWorkflow>` is passed to the `NodeDecommisionMetric.metricRecordOfContainerStateByHost`.  Within it, it sets the internal metrics by host and _also_ removes stale nodes that exited the workflow since the last run of the monitor snapshot captured.  It looks like what we are discussing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1307006622

   Yes it makes sense to open a new Jira for the prometheus issue.
   
   The tagged metrics look better now - we have one tag per DN, which is what I would expect.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1000808150


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +225,43 @@ public int getTrackedNodeCount() {
     return trackedNodes.size();
   }
 
+  synchronized void setMetricsToGauge() {
+    metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+    metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+    metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+            trackedDecomMaintenance);
+    metrics.setTrackedContainersUnderReplicatedTotal(
+            underReplicatedContainers);
+    metrics.setTrackedContainersSufficientlyReplicatedTotal(
+            sufficientlyReplicatedContainers);
+    metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+    for (Map.Entry<String, Long> e :
+            pipelinesWaitingToCloseByHost.entrySet()) {
+      metrics.metricRecordPipelineWaitingToCloseByHost(e.getKey(),
+              e.getValue());
+    }
+    for (Map.Entry<String, ContainerStateInWorkflow> e :

Review Comment:
   @sodonnel , with the metrics registry it appears that the metrics we track remain in the registry.  With this behavior, currently each datanode we add to track _**remains**_ unless we have an api to remove it from the `MetricsRegistry`.  Is there a way to delete/remove a gauge from the registry?  See `MetricsRegistry.java` https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/metrics2/lib/MetricsRegistry.java#L40.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1305870740

   ```
       "tag.datanode.1" : "ozone-datanode-3.ozone_default",
       "tag.Hostname.1" : "39160451dea0",
       "TrackedUnderReplicatedDN.1" : 1,
       "tag.datanode.2" : "ozone-datanode-3.ozone_default",
       "tag.Hostname.2" : "39160451dea0",
       "TrackedSufficientlyReplicatedDN.2" : 0,
       "tag.datanode.3" : "ozone-datanode-3.ozone_default",
       "tag.Hostname.3" : "39160451dea0",
       "TrackedPipelinesWaitingToCloseDN.3" : 0,
       "tag.datanode.4" : "ozone-datanode-3.ozone_default",
       "tag.Hostname.4" : "39160451dea0",
       "TrackedUnhealthyContainersDN.4" : 0
   ```
   Its a bit strange that we have 4 tags for the same DN. The way I'd expect this to work is we have 1 tag per DN, and then the 4 metrics (under, sufficiently, pipelines, unhealthy) all sharing that tag.
   
   I **think** this is due to the way you are adding the tags in `getMetrics` - you are building a new tag per metric, rather than a tag per DN and then tagging all its metrics with that tag.
   
   It might be easier if you simply stored a Set or Map of `ContainerStateInWorkflow` inside the metrics class, as then you can iterate it on a host by host basis.
   
   Example of what I mean, plus more simplifications - https://github.com/sodonnel/hadoop-ozone/commit/673109c0403c69265fd100499373002e37ddfbc0    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1305783099

   > Now, a single metric name is used for the same metric collected for each datanode, within the metric is an associated tag that identifies the node for the metric reading, ie. `node_decommission_metrics_tracked_sufficiently_replicated_dn{datanode="ozone-datanode-2.ozone_default",hostname="39160451dea0"}.
   
   I don't see how in the source code this tag gets set - can you point me at how the tag per DN is getting set please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r999254808


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +225,43 @@ public int getTrackedNodeCount() {
     return trackedNodes.size();
   }
 
+  synchronized void setMetricsToGauge() {
+    metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+    metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+    metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+            trackedDecomMaintenance);
+    metrics.setTrackedContainersUnderReplicatedTotal(
+            underReplicatedContainers);
+    metrics.setTrackedContainersSufficientlyReplicatedTotal(
+            sufficientlyReplicatedContainers);
+    metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+    for (Map.Entry<String, Long> e :
+            pipelinesWaitingToCloseByHost.entrySet()) {
+      metrics.metricRecordPipelineWaitingToCloseByHost(e.getKey(),
+              e.getValue());
+    }
+    for (Map.Entry<String, ContainerStateInWorkflow> e :

Review Comment:
   I might be wrong, but I think there is a bug here.
   
   Lets say we put a host to maintenance. It will have some metrics tracked in the ByHost maps.
   
   After each pass we reset these maps to have zero counts, but we don't remove the entries from the maps anywhere (unless I have missed it). Then we update the values accordingly.
   
   Later the node goes back into service and even though it is removed from the monitor, it will be tracked with zero counts forever.
   
   Over time on a long running cluster, we will build up a lot of "by host" metrics with zero values, when they really should be removed.
   
   I think the reset will need to remove them from the maps rather than zeroing them, and also when setting the values to the metric gauge, you will need to remove values no longer there from it too.
   
   It might be easier to pass a `Map<String, ContainerStateInWorkflow>` to the metrics class to facilitate removing the stale entries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1004983578


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -74,6 +76,58 @@ public class DatanodeAdminMonitorImpl implements DatanodeAdminMonitor {
   private Queue<DatanodeDetails> pendingNodes = new ArrayDeque();
   private Queue<DatanodeDetails> cancelledNodes = new ArrayDeque();
   private Set<DatanodeDetails> trackedNodes = new HashSet<>();
+  private NodeDecommissionMetrics metrics;
+  private long pipelinesWaitingToClose = 0;
+  private long sufficientlyReplicatedContainers = 0;
+  private long trackedDecomMaintenance = 0;
+  private long trackedRecommission = 0;
+  private long unhealthyContainers = 0;
+  private long underReplicatedContainers = 0;
+
+  private static final class ContainerStateInWorkflow {
+    private long sufficientlyReplicated = 0;
+    private long unhealthyContainers = 0;
+    private long underReplicatedContainers = 0;
+    private String host = "";
+    private long pipelinesWaitingToClose = 0;
+
+    private ContainerStateInWorkflow(String host,
+                                     long sufficientlyReplicated,

Review Comment:
   Yes, currently we instantiate zeroing the values then use the setters to update. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1295109620

   ThreadLocal should not be needed for this. I feel there is probably a simpler solution we can use, but I would need to see the original error first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1002288931


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionMetrics.java:
##########
@@ -0,0 +1,245 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with this
+ * work for additional information regarding copyright ownership.  The ASF
+ * licenses this file to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ * <p>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package org.apache.hadoop.hdds.scm.node;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.metrics2.MetricsCollector;
+import org.apache.hadoop.metrics2.MetricsRecordBuilder;
+import org.apache.hadoop.metrics2.MetricsSource;
+import org.apache.hadoop.metrics2.annotation.Metric;
+import org.apache.hadoop.metrics2.annotation.Metrics;
+import org.apache.hadoop.metrics2.lib.DefaultMetricsSystem;
+import org.apache.hadoop.metrics2.lib.Interns;
+import org.apache.hadoop.metrics2.lib.MetricsRegistry;
+import org.apache.hadoop.metrics2.lib.MutableGaugeLong;
+import org.apache.hadoop.ozone.OzoneConsts;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Class contains metrics related to the NodeDecommissionManager.
+ */
+@Metrics(about = "Node Decommission Metrics", context = OzoneConsts.OZONE)
+public final class NodeDecommissionMetrics implements MetricsSource {
+  public static final String METRICS_SOURCE_NAME =
+      org.apache.hadoop.hdds.scm.node.NodeDecommissionMetrics
+          .class.getSimpleName();
+
+  @Metric("Number of nodes tracked for decommissioning and maintenance.")
+  private MutableGaugeLong trackedDecommissioningMaintenanceNodesTotal;
+
+  @Metric("Number of nodes tracked for recommissioning.")
+  private MutableGaugeLong trackedRecommissionNodesTotal;
+
+  @Metric("Number of nodes tracked with pipelines waiting to close.")
+  private MutableGaugeLong trackedPipelinesWaitingToCloseTotal;
+
+  @Metric("Number of containers under replicated in tracked nodes.")
+  private MutableGaugeLong trackedContainersUnderReplicatedTotal;
+
+  @Metric("Number of containers unhealthy in tracked nodes.")
+  private MutableGaugeLong trackedContainersUnhealthyTotal;
+
+  @Metric("Number of containers sufficiently replicated in tracked nodes.")
+  private MutableGaugeLong trackedContainersSufficientlyReplicatedTotal;
+
+  private MetricsRegistry registry;
+
+  private Map<String, MutableGaugeLong> trackedSufficientlyReplicatedByHost;
+
+  private Map<String, MutableGaugeLong> trackedUnderReplicatedByHost;
+
+  private Map<String, MutableGaugeLong> trackedUnhealthyContainersByHost;
+
+  private Map<String, MutableGaugeLong> trackedPipelinesWaitingToCloseByHost;
+
+  /** Private constructor. */
+  private NodeDecommissionMetrics() {
+    this.registry = new MetricsRegistry(METRICS_SOURCE_NAME);
+    trackedSufficientlyReplicatedByHost = new HashMap<>();
+    trackedUnderReplicatedByHost = new HashMap<>();
+    trackedUnhealthyContainersByHost = new HashMap();
+    trackedPipelinesWaitingToCloseByHost = new HashMap();
+  }
+
+  /**
+   * Create and returns NodeDecommissionMetrics instance.
+   *
+   * @return NodeDecommissionMetrics
+   */
+  public static NodeDecommissionMetrics create() {
+    return DefaultMetricsSystem.instance().register(METRICS_SOURCE_NAME,
+        "Metrics tracking the progress of nodes in the "
+            + "Decommissioning and Maintenance workflows.  "
+            + "Tracks num nodes in mode and container "
+            + "replications state and pipelines waiting to close",
+        new NodeDecommissionMetrics());
+  }
+
+  /**
+   * Get aggregated gauge metrics.
+   */
+  @Override
+  public void getMetrics(MetricsCollector collector, boolean all) {
+    MetricsRecordBuilder builder = collector
+        .addRecord(METRICS_SOURCE_NAME);
+    trackedDecommissioningMaintenanceNodesTotal.snapshot(builder, all);
+    trackedRecommissionNodesTotal.snapshot(builder, all);
+    trackedPipelinesWaitingToCloseTotal.snapshot(builder, all);
+    trackedContainersUnderReplicatedTotal.snapshot(builder, all);
+    trackedContainersUnhealthyTotal.snapshot(builder, all);
+    trackedContainersSufficientlyReplicatedTotal.snapshot(builder, all);
+
+    registry.snapshot(builder, all);
+  }
+
+  /**
+   * Unregister the metrics instance.
+   */
+  public void unRegister() {
+    DefaultMetricsSystem.instance().unregisterSource(METRICS_SOURCE_NAME);
+  }
+
+  public void setTrackedDecommissioningMaintenanceNodesTotal(
+            long numNodesTracked) {
+    trackedDecommissioningMaintenanceNodesTotal
+        .set(numNodesTracked);
+  }
+
+  public void setTrackedRecommissionNodesTotal(
+          long numNodesTrackedRecommissioned) {
+    trackedRecommissionNodesTotal.set(numNodesTrackedRecommissioned);
+  }
+
+  public void setTrackedPipelinesWaitingToCloseTotal(
+          long numTrackedPipelinesWaitToClose) {
+    trackedPipelinesWaitingToCloseTotal
+        .set(numTrackedPipelinesWaitToClose);
+  }
+
+  public void setTrackedContainersUnderReplicatedTotal(
+          long numTrackedUnderReplicated) {
+    trackedContainersUnderReplicatedTotal
+        .set(numTrackedUnderReplicated);
+  }
+
+  public void setTrackedContainersUnhealthyTotal(
+          long numTrackedUnhealthy) {
+    trackedContainersUnhealthyTotal
+        .set(numTrackedUnhealthy);
+  }
+
+  public void setTrackedContainersSufficientlyReplicatedTotal(
+          long numTrackedSufficientlyReplicated) {
+    trackedContainersSufficientlyReplicatedTotal
+        .set(numTrackedSufficientlyReplicated);
+  }
+
+  public long getTrackedDecommissioningMaintenanceNodesTotal() {
+    return trackedDecommissioningMaintenanceNodesTotal.value();
+  }
+
+  public long getTrackedRecommissionNodesTotal() {
+    return trackedRecommissionNodesTotal.value();
+  }
+
+  public long getTrackedPipelinesWaitingToCloseTotal() {
+    return trackedPipelinesWaitingToCloseTotal.value();
+  }
+
+  public long getTrackedContainersUnderReplicatedTotal() {
+    return trackedContainersUnderReplicatedTotal.value();
+  }
+
+  public long getTrackedContainersUnhealthyTotal() {
+    return trackedContainersUnhealthyTotal.value();
+  }
+
+  public long getTrackedContainersSufficientlyReplicatedTotal() {
+    return trackedContainersSufficientlyReplicatedTotal.value();
+  }
+
+  public void metricRecordPipelineWaitingToCloseByHost(String host,
+                                                       long num) {
+    trackedPipelinesWaitingToCloseByHost.computeIfAbsent(host,
+        hostID -> registry.newGauge(
+            Interns.info("trackedPipelinesWaitingToClose-" + hostID,

Review Comment:
   Thanks.  Moved to new private method : 
   `  private TrackedWorkflowContainerState createContainerMetricsInfo`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1002289058


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -90,6 +137,9 @@ public DatanodeAdminMonitorImpl(
     this.eventQueue = eventQueue;
     this.nodeManager = nodeManager;
     this.replicationManager = replicationManager;
+
+    containerStateByHost = new HashMap<>();
+    pipelinesWaitingToCloseByHost = new HashMap<>();

Review Comment:
   Combined all metrics collected by host in monitor to `ContainerStateInWorkflow` as suggested.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] kerneltime commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

kerneltime commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1013673313


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionMetrics.java:
##########
@@ -0,0 +1,253 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with this
+ * work for additional information regarding copyright ownership.  The ASF
+ * licenses this file to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ * <p>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package org.apache.hadoop.hdds.scm.node;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.metrics2.MetricsCollector;
+import org.apache.hadoop.metrics2.MetricsRecordBuilder;
+import org.apache.hadoop.metrics2.MetricsSource;
+import org.apache.hadoop.metrics2.annotation.Metric;
+import org.apache.hadoop.metrics2.annotation.Metrics;
+import org.apache.hadoop.metrics2.lib.DefaultMetricsSystem;
+import org.apache.hadoop.metrics2.lib.Interns;
+import org.apache.hadoop.metrics2.lib.MetricsRegistry;
+import org.apache.hadoop.metrics2.lib.MutableGaugeLong;
+import org.apache.hadoop.ozone.OzoneConsts;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Class contains metrics related to the NodeDecommissionManager.
+ */
+@Metrics(about = "Node Decommission Metrics", context = OzoneConsts.OZONE)
+public final class NodeDecommissionMetrics implements MetricsSource {
+  public static final String METRICS_SOURCE_NAME =
+        org.apache.hadoop.hdds.scm.node.NodeDecommissionMetrics
+                .class.getSimpleName();
+
+  @Metric("Number of nodes tracked for decommissioning and maintenance.")
+  private MutableGaugeLong trackedDecommissioningMaintenanceNodesTotal;
+
+  @Metric("Number of nodes tracked for recommissioning.")
+  private MutableGaugeLong trackedRecommissionNodesTotal;

Review Comment:
   I think we can drop tracked and improve the readability when visualizing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1295373631

   > some changes to make it simpler and fix the synchronization issue
   
   Thanks.  Pushed similar changes for simplification and fix for synchronization issue in latest commit after revert.  Some minor cleanup and changes to follow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1306594864

   @sodonnel , for a possible fix for handling `prometheus` stale metrics, we should open a **_new_** jira to flush stale metrics on each refresh write in the PrometheusSink,
   https://github.com/apache/ozone/blob/79a9d39da7f33e71bc00183e280105562354cca4/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/http/PrometheusMetricsSink.java#L134 
   
   The flush of old stale metrics and populating the internal PrometheusMetricsSink Map can be handled similar to a resolved HDFS jira, HADOOP-17804.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] captainzmc commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

captainzmc commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1301583106

   Thanks @neils-dev @michelsumbul for the feedback. I got you point. Sure. let's keep this PR in 1.3 release blocked list.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] captainzmc commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

captainzmc commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1308118414

   Thanks @neils-dev for the patch, and thanks @sodonnel @kerneltime for the review, let's merge this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1283798159

   One other thing I spotted:
   
   ```
       "TrackedPipelinesWaitingToCloseTotal" : 2,
       "TrackedRecommissionNodesTotal" : 0,
       "trackedPipelinesWaitingToClose-ozone-datanode-3.ozone_default" : 0,
       "trackedSufficientlyReplicated-ozone-datanode-3.ozone_default" : 0,
       "trackedUnderReplicated-ozone-datanode-3.ozone_default" : 0,
       "trackedUnhealthyContainers-ozone-datanode-3.ozone_default" : 0,
   ```
   The host level metrics start with a lower case letter, but the others start with upper case. We should be consistent here inline with what other metrics do.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r998797490


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -363,6 +427,14 @@ private boolean checkContainersReplicatedOnNode(DatanodeDetails dn)
     LOG.info("{} has {} sufficientlyReplicated, {} underReplicated and {} " +
         "unhealthy containers",
         dn, sufficientlyReplicated, underReplicated, unhealthy);
+    containerStateByHost.computeIfAbsent(dn.getHostName(),
+            hostID -> new ContainerStateInWorkflow(hostID,

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r998408719


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -82,6 +85,44 @@ public class DatanodeAdminMonitorImpl implements DatanodeAdminMonitor {
   private long unhealthyContainers = 0;
   private long underReplicatedContainers = 0;
 
+  @SuppressFBWarnings(value = "SIC_INNER_SHOULD_BE_STATIC")
+  private final class ContainerStateInWorkflow {

Review Comment:
   Rather than suppress the FB warning, can this class be made `private final static class ...`?  I am not an expert in this area, but usually inner classes I've seen are static. The difference between static and non static inner classes seems to be that "non static" inner classes can directly access the enclosing classes instance variables and methods.
   
   A static inner class cannot directly access the enclosing classes methods. It has to do it via an object reference.
   
   In this case, the inner class is a simple wrapper around a set of variables and does not need to access the enclosing methods class, and therefore can be static I think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r999240192


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionMetrics.java:
##########
@@ -0,0 +1,245 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with this
+ * work for additional information regarding copyright ownership.  The ASF
+ * licenses this file to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ * <p>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package org.apache.hadoop.hdds.scm.node;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.metrics2.MetricsCollector;
+import org.apache.hadoop.metrics2.MetricsRecordBuilder;
+import org.apache.hadoop.metrics2.MetricsSource;
+import org.apache.hadoop.metrics2.annotation.Metric;
+import org.apache.hadoop.metrics2.annotation.Metrics;
+import org.apache.hadoop.metrics2.lib.DefaultMetricsSystem;
+import org.apache.hadoop.metrics2.lib.Interns;
+import org.apache.hadoop.metrics2.lib.MetricsRegistry;
+import org.apache.hadoop.metrics2.lib.MutableGaugeLong;
+import org.apache.hadoop.ozone.OzoneConsts;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Class contains metrics related to the NodeDecommissionManager.
+ */
+@Metrics(about = "Node Decommission Metrics", context = OzoneConsts.OZONE)
+public final class NodeDecommissionMetrics implements MetricsSource {
+  public static final String METRICS_SOURCE_NAME =
+      org.apache.hadoop.hdds.scm.node.NodeDecommissionMetrics
+          .class.getSimpleName();
+
+  @Metric("Number of nodes tracked for decommissioning and maintenance.")
+  private MutableGaugeLong trackedDecommissioningMaintenanceNodesTotal;
+
+  @Metric("Number of nodes tracked for recommissioning.")
+  private MutableGaugeLong trackedRecommissionNodesTotal;
+
+  @Metric("Number of nodes tracked with pipelines waiting to close.")
+  private MutableGaugeLong trackedPipelinesWaitingToCloseTotal;
+
+  @Metric("Number of containers under replicated in tracked nodes.")
+  private MutableGaugeLong trackedContainersUnderReplicatedTotal;
+
+  @Metric("Number of containers unhealthy in tracked nodes.")
+  private MutableGaugeLong trackedContainersUnhealthyTotal;
+
+  @Metric("Number of containers sufficiently replicated in tracked nodes.")
+  private MutableGaugeLong trackedContainersSufficientlyReplicatedTotal;
+
+  private MetricsRegistry registry;
+
+  private Map<String, MutableGaugeLong> trackedSufficientlyReplicatedByHost;
+
+  private Map<String, MutableGaugeLong> trackedUnderReplicatedByHost;
+
+  private Map<String, MutableGaugeLong> trackedUnhealthyContainersByHost;
+
+  private Map<String, MutableGaugeLong> trackedPipelinesWaitingToCloseByHost;
+
+  /** Private constructor. */
+  private NodeDecommissionMetrics() {
+    this.registry = new MetricsRegistry(METRICS_SOURCE_NAME);
+    trackedSufficientlyReplicatedByHost = new HashMap<>();
+    trackedUnderReplicatedByHost = new HashMap<>();
+    trackedUnhealthyContainersByHost = new HashMap();
+    trackedPipelinesWaitingToCloseByHost = new HashMap();
+  }
+
+  /**
+   * Create and returns NodeDecommissionMetrics instance.
+   *
+   * @return NodeDecommissionMetrics
+   */
+  public static NodeDecommissionMetrics create() {
+    return DefaultMetricsSystem.instance().register(METRICS_SOURCE_NAME,
+        "Metrics tracking the progress of nodes in the "
+            + "Decommissioning and Maintenance workflows.  "
+            + "Tracks num nodes in mode and container "
+            + "replications state and pipelines waiting to close",
+        new NodeDecommissionMetrics());
+  }
+
+  /**
+   * Get aggregated gauge metrics.
+   */
+  @Override
+  public void getMetrics(MetricsCollector collector, boolean all) {
+    MetricsRecordBuilder builder = collector
+        .addRecord(METRICS_SOURCE_NAME);
+    trackedDecommissioningMaintenanceNodesTotal.snapshot(builder, all);
+    trackedRecommissionNodesTotal.snapshot(builder, all);
+    trackedPipelinesWaitingToCloseTotal.snapshot(builder, all);
+    trackedContainersUnderReplicatedTotal.snapshot(builder, all);
+    trackedContainersUnhealthyTotal.snapshot(builder, all);
+    trackedContainersSufficientlyReplicatedTotal.snapshot(builder, all);
+
+    registry.snapshot(builder, all);
+  }
+
+  /**
+   * Unregister the metrics instance.
+   */
+  public void unRegister() {
+    DefaultMetricsSystem.instance().unregisterSource(METRICS_SOURCE_NAME);
+  }
+
+  public void setTrackedDecommissioningMaintenanceNodesTotal(
+            long numNodesTracked) {
+    trackedDecommissioningMaintenanceNodesTotal
+        .set(numNodesTracked);
+  }
+
+  public void setTrackedRecommissionNodesTotal(
+          long numNodesTrackedRecommissioned) {
+    trackedRecommissionNodesTotal.set(numNodesTrackedRecommissioned);
+  }
+
+  public void setTrackedPipelinesWaitingToCloseTotal(
+          long numTrackedPipelinesWaitToClose) {
+    trackedPipelinesWaitingToCloseTotal
+        .set(numTrackedPipelinesWaitToClose);
+  }
+
+  public void setTrackedContainersUnderReplicatedTotal(
+          long numTrackedUnderReplicated) {
+    trackedContainersUnderReplicatedTotal
+        .set(numTrackedUnderReplicated);
+  }
+
+  public void setTrackedContainersUnhealthyTotal(
+          long numTrackedUnhealthy) {
+    trackedContainersUnhealthyTotal
+        .set(numTrackedUnhealthy);
+  }
+
+  public void setTrackedContainersSufficientlyReplicatedTotal(
+          long numTrackedSufficientlyReplicated) {
+    trackedContainersSufficientlyReplicatedTotal
+        .set(numTrackedSufficientlyReplicated);
+  }
+
+  public long getTrackedDecommissioningMaintenanceNodesTotal() {
+    return trackedDecommissioningMaintenanceNodesTotal.value();
+  }
+
+  public long getTrackedRecommissionNodesTotal() {
+    return trackedRecommissionNodesTotal.value();
+  }
+
+  public long getTrackedPipelinesWaitingToCloseTotal() {
+    return trackedPipelinesWaitingToCloseTotal.value();
+  }
+
+  public long getTrackedContainersUnderReplicatedTotal() {
+    return trackedContainersUnderReplicatedTotal.value();
+  }
+
+  public long getTrackedContainersUnhealthyTotal() {
+    return trackedContainersUnhealthyTotal.value();
+  }
+
+  public long getTrackedContainersSufficientlyReplicatedTotal() {
+    return trackedContainersSufficientlyReplicatedTotal.value();
+  }
+
+  public void metricRecordPipelineWaitingToCloseByHost(String host,
+                                                       long num) {
+    trackedPipelinesWaitingToCloseByHost.computeIfAbsent(host,
+        hostID -> registry.newGauge(
+            Interns.info("trackedPipelinesWaitingToClose-" + hostID,
+                "Number of pipelines waiting to close for "
+                    + "host in decommissioning and "
+                    + "maintenance mode"), 0L)).set(num);
+  }
+
+  public void metricRecordOfReplicationByHost(String host,
+                                              long sufficientlyReplicated,
+                                              long underReplicated,
+                                              long unhealthy) {
+    trackedSufficientlyReplicatedByHost.computeIfAbsent(host,
+        hostID -> registry.newGauge(
+            Interns.info("trackedSufficientlyReplicated-" + hostID,
+                "Number of sufficiently replicated containers "
+                    + "for host in decommissioning and "
+                    + "maintenance mode"), 0L)).set(sufficientlyReplicated);
+
+    trackedUnderReplicatedByHost.computeIfAbsent(host,
+        hostID -> registry.newGauge(
+            Interns.info("trackedUnderReplicated-" + hostID,
+                "Number of under-replicated containers "
+                    + "for host in decommissioning and "
+                    + "maintenance mode"), 0L)).set(underReplicated);
+
+    trackedUnhealthyContainersByHost.computeIfAbsent(host,
+        hostID -> registry.newGauge(
+            Interns.info("trackedUnhealthyContainers-" + hostID,
+                "Number of unhealthy containers "
+                    + "for host in decommissioning and "
+                    + "maintenance mode"), 0L)).set(unhealthy);
+  }
+
+  @VisibleForTesting
+  public Long getTrackedPipelinesWaitingToCloseByHost(String host) {
+    if (!trackedPipelinesWaitingToCloseByHost.containsKey(host)) {

Review Comment:
   These get methods could all be simplified to something like:
   
   ```
   private static final MutableGaugeLong ZERO_GAUGE = new MutableGaugeLong;
   
   return trackedPipelinesWaitingToCloseByHost.getOrDefault(host, ZERO_GAUGE).value()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1004976856


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +232,55 @@ public int getTrackedNodeCount() {
     return trackedNodes.size();
   }
 
+  synchronized void setMetricsToGauge() {
+    metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+    metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+    metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+            trackedDecomMaintenance);
+    metrics.setTrackedContainersUnderReplicatedTotal(
+            underReplicatedContainers);
+    metrics.setTrackedContainersSufficientlyReplicatedTotal(
+            sufficientlyReplicatedContainers);
+    metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+    for (Map.Entry<String, ContainerStateInWorkflow> e :
+            containerStateByHost.entrySet()) {
+      metrics.metricRecordOfContainerStateByHost(e.getKey(),
+          e.getValue().sufficientlyReplicated,
+          e.getValue().underReplicatedContainers,
+          e.getValue().unhealthyContainers,
+          e.getValue().pipelinesWaitingToClose);
+    }
+  }
+
+  void resetContainerMetrics() {
+    pipelinesWaitingToClose = 0;
+    sufficientlyReplicatedContainers = 0;
+    unhealthyContainers = 0;
+    underReplicatedContainers = 0;
+
+    for (Map.Entry<String, ContainerStateInWorkflow> e :

Review Comment:
   Does the implementation as it stands now, protect against incomplete intermediate metrics? The metrics are snapshot via the call to getMetrics, but the metrics are set over several calls from the Decommission monitor to the metrics class and there is no synchronisation. Could we not set `trackedPipelinesWaitingToCloseTotal` and then before `trackedContainersUnderReplicatedTotal` is set, `getMetrics` is called giving an inconsistent result?
   
   Probably, getMetrics() and any setters need synchronized, and even then you need to set everything in a single synchronized call.
   
   We can build up a `Map<String, ContainerStateInWorkflow>` for each iteration and at the end of the iteration pass it to the metrics object like new `Map<>(mapJustBuiltUp)` - then you can clear the original and the Map and ContainerStateInWorkflow objects are referenced only in the new metrics class and will not be changed again.
   
   Replace them all on the next call and then we don't need to worry about expiring individual nodes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1015590373


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionMetrics.java:
##########
@@ -0,0 +1,329 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with this
+ * work for additional information regarding copyright ownership.  The ASF
+ * licenses this file to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ * <p>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package org.apache.hadoop.hdds.scm.node;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl.ContainerStateInWorkflow;
+import org.apache.hadoop.metrics2.MetricsCollector;
+import org.apache.hadoop.metrics2.MetricsInfo;
+import org.apache.hadoop.metrics2.MetricsRecordBuilder;
+import org.apache.hadoop.metrics2.MetricsSource;
+import org.apache.hadoop.metrics2.MetricsTag;
+import org.apache.hadoop.metrics2.annotation.Metric;
+import org.apache.hadoop.metrics2.annotation.Metrics;
+import org.apache.hadoop.metrics2.lib.DefaultMetricsSystem;
+import org.apache.hadoop.metrics2.lib.Interns;
+import org.apache.hadoop.metrics2.lib.MetricsRegistry;
+import org.apache.hadoop.metrics2.lib.MutableGaugeLong;
+import org.apache.hadoop.ozone.OzoneConsts;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Class contains metrics related to the NodeDecommissionManager.
+ */
+@Metrics(about = "Node Decommission Metrics", context = OzoneConsts.OZONE)
+public final class NodeDecommissionMetrics implements MetricsSource {
+  public static final String METRICS_SOURCE_NAME =
+      org.apache.hadoop.hdds.scm.node.NodeDecommissionMetrics
+          .class.getSimpleName();
+
+  @Metric("Number of nodes tracked for decommissioning and maintenance.")
+  private MutableGaugeLong trackedDecommissioningMaintenanceNodesTotal;
+
+  @Metric("Number of nodes tracked for recommissioning.")
+  private MutableGaugeLong trackedRecommissionNodesTotal;
+
+  @Metric("Number of nodes tracked with pipelines waiting to close.")
+  private MutableGaugeLong trackedPipelinesWaitingToCloseTotal;
+
+  @Metric("Number of containers under replicated in tracked nodes.")
+  private MutableGaugeLong trackedContainersUnderReplicatedTotal;
+
+  @Metric("Number of containers unhealthy in tracked nodes.")
+  private MutableGaugeLong trackedContainersUnhealthyTotal;
+
+  @Metric("Number of containers sufficiently replicated in tracked nodes.")
+  private MutableGaugeLong trackedContainersSufficientlyReplicatedTotal;
+
+  private MetricsRegistry registry;
+
+  private enum MetricByHost {
+    PipelinesWaitingToClose("TrackedPipelinesWaitingToCloseDN",
+        "Number of pipelines waiting to close for "
+            + "host in decommissioning and "
+            + "maintenance mode"),
+    SufficientlyReplicated("TrackedSufficientlyReplicatedDN",
+        "Number of sufficiently replicated containers "
+            + "for host in decommissioning and "
+            + "maintenance mode"),
+    UnderReplicated("TrackedUnderReplicatedDN",
+        "Number of under-replicated containers "
+            + "for host in decommissioning and "
+            + "maintenance mode"),
+    UnhealthyContainers("TrackedUnhealthyContainersDN",
+        "Number of unhealthy containers "
+            + "for host in decommissioning and "
+            + "maintenance mode");
+
+    private final String metricName;
+    private final String desc;
+
+    MetricByHost(String name, String desc) {
+      this.metricName = name;
+      this.desc = desc;
+    }
+
+    public String getMetricName(String host) {
+      return metricName + "-" + host;
+    }
+
+    public String getMetricName() {
+      return metricName;
+    }
+
+    public String getDescription() {
+      return desc;
+    }
+  };
+
+  private static final class TrackedWorkflowContainerState {
+    private String metricByHost;
+    private MetricsInfo metricsInfo;
+    private Long value;
+
+    private TrackedWorkflowContainerState(String metricByHost,
+                                          MetricsInfo info,
+                                          Long val) {
+      this.metricByHost = metricByHost;
+      metricsInfo = info;
+      value = val;
+    }
+
+    public void setValue(Long val) {
+      value = val;
+    }
+
+    public String getHost() {
+      return metricByHost.substring(metricByHost.indexOf("-") + 1,
+          metricByHost.length());
+    }
+
+    public MetricsInfo getMetricsInfo() {
+      return metricsInfo;
+    }
+
+    public Long getValue() {
+      return value;
+    }
+  }
+
+  private Map<String, TrackedWorkflowContainerState>
+      trackedWorkflowContainerMetricByHost;
+
+  /** Private constructor. */
+  private NodeDecommissionMetrics() {
+    this.registry = new MetricsRegistry(METRICS_SOURCE_NAME);
+    trackedWorkflowContainerMetricByHost = new HashMap<>();
+  }
+
+  /**
+   * Create and returns NodeDecommissionMetrics instance.
+   *
+   * @return NodeDecommissionMetrics
+   */
+  public static NodeDecommissionMetrics create() {
+    return DefaultMetricsSystem.instance().register(METRICS_SOURCE_NAME,
+        "Metrics tracking the progress of nodes in the "
+            + "Decommissioning and Maintenance workflows.  "
+            + "Tracks num nodes in mode and container "
+            + "replications state and pipelines waiting to close",
+        new NodeDecommissionMetrics());
+  }
+
+  /**
+   * Get aggregated gauge metrics.
+   */
+  @Override
+  public synchronized void getMetrics(MetricsCollector collector, boolean all) {
+    MetricsRecordBuilder builder = collector
+        .addRecord(METRICS_SOURCE_NAME);
+    trackedDecommissioningMaintenanceNodesTotal.snapshot(builder, all);
+    trackedRecommissionNodesTotal.snapshot(builder, all);
+    trackedPipelinesWaitingToCloseTotal.snapshot(builder, all);
+    trackedContainersUnderReplicatedTotal.snapshot(builder, all);
+    trackedContainersUnhealthyTotal.snapshot(builder, all);
+    trackedContainersSufficientlyReplicatedTotal.snapshot(builder, all);
+    builder.endRecord();
+
+    for (Map.Entry<String, TrackedWorkflowContainerState> e :
+        trackedWorkflowContainerMetricByHost.entrySet()) {
+      MetricsRecordBuilder recordBuilder = collector
+          .addRecord(METRICS_SOURCE_NAME);
+      recordBuilder.add(
+          new MetricsTag(Interns.info("datanode",
+              "datanode host in decommission maintenance workflow"),
+              e.getValue().getHost()));
+      recordBuilder.addGauge(e.getValue().getMetricsInfo(),
+          e.getValue().getValue());
+      recordBuilder.endRecord();
+    }
+  }
+
+  /**
+   * Unregister the metrics instance.
+   */
+  public void unRegister() {
+    DefaultMetricsSystem.instance().unregisterSource(METRICS_SOURCE_NAME);
+  }
+
+  public synchronized void setTrackedDecommissioningMaintenanceNodesTotal(
+            long numNodesTracked) {
+    trackedDecommissioningMaintenanceNodesTotal
+        .set(numNodesTracked);
+  }
+
+  public synchronized void setTrackedRecommissionNodesTotal(
+          long numNodesTrackedRecommissioned) {
+    trackedRecommissionNodesTotal.set(numNodesTrackedRecommissioned);
+  }
+
+  public synchronized void setTrackedPipelinesWaitingToCloseTotal(
+          long numTrackedPipelinesWaitToClose) {
+    trackedPipelinesWaitingToCloseTotal
+        .set(numTrackedPipelinesWaitToClose);
+  }
+
+  public synchronized void setTrackedContainersUnderReplicatedTotal(
+          long numTrackedUnderReplicated) {
+    trackedContainersUnderReplicatedTotal
+        .set(numTrackedUnderReplicated);
+  }
+
+  public synchronized void setTrackedContainersUnhealthyTotal(
+          long numTrackedUnhealthy) {
+    trackedContainersUnhealthyTotal
+        .set(numTrackedUnhealthy);
+  }
+
+  public synchronized void setTrackedContainersSufficientlyReplicatedTotal(
+          long numTrackedSufficientlyReplicated) {
+    trackedContainersSufficientlyReplicatedTotal
+        .set(numTrackedSufficientlyReplicated);
+  }
+
+  public synchronized long getTrackedDecommissioningMaintenanceNodesTotal() {
+    return trackedDecommissioningMaintenanceNodesTotal.value();
+  }
+
+  public synchronized long getTrackedRecommissionNodesTotal() {
+    return trackedRecommissionNodesTotal.value();
+  }
+
+  public synchronized long getTrackedPipelinesWaitingToCloseTotal() {
+    return trackedPipelinesWaitingToCloseTotal.value();
+  }
+
+  public synchronized long getTrackedContainersUnderReplicatedTotal() {
+    return trackedContainersUnderReplicatedTotal.value();
+  }
+
+  public synchronized long getTrackedContainersUnhealthyTotal() {
+    return trackedContainersUnhealthyTotal.value();
+  }
+
+  public synchronized long getTrackedContainersSufficientlyReplicatedTotal() {
+    return trackedContainersSufficientlyReplicatedTotal.value();
+  }
+
+  private TrackedWorkflowContainerState createContainerMetricsInfo(
+      String host,
+      MetricByHost metric) {
+    return new TrackedWorkflowContainerState(host,
+        Interns.info(metric.getMetricName(),
+            metric.getDescription()), 0L);
+  }
+
+  public synchronized void metricRecordOfContainerStateByHost(
+      Map<String, ContainerStateInWorkflow> containerStatesByHost) {
+    trackedWorkflowContainerMetricByHost.clear();
+    for (Map.Entry<String, ContainerStateInWorkflow> e :
+        containerStatesByHost.entrySet()) {
+      trackedWorkflowContainerMetricByHost
+          .computeIfAbsent(MetricByHost.SufficientlyReplicated

Review Comment:
   I don't think we need all the computIfAbsent calls here - we just cleared the map so they are always going to be absent, so just put the new value.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1307543175

   Latest changes look good. Please have a check @kerneltime and we can commit if you are happy.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r999244991


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -90,6 +137,9 @@ public DatanodeAdminMonitorImpl(
     this.eventQueue = eventQueue;
     this.nodeManager = nodeManager;
     this.replicationManager = replicationManager;
+
+    containerStateByHost = new HashMap<>();
+    pipelinesWaitingToCloseByHost = new HashMap<>();

Review Comment:
   Why split the pipelines into a seperate map? It looks like it would be easier overall to have a pipeline count setter on the `ContainerStateInWorkflow` object and just carry the pipeline count around with the containers etc too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1005724468


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +232,55 @@ public int getTrackedNodeCount() {
     return trackedNodes.size();
   }
 
+  synchronized void setMetricsToGauge() {
+    metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+    metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+    metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+            trackedDecomMaintenance);
+    metrics.setTrackedContainersUnderReplicatedTotal(
+            underReplicatedContainers);
+    metrics.setTrackedContainersSufficientlyReplicatedTotal(
+            sufficientlyReplicatedContainers);
+    metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+    for (Map.Entry<String, ContainerStateInWorkflow> e :
+            containerStateByHost.entrySet()) {
+      metrics.metricRecordOfContainerStateByHost(e.getKey(),
+          e.getValue().sufficientlyReplicated,
+          e.getValue().underReplicatedContainers,
+          e.getValue().unhealthyContainers,
+          e.getValue().pipelinesWaitingToClose);
+    }
+  }
+
+  void resetContainerMetrics() {
+    pipelinesWaitingToClose = 0;
+    sufficientlyReplicatedContainers = 0;
+    unhealthyContainers = 0;
+    underReplicatedContainers = 0;
+
+    for (Map.Entry<String, ContainerStateInWorkflow> e :

Review Comment:
   > To make it consistent, you need to synchronize in the metrics object and set ALL the metrics in a single synchronized call.
   
   Will do.  At least try to keep possible inconsistency to a minimum.  We won't report metrics that show one value in one sample, that clear, then go back to value due to our sampling from the monitor. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1306591369

   Latest push contains both simplification changes to `NodeDecommissionMetrics` and `DatanodeAdminMonitorImpl` to use the `ContainerStateInWorkflow` inner class and methods.  Thanks for the suggestion.  
   Also included is a fix to the jmx by host metrics collected:
   
   > The way I'd expect this to work is we have 1 tag per DN, and then the 4 metrics (under, sufficiently, pipelines, unhealthy) all sharing that tag.
   
   example with datanode 1 and datanode 3 in decommissioning workflow as captured by monitor:
   ```
       "TrackedPipelinesWaitingToCloseTotal" : 4,
       "TrackedContainersUnderReplicatedTotal" : 0,
       "TrackedContainersUnhealthyTotal" : 0,
       "TrackedContainersSufficientlyReplicatedTotal" : 0,
       "tag.datanode.1" : "ozone_datanode_1.ozone_default",
       "tag.Hostname.1" : "f27febaf1870",
       "TrackedPipelinesWaitingToCloseDN.1" : 2,
       "TrackedUnderReplicatedDN.1" : 0,
       "TrackedSufficientlyReplicatedDN.1" : 0,
       "TrackedUnhealthyContainersDN.1" : 0,
       "tag.datanode.2" : "ozone_datanode_3.ozone_default",
       "tag.Hostname.2" : "f27febaf1870",
       "TrackedPipelinesWaitingToCloseDN.2" : 2,
       "TrackedUnderReplicatedDN.2" : 0,
       "TrackedSufficientlyReplicatedDN.2" : 0,
       "TrackedUnhealthyContainersDN.2" : 0
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1279598223

   Thanks @sodonnel for your help to expose the decommission / maintenance metrics for monitoring the workflow.  As you suggested, I've added metrics to monitor the workflow progress by host.  These host based metrics are created dynamically and track the pipeline and container state for datanodes going through the decommissioning and maintenance workflow.
   
   The metrics include,
   node_decommission_metrics_tracked_pipelines_waiting_to_close_**ozone_datanode_3_ozone_default**
   node_decommission_metrics_tracked_sufficiently_replicated_**ozone_datanode_3_ozone_default**
   node_decommission_metrics_tracked_unhealthy_containers_**localhost**
   node_decommission_metrics_tracked_under_replicated_containers_**localhost**
   **(hostname marked in bold)**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r998421890


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -82,6 +85,44 @@ public class DatanodeAdminMonitorImpl implements DatanodeAdminMonitor {
   private long unhealthyContainers = 0;
   private long underReplicatedContainers = 0;
 
+  @SuppressFBWarnings(value = "SIC_INNER_SHOULD_BE_STATIC")
+  private final class ContainerStateInWorkflow {
+    private long sufficientlyReplicated = 0;
+    private long unhealthyContainers = 0;
+    private long underReplicatedContainers = 0;
+    private String host = "";
+
+    private ContainerStateInWorkflow(String host,
+                                     long sufficientlyReplicated,

Review Comment:
   Formating seems off on this line by 1 space.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] kerneltime commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

kerneltime commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r997572384


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionMetrics.java:
##########
@@ -0,0 +1,253 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with this
+ * work for additional information regarding copyright ownership.  The ASF
+ * licenses this file to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ * <p>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package org.apache.hadoop.hdds.scm.node;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.metrics2.MetricsCollector;
+import org.apache.hadoop.metrics2.MetricsRecordBuilder;
+import org.apache.hadoop.metrics2.MetricsSource;
+import org.apache.hadoop.metrics2.annotation.Metric;
+import org.apache.hadoop.metrics2.annotation.Metrics;
+import org.apache.hadoop.metrics2.lib.DefaultMetricsSystem;
+import org.apache.hadoop.metrics2.lib.Interns;
+import org.apache.hadoop.metrics2.lib.MetricsRegistry;
+import org.apache.hadoop.metrics2.lib.MutableGaugeLong;
+import org.apache.hadoop.ozone.OzoneConsts;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Class contains metrics related to the NodeDecommissionManager.
+ */
+@Metrics(about = "Node Decommission Metrics", context = OzoneConsts.OZONE)
+public final class NodeDecommissionMetrics implements MetricsSource {
+  public static final String METRICS_SOURCE_NAME =
+        org.apache.hadoop.hdds.scm.node.NodeDecommissionMetrics
+                .class.getSimpleName();
+
+  @Metric("Number of nodes tracked for decommissioning and maintenance.")
+  private MutableGaugeLong trackedDecommissioningMaintenanceNodesTotal;
+
+  @Metric("Number of nodes tracked for recommissioning.")
+  private MutableGaugeLong trackedRecommissionNodesTotal;

Review Comment:
   Is `tracked` a bit superfluous here? These can be `RecommissionNodesTotal` unless the prefix `tracked` is adding more context here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r998795404


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionMetrics.java:
##########
@@ -0,0 +1,253 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with this
+ * work for additional information regarding copyright ownership.  The ASF
+ * licenses this file to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ * <p>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package org.apache.hadoop.hdds.scm.node;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.metrics2.MetricsCollector;
+import org.apache.hadoop.metrics2.MetricsRecordBuilder;
+import org.apache.hadoop.metrics2.MetricsSource;
+import org.apache.hadoop.metrics2.annotation.Metric;
+import org.apache.hadoop.metrics2.annotation.Metrics;
+import org.apache.hadoop.metrics2.lib.DefaultMetricsSystem;
+import org.apache.hadoop.metrics2.lib.Interns;
+import org.apache.hadoop.metrics2.lib.MetricsRegistry;
+import org.apache.hadoop.metrics2.lib.MutableGaugeLong;
+import org.apache.hadoop.ozone.OzoneConsts;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Class contains metrics related to the NodeDecommissionManager.
+ */
+@Metrics(about = "Node Decommission Metrics", context = OzoneConsts.OZONE)
+public final class NodeDecommissionMetrics implements MetricsSource {
+  public static final String METRICS_SOURCE_NAME =
+        org.apache.hadoop.hdds.scm.node.NodeDecommissionMetrics
+                .class.getSimpleName();
+
+  @Metric("Number of nodes tracked for decommissioning and maintenance.")
+  private MutableGaugeLong trackedDecommissioningMaintenanceNodesTotal;
+
+  @Metric("Number of nodes tracked for recommissioning.")
+  private MutableGaugeLong trackedRecommissionNodesTotal;

Review Comment:
   The significance of `tracked` is coming from the convention in the monitor code which tracks the nodes in the decommissioning and maintenance workflow - 
    `DatanodeAdminMonitorImpl.java` javadoc: _Once an node is placed into tracked nodes, it goes through a workflow where
     the following happens:_ 
    and the corresponding log outputs,
   _INFO node.DatanodeAdminMonitorImpl: There are 1 nodes tracked for decommission and maintenance._
   Would it be better to continue to use the prefix `tracked`, remove altogether, or perhaps another keyword uniform to the decommissioning/maintenance mode metrics to easily search for with jmx and prometheus?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r998797359


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -82,6 +85,44 @@ public class DatanodeAdminMonitorImpl implements DatanodeAdminMonitor {
   private long unhealthyContainers = 0;
   private long underReplicatedContainers = 0;
 
+  @SuppressFBWarnings(value = "SIC_INNER_SHOULD_BE_STATIC")
+  private final class ContainerStateInWorkflow {
+    private long sufficientlyReplicated = 0;
+    private long unhealthyContainers = 0;
+    private long underReplicatedContainers = 0;
+    private String host = "";
+
+    private ContainerStateInWorkflow(String host,
+                                     long sufficientlyReplicated,

Review Comment:
   Yes, using a new ide.  Adjusted formatting for files accordingly.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -82,6 +85,44 @@ public class DatanodeAdminMonitorImpl implements DatanodeAdminMonitor {
   private long unhealthyContainers = 0;
   private long underReplicatedContainers = 0;
 
+  @SuppressFBWarnings(value = "SIC_INNER_SHOULD_BE_STATIC")
+  private final class ContainerStateInWorkflow {
+    private long sufficientlyReplicated = 0;
+    private long unhealthyContainers = 0;
+    private long underReplicatedContainers = 0;
+    private String host = "";
+
+    private ContainerStateInWorkflow(String host,
+                                     long sufficientlyReplicated,
+                                    long underReplicatedContainers,
+                                    long unhealthyContainers) {
+      this.host = host;
+      this.sufficientlyReplicated = sufficientlyReplicated;
+      this.unhealthyContainers = unhealthyContainers;
+      this.underReplicatedContainers = underReplicatedContainers;
+    }
+
+    public void setAll(long sufficiently,
+                    long under,

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1263959745

   HDFS has a metric like this:
   
   ```
    "DecomNodes" : "{\"cdh-6x-of-1.cdh-6x-of.root.hwx.site:20002\":{\"xferaddr\":\"172.27.52.133:20002\",\"underReplicatedBlocks\":0,\"decommissionOnlyReplicas\":0,\"underReplicateInOpenFiles\":0}}",
   ```
   It seems to register a MBean instance in the FSNameSystem class. Then it has a few places it provides these JSON key values in the metrics.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r996216813


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionMetrics.java:
##########
@@ -0,0 +1,158 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with this
+ * work for additional information regarding copyright ownership.  The ASF
+ * licenses this file to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ * <p>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package org.apache.hadoop.hdds.scm.node;
+
+import org.apache.hadoop.metrics2.MetricsCollector;
+import org.apache.hadoop.metrics2.MetricsRecordBuilder;
+import org.apache.hadoop.metrics2.MetricsSource;
+import org.apache.hadoop.metrics2.annotation.Metric;
+import org.apache.hadoop.metrics2.annotation.Metrics;
+import org.apache.hadoop.metrics2.lib.DefaultMetricsSystem;
+import org.apache.hadoop.metrics2.lib.MetricsRegistry;
+import org.apache.hadoop.metrics2.lib.MutableGaugeLong;
+import org.apache.hadoop.ozone.OzoneConsts;
+
+/**
+ * Class contains metrics related to the NodeDecommissionManager.
+ */
+@Metrics(about = "Node Decommission Metrics", context = OzoneConsts.OZONE)
+public final class NodeDecommissionMetrics implements MetricsSource {
+  public static final String METRICS_SOURCE_NAME =
+        org.apache.hadoop.hdds.scm.node.NodeDecommissionMetrics
+                .class.getSimpleName();
+
+  @Metric("Number of nodes tracked for decommissioning and maintenance.")
+  private MutableGaugeLong totalTrackedDecommissioningMaintenanceNodes;

Review Comment:
   Thanks @kerneltime for your review of this PR and for your comments.  I've pushed changes in latest commit for naming the metrics in that manner for all metrics collected.  ie. `trackedDecommissioningMaintenanceNodesTotal`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1004464345


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +232,55 @@ public int getTrackedNodeCount() {
     return trackedNodes.size();
   }
 
+  synchronized void setMetricsToGauge() {
+    metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+    metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+    metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+            trackedDecomMaintenance);
+    metrics.setTrackedContainersUnderReplicatedTotal(
+            underReplicatedContainers);
+    metrics.setTrackedContainersSufficientlyReplicatedTotal(
+            sufficientlyReplicatedContainers);
+    metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+    for (Map.Entry<String, ContainerStateInWorkflow> e :
+            containerStateByHost.entrySet()) {
+      metrics.metricRecordOfContainerStateByHost(e.getKey(),
+          e.getValue().sufficientlyReplicated,
+          e.getValue().underReplicatedContainers,
+          e.getValue().unhealthyContainers,
+          e.getValue().pipelinesWaitingToClose);
+    }
+  }
+
+  void resetContainerMetrics() {
+    pipelinesWaitingToClose = 0;
+    sufficientlyReplicatedContainers = 0;
+    unhealthyContainers = 0;
+    underReplicatedContainers = 0;
+
+    for (Map.Entry<String, ContainerStateInWorkflow> e :

Review Comment:
   At the moment, to reset things, you need to iterate this map and clear all values. Then iterate it again and remove any entries that are no longer tracked.
   
   What if, you just clear the map, which is a one liner.
   
   Then make  ContainerStateInWorkflow a public inner class, and pass the `Map<String, ContainerStateInWorkflow>` directly to the metrics class and have it replace all its metrics internally with the new map.
   
   That way reset becomes a lot easier too.
   
   Also, if we add another metric for a host, we don't need to add another parameter to the `metricRecordOfContainerStateByHost` method, as it just receives the wrapper object anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1295050426

> Why use `threadLocal` variables in the latest patch? Does this even work?

With the changes to the collecting the metrics in the `DatanodeAdminMonitorImpl` with on each iteration clearing the Map and then updating the container state for each active node in the workflow, it triggered synchronization (blocked) warnings (findbugs) due to setting the global Maps and counters in the scheduled monitor threads. To resolve this, with the thread having its own thread context there is no conflict. Maybe I've misunderstood its behavior here,

> If another thread reads the values set by the monitor thread, does it even get the values, as they should be local to the thread who set them?

The monitor is the only thread setting the values and writing to the metrics. Each run of the monitor thread has its own context through threadlocal variables. Is this a problem? I can revert the threadlocal changes easily, however we will be seeing the synchronized block problems when setting the individual counters in the `DatanodeAdminMonitorImpl` such as for the `pipelinesWaitingToClose` .

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1301013933

   Hi @captainzmc , thanks for reaching out for this PR to be included in the 1.3 release.  This patch is actually something that we would like to have in the 1.3 release.  Such functionality is new and needed for our production environment.  We would like to use this included in the 1.3 stable release.  Please do continue to include this PR in  1.3 release blocked list.
   @kerneltime, as discussed in our past community call, please do review the changes.  @sodonnel please do so when you can.
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1016893538


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionMetrics.java:
##########
@@ -0,0 +1,253 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with this
+ * work for additional information regarding copyright ownership.  The ASF
+ * licenses this file to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ * <p>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations under
+ * the License.
+ */
+
+package org.apache.hadoop.hdds.scm.node;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.metrics2.MetricsCollector;
+import org.apache.hadoop.metrics2.MetricsRecordBuilder;
+import org.apache.hadoop.metrics2.MetricsSource;
+import org.apache.hadoop.metrics2.annotation.Metric;
+import org.apache.hadoop.metrics2.annotation.Metrics;
+import org.apache.hadoop.metrics2.lib.DefaultMetricsSystem;
+import org.apache.hadoop.metrics2.lib.Interns;
+import org.apache.hadoop.metrics2.lib.MetricsRegistry;
+import org.apache.hadoop.metrics2.lib.MutableGaugeLong;
+import org.apache.hadoop.ozone.OzoneConsts;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Class contains metrics related to the NodeDecommissionManager.
+ */
+@Metrics(about = "Node Decommission Metrics", context = OzoneConsts.OZONE)
+public final class NodeDecommissionMetrics implements MetricsSource {
+  public static final String METRICS_SOURCE_NAME =
+        org.apache.hadoop.hdds.scm.node.NodeDecommissionMetrics
+                .class.getSimpleName();
+
+  @Metric("Number of nodes tracked for decommissioning and maintenance.")
+  private MutableGaugeLong trackedDecommissioningMaintenanceNodesTotal;
+
+  @Metric("Number of nodes tracked for recommissioning.")
+  private MutableGaugeLong trackedRecommissionNodesTotal;

Review Comment:
   Thanks @kerneltime.  Changes pushed see: https://github.com/apache/ozone/pull/3781#issuecomment-1307524690



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] captainzmc merged pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

captainzmc merged PR #3781:
URL: https://github.com/apache/ozone/pull/3781


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r998434320


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -363,6 +427,14 @@ private boolean checkContainersReplicatedOnNode(DatanodeDetails dn)
     LOG.info("{} has {} sufficientlyReplicated, {} underReplicated and {} " +
         "unhealthy containers",
         dn, sufficientlyReplicated, underReplicated, unhealthy);
+    containerStateByHost.computeIfAbsent(dn.getHostName(),
+            hostID -> new ContainerStateInWorkflow(hostID,

Review Comment:
   The indent here again seems too long (8 spaces instead of 4?) and all the `0L` could be pulled onto the same line I think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r998652376


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionMetrics.java:
##########
@@ -97,62 +109,138 @@ public void unRegister() {
     DefaultMetricsSystem.instance().unregisterSource(METRICS_SOURCE_NAME);
   }
 
-  public void setTotalTrackedDecommissioningMaintenanceNodes(
+  public void setTrackedDecommissioningMaintenanceNodesTotal(
             long numNodesTracked) {
-    totalTrackedDecommissioningMaintenanceNodes
+    trackedDecommissioningMaintenanceNodesTotal
             .set(numNodesTracked);
   }
 
-  public void setTotalTrackedRecommissionNodes(
+  public void setTrackedRecommissionNodesTotal(
           long numNodesTrackedRecommissioned) {
-    totalTrackedRecommissionNodes.set(numNodesTrackedRecommissioned);
+    trackedRecommissionNodesTotal.set(numNodesTrackedRecommissioned);
   }
 
-  public void setTotalTrackedPipelinesWaitingToClose(
+  public void setTrackedPipelinesWaitingToCloseTotal(
           long numTrackedPipelinesWaitToClose) {
-    totalTrackedPipelinesWaitingToClose
+    trackedPipelinesWaitingToCloseTotal
             .set(numTrackedPipelinesWaitToClose);
   }
 
-  public void setTotalTrackedContainersUnderReplicated(
+  public void setTrackedContainersUnderReplicatedTotal(
           long numTrackedUnderReplicated) {
-    totalTrackedContainersUnderReplicated
+    trackedContainersUnderReplicatedTotal
             .set(numTrackedUnderReplicated);
   }
 
-  public void setTotalTrackedContainersUnhealthy(
+  public void setTrackedContainersUnhealthyTotal(
           long numTrackedUnhealthy) {
-    totalTrackedContainersUnhealthy
+    trackedContainersUnhealthyTotal
             .set(numTrackedUnhealthy);
   }
 
-  public void setTotalTrackedContainersSufficientlyReplicated(
+  public void setTrackedContainersSufficientlyReplicatedTotal(
           long numTrackedSufficientlyReplicated) {
-    totalTrackedContainersSufficientlyReplicated
+    trackedContainersSufficientlyReplicatedTotal
             .set(numTrackedSufficientlyReplicated);
   }
 
-  public long getTotalTrackedDecommissioningMaintenanceNodes() {
-    return totalTrackedDecommissioningMaintenanceNodes.value();
+  public long getTrackedDecommissioningMaintenanceNodesTotal() {
+    return trackedDecommissioningMaintenanceNodesTotal.value();
   }
 
-  public long getTotalTrackedRecommissionNodes() {
-    return totalTrackedRecommissionNodes.value();
+  public long getTrackedRecommissionNodesTotal() {
+    return trackedRecommissionNodesTotal.value();
+  }
+
+  public long getTrackedPipelinesWaitingToCloseTotal() {
+    return trackedPipelinesWaitingToCloseTotal.value();
+  }
+
+  public long getTrackedContainersUnderReplicatedTotal() {
+    return trackedContainersUnderReplicatedTotal.value();
+  }
+
+  public long getTrackedContainersUnhealthyTotal() {
+    return trackedContainersUnhealthyTotal.value();
+  }
+
+  public long getTrackedContainersSufficientlyReplicatedTotal() {
+    return trackedContainersSufficientlyReplicatedTotal.value();
+  }
+
+  public void metricRecordPipelineWaitingToCloseByHost(String host,
+                                                       long num) {
+    trackedPipelinesWaitingToCloseByHost.computeIfAbsent(host,
+            hostID -> registry.newGauge(
+                    Interns.info("trackedPipelinesWaitingToClose-" + hostID,
+                            "Number of pipelines waiting to close for "
+                                    + "host in decommissioning and "
+                                    + "maintenance mode"),
+                    0L)
+    ).set(num);
+  }
+
+  public void metricRecordOfReplicationByHost(String host,
+                               long sufficientlyReplicated,
+                               long underReplicated,
+                               long unhealthy) {
+    trackedSufficientlyReplicatedByHost.computeIfAbsent(host,

Review Comment:
   Formatting is a bit strange here too. If you indent by 4 instead of 8 I think we can wrap one less line.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1001614335


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +225,43 @@ public int getTrackedNodeCount() {
     return trackedNodes.size();
   }
 
+  synchronized void setMetricsToGauge() {
+    metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+    metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+    metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+            trackedDecomMaintenance);
+    metrics.setTrackedContainersUnderReplicatedTotal(
+            underReplicatedContainers);
+    metrics.setTrackedContainersSufficientlyReplicatedTotal(
+            sufficientlyReplicatedContainers);
+    metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+    for (Map.Entry<String, Long> e :
+            pipelinesWaitingToCloseByHost.entrySet()) {
+      metrics.metricRecordPipelineWaitingToCloseByHost(e.getKey(),
+              e.getValue());
+    }
+    for (Map.Entry<String, ContainerStateInWorkflow> e :

Review Comment:
   In #3791 Symious added a tag with a group of metrics in JSON form. For the metrics system, this is just a tag to string, rather than a gauge, but we could group all currently decommissioning / maintenence nodes into a JSON representation to expose the fine grained info. If no nodes are in the workflow, it would just be an empty json object, so nodes can come and go easily.
   
   Then you still have your aggregate metrics as they are now.
   
   It is unlikely that someone would want to chart an individual DN as they would have to create a new chart for each DN.
   
   What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1004471235


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +225,43 @@ public int getTrackedNodeCount() {
     return trackedNodes.size();
   }
 
+  synchronized void setMetricsToGauge() {
+    metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+    metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+    metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+            trackedDecomMaintenance);
+    metrics.setTrackedContainersUnderReplicatedTotal(
+            underReplicatedContainers);
+    metrics.setTrackedContainersSufficientlyReplicatedTotal(
+            sufficientlyReplicatedContainers);
+    metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+    for (Map.Entry<String, Long> e :
+            pipelinesWaitingToCloseByHost.entrySet()) {
+      metrics.metricRecordPipelineWaitingToCloseByHost(e.getKey(),
+              e.getValue());
+    }
+    for (Map.Entry<String, ContainerStateInWorkflow> e :

Review Comment:
   I'm not sure how the prom end point works. Its not ideal that it keeps the last value pushed, but I am not sure where that code even comes from!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r998797590


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionMetrics.java:
##########
@@ -97,62 +109,138 @@ public void unRegister() {
     DefaultMetricsSystem.instance().unregisterSource(METRICS_SOURCE_NAME);
   }
 
-  public void setTotalTrackedDecommissioningMaintenanceNodes(
+  public void setTrackedDecommissioningMaintenanceNodesTotal(
             long numNodesTracked) {
-    totalTrackedDecommissioningMaintenanceNodes
+    trackedDecommissioningMaintenanceNodesTotal
             .set(numNodesTracked);
   }
 
-  public void setTotalTrackedRecommissionNodes(
+  public void setTrackedRecommissionNodesTotal(
           long numNodesTrackedRecommissioned) {
-    totalTrackedRecommissionNodes.set(numNodesTrackedRecommissioned);
+    trackedRecommissionNodesTotal.set(numNodesTrackedRecommissioned);
   }
 
-  public void setTotalTrackedPipelinesWaitingToClose(
+  public void setTrackedPipelinesWaitingToCloseTotal(
           long numTrackedPipelinesWaitToClose) {
-    totalTrackedPipelinesWaitingToClose
+    trackedPipelinesWaitingToCloseTotal
             .set(numTrackedPipelinesWaitToClose);
   }
 
-  public void setTotalTrackedContainersUnderReplicated(
+  public void setTrackedContainersUnderReplicatedTotal(
           long numTrackedUnderReplicated) {
-    totalTrackedContainersUnderReplicated
+    trackedContainersUnderReplicatedTotal
             .set(numTrackedUnderReplicated);
   }
 
-  public void setTotalTrackedContainersUnhealthy(
+  public void setTrackedContainersUnhealthyTotal(
           long numTrackedUnhealthy) {
-    totalTrackedContainersUnhealthy
+    trackedContainersUnhealthyTotal
             .set(numTrackedUnhealthy);
   }
 
-  public void setTotalTrackedContainersSufficientlyReplicated(
+  public void setTrackedContainersSufficientlyReplicatedTotal(
           long numTrackedSufficientlyReplicated) {
-    totalTrackedContainersSufficientlyReplicated
+    trackedContainersSufficientlyReplicatedTotal
             .set(numTrackedSufficientlyReplicated);
   }
 
-  public long getTotalTrackedDecommissioningMaintenanceNodes() {
-    return totalTrackedDecommissioningMaintenanceNodes.value();
+  public long getTrackedDecommissioningMaintenanceNodesTotal() {
+    return trackedDecommissioningMaintenanceNodesTotal.value();
   }
 
-  public long getTotalTrackedRecommissionNodes() {
-    return totalTrackedRecommissionNodes.value();
+  public long getTrackedRecommissionNodesTotal() {
+    return trackedRecommissionNodesTotal.value();
+  }
+
+  public long getTrackedPipelinesWaitingToCloseTotal() {
+    return trackedPipelinesWaitingToCloseTotal.value();
+  }
+
+  public long getTrackedContainersUnderReplicatedTotal() {
+    return trackedContainersUnderReplicatedTotal.value();
+  }
+
+  public long getTrackedContainersUnhealthyTotal() {
+    return trackedContainersUnhealthyTotal.value();
+  }
+
+  public long getTrackedContainersSufficientlyReplicatedTotal() {
+    return trackedContainersSufficientlyReplicatedTotal.value();
+  }
+
+  public void metricRecordPipelineWaitingToCloseByHost(String host,
+                                                       long num) {
+    trackedPipelinesWaitingToCloseByHost.computeIfAbsent(host,
+            hostID -> registry.newGauge(
+                    Interns.info("trackedPipelinesWaitingToClose-" + hostID,
+                            "Number of pipelines waiting to close for "
+                                    + "host in decommissioning and "
+                                    + "maintenance mode"),
+                    0L)
+    ).set(num);
+  }
+
+  public void metricRecordOfReplicationByHost(String host,
+                               long sufficientlyReplicated,
+                               long underReplicated,
+                               long unhealthy) {
+    trackedSufficientlyReplicatedByHost.computeIfAbsent(host,

Review Comment:
   Fixed - thanks again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] captainzmc commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

captainzmc commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1300584899

   Hi @sodonnel, currently the release of Ozone-1.3 is blocked on this PR. Could you help continue to review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1263753876

   The changes look good, but I think it would be much more useful if we could track metric at the decommissioning node level too. Ie:
   
   ```
   TotalTrackedContainersUnderReplicatedForHostname = xyz
   ```
   
   I had a look at the ReplicationManagerMetric class, and in there, is an example of how to form a metric "on the fly" using:
   
   ```
   private static final MetricsInfo INFLIGHT_REPLICATION = Interns.info(
         "InflightReplication",
         "Tracked inflight container replication requests.");
   ```
   
   I think it should be possible store the counts per hostname in a map or list, and then when the metrics are snapshot, form dynamic metric names for the host level under / over / unhealthy container counts.
   
   Also keep the aggregate metrics. These host level metrics would let people see if one host is stuck or if all are making progress etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1005737566


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +232,55 @@ public int getTrackedNodeCount() {
     return trackedNodes.size();
   }
 
+  synchronized void setMetricsToGauge() {
+    metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+    metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+    metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+            trackedDecomMaintenance);
+    metrics.setTrackedContainersUnderReplicatedTotal(
+            underReplicatedContainers);
+    metrics.setTrackedContainersSufficientlyReplicatedTotal(
+            sufficientlyReplicatedContainers);
+    metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+    for (Map.Entry<String, ContainerStateInWorkflow> e :
+            containerStateByHost.entrySet()) {
+      metrics.metricRecordOfContainerStateByHost(e.getKey(),
+          e.getValue().sufficientlyReplicated,
+          e.getValue().underReplicatedContainers,
+          e.getValue().unhealthyContainers,
+          e.getValue().pipelinesWaitingToClose);
+    }
+  }
+
+  void resetContainerMetrics() {
+    pipelinesWaitingToClose = 0;
+    sufficientlyReplicatedContainers = 0;
+    unhealthyContainers = 0;
+    underReplicatedContainers = 0;
+
+    for (Map.Entry<String, ContainerStateInWorkflow> e :

Review Comment:
   Is this something we should look to add to the code, adding in a single call to the metrics object for all collected metrics to update?  Modify `DatanodeAdminMonitorImpl.setMetricsToGauge`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1295255196

   I took the code you have pushed here, reverted the last commit and implemented some changes to make it simpler and fix the synchronization issue. Please have a look here:
   
   https://github.com/sodonnel/hadoop-ozone/tree/neil-metrics


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1295689730

   Note filed jira for problem with the master branch for prometheus scraping scm metrics from prom endpoint.  This affects our metrics monitoring with prom as well as the scm related metrics such as the` scm_node_manager_decommission*` metrics.  HDDS-7437.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1001605393


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +225,43 @@ public int getTrackedNodeCount() {
     return trackedNodes.size();
   }
 
+  synchronized void setMetricsToGauge() {
+    metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+    metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+    metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+            trackedDecomMaintenance);
+    metrics.setTrackedContainersUnderReplicatedTotal(
+            underReplicatedContainers);
+    metrics.setTrackedContainersSufficientlyReplicatedTotal(
+            sufficientlyReplicatedContainers);
+    metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+    for (Map.Entry<String, Long> e :
+            pipelinesWaitingToCloseByHost.entrySet()) {
+      metrics.metricRecordPipelineWaitingToCloseByHost(e.getKey(),
+              e.getValue());
+    }
+    for (Map.Entry<String, ContainerStateInWorkflow> e :

Review Comment:
   Huum, looks like you are correct. I wonder what the best approach is here.
   
   I don't think its a great user experience if we start with no individual nodes track, and then over time (in a long running SCM) more and more nodes get added for maintenance and decommission and the number builds up all with zero counts. I guess its not a major problem, but it would be nice to resolve it somehow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r999702123


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -90,6 +137,9 @@ public DatanodeAdminMonitorImpl(
     this.eventQueue = eventQueue;
     this.nodeManager = nodeManager;
     this.replicationManager = replicationManager;
+
+    containerStateByHost = new HashMap<>();
+    pipelinesWaitingToCloseByHost = new HashMap<>();

Review Comment:
   Split between replication state and pipelines was for grouping - they are initialized and set in separate parts of the monitor code that resulted in using two separate maps to store the two.  Looking to, as suggested, reuse the `ContainerStateInWorkflow `for the two, perhaps two different setters; one for the replication and the other for pipelines.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1006408976


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +232,55 @@ public int getTrackedNodeCount() {
     return trackedNodes.size();
   }
 
+  synchronized void setMetricsToGauge() {
+    metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+    metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+    metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+            trackedDecomMaintenance);
+    metrics.setTrackedContainersUnderReplicatedTotal(
+            underReplicatedContainers);
+    metrics.setTrackedContainersSufficientlyReplicatedTotal(
+            sufficientlyReplicatedContainers);
+    metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+    for (Map.Entry<String, ContainerStateInWorkflow> e :
+            containerStateByHost.entrySet()) {
+      metrics.metricRecordOfContainerStateByHost(e.getKey(),
+          e.getValue().sufficientlyReplicated,
+          e.getValue().underReplicatedContainers,
+          e.getValue().unhealthyContainers,
+          e.getValue().pipelinesWaitingToClose);
+    }
+  }
+
+  void resetContainerMetrics() {
+    pipelinesWaitingToClose = 0;
+    sufficientlyReplicatedContainers = 0;
+    unhealthyContainers = 0;
+    underReplicatedContainers = 0;
+
+    for (Map.Entry<String, ContainerStateInWorkflow> e :

Review Comment:
   Latest commit contains the modifications we are discussing - 
   
   > We can build up a Map<String, ContainerStateInWorkflow> for each iteration and at the end of the iteration pass it to the metrics object like new Map<>(mapJustBuiltUp)
   
   Each iteration in the monitor collects snapshot of node metrics in workflow within threadLocal variables.
   
   > Then make ContainerStateInWorkflow a public inner class, and pass the Map<String, ContainerStateInWorkflow> directly to the metrics class and have it replace all its metrics internally with the new map.
   
   The `NodeDecommissionMetrics` uses the` ContainerStateInWorkflow` for each node to set the metric gauges pulled by `getMetrics`.  Each iteration clears the internal maps.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1004983578


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -74,6 +76,58 @@ public class DatanodeAdminMonitorImpl implements DatanodeAdminMonitor {
   private Queue<DatanodeDetails> pendingNodes = new ArrayDeque();
   private Queue<DatanodeDetails> cancelledNodes = new ArrayDeque();
   private Set<DatanodeDetails> trackedNodes = new HashSet<>();
+  private NodeDecommissionMetrics metrics;
+  private long pipelinesWaitingToClose = 0;
+  private long sufficientlyReplicatedContainers = 0;
+  private long trackedDecomMaintenance = 0;
+  private long trackedRecommission = 0;
+  private long unhealthyContainers = 0;
+  private long underReplicatedContainers = 0;
+
+  private static final class ContainerStateInWorkflow {
+    private long sufficientlyReplicated = 0;
+    private long unhealthyContainers = 0;
+    private long underReplicatedContainers = 0;
+    private String host = "";
+    private long pipelinesWaitingToClose = 0;
+
+    private ContainerStateInWorkflow(String host,
+                                     long sufficientlyReplicated,

Review Comment:
   Yes, currently we instantiate zeroing the values then use the setters to update the values.  The constructor contains the parameters in case we start initializing with non-zero values.  If we don't use it, we can remove and default the parameters to zero.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1004946930


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +232,55 @@ public int getTrackedNodeCount() {
     return trackedNodes.size();
   }
 
+  synchronized void setMetricsToGauge() {
+    metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+    metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+    metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+            trackedDecomMaintenance);
+    metrics.setTrackedContainersUnderReplicatedTotal(
+            underReplicatedContainers);
+    metrics.setTrackedContainersSufficientlyReplicatedTotal(
+            sufficientlyReplicatedContainers);
+    metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+    for (Map.Entry<String, ContainerStateInWorkflow> e :
+            containerStateByHost.entrySet()) {
+      metrics.metricRecordOfContainerStateByHost(e.getKey(),
+          e.getValue().sufficientlyReplicated,
+          e.getValue().underReplicatedContainers,
+          e.getValue().unhealthyContainers,
+          e.getValue().pipelinesWaitingToClose);
+    }
+  }
+
+  void resetContainerMetrics() {
+    pipelinesWaitingToClose = 0;
+    sufficientlyReplicatedContainers = 0;
+    unhealthyContainers = 0;
+    underReplicatedContainers = 0;
+
+    for (Map.Entry<String, ContainerStateInWorkflow> e :

Review Comment:
   Yes, thanks,  I was looking to couple the `ContainerStateInWorkflow` for use both in the `DatanodeAdminMonitorImpl`and in the `NodeDecommissionMetrics` however there are a few issues with that and thus it is implemented this way.  Those issues are,
   
   i.) there needs to be separate stores for numbers collected for the metrics from the monitor and numbers stored in the `NodeDecommissionMonitor.`  This is so that we do not report incomplete intermediate numbers to the` NodeDecommissionMonitor` when the metrics are periodically pulled through `getMetrics().`  We flush the numbers on each run of the monitor thread to the `NodeDecommissionMetrics `once all the numbers have been collected (the calls to `metricRecordOfContainterStateByHost)`.  For this reason the `Map<string, ContainerStateInWorkflow>` cannot be used directly in the `NodeDecommisonMonitor`.
   
   ii.) With the two separate stores, we need to know which hosts stored are currently in the workflow and which are out of the workflow and stale.  Thus the check in the monitor code to collect those hosts that are stale and reporting that to the `NodeDecommissionMetrics.metricRemoveRecordOfContainerStateByHost()`.  For this reason as well, it looks like clearing the map on each run of the monitor instead of iterating to reset to 0, as suggested, is not possible.  We need to know which nodes (hosts) have become stale since the last run of the monitor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1004458588


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -74,6 +76,58 @@ public class DatanodeAdminMonitorImpl implements DatanodeAdminMonitor {
   private Queue<DatanodeDetails> pendingNodes = new ArrayDeque();
   private Queue<DatanodeDetails> cancelledNodes = new ArrayDeque();
   private Set<DatanodeDetails> trackedNodes = new HashSet<>();
+  private NodeDecommissionMetrics metrics;
+  private long pipelinesWaitingToClose = 0;
+  private long sufficientlyReplicatedContainers = 0;
+  private long trackedDecomMaintenance = 0;
+  private long trackedRecommission = 0;
+  private long unhealthyContainers = 0;
+  private long underReplicatedContainers = 0;
+
+  private static final class ContainerStateInWorkflow {
+    private long sufficientlyReplicated = 0;
+    private long unhealthyContainers = 0;
+    private long underReplicatedContainers = 0;
+    private String host = "";
+    private long pipelinesWaitingToClose = 0;
+
+    private ContainerStateInWorkflow(String host,
+                                     long sufficientlyReplicated,

Review Comment:
   Do we ever pass non-zeros for these values? Perhaps we can just drop these parameters and let them default to zero?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1004992451


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +225,43 @@ public int getTrackedNodeCount() {
     return trackedNodes.size();
   }
 
+  synchronized void setMetricsToGauge() {
+    metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+    metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+    metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+            trackedDecomMaintenance);
+    metrics.setTrackedContainersUnderReplicatedTotal(
+            underReplicatedContainers);
+    metrics.setTrackedContainersSufficientlyReplicatedTotal(
+            sufficientlyReplicatedContainers);
+    metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+    for (Map.Entry<String, Long> e :
+            pipelinesWaitingToCloseByHost.entrySet()) {
+      metrics.metricRecordPipelineWaitingToCloseByHost(e.getKey(),
+              e.getValue());
+    }
+    for (Map.Entry<String, ContainerStateInWorkflow> e :

Review Comment:
   Thanks.  We should go forward with using this implementation that works for JMX metrics for completing this PR  to _expose decommission / maintenance metrics via JMX_ and open a new jira to look into supporting the prom endpoint.  This PR supports metrics tracking the decommission and maintenance workflow both with aggregated counts and DN host specific counts.  A jira will be filed to track prom endpoint behavior for the metrics.  What do you think?  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] captainzmc commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

captainzmc commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1300764566

   Thanks @sodonnel for the feedback.  
   Hi @neils-dev, I saw that you mentioned in the dev email that ozone 1.3 needs this patch. But I agree with Stephen that the current 1.3 release is more focused on the important bug fixes. So I'm going to remove this PR from the 1.3 block list, what do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1308883027

   Thanks @sodonnel , @kerneltime , @captainzmc .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] neils-dev commented on pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

neils-dev commented on PR #3781:
URL: https://github.com/apache/ozone/pull/3781#issuecomment-1307524690

   Thanks @kerneltime , @sodonnel for the comment on removing the prefix "tracked" from metrics published through the NodeDecommisionMetrics.  Sounds good.  I've updated the code, now metrics are pushed as for jmx,
   
   >     "DecommissioningMaintenanceNodesTotal" : 1,
   >     "RecommissionNodesTotal" : 0,
   >     "PipelinesWaitingToCloseTotal" : 1,
   >     "ContainersUnderReplicatedTotal" : 0,
   >     "ContainersUnhealthyTotal" : 0,
   >     "ContainersSufficientlyReplicatedTotal" : 0,
   >     "tag.datanode.1" : "ozone_datanode_2.ozone_default",
   >     "tag.Hostname.1" : "30857068c05f",
   >     "PipelinesWaitingToCloseDN.1" : 1,
   >     "UnderReplicatedDN.1" : 0,
   >     "SufficientlyReplicatedDN.1" : 0,
   >     "UnhealthyContainersDN.1" : 0 
    
   and on the prom endpoint,
   
   > node_decommission_metrics_containers_under_replicated_total{hostname="30857068c05f"} 1
   > ...
   > node_decommission_metrics_decommissioning_maintenance_nodes_total gauge
   > node_decommission_metrics_decommissioning_maintenance_nodes_total{hostname="30857068c05f"} 1
   > 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r998422819


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -82,6 +85,44 @@ public class DatanodeAdminMonitorImpl implements DatanodeAdminMonitor {
   private long unhealthyContainers = 0;
   private long underReplicatedContainers = 0;
 
+  @SuppressFBWarnings(value = "SIC_INNER_SHOULD_BE_STATIC")
+  private final class ContainerStateInWorkflow {
+    private long sufficientlyReplicated = 0;
+    private long unhealthyContainers = 0;
+    private long underReplicatedContainers = 0;
+    private String host = "";
+
+    private ContainerStateInWorkflow(String host,
+                                     long sufficientlyReplicated,
+                                    long underReplicatedContainers,
+                                    long unhealthyContainers) {
+      this.host = host;
+      this.sufficientlyReplicated = sufficientlyReplicated;
+      this.unhealthyContainers = unhealthyContainers;
+      this.underReplicatedContainers = underReplicatedContainers;
+    }
+
+    public void setAll(long sufficiently,
+                    long under,

Review Comment:
   Formatting here seems off again - should either be 4 spaces in from the line above or aligned with the other parameters.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org