You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ozone.apache.org by "Neil Joshi (Jira)" <ji...@apache.org> on 2022/09/15 01:09:00 UTC

[jira] [Commented] (HDDS-2642) Expose decommission / maintenance metrics via JMX

    [ https://issues.apache.org/jira/browse/HDDS-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17605020#comment-17605020 ] 

Neil Joshi commented on HDDS-2642:
----------------------------------

[~sodonnell] thanks for the offline discussion on exposing decommissioning / maintenance workflow progress metrics to JMX and prom endpoints.  As discussed, metrics will be collected and aggregated from the DatanodeAdminMonitor.  For this,

on each tick (execution of monitor), the following metrics will be set through MutableGaugeLong metrics types:

 
{code:java}
totalTrackedNodes - in decommission and maintenance workflow from the tracked nodes queue.
totaRecommissionNodes - total number in time tick from the canceled nodes queue.
totalTrackedPipelinesWaitingToClose - total number of pipelines need to close as seen in time tick
for the tracked nodes, on every time tick aggregating the replication state:
totalTrackedUnderReplicated
totalTrackedOverReplicated
totalTrackedSufficientlyReplicated
 
{code}
 

In addition to collecting and exposing these metrics for every monitor tick, what other metrics should be collected for the progress of the decommission and maintenance workflow to be observable?  To be useful? 

> Expose decommission / maintenance metrics via JMX
> -------------------------------------------------
>
>                 Key: HDDS-2642
>                 URL: https://issues.apache.org/jira/browse/HDDS-2642
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: SCM
>    Affects Versions: 0.5.0
>            Reporter: Stephen O'Donnell
>            Assignee: Neil Joshi
>            Priority: Major
>
> As nodes transition through the decommission and maintenance workflow, we should expose the hosts going through admin via JMX, along with possibly:
> 1. The stage of the process (close pipelines, replicate containers etc)
> 2. The number of sufficiently replicated, under replicated and unhealthy containers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org