You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Stephen O'Donnell (Jira)" <ji...@apache.org> on 2020/01/07 09:02:00 UTC

[jira] [Commented] (HDDS-2113) Update JMX metrics in SCMNodeMetrics for Decommission and Maintenance

    [ https://issues.apache.org/jira/browse/HDDS-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009529#comment-17009529 ] 

Stephen O'Donnell commented on HDDS-2113:
-----------------------------------------

For the count of nodes in various states:

In offline discussions some time ago, I think it was [~nanda] who suggested that instead of having only the metrics HEALTHY, STALE and DEAD, we should instead have nested metrics, ie:
{code}
IN_SERVICE: 
  HEALTHY: 10
  STALE: 0
  DEAD: 0
DECOMMISSIONING:
  HEALTHY: 1
  STALE: 0
  DEAD: 0
...
{code}

That way we avoid having the cross product of all states at the top level and it makes it easier.

For calculating capacity, I think it is safe to assume:

For any node not IN_SERVICE the free space on the node is not usable as the node is effectively read only. Therefore we should not count space on these nodes toward the cluster capacity.

Things are less clear when you look at space used. For a node decommissioning, the space used on the node effectively needs to be transfer to other nodes via container replication before decommission can complete, but this is difficult to track from a space usage perspective. When a node completes decommission, we can assume it provides no capacity to the cluster and uses none. Therefore, for decommissioning + decommissioned nodes, the simplest calculation is to exclude the node completely in a similar way to a dead node.

For maintenance nodes, things are even less clear. For a maintenance node, it is read only so it cannot provide capacity to the cluster, but it is expected to return to service, so excluding it completely probably does not make sense.

Perhaps the simplest solution is to do the following:

1. For any node not IN_SERVICE, do not include its usage or space in the cluster capacity totals.
2. Introduce some new metrics to account for the maintenance and perhaps decommission capacity, eg:

{code}
# Existing metrics
"DiskCapacity" : 62725623808,
"DiskUsed" : 4096,
"DiskRemaining" : 50459619328,

# Suggested new ones
"MaintenanceDiskCapacity": 0
"MaintenanceDiskUsed": 0
"MaintenanceDiskRemaining": 0
"DecommissionedDiskCapacity": 0
"DecommissionedDiskUsed": 0
"DecommissionedDiskRemaining": 0
...
{code}

That way, the cluster totals are only what is currently "online", but we have the other metrics to track what has been removed etc.

There could also be an argument that new decommissionedDisk metrics are not needed as that capacity is technically lost from the cluster forever.

> Update JMX metrics in SCMNodeMetrics for Decommission and Maintenance
> ---------------------------------------------------------------------
>
>                 Key: HDDS-2113
>                 URL: https://issues.apache.org/jira/browse/HDDS-2113
>             Project: Hadoop Distributed Data Store
>          Issue Type: Sub-task
>          Components: SCM
>    Affects Versions: 0.5.0
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>
> Currently the class SCMNodeMetrics exposes JMX metrics for the number of HEALTHY, STALE and DEAD nodes.
> It also exposes the disk capacity of the cluster and the amount of space used and available.
> We need to decide how we want to display things in JMX when nodes are in and entering maintenance, decommissioning and decommissioned.
> We now have 15 states rather than the previous 3, as we can have nodes in:
>  * IN_SERVICE
>  * ENTERING_MAINTENANCE
>  * IN_MAINTENANCE
>  * DECOMMISSIONING
>  * DECOMMISSIONED
> And in each of these states, nodes can be:
>  * HEALTHY
>  * STALE
>  * DEAD
> The simplest case would be to expose these 15 states directly in JMX, as it gives the complete picture, but I wonder if we need any summary JMX metrics too?
>  
> We also need to consider how to count disk capacity and usage. For example:
>  # Do we count capacity and usage on a DECOMMISSIONING node? This is not a clear cut answer, as a decommissioning node does not provide any capacity for writers in the cluster, but it does use capacity.
>  # For a DECOMMISSIONED node, we probably should not count capacity or usage
>  # For an ENTERING_MAINTENANCE node, do we count capacity and usage? I suspect we should include the capacity and usage in the totals, however a node in this state will not be available for writes.
>  # For an IN_MAINTENANCE node that is healthy?
>  # For an IN_MAINTENANCE node that is dead?
> I would welcome any thoughts on this before changing the code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org