You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@storm.apache.org by revans2 <gi...@git.apache.org> on 2018/09/22 17:00:32 UTC

[GitHub] storm pull request #2845: STORM-3234: Replace old metrics with better docume...

GitHub user revans2 opened a pull request:

    https://github.com/apache/storm/pull/2845

    STORM-3234: Replace old metrics with better documentation.

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/revans2/incubator-storm STORM-3234

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/2845.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2845
    
----

----


---

[GitHub] storm issue #2845: STORM-3234: Replace old metrics docs with better document...

Posted by revans2 <gi...@git.apache.org>.

Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/2845
  
    @govind-menon I updated the docs as I checked them in.


---

[GitHub] storm pull request #2845: STORM-3234: Replace old metrics docs with better d...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/storm/pull/2845


---

[GitHub] storm pull request #2845: STORM-3234: Replace old metrics docs with better d...

Posted by govind-menon <gi...@git.apache.org>.

Github user govind-menon commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2845#discussion_r220219980
  
    --- Diff: docs/ClusterMetrics.md ---
    @@ -0,0 +1,256 @@
    +---
    +title: Cluster Metrics
    +layout: documentation
    +documentation: true
    +---
    +
    +#Cluster Metrics
    +
    +There are lots of metrics to help you monitor a running cluster.  Many of these metrics are still a work in progress and so is the metrics system itself so any of them may change, even between minor version releases.  We will try to keep them as stable as possible, but they should all be considered somewhat unstable. Some of the metrics may also be for experimental features, or features that are not complete yet, so please read the description of the metric before using it for monitoring or alerting.
    +
    +Also be aware that depending on the metrics system you use, the names are likely to be translated into a different format that is compatible with the system.  Typically this means that the ':' separating character will be replaced with a '.' character.
    +
    +Most metrics should have the units that they are reported in as a part of the description.  For Timers often this is configured by the reporter that is uploading them to your system.  Pay attention because even if the metric name has a time unit in it, it may be false.
    +
    +Also most metrics, except for gauges and counters, are a collection of numbers, and not a single value.  Often these result in multiple metrics being uploaded to a reporting system, such as percentiles for a histogram, or rates for a meter.  It is dependent on the configured metrics reporter how this happens, or how the name here corresponds to the metric in your reporting system.
    +
    +## Cluster Metrics (From Nimbus)
    +
    +These are metrics that come from the active nimbus instance and report the state of the cluster as a whole, as seen by nimbus.
    +
    +| Metric Name | Type | Description |
    +|-------------|------|-------------|
    +| cluster:num-nimbus-leaders | gauge | Number of nimbuses marked as a leader. This should really only ever be 1 in a health cluster, or 0 for a short period of time while a failover happens. |
    +| cluster:num-nimbuses | gauge | Number of nimbuses, leader or standby. |
    +| cluster:num-supervisors | gauge | Number of supervisors. |
    +| cluster:num-topologies | gauge | Number of topologies. |
    +| cluster:num-total-used-workers | gauge | Number of used workers/slots. |
    +| cluster:num-total-workers | gauge | Number of workers/slots. |
    +| cluster:total-fragmented-cpu-non-negative | gauge | Total fragmented CPU (% of core).  This is CPU that the system thinks it cannot use because other resources on the node are used up. |
    +| cluster:total-fragmented-memory-non-negative | gauge | Total fragmented memory (MB).  This is memory that the system thinks it cannot use because other resources on the node are used up.  |
    +| topologies:assigned-cpu | histogram | CPU scheduled per topology (% of a core) |
    +| topologies:assigned-mem-off-heap | histogram | Off heap memory scheduled per topology (MB) |
    +| topologies:assigned-mem-on-heap | histogram | On heap memory scheduled per topology (MB) |
    +| topologies:num-executors | histogram | Number of executors per topology. |
    +| topologies:num-tasks | histogram | Number of tasks per topology. |
    +| topologies:num-workers | histogram | Number of workers per topology. |
    +| topologies:replication-count | histogram | Replication count per topology. |
    +| topologies:requested-cpu | histogram | CPU requested per topology  (% of a core). |
    +| topologies:requested-mem-off-heap | histogram | Off heap memory requested per topology (MB). |
    +| topologies:requested-mem-on-heap | histogram | On heap memory requested per topology (MB). |
    +| topologies:uptime-secs | histogram | Uptime per topology (seconds). |
    +| nimbus:available-cpu-non-negative | gauge | Available cpu on the cluster (% of a core). |
    +| nimbus:total-cpu | gauge | total CPU on the cluster (% of a core) |
    +| nimbus:total-memory | gauge | total memory on the cluster MB |
    +| supervisors:fragmented-cpu | histogram | fragmented cpu per supervisor (% of a core) |
    +| supervisors:fragmented-mem | histogram | fragmented memory per supervisor (MB) |
    +| supervisors:num-used-workers | histogram | workers used per supervisor |
    +| supervisors:num-workers | histogram | number of workers per supervisor |
    +| supervisors:uptime-secs | histogram | uptime of supervisors |
    +| supervisors:used-cpu | histogram | cpu used per supervisor (% of a core) |
    +| supervisors:used-mem | histogram | memory used per supervisor MB |
    +
    +## Nimbus Metrics
    +
    +These are metrics that are specific to a nimbus instance.  In many instances only the active nimbus will be reporting these metrics, but they could come from standby nimbus instances as well.
    +
    +| Metric Name | Type | Description |
    +|-------------|------|-------------|
    +| nimbus:files-upload-duration-ms | timer | Time it takes to upload a file from start to finish (Not Blobs, but this may change) |
    +| nimbus:longest-scheduling-time-ms | gauge | Longest time ever taken so far to schedule. This includes the current scheduling run, which is intended to detect if scheduling is stuck for some reason. |
    +| nimbus:num-activate-calls | meter | calls to the activate thrift method. |
    +| nimbus:num-added-executors-per-scheduling | histogram | number of executors added after a scheduling run. |
    +| nimbus:num-added-slots-per-scheduling | histogram |  number of slots added after a scheduling run. |
    +| nimbus:num-beginFileUpload-calls | meter | calls to the beginFileUpload thrift method. |
    +| nimbus:num-blacklisted-supervisor | gauge | Number of supervisors currently marked as blacklisted because they appear to be somewhat unstable. |
    +| nimbus:num-deactivate-calls | meter | calls to deactivate thrift method. |
    +| nimbus:num-debug-calls | meter | calls to debug thrift method.|
    +| nimbus:num-downloadChunk-calls | meter | calls to downloadChunk thrift method. |
    +| nimbus:num-finishFileUpload-calls | meter | calls to finishFileUpload thrift method.|
    +| nimbus:num-gained-leadership | meter | number of times this nimbus gained leadership. |
    +| nimbus:num-getClusterInfo-calls | meter | calls to getClusterInfo thrift method. |
    +| nimbus:num-getComponentPageInfo-calls | meter | calls to getComponentPageInfo thrift method. |
    +| nimbus:num-getComponentPendingProfileActions-calls | meter | calls to getComponentPendingProfileActions thrift method. |
    +| nimbus:num-getLeader-calls | meter | calls to getLeader thrift method. |
    +| nimbus:num-getLogConfig-calls | meter | calls to getLogConfig thrift method. |
    +| nimbus:num-getNimbusConf-calls | meter | calls to getNimbusConf thrift method. |
    +| nimbus:num-getOwnerResourceSummaries-calls | meter | calls to getOwnerResourceSummaries thrift method. |
    +| nimbus:num-getSupervisorPageInfo-calls | meter | calls to getSupervisorPageInfo thrift method. |
    +| nimbus:num-getTopology-calls | meter | calls to getTopology thrift method. |
    +| nimbus:num-getTopologyConf-calls | meter | calls to getTopologyConf thrift method. |
    +| nimbus:num-getTopologyInfo-calls | meter | calls to getTopologyInfo thrift method. |
    +| nimbus:num-getTopologyInfoWithOpts-calls | meter | calls to getTopologyInfoWithOpts thrift method includes calls to getTopologyInfo. |
    +| nimbus:num-getTopologyPageInfo-calls | meter | calls to getTopologyPageInfo thrift method. |
    +| nimbus:num-getUserTopology-calls | meter | calls to getUserTopology thrift method. |
    +| nimbus:num-isTopologyNameAllowed-calls | meter | calls to isTopologyNameAllowed thrift method. |
    +| nimbus:num-killTopology-calls | meter | calls to killTopology thrift method. |
    +| nimbus:num-killTopologyWithOpts-calls | meter | calls to killTopologyWithOpts thrift method includes calls to killTopology. |
    +| nimbus:num-launched | meter | number of times a nimbus was launched |
    +| nimbus:num-lost-leadership | meter | number of times this nimbus lost leadership |
    +| nimbus:num-negative-resource-events | meter | any time a resource goes negative (either CPU or Memory)  Not consistent as it is used for internal calculations that may go negative and does not represent over scheduling of resources. |
    --- End diff --
    
    Do you mean "not inconsistent".
    
    Nit: missing a semicolon or .


---

[GitHub] storm pull request #2845: STORM-3234: Replace old metrics docs with better d...

Posted by govind-menon <gi...@git.apache.org>.

Github user govind-menon commented on a diff in the pull request:

https://github.com/apache/storm/pull/2845#discussion_r220219222

--- Diff: docs/ClusterMetrics.md ---
@@ -0,0 +1,256 @@
+---
+title: Cluster Metrics
+layout: documentation
+documentation: true
+---
+
+#Cluster Metrics
+
+There are lots of metrics to help you monitor a running cluster. Many of these metrics are still a work in progress and so is the metrics system itself so any of them may change, even between minor version releases. We will try to keep them as stable as possible, but they should all be considered somewhat unstable. Some of the metrics may also be for experimental features, or features that are not complete yet, so please read the description of the metric before using it for monitoring or alerting.
+
+Also be aware that depending on the metrics system you use, the names are likely to be translated into a different format that is compatible with the system. Typically this means that the ':' separating character will be replaced with a '.' character.
+
+Most metrics should have the units that they are reported in as a part of the description. For Timers often this is configured by the reporter that is uploading them to your system. Pay attention because even if the metric name has a time unit in it, it may be false.
+
+Also most metrics, except for gauges and counters, are a collection of numbers, and not a single value. Often these result in multiple metrics being uploaded to a reporting system, such as percentiles for a histogram, or rates for a meter. It is dependent on the configured metrics reporter how this happens, or how the name here corresponds to the metric in your reporting system.
+
+## Cluster Metrics (From Nimbus)
+
+These are metrics that come from the active nimbus instance and report the state of the cluster as a whole, as seen by nimbus.
+
+| Metric Name | Type | Description |
+|-------------|------|-------------|
+| cluster:num-nimbus-leaders | gauge | Number of nimbuses marked as a leader. This should really only ever be 1 in a health cluster, or 0 for a short period of time while a failover happens. |
--- End diff --

Nit: healthy

---