You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@storm.apache.org by bo...@apache.org on 2018/09/25 17:22:02 UTC

[1/3] storm git commit: STORM-3234: Replace old metrics with better documentation.

Repository: storm
Updated Branches:
  refs/heads/master d4a5fa149 -> 9d083fa91


STORM-3234: Replace old metrics with better documentation.


Project: http://git-wip-us.apache.org/repos/asf/storm/repo
Commit: http://git-wip-us.apache.org/repos/asf/storm/commit/0db99bcb
Tree: http://git-wip-us.apache.org/repos/asf/storm/tree/0db99bcb
Diff: http://git-wip-us.apache.org/repos/asf/storm/diff/0db99bcb

Branch: refs/heads/master
Commit: 0db99bcbf4e9edc8f7f1eb16143075bb4ee80806
Parents: b2a1a58
Author: Robert (Bobby) Evans <ev...@yahoo-inc.com>
Authored: Sat Sep 22 11:56:26 2018 -0500
Committer: Robert (Bobby) Evans <ev...@yahoo-inc.com>
Committed: Sat Sep 22 11:56:26 2018 -0500

----------------------------------------------------------------------
 docs/ClusterMetrics.md                          | 256 +++++++++++++++++++
 docs/Metrics.md                                 |   2 +
 docs/index.md                                   |   2 +-
 .../storm-metrics-profiling-internal-actions.md |  76 ------
 .../org/apache/storm/daemon/nimbus/Nimbus.java  |   4 -
 5 files changed, 259 insertions(+), 81 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/storm/blob/0db99bcb/docs/ClusterMetrics.md
----------------------------------------------------------------------
diff --git a/docs/ClusterMetrics.md b/docs/ClusterMetrics.md
new file mode 100644
index 0000000..d32d0b1
--- /dev/null
+++ b/docs/ClusterMetrics.md
@@ -0,0 +1,256 @@
+---
+title: Cluster Metrics
+layout: documentation
+documentation: true
+---
+
+#Cluster Metrics
+
+There are lots of metrics to help you monitor a running cluster.  Many of these metrics are still a work in progress and so is the metrics system itself so any of them may change, even between minor version releases.  We will try to keep them as stable as possible, but they should all be considered somewhat unstable. Some of the metrics may also be for experimental features, or features that are not complete yet, so please read the description of the metric before using it for monitoring or alerting.
+
+Also be aware that depending on the metrics system you use, the names are likely to be translated into a different format that is compatible with the system.  Typically this means that the ':' separating character will be replaced with a '.' character.
+
+Most metrics should have the units that they are reported in as a part of the description.  For Timers often this is configured by the reporter that is uploading them to your system.  Pay attention because even if the metric name has a time unit in it, it may be false.
+
+Also most metrics, except for gauges and counters, are a collection of numbers, and not a single value.  Often these result in multiple metrics being uploaded to a reporting system, such as percentiles for a histogram, or rates for a meter.  It is dependent on the configured metrics reporter how this happens, or how the name here corresponds to the metric in your reporting system.
+
+## Cluster Metrics (From Nimbus)
+
+These are metrics that come from the active nimbus instance and report the state of the cluster as a whole, as seen by nimbus.
+
+| Metric Name | Type | Description |
+|-------------|------|-------------|
+| cluster:num-nimbus-leaders | gauge | Number of nimbuses marked as a leader. This should really only ever be 1 in a health cluster, or 0 for a short period of time while a failover happens. |
+| cluster:num-nimbuses | gauge | Number of nimbuses, leader or standby. |
+| cluster:num-supervisors | gauge | Number of supervisors. |
+| cluster:num-topologies | gauge | Number of topologies. |
+| cluster:num-total-used-workers | gauge | Number of used workers/slots. |
+| cluster:num-total-workers | gauge | Number of workers/slots. |
+| cluster:total-fragmented-cpu-non-negative | gauge | Total fragmented CPU (% of core).  This is CPU that the system thinks it cannot use because other resources on the node are used up. |
+| cluster:total-fragmented-memory-non-negative | gauge | Total fragmented memory (MB).  This is memory that the system thinks it cannot use because other resources on the node are used up.  |
+| topologies:assigned-cpu | histogram | CPU scheduled per topology (% of a core) |
+| topologies:assigned-mem-off-heap | histogram | Off heap memory scheduled per topology (MB) |
+| topologies:assigned-mem-on-heap | histogram | On heap memory scheduled per topology (MB) |
+| topologies:num-executors | histogram | Number of executors per topology. |
+| topologies:num-tasks | histogram | Number of tasks per topology. |
+| topologies:num-workers | histogram | Number of workers per topology. |
+| topologies:replication-count | histogram | Replication count per topology. |
+| topologies:requested-cpu | histogram | CPU requested per topology  (% of a core). |
+| topologies:requested-mem-off-heap | histogram | Off heap memory requested per topology (MB). |
+| topologies:requested-mem-on-heap | histogram | On heap memory requested per topology (MB). |
+| topologies:uptime-secs | histogram | Uptime per topology (seconds). |
+| nimbus:available-cpu-non-negative | gauge | Available cpu on the cluster (% of a core). |
+| nimbus:total-cpu | gauge | total CPU on the cluster (% of a core) |
+| nimbus:total-memory | gauge | total memory on the cluster MB |
+| supervisors:fragmented-cpu | histogram | fragmented cpu per supervisor (% of a core) |
+| supervisors:fragmented-mem | histogram | fragmented memory per supervisor (MB) |
+| supervisors:num-used-workers | histogram | workers used per supervisor |
+| supervisors:num-workers | histogram | number of workers per supervisor |
+| supervisors:uptime-secs | histogram | uptime of supervisors |
+| supervisors:used-cpu | histogram | cpu used per supervisor (% of a core) |
+| supervisors:used-mem | histogram | memory used per supervisor MB |
+
+## Nimbus Metrics
+
+These are metrics that are specific to a nimbus instance.  In many instances only the active nimbus will be reporting these metrics, but they could come from standby nimbus instances as well.
+
+| Metric Name | Type | Description |
+|-------------|------|-------------|
+| nimbus:files-upload-duration-ms | timer | Time it takes to upload a file from start to finish (Not Blobs, but this may change) |
+| nimbus:longest-scheduling-time-ms | gauge | Longest time ever taken so far to schedule. This includes the current scheduling run, which is intended to detect if scheduling is stuck for some reason. |
+| nimbus:num-activate-calls | meter | calls to the activate thrift method. |
+| nimbus:num-added-executors-per-scheduling | histogram | number of executors added after a scheduling run. |
+| nimbus:num-added-slots-per-scheduling | histogram |  number of slots added after a scheduling run. |
+| nimbus:num-beginFileUpload-calls | meter | calls to the beginFileUpload thrift method. |
+| nimbus:num-blacklisted-supervisor | gauge | Number of supervisors currently marked as blacklisted because they appear to be somewhat unstable. |
+| nimbus:num-deactivate-calls | meter | calls to deactivate thrift method. |
+| nimbus:num-debug-calls | meter | calls to debug thrift method.|
+| nimbus:num-downloadChunk-calls | meter | calls to downloadChunk thrift method. |
+| nimbus:num-finishFileUpload-calls | meter | calls to finishFileUpload thrift method.|
+| nimbus:num-gained-leadership | meter | number of times this nimbus gained leadership. |
+| nimbus:num-getClusterInfo-calls | meter | calls to getClusterInfo thrift method. |
+| nimbus:num-getComponentPageInfo-calls | meter | calls to getComponentPageInfo thrift method. |
+| nimbus:num-getComponentPendingProfileActions-calls | meter | calls to getComponentPendingProfileActions thrift method. |
+| nimbus:num-getLeader-calls | meter | calls to getLeader thrift method. |
+| nimbus:num-getLogConfig-calls | meter | calls to getLogConfig thrift method. |
+| nimbus:num-getNimbusConf-calls | meter | calls to getNimbusConf thrift method. |
+| nimbus:num-getOwnerResourceSummaries-calls | meter | calls to getOwnerResourceSummaries thrift method. |
+| nimbus:num-getSupervisorPageInfo-calls | meter | calls to getSupervisorPageInfo thrift method. |
+| nimbus:num-getTopology-calls | meter | calls to getTopology thrift method. |
+| nimbus:num-getTopologyConf-calls | meter | calls to getTopologyConf thrift method. |
+| nimbus:num-getTopologyInfo-calls | meter | calls to getTopologyInfo thrift method. |
+| nimbus:num-getTopologyInfoWithOpts-calls | meter | calls to getTopologyInfoWithOpts thrift method includes calls to getTopologyInfo. |
+| nimbus:num-getTopologyPageInfo-calls | meter | calls to getTopologyPageInfo thrift method. |
+| nimbus:num-getUserTopology-calls | meter | calls to getUserTopology thrift method. |
+| nimbus:num-isTopologyNameAllowed-calls | meter | calls to isTopologyNameAllowed thrift method. |
+| nimbus:num-killTopology-calls | meter | calls to killTopology thrift method. |
+| nimbus:num-killTopologyWithOpts-calls | meter | calls to killTopologyWithOpts thrift method includes calls to killTopology. |
+| nimbus:num-launched | meter | number of times a nimbus was launched |
+| nimbus:num-lost-leadership | meter | number of times this nimbus lost leadership |
+| nimbus:num-negative-resource-events | meter | any time a resource goes negative (either CPU or Memory)  Not consistent as it is used for internal calculations that may go negative and does not represent over scheduling of resources. |
+| nimbus:num-net-executors-increase-per-scheduling | histogram | added executors minus removed executors after a scheduling run |
+| nimbus:num-net-slots-increase-per-scheduling | histogram | added slots minus removed slots after a scheduling run |
+| nimbus:num-rebalance-calls | meter | calls to rebalance thrift method. |
+| nimbus:num-removed-executors-per-scheduling | histogram | number of executors removed after a scheduling run |
+| nimbus:num-removed-slots-per-scheduling | histogram | number of slots removed after a scheduling run |
+| nimbus:num-setLogConfig-calls | meter | calls to setLogConfig thrift method. |
+| nimbus:num-setWorkerProfiler-calls | meter | calls to setWorkerProfiler thrift method. |
+| nimbus:num-shutdown-calls | meter | times nimbus is shut down (this may not actually be reported as nimbus is in the middle of shutting down) |
+| nimbus:num-submitTopology-calls | meter | calls to submitTopology thrift method. |
+| nimbus:num-submitTopologyWithOpts-calls | meter | calls to submitTopologyWithOpts thrift method includes calls to submitTopology. |
+| nimbus:num-uploadChunk-calls | meter | calls to uploadChunk thrift method. |
+| nimbus:num-uploadNewCredentials-calls | meter | calls to uploadNewCredentials thrift method. |
+| nimbus:process-worker-metric-calls | meter | calls to processWorkerMetrics thrift method. |
+| nimbus:topology-scheduling-duration-ms | timer | time it takes to do a scheduling run. |
+| nimbus:total-available-memory-non-negative | gauge | available memory on the cluster MB |
+| nimbuses:uptime-secs | histogram | uptime of nimbuses |
+| MetricsCleaner:purgeTimestamp | gauge | last time metrics were purged (Unfinished Feature) |
+| RocksDB:metric-failures | meter | generally any failure that happens in the rocksdb metrics store. (Unfinished Feature) |
+
+
+## DRPC Metrics
+
+Metrics related to DRPC servers.
+
+| Metric Name | Type | Description |
+|-------------|------|-------------|
+| drpc:HTTP-request-response-duration | timer | how long it takes to execute an http drpc request |
+| drpc:num-execute-calls | meter | calls to execute a DRPC request |
+| drpc:num-execute-http-requests | meter | http requests to the DRPC server |
+| drpc:num-failRequest-calls | meter | calls to failRequest |
+| drpc:num-fetchRequest-calls | meter | calls to fetchRequest |
+| drpc:num-result-calls | meter | calls to returnResult |
+| drpc:num-server-timedout-requests | meter | times a DRPC request timed out without a response |
+| drpc:num-shutdown-calls | meter | number of times shutdown is called on the drpc server |
+
+## Logviewer Metrics
+
+Metrics related to the logviewer process. This process currently also handles cleaning up worker logs when they get too large or too old.
+
+| Metric Name | Type | Description |
+|-------------|------|-------------|
+| logviewer:cleanup-routine-duration-ms | timer | how long it takes to run the log cleanup routine |
+| logviewer:deep-search-request-duration-ms | timer | how long it takes for /deepSearch/{topoId} |
+| logviewer:disk-space-freed-in-bytes | histogram | number of bytes cleaned up each time through the cleanup routine. |
+| logviewer:download-file-size-rounded-MB | histogram | size in MB of files being downloaded |
+| logviewer:num-daemonlog-page-http-requests | meter | calls to /daemonlog |
+| logviewer:num-deep-search-no-result | meter | number of deep search requests that did not return any results |
+| logviewer:num-deep-search-requests-with-archived | meter | calls to /deepSearch/{topoId} with ?search-archived=true |
+| logviewer:num-deep-search-requests-without-archived | meter | calls to /deepSearch/{topoId} with ?search-archived=false |
+| logviewer:num-download-daemon-log-exceptions | meter | num errors in calls to /daemondownload |
+| logviewer:num-download-dump-exceptions | meter | num errors in calls to /dumps/{topo-id}/{host-port}/{filename} |
+| logviewer:num-download-log-daemon-file-http-requests | meter | calls to /daemondownload |
+| logviewer:num-download-log-exceptions | meter | num errors in calls to /download |
+| logviewer:num-download-log-file-http-requests | meter | calls to /download |
+| logviewer:num-file-download-exceptions | meter | errors while trying to download files. |
+| logviewer:num-file-download-exceptions | meter | number of exceptions trying to download a log file |
+| logviewer:num-file-open-exceptions | meter | errors trying to open a file (when deleting logs) |
+| logviewer:num-file-open-exceptions | meter | number of exceptions trying to open a log file for serving |
+| logviewer:num-file-read-exceptions | meter | number of exceptions trying to read from a log file for serving |
+| logviewer:num-file-removal-exceptions | meter | number of exceptions trying to cleanup files. |
+| logviewer:num-files-cleaned-up | histogram | number of files cleaned up each time through the cleanup routine. |
+| logviewer:num-files-scanned-per-deep-search | histogram | number of files scanned per deep search |
+| logviewer:num-list-dump-files-exceptions | meter | num errors in calls to /dumps/{topo-id}/{host-port} |
+| logviewer:num-list-logs-http-request | meter | calls to /listLogs |
+| logviewer:num-log-page-http-requests | meter | calls to /log |
+| logviewer:num-other-cleanup-exceptions | meter | number of exception in the cleanup loop, not directly deleting files. |
+| logviewer:num-page-read | meter | number of pages (parts of a log file) that are served up |
+| logviewer:num-read-daemon-log-exceptions | meter | num errors in calls to /daemonlog |
+| logviewer:num-read-log-exceptions | meter | num errors in calls to /log |
+| logviewer:num-search-exceptions | meter | num errors in calls to /search |
+| logviewer:num-search-log-exceptions | meter | num errors in calls to /listLogs |
+| logviewer:num-search-logs-requests | meter | calls to /search |
+| logviewer:num-search-request-no-result | meter | number of regular search results that were empty |
+| logviewer:num-set-permission-exceptions | meter | num errors running set permissions to open up files for reading. |
+| logviewer:num-shutdown-calls | meter | number of times shutdown was called on the logviewer |
+| logviewer:search-requests-duration-ms | timer | how long it takes for /search |
+| logviewer:worker-log-dir-size | gauge | size in bytes of the worker logs directory. |
+
+## Supervisor Metrics
+
+Metrics associated with the supervisor, which launches the workers for a topology.  The supervisor also has a state machine for each slot.  Some of the metrics are associated with that state machine and can be confusing if you do not understand the state machine.
+
+| Metric Name | Type | Description |
+|-------------|------|-------------|
+| supervisor:blob-cache-update-duration | timer | how long it takes to update all of the blobs in the cache (frequently just check if they have changed, but may also include downloading them.) |
+| supervisor:blob-fetching-rate-MB/s | histogram | Download rate of a blob in MB/sec.  Blobs are downloaded rarely so it is very bursty. |
+| supervisor:blob-localization-duration | timer | Approximately how long it takes to get the blob we want after it is requested. |
+| supervisor:current-reserved-memory-mb | gauge | total amount of memory reserved for workers on the supervisor (MB) |
+| supervisor:current-used-memory-mb | gauge | memory currently used as measured by the supervisor (this typically requires cgroups) (MB) |
+| supervisor:num-blob-update-version-changed | meter | number of times a version of a blob changes. |
+| supervisor:num-cleanup-exceptions | meter | exceptions thrown during container cleanup. |
+| supervisor:num-force-kill-exceptions | meter | exceptions thrown during force kill. |
+| supervisor:num-kill-exceptions | meter | exceptions thrown during kill. |
+| supervisor:num-launched | meter | number of times the supervisor is launched. |
+| supervisor:num-shell-exceptions | meter | number of exceptions calling shell commands. |
+| supervisor:num-slots-used-gauge | gauge | number of slots used on the supervisor. |
+| supervisor:num-worker-transitions-into-empty | meter | number of transitions into empty state. |
+| supervisor:num-worker-transitions-into-kill | meter | number of transitions into kill state. |
+| supervisor:num-worker-transitions-into-kill-and-relaunch | meter | number of transitions into kill-and-relaunch state |
+| supervisor:num-worker-transitions-into-kill-blob-update | meter | number of transitions into kill-blob-update state |
+| supervisor:num-worker-transitions-into-running | meter | number of transitions into running state |
+| supervisor:num-worker-transitions-into-waiting-for-blob-localization | meter | number of transitions into waiting-for-blob-localization state |
+| supervisor:num-worker-transitions-into-waiting-for-blob-update | meter | number of transitions into waiting-for-blob-update state |
+| supervisor:num-worker-transitions-into-waiting-for-worker-start | meter | number of transitions into waiting-for-worker-start state |
+| supervisor:num-workers-force-kill | meter | number of times a worker was force killed.  This may mean that the worker did not exit cleanly/quickly. |
+| supervisor:num-workers-killed-assignment-changed | meter | workers killed because the assignment changed. |
+| supervisor:num-workers-killed-blob-changed | meter | workers killed because the blob changed and they needed to be relaunched. |
+| supervisor:num-workers-killed-hb-null | meter | workers killed because there was no hb at all from the worker. This would typically only happen when a worker is launched for the first time. |
+| supervisor:num-workers-killed-hb-timeout | meter | workers killed because the hb from the worker was too old.  This often happens because of GC issues in the worker that prevents it from sending a heartbeat, but could also mean the worker process exited, and the supervisor is not the parent of the process to know that it exited. |
+| supervisor:num-workers-killed-memory-violation | meter | workers killed because the worker was using too much memory.  If the supervisor can monitor memory usage of the worker (typically through cgroups) and the worker goes over the limit it may be shot. |
+| supervisor:num-workers-killed-process-exit | meter | workers killed because the process exited and the supervisor was the parent process |
+| supervisor:num-workers-launched | meter | number of workers launched |
+| supervisor:single-blob-localization-duration | timer | how long it takes for a blob to be updated (downloaded, unzipped, inform slots, and make the move) |
+| supervisor:time-worker-spent-in-state-empty-ms | timer | time spent in empty state as it transitions out. Not necessarily in ms. |
+| supervisor:time-worker-spent-in-state-kill-and-relaunch-ms | timer | time spent in kill-and-relaunch state as it transitions out. Not necessarily in ms. |
+| supervisor:time-worker-spent-in-state-kill-blob-update-ms | timer | time spent in kill-blob-update state as it transitions out. Not necessarily in ms. |
+| supervisor:time-worker-spent-in-state-kill-ms | timer | time spent in kill state as it transitions out. Not necessarily in ms. |
+| supervisor:time-worker-spent-in-state-running-ms | timer | time spent in running state as it transitions out. Not necessarily in ms. |
+| supervisor:time-worker-spent-in-state-waiting-for-blob-localization-ms | timer | time spent in waiting-for-blob-localization state as it transitions out. Not necessarily in ms. |
+| supervisor:time-worker-spent-in-state-waiting-for-blob-update-ms | timer | time spent in waiting-for-blob-update state as it transitions out. Not necessarily in ms. |
+| supervisor:time-worker-spent-in-state-waiting-for-worker-start-ms | timer | time spent in waiting-for-worker-start state as it transitions out. Not necessarily in ms. |
+| supervisor:worker-launch-duration | timer | Time taken for a worker to launch. |
+| supervisor:worker-per-call-clean-up-duration-ns | meter | how long it takes to cleanup a worker (ns). |
+| supervisor:worker-shutdown-duration-ns | meter | how long it takes to shutdown a worker (ns). |
+
+
+## UI Metrics
+
+Metrics associated with a single UI daemon.
+
+| Metric Name | Type | Description |
+|-------------|------|-------------|
+| ui:num-activate-topology-http-requests | meter | calls to /topology/{id}/activate |
+| ui:num-all-topologies-summary-http-requests | meter | calls to /topology/summary |
+| ui:num-build-visualization-http-requests | meter | calls to /topology/{id}/visualization |
+| ui:num-cluster-configuration-http-requests | meter | calls to /cluster/configuration |
+| ui:num-cluster-summary-http-requests | meter | calls to /cluster/summary |
+| ui:num-component-op-response-http-requests | meter | calls to /topology/{id}/component/{component}/debug/{action}/{spct} |
+| ui:num-component-page-http-requests | meter | calls to /topology/{id}/component/{component} |
+| ui:num-deactivate-topology-http-requests | meter | calls to topology/{id}/deactivate |
+| ui:num-debug-topology-http-requests | meter | calls to /topology/{id}/debug/{action}/{spct} |
+| ui:num-get-owner-resource-summaries-http-request | meter | calls to /owner-resources or /owner-resources/{id} |
+| ui:num-log-config-http-requests | meter | calls to /topology/{id}/logconfig |
+| ui:num-main-page-http-requests | meter | number of requests to /index.html |
+| ui:num-mk-visualization-data-http-requests | meter | calls to /topology/{id}/visualization-init |
+| ui:num-nimbus-summary-http-requests | meter | calls to /nimbus/summary |
+| ui:num-supervisor-http-requests | meter | calls to /supervisor |
+| ui:num-supervisor-summary-http-requests | meter | calls to /supervisor/summary |
+| ui:num-topology-lag-http-requests | meter | calls to /topology/{id}/lag |
+| ui:num-topology-metric-http-requests | meter | calls to /topology/{id}/metrics |
+| ui:num-topology-op-response-http-requests | meter | calls to /topology/{id}/logconfig or /topology/{id}/rebalance/{wait-time} or /topology/{id}/kill/{wait-time} |
+| ui:num-topology-page-http-requests | meter | calls to /topology/{id} |
+| num-web-requests | meter | nominally the total number of web requests being made. |
+
+## Pacemaker Metrics (Deprecated)
+
+The pacemaker process is deprecated and only still exists for backwards compatibility.
+
+| Metric Name | Type | Description |
+|-------------|------|-------------|
+| pacemaker:get-pulse=count | meter | number of times getPulse was called.  yes the = is in the name, but typically this is mapped to a '-' by the metrics reporters. |
+| pacemaker:heartbeat-size | histogram | size in bytes of heartbeats |
+| pacemaker:send-pulse-count | meter | number of times sendPulse was called |
+| pacemaker:size-total-keys | gauge | total number of keys in this pacemaker instance |
+| pacemaker:total-receive-size | meter | total size in bytes of heartbeats received |
+| pacemaker:total-sent-size | meter | total size in bytes of heartbeats read |

http://git-wip-us.apache.org/repos/asf/storm/blob/0db99bcb/docs/Metrics.md
----------------------------------------------------------------------
diff --git a/docs/Metrics.md b/docs/Metrics.md
index 09ed2fc..e92f5db 100644
--- a/docs/Metrics.md
+++ b/docs/Metrics.md
@@ -6,6 +6,8 @@ documentation: true
 Storm exposes a metrics interface to report summary statistics across the full topology.
 The numbers you see on the UI come from some of these built in metrics, but are reported through the worker heartbeats instead of through the IMetricsConsumer described below.
 
+If you are looking for cluster wide monitoring please see [Cluster Metrics](ClusterMetrics.html).
+
 ### Metric Types
 
 Metrics have to implement [`IMetric`]({{page.git-blob-base}}/storm-client/src/jvm/org/apache/storm/metric/api/IMetric.java) which contains just one method, `getValueAndReset` -- do any remaining work to find the summary value, and reset back to an initial state. For example, the MeanReducer divides the running total by its running count to find the mean, then initializes both values back to zero.

http://git-wip-us.apache.org/repos/asf/storm/blob/0db99bcb/docs/index.md
----------------------------------------------------------------------
diff --git a/docs/index.md b/docs/index.md
index 2cd741f..17b0a7f 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -65,7 +65,7 @@ But small change will not affect the user experience. We will notify the user wh
 * [CGroup Enforcement](cgroups_in_storm.html)
 * [Pacemaker reduces load on zookeeper for large clusters](Pacemaker.html)
 * [Resource Aware Scheduler](Resource_Aware_Scheduler_overview.html)
-* [Daemon Metrics/Monitoring](storm-metrics-profiling-internal-actions.html)
+* [Daemon Metrics/Monitoring](ClusterMetrics.html)
 * [Windows users guide](windows-users-guide.html)
 * [Classpath handling](Classpath-handling.html)
 

http://git-wip-us.apache.org/repos/asf/storm/blob/0db99bcb/docs/storm-metrics-profiling-internal-actions.md
----------------------------------------------------------------------
diff --git a/docs/storm-metrics-profiling-internal-actions.md b/docs/storm-metrics-profiling-internal-actions.md
deleted file mode 100644
index b128365..0000000
--- a/docs/storm-metrics-profiling-internal-actions.md
+++ /dev/null
@@ -1,76 +0,0 @@
----
-title: Storm Metrics for Profiling Various Storm Internal Actions
-layout: documentation
-documentation: true
----
-
-# Storm Metrics for Profiling Various Storm Internal Actions
-
-With the addition of these metrics, Storm users can collect, view, and analyze the performance of various internal actions.  The actions that are profiled include thrift rpc calls and http quests within the storm daemons. For instance, in the Storm Nimbus daemon, the following thrift calls defined in the Nimbus$Iface are profiled:
-
-- submitTopology
-- submitTopologyWithOpts
-- killTopology
-- killTopologyWithOpts
-- activate
-- deactivate
-- rebalance
-- setLogConfig
-- getLogConfig
-
-Various HTTP GET and POST requests are marked for profiling as well such as the GET and POST requests for the Storm UI daemon (ui/core.cj)
-To implement these metrics the following packages are used: 
-- io.dropwizard.metrics
-- metrics-clojure
-
-## How it works
-
-By using packages io.dropwizard.metrics and metrics-clojure (clojure wrapper for the metrics Java API), we can mark functions to profile by declaring (defmeter num-some-func-calls) and then adding the (mark! num-some-func-calls) to where the function is invoked. For example:
-
-    (defmeter num-some-func-calls)
-    (defn some-func [args]
-        (mark! num-some-func-calls)
-        (body))
-        
-What essentially the mark! API call does is increment a counter that represents how many times a certain action occured.  For instantanous measurements user can use gauges.  For example: 
-
-    (defgauge nimbus:num-supervisors
-         (fn [] (.size (.supervisors (:storm-cluster-state nimbus) nil))))
-         
-The above example will get the number of supervisors in the cluster.  This metric is not accumlative like one previously discussed.
-
-A metrics reporting server needs to also be activated to collect the metrics. You can do this by calling the following function:
-
-    (defn start-metrics-reporters []
-        (jmx/start (jmx/reporter {})))
-
-## How to collect the metrics
-
-Metrics can be reported via JMX or HTTP.  A user can use JConsole or VisualVM to connect to the jvm process and view the stats.
-
-To view the metrics in a GUI use VisualVM or JConsole.  Screenshot of using VisualVm for metrics: 
-
-![Viewing metrics with VisualVM](images/viewing_metrics_with_VisualVM.png)
-
-For detailed information regarding how to collect the metrics please reference: 
-
-https://dropwizard.github.io/metrics/3.1.0/getting-started/
-
-If you want use JMX and view metrics through JConsole or VisualVM, remember launch JVM processes your want to profile with the correct JMX configurations.  For example in Storm you would add the following to conf/storm.yaml
-
-    nimbus.childopts: "-Xmx1024m -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=3333  -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
-    
-    ui.childopts: "-Xmx768m -Dcom.sun.management.jmxremote.port=3334 -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
-    
-    logviewer.childopts: "-Xmx128m -Dcom.sun.management.jmxremote.port=3335 -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
-    
-    drpc.childopts: "-Xmx768m -Dcom.sun.management.jmxremote.port=3336 -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
-   
-    supervisor.childopts: "-Xmx256m -Dcom.sun.management.jmxremote.port=3337 -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
-
-### Please Note:
-Since we shade all of the packages we use, additional plugins for collecting metrics might not work at this time.  Currently collecting the metrics via JMX is supported.
-   
-For more information about io.dropwizard.metrics and metrics-clojure packages please reference their original documentation:
-- https://dropwizard.github.io/metrics/3.1.0/
-- http://metrics-clojure.readthedocs.org/en/latest/

http://git-wip-us.apache.org/repos/asf/storm/blob/0db99bcb/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java
----------------------------------------------------------------------
diff --git a/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java b/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java
index 5b65d72..d66c0ef 100644
--- a/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java
+++ b/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java
@@ -248,7 +248,6 @@ public class Nimbus implements Iface, Shutdownable, DaemonCommon {
     private final Meter beginFileUploadCalls;
     private final Meter uploadChunkCalls;
     private final Meter finishFileUploadCalls;
-    private final Meter beginFileDownloadCalls;
     private final Meter downloadChunkCalls;
     private final Meter getNimbusConfCalls;
     private final Meter getLogConfigCalls;
@@ -263,7 +262,6 @@ public class Nimbus implements Iface, Shutdownable, DaemonCommon {
     private final Meter getTopologyPageInfoCalls;
     private final Meter getSupervisorPageInfoCalls;
     private final Meter getComponentPageInfoCalls;
-    private final Histogram scheduleTopologyTimeMs;
     private final Meter getOwnerResourceSummariesCalls;
     private final Meter shutdownCalls;
     private final Meter processWorkerMetricsCalls;
@@ -494,7 +492,6 @@ public class Nimbus implements Iface, Shutdownable, DaemonCommon {
         this.beginFileUploadCalls = metricsRegistry.registerMeter("nimbus:num-beginFileUpload-calls");
         this.uploadChunkCalls = metricsRegistry.registerMeter("nimbus:num-uploadChunk-calls");
         this.finishFileUploadCalls = metricsRegistry.registerMeter("nimbus:num-finishFileUpload-calls");
-        this.beginFileDownloadCalls = metricsRegistry.registerMeter("nimbus:num-beginFileDownload-calls");
         this.downloadChunkCalls = metricsRegistry.registerMeter("nimbus:num-downloadChunk-calls");
         this.getNimbusConfCalls = metricsRegistry.registerMeter("nimbus:num-getNimbusConf-calls");
         this.getLogConfigCalls = metricsRegistry.registerMeter("nimbus:num-getLogConfig-calls");
@@ -510,7 +507,6 @@ public class Nimbus implements Iface, Shutdownable, DaemonCommon {
         this.getTopologyPageInfoCalls = metricsRegistry.registerMeter("nimbus:num-getTopologyPageInfo-calls");
         this.getSupervisorPageInfoCalls = metricsRegistry.registerMeter("nimbus:num-getSupervisorPageInfo-calls");
         this.getComponentPageInfoCalls = metricsRegistry.registerMeter("nimbus:num-getComponentPageInfo-calls");
-        this.scheduleTopologyTimeMs = metricsRegistry.registerHistogram("nimbus:time-scheduleTopology-ms");
         this.getOwnerResourceSummariesCalls = metricsRegistry.registerMeter(
             "nimbus:num-getOwnerResourceSummaries-calls");
         this.shutdownCalls = metricsRegistry.registerMeter("nimbus:num-shutdown-calls");


[3/3] storm git commit: STORM-3234: Addressed review comments

Posted by bo...@apache.org.
STORM-3234: Addressed review comments


Project: http://git-wip-us.apache.org/repos/asf/storm/repo
Commit: http://git-wip-us.apache.org/repos/asf/storm/commit/9d083fa9
Tree: http://git-wip-us.apache.org/repos/asf/storm/tree/9d083fa9
Diff: http://git-wip-us.apache.org/repos/asf/storm/diff/9d083fa9

Branch: refs/heads/master
Commit: 9d083fa919e0000d9b74938b80459c8833b116b3
Parents: 5334e0c
Author: Robert Evans <ev...@yahoo-inc.com>
Authored: Tue Sep 25 11:53:50 2018 -0500
Committer: Robert Evans <ev...@yahoo-inc.com>
Committed: Tue Sep 25 11:53:50 2018 -0500

----------------------------------------------------------------------
 docs/ClusterMetrics.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/storm/blob/9d083fa9/docs/ClusterMetrics.md
----------------------------------------------------------------------
diff --git a/docs/ClusterMetrics.md b/docs/ClusterMetrics.md
index d32d0b1..17c4ca1 100644
--- a/docs/ClusterMetrics.md
+++ b/docs/ClusterMetrics.md
@@ -20,7 +20,7 @@ These are metrics that come from the active nimbus instance and report the state
 
 | Metric Name | Type | Description |
 |-------------|------|-------------|
-| cluster:num-nimbus-leaders | gauge | Number of nimbuses marked as a leader. This should really only ever be 1 in a health cluster, or 0 for a short period of time while a failover happens. |
+| cluster:num-nimbus-leaders | gauge | Number of nimbuses marked as a leader. This should really only ever be 1 in a healthy cluster, or 0 for a short period of time while a failover happens. |
 | cluster:num-nimbuses | gauge | Number of nimbuses, leader or standby. |
 | cluster:num-supervisors | gauge | Number of supervisors. |
 | cluster:num-topologies | gauge | Number of topologies. |
@@ -87,7 +87,7 @@ These are metrics that are specific to a nimbus instance.  In many instances onl
 | nimbus:num-killTopologyWithOpts-calls | meter | calls to killTopologyWithOpts thrift method includes calls to killTopology. |
 | nimbus:num-launched | meter | number of times a nimbus was launched |
 | nimbus:num-lost-leadership | meter | number of times this nimbus lost leadership |
-| nimbus:num-negative-resource-events | meter | any time a resource goes negative (either CPU or Memory)  Not consistent as it is used for internal calculations that may go negative and does not represent over scheduling of resources. |
+| nimbus:num-negative-resource-events | meter | Any time a resource goes negative (either CPU or Memory).  This metric is not ideal as it is measured in a data structure that is used for internal calculations that may go negative and not actually represent over scheduling of a resource. |
 | nimbus:num-net-executors-increase-per-scheduling | histogram | added executors minus removed executors after a scheduling run |
 | nimbus:num-net-slots-increase-per-scheduling | histogram | added slots minus removed slots after a scheduling run |
 | nimbus:num-rebalance-calls | meter | calls to rebalance thrift method. |


[2/3] storm git commit: Merge branch 'STORM-3234' of https://github.com/revans2/incubator-storm into STORM-3234

Posted by bo...@apache.org.
Merge branch 'STORM-3234' of https://github.com/revans2/incubator-storm into STORM-3234

STORM-3234: Replace old metrics docs with better documentation.

This closes #2845


Project: http://git-wip-us.apache.org/repos/asf/storm/repo
Commit: http://git-wip-us.apache.org/repos/asf/storm/commit/5334e0c2
Tree: http://git-wip-us.apache.org/repos/asf/storm/tree/5334e0c2
Diff: http://git-wip-us.apache.org/repos/asf/storm/diff/5334e0c2

Branch: refs/heads/master
Commit: 5334e0c2f13e282ff6c4a8e536268816b97f81c6
Parents: d4a5fa1 0db99bc
Author: Robert Evans <ev...@yahoo-inc.com>
Authored: Tue Sep 25 11:50:53 2018 -0500
Committer: Robert Evans <ev...@yahoo-inc.com>
Committed: Tue Sep 25 11:50:53 2018 -0500

----------------------------------------------------------------------
 docs/ClusterMetrics.md                          | 256 +++++++++++++++++++
 docs/Metrics.md                                 |   2 +
 docs/index.md                                   |   2 +-
 .../storm-metrics-profiling-internal-actions.md |  76 ------
 .../org/apache/storm/daemon/nimbus/Nimbus.java  |   4 -
 5 files changed, 259 insertions(+), 81 deletions(-)
----------------------------------------------------------------------