You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@zookeeper.apache.org by ma...@apache.org on 2021/06/05 10:01:19 UTC

[zookeeper] branch master updated: ZOOKEEPER-3907: add a documentation about alerting on metrics

This is an automated email from the ASF dual-hosted git repository.

maoling pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/zookeeper.git


The following commit(s) were added to refs/heads/master by this push:
     new 5e787c5  ZOOKEEPER-3907: add a documentation about alerting on metrics
5e787c5 is described below

commit 5e787c5990091b2d1fc560eba88d3c25b04690a2
Author: maoling <ma...@sina.com>
AuthorDate: Sat Jun 5 18:01:10 2021 +0800

    ZOOKEEPER-3907: add a documentation about alerting on metrics
    
    - more details in the [ZOOKEEPER-3907](https://issues.apache.org/jira/browse/ZOOKEEPER-3907)
    
    Author: maoling <ma...@sina.com>
    
    Reviewers: Enrico Olivelli <eo...@apache.org>
    
    Closes #1425 from maoling/ZOOKEEPER-3907 and squashes the following commits:
    
    27f640f2e [maoling] change title to: Alerting with Prometheus
    95a1269bd [maoling] ZOOKEEPER-3907: add a documentation about alerting on metrics
---
 .../main/resources/markdown/zookeeperMonitor.md    | 120 +++++++++++++++++++++
 1 file changed, 120 insertions(+)

diff --git a/zookeeper-docs/src/main/resources/markdown/zookeeperMonitor.md b/zookeeper-docs/src/main/resources/markdown/zookeeperMonitor.md
index 83dfb6b..eb50a04 100644
--- a/zookeeper-docs/src/main/resources/markdown/zookeeperMonitor.md
+++ b/zookeeper-docs/src/main/resources/markdown/zookeeperMonitor.md
@@ -19,6 +19,7 @@ limitations under the License.
 * [New Metrics System](#Metrics-System)
     * [Metrics](#Metrics)
     * [Prometheus](#Prometheus)
+    * [Alerting with Prometheus](#Alerting)
     * [Grafana](#Grafana)
     * [InfluxDB](#influxdb)
 
@@ -73,6 +74,125 @@ All the metrics are included in the `ServerMetrics.java`.
 
 - Now Prometheus will scrape zk metrics every 10 seconds.
 
+<a name="Alerting"></a>
+
+### Alerting with Prometheus
+- We recommend that you read [Prometheus Official Alerting Page](https://prometheus.io/docs/practices/alerting/) to explore
+  some principles of alerting
+
+- We recommend that you use [Prometheus Alertmanager](https://www.prometheus.io/docs/alerting/latest/alertmanager/) which can
+  help users to receive alerting email or instant message(by webhook) in a more convenient way
+
+- We provide an alerting example where these metrics should be taken a special attention. Note: this is for your reference only,
+  and you need to adjust them according to your actual situation and resource environment
+
+
+        use ./promtool check rules rules/zk.yml to check the correctness of the config file
+        cat rules/zk.yml
+
+        groups:
+        - name: zk-alert-example
+          rules:
+          - alert: ZooKeeper server is down
+            expr:  up == 0
+            for: 1m
+            labels:
+              severity: critical
+            annotations:
+              summary: "Instance {{ $labels.instance }} ZooKeeper server is down"
+              description: "{{ $labels.instance }} of job {{$labels.job}} ZooKeeper server is down: [{{ $value }}]."
+
+          - alert: create too many znodes
+            expr: znode_count > 1000000
+            for: 1m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Instance {{ $labels.instance }} create too many znodes"
+              description: "{{ $labels.instance }} of job {{$labels.job}} create too many znodes: [{{ $value }}]."
+
+          - alert: create too many connections
+            expr: num_alive_connections > 50 # suppose we use the default maxClientCnxns: 60
+            for: 1m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Instance {{ $labels.instance }} create too many connections"
+              description: "{{ $labels.instance }} of job {{$labels.job}} create too many connections: [{{ $value }}]."
+
+          - alert: znode total occupied memory is too big
+            expr: approximate_data_size /1024 /1024 > 1 * 1024 # more than 1024 MB(1 GB)
+            for: 1m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Instance {{ $labels.instance }} znode total occupied memory is too big"
+              description: "{{ $labels.instance }} of job {{$labels.job}} znode total occupied memory is too big: [{{ $value }}] MB."
+
+          - alert: set too many watch
+            expr: watch_count > 10000
+            for: 1m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Instance {{ $labels.instance }} set too many watch"
+              description: "{{ $labels.instance }} of job {{$labels.job}} set too many watch: [{{ $value }}]."
+
+          - alert: a leader election happens
+            expr: increase(election_time_count[5m]) > 0
+            for: 1m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Instance {{ $labels.instance }} a leader election happens"
+              description: "{{ $labels.instance }} of job {{$labels.job}} a leader election happens: [{{ $value }}]."
+
+          - alert: open too many files
+            expr: open_file_descriptor_count > 300
+            for: 1m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Instance {{ $labels.instance }} open too many files"
+              description: "{{ $labels.instance }} of job {{$labels.job}} open too many files: [{{ $value }}]."
+
+          - alert: fsync time is too long
+            expr: rate(fsynctime_sum[1m]) > 100
+            for: 1m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Instance {{ $labels.instance }} fsync time is too long"
+              description: "{{ $labels.instance }} of job {{$labels.job}} fsync time is too long: [{{ $value }}]."
+
+          - alert: take snapshot time is too long
+            expr: rate(snapshottime_sum[5m]) > 100
+            for: 1m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Instance {{ $labels.instance }} take snapshot time is too long"
+              description: "{{ $labels.instance }} of job {{$labels.job}} take snapshot time is too long: [{{ $value }}]."
+
+          - alert: avg latency is too high
+            expr: avg_latency > 100
+            for: 1m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Instance {{ $labels.instance }} avg latency is too high"
+              description: "{{ $labels.instance }} of job {{$labels.job}} avg latency is too high: [{{ $value }}]."
+
+          - alert: JvmMemoryFillingUp
+            expr: jvm_memory_bytes_used / jvm_memory_bytes_max{area="heap"} > 0.8
+            for: 5m
+            labels:
+              severity: warning
+            annotations:
+              summary: "JVM memory filling up (instance {{ $labels.instance }})"
+              description: "JVM memory is filling up (> 80%)\n labels: {{ $labels }}  value = {{ $value }}\n"
+
+
 <a name="Grafana"></a>
 
 ### Grafana