You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "janhoy (via GitHub)" <gi...@apache.org> on 2023/03/14 14:25:35 UTC

[GitHub] [solr] janhoy commented on a diff in pull request #395: SOLR-15767: Prometheus alert rules for monitoring SolrCloud clusters on Kubernetes

janhoy commented on code in PR #395:
URL: https://github.com/apache/solr/pull/395#discussion_r1135552528


##########
solr/prometheus-exporter/conf/solr-alert-rules.yaml:
##########
@@ -0,0 +1,243 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Use with the Prometheus stack (https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
+# to monitor SolrCloud clusters on Kubernetes. Prior to importing these rules into your K8s cluster, you'll need to
+# adjust the various thresholds for each rule for your specific use case(s). Moreover, you should set the "for" interval
+# for each rule based on how aggressive you want alerts to be raised. Lastly, you'll have to configure the "receivers",
+# e.g. Slack or PagerDuty, for the alerts using alertmanager.
+
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  labels:
+    prometheus: k8s
+    role: alert-rules
+  name: solr-alert-rules
+spec:
+  groups:
+  - name: SolrQuery
+    rules:
+    - alert: SolrHighQueryLatencyP95
+      annotations:
+        description: High latency (p95 > {{ $value }}ms) for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection) (solr_metrics_core_query_p95_ms) > 100
+      for: 5m
+      labels:
+        severity: major
+        impact: query performance
+    - alert: SolrHighQPSPerCore
+      annotations:
+        description: QPS (1-min rate) for {{ $labels.core }} in {{ $labels.shard }} for collection {{ $labels.collection }} is {{ $value }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection,shard,core) (solr_metrics_core_query_1minRate) > 50
+      for: 5m
+      labels:
+        severity: major
+        impact: query performance
+        workflow: solr-handle-qps
+    - alert: SolrReplicaNotActive
+      annotations:
+        description: Replica {{ $labels.replica }} for {{ $labels.shard }} for collection {{ $labels.collection }} is not active
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count by (namespace, collection, shard, replica) (solr_collections_replica_state{state!="active"}) > 0
+      for: 10m
+      labels:
+        severity: major
+        impact: collection health
+    - alert: SolrReplicaLost
+      annotations:
+        description: Replica {{ $labels.replica }} for {{ $labels.shard }} for collection {{ $labels.collection }} on node {{ $labels.base_url }} was lost! 
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count(solr_collections_replica_state{state="active"} offset 10m) by (namespace, collection, shard, replica, base_url) unless count(solr_collections_replica_state{state="active"}) by (namespace, collection, shard, replica, base_url)
+      for: 10m
+      labels:
+        severity: major
+        impact: collection health
+    - alert: SolrNoActiveReplicaForShard
+      annotations:
+        description: No active replicas for {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count by (namespace, collection, shard) (solr_collections_replica_state{state="active"}) < 1
+      for: 5m
+      labels:
+        severity: critical
+        impact: collection health
+    - alert: SolrCoreQueryErrors
+      annotations:
+        description: Too many errors from {{ $labels.core }} in {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection,shard,core) (solr_metrics_core_query_errors_1minRate{searchHandler="/select"}) > 10
+      for: 5m
+      labels:
+        severity: warning
+        impact: core health
+    - alert: SolrUnbalancedQueryLoad
+      annotations:
+        description: Unbalanced query load ({{ $value }}% of total requests) sent to Solr pod {{ $labels.base_url }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: 100 * (sum by(namespace,base_url) (solr_metrics_core_query_1minRate) / ignoring(namespace,base_url) group_left() sum(solr_metrics_core_query_1minRate) > ignoring(namespace,base_url) group_left() max(1 / solr_collections_live_nodes) + 0.1)
+      for: 10m
+      labels:
+        severity: warning
+        impact: query performance
+
+  - name: SolrIndexing
+    rules:
+    - alert: SolrNoLeaderForShard
+      annotations:
+        description: No leader for {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection, shard) (solr_collections_shard_leader) < 1
+      for: 2m
+      labels:
+        severity: critical
+        impact: collection health
+    - alert: SolrSlowRecovery
+      annotations:
+        description: Slow recovery for replica {{ $labels.replica }} for {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count by (namespace,collection, shard, replica) (solr_collections_replica_state{state="recovering"}) > 0
+      for: 15m
+      labels:
+        severity: major
+        impact: collection health
+    - alert: SolrHighCommitFrequency
+      annotations:
+        description: High commit rate for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection) (rate(solr_metrics_core_update_handler_commits_total[1m])) > 3
+      for: 5m
+      labels:
+        severity: major
+        impact: indexing
+    - alert: SolrUpdateErrors
+      annotations:
+        description: High update error rate (1-minute window) for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection) (rate(solr_metrics_core_update_handler_errors_total[1m])) >= 20
+      for: 5m
+      labels:
+        severity: major
+        impact: indexing
+    - alert: SolrSlowCacheWarmup
+      annotations:
+        description: Slow {{ $labels.type }} warm-up time for core {{ $labels.core }} in {{ $labels.shard }} in {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection, shard, core, type) (solr_metrics_core_searcher_warmup_time_seconds{item="warmupTime"}) > 10
+      for: 5m
+      labels:
+        severity: major
+        impact: indexing
+
+  - name: SolrNodeHealth
+    rules:
+    - alert: SolrAvailableDiskVeryLow
+      expr: min by (namespace,base_url) (solr_metrics_core_fs_bytes{item="usableSpace"}) / max by (namespace,base_url) (solr_metrics_core_fs_bytes{item="totalSpace"}) <= 0.1
+      for: 10m
+      labels:
+        severity: major
+        impact: potential index corruption
+      annotations:
+        description: Available disk on Solr pod {{ $labels.base_url }} is very low; only {{ $value }}% of total disk is available
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrAvailableDiskLow
+      expr: min by (namespace,base_url) (solr_metrics_core_fs_bytes{item="usableSpace"}) / max by (namespace,base_url) (solr_metrics_core_fs_bytes{item="totalSpace"}) <= 0.2
+      for: 10m
+      labels:
+        severity: warning
+        impact: node health
+      annotations:
+        description: Available disk on Solr pod {{ $labels.base_url }} is low; only {{ $value }}% of total disk is available
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrHighHeapUsage
+      expr: sum by (namespace,base_url) (solr_metrics_jvm_memory_heap_bytes{item="used"}) / sum by (namespace,base_url) (solr_metrics_jvm_memory_heap_bytes{item="max"}) > 0.9
+      for: 5m
+      labels:
+        severity: major
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} high heap usage at {{ $value }}%
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrHighCPU
+      expr: max by (namespace,base_url) (solr_metrics_jvm_os_cpu_load{item="systemCpuLoad"}) > 20
+      for: 10m
+      labels:
+        severity: major
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} high system CPU usage {{ $value }}%
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrHighCPULoadAvg
+      expr: max by (namespace,base_url) (solr_metrics_jvm_os_load_average) > 20
+      for: 10m
+      labels:
+        severity: major
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} high system CPU load {{ $value }}%
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrBlockedOrDeadlockedThreads
+      expr: max by (namespace,base_url) (solr_metrics_jvm_threads{item=~"blocked|deadlock"}) > 0
+      for: 2m
+      labels:
+        severity: critical
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} has blocked / deadlocked threads
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrUnbalancedLoad

Review Comment:
   When would this trigger? If clients hit a node directly instead of using k8s service endpoint?



##########
solr/prometheus-exporter/conf/solr-alert-rules.yaml:
##########
@@ -0,0 +1,243 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Use with the Prometheus stack (https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
+# to monitor SolrCloud clusters on Kubernetes. Prior to importing these rules into your K8s cluster, you'll need to
+# adjust the various thresholds for each rule for your specific use case(s). Moreover, you should set the "for" interval
+# for each rule based on how aggressive you want alerts to be raised. Lastly, you'll have to configure the "receivers",
+# e.g. Slack or PagerDuty, for the alerts using alertmanager.
+
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  labels:
+    prometheus: k8s
+    role: alert-rules
+  name: solr-alert-rules
+spec:
+  groups:
+  - name: SolrQuery

Review Comment:
   Alert group names seem to follow `foo-bar` pattern, i.e. lowercase space-separated, at least for the ootb prom rules. Should we do the same?
   ```suggestion
     - name: solr-query
   ```



##########
solr/prometheus-exporter/conf/solr-alert-rules.yaml:
##########
@@ -0,0 +1,243 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Use with the Prometheus stack (https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
+# to monitor SolrCloud clusters on Kubernetes. Prior to importing these rules into your K8s cluster, you'll need to
+# adjust the various thresholds for each rule for your specific use case(s). Moreover, you should set the "for" interval
+# for each rule based on how aggressive you want alerts to be raised. Lastly, you'll have to configure the "receivers",
+# e.g. Slack or PagerDuty, for the alerts using alertmanager.
+
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  labels:
+    prometheus: k8s
+    role: alert-rules
+  name: solr-alert-rules
+spec:
+  groups:
+  - name: SolrQuery
+    rules:
+    - alert: SolrHighQueryLatencyP95
+      annotations:
+        description: High latency (p95 > {{ $value }}ms) for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection) (solr_metrics_core_query_p95_ms) > 100
+      for: 5m
+      labels:
+        severity: major
+        impact: query performance
+    - alert: SolrHighQPSPerCore
+      annotations:
+        description: QPS (1-min rate) for {{ $labels.core }} in {{ $labels.shard }} for collection {{ $labels.collection }} is {{ $value }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection,shard,core) (solr_metrics_core_query_1minRate) > 50
+      for: 5m
+      labels:
+        severity: major
+        impact: query performance
+        workflow: solr-handle-qps
+    - alert: SolrReplicaNotActive
+      annotations:
+        description: Replica {{ $labels.replica }} for {{ $labels.shard }} for collection {{ $labels.collection }} is not active
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count by (namespace, collection, shard, replica) (solr_collections_replica_state{state!="active"}) > 0
+      for: 10m
+      labels:
+        severity: major
+        impact: collection health
+    - alert: SolrReplicaLost
+      annotations:
+        description: Replica {{ $labels.replica }} for {{ $labels.shard }} for collection {{ $labels.collection }} on node {{ $labels.base_url }} was lost! 
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count(solr_collections_replica_state{state="active"} offset 10m) by (namespace, collection, shard, replica, base_url) unless count(solr_collections_replica_state{state="active"}) by (namespace, collection, shard, replica, base_url)
+      for: 10m
+      labels:
+        severity: major
+        impact: collection health
+    - alert: SolrNoActiveReplicaForShard
+      annotations:
+        description: No active replicas for {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count by (namespace, collection, shard) (solr_collections_replica_state{state="active"}) < 1
+      for: 5m
+      labels:
+        severity: critical
+        impact: collection health
+    - alert: SolrCoreQueryErrors
+      annotations:
+        description: Too many errors from {{ $labels.core }} in {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection,shard,core) (solr_metrics_core_query_errors_1minRate{searchHandler="/select"}) > 10
+      for: 5m
+      labels:
+        severity: warning
+        impact: core health
+    - alert: SolrUnbalancedQueryLoad
+      annotations:
+        description: Unbalanced query load ({{ $value }}% of total requests) sent to Solr pod {{ $labels.base_url }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: 100 * (sum by(namespace,base_url) (solr_metrics_core_query_1minRate) / ignoring(namespace,base_url) group_left() sum(solr_metrics_core_query_1minRate) > ignoring(namespace,base_url) group_left() max(1 / solr_collections_live_nodes) + 0.1)
+      for: 10m
+      labels:
+        severity: warning
+        impact: query performance
+
+  - name: SolrIndexing
+    rules:
+    - alert: SolrNoLeaderForShard
+      annotations:
+        description: No leader for {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection, shard) (solr_collections_shard_leader) < 1
+      for: 2m
+      labels:
+        severity: critical
+        impact: collection health
+    - alert: SolrSlowRecovery
+      annotations:
+        description: Slow recovery for replica {{ $labels.replica }} for {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count by (namespace,collection, shard, replica) (solr_collections_replica_state{state="recovering"}) > 0
+      for: 15m
+      labels:
+        severity: major
+        impact: collection health
+    - alert: SolrHighCommitFrequency
+      annotations:
+        description: High commit rate for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection) (rate(solr_metrics_core_update_handler_commits_total[1m])) > 3
+      for: 5m
+      labels:
+        severity: major
+        impact: indexing
+    - alert: SolrUpdateErrors
+      annotations:
+        description: High update error rate (1-minute window) for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection) (rate(solr_metrics_core_update_handler_errors_total[1m])) >= 20
+      for: 5m
+      labels:
+        severity: major
+        impact: indexing
+    - alert: SolrSlowCacheWarmup
+      annotations:
+        description: Slow {{ $labels.type }} warm-up time for core {{ $labels.core }} in {{ $labels.shard }} in {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection, shard, core, type) (solr_metrics_core_searcher_warmup_time_seconds{item="warmupTime"}) > 10
+      for: 5m
+      labels:
+        severity: major
+        impact: indexing
+
+  - name: SolrNodeHealth
+    rules:
+    - alert: SolrAvailableDiskVeryLow
+      expr: min by (namespace,base_url) (solr_metrics_core_fs_bytes{item="usableSpace"}) / max by (namespace,base_url) (solr_metrics_core_fs_bytes{item="totalSpace"}) <= 0.1
+      for: 10m
+      labels:
+        severity: major
+        impact: potential index corruption
+      annotations:
+        description: Available disk on Solr pod {{ $labels.base_url }} is very low; only {{ $value }}% of total disk is available
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrAvailableDiskLow
+      expr: min by (namespace,base_url) (solr_metrics_core_fs_bytes{item="usableSpace"}) / max by (namespace,base_url) (solr_metrics_core_fs_bytes{item="totalSpace"}) <= 0.2
+      for: 10m
+      labels:
+        severity: warning
+        impact: node health
+      annotations:
+        description: Available disk on Solr pod {{ $labels.base_url }} is low; only {{ $value }}% of total disk is available
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrHighHeapUsage
+      expr: sum by (namespace,base_url) (solr_metrics_jvm_memory_heap_bytes{item="used"}) / sum by (namespace,base_url) (solr_metrics_jvm_memory_heap_bytes{item="max"}) > 0.9
+      for: 5m
+      labels:
+        severity: major
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} high heap usage at {{ $value }}%
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrHighCPU
+      expr: max by (namespace,base_url) (solr_metrics_jvm_os_cpu_load{item="systemCpuLoad"}) > 20
+      for: 10m
+      labels:
+        severity: major
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} high system CPU usage {{ $value }}%
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrHighCPULoadAvg
+      expr: max by (namespace,base_url) (solr_metrics_jvm_os_load_average) > 20
+      for: 10m
+      labels:
+        severity: major
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} high system CPU load {{ $value }}%
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrBlockedOrDeadlockedThreads
+      expr: max by (namespace,base_url) (solr_metrics_jvm_threads{item=~"blocked|deadlock"}) > 0
+      for: 2m
+      labels:
+        severity: critical
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} has blocked / deadlocked threads
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrUnbalancedLoad
+      # compares the number of requests per base_url to the expected balanced load 1/N (+10% for flexibility)
+      expr: 100 * (sum by(namespace,base_url) (solr_metrics_jetty_requests_total) / ignoring(namespace,base_url) group_left() sum(solr_metrics_jetty_requests_total) > ignoring(namespace,base_url) group_left() max(1 / solr_collections_live_nodes) + 0.1)
+      for: 15m
+      labels:
+        severity: major
+        impact: node performance
+      annotations:
+        description: Unbalanced load ({{ $value }}% of total requests) sent to Solr pod {{ $labels.base_url }}
+        runbook_url: link_to_runbook_for_this_problem
+
+  - name: SolrClusterHealth
+    rules:
+    - alert: SolrZkEnsembleBelowQuorum
+      expr: count(solr_zookeeper_nodestatus == 1) / sum(solr_zookeeper_ensemble_size) <= 0.5
+      for: 2m
+      labels:
+        severity: critical
+        impact: cluster health
+      annotations:
+        description: Healthy Zookeeper node count is below Quorum {{ $value }}

Review Comment:
   Description should perhaps mention namespace?



##########
solr/prometheus-exporter/conf/solr-alert-rules.yaml:
##########
@@ -0,0 +1,243 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Use with the Prometheus stack (https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
+# to monitor SolrCloud clusters on Kubernetes. Prior to importing these rules into your K8s cluster, you'll need to
+# adjust the various thresholds for each rule for your specific use case(s). Moreover, you should set the "for" interval
+# for each rule based on how aggressive you want alerts to be raised. Lastly, you'll have to configure the "receivers",
+# e.g. Slack or PagerDuty, for the alerts using alertmanager.
+
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  labels:
+    prometheus: k8s
+    role: alert-rules
+  name: solr-alert-rules
+spec:
+  groups:
+  - name: SolrQuery
+    rules:
+    - alert: SolrHighQueryLatencyP95
+      annotations:
+        description: High latency (p95 > {{ $value }}ms) for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection) (solr_metrics_core_query_p95_ms) > 100
+      for: 5m
+      labels:
+        severity: major
+        impact: query performance
+    - alert: SolrHighQPSPerCore
+      annotations:
+        description: QPS (1-min rate) for {{ $labels.core }} in {{ $labels.shard }} for collection {{ $labels.collection }} is {{ $value }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection,shard,core) (solr_metrics_core_query_1minRate) > 50
+      for: 5m
+      labels:
+        severity: major
+        impact: query performance
+        workflow: solr-handle-qps
+    - alert: SolrReplicaNotActive
+      annotations:
+        description: Replica {{ $labels.replica }} for {{ $labels.shard }} for collection {{ $labels.collection }} is not active
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count by (namespace, collection, shard, replica) (solr_collections_replica_state{state!="active"}) > 0
+      for: 10m
+      labels:
+        severity: major
+        impact: collection health
+    - alert: SolrReplicaLost
+      annotations:
+        description: Replica {{ $labels.replica }} for {{ $labels.shard }} for collection {{ $labels.collection }} on node {{ $labels.base_url }} was lost! 
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count(solr_collections_replica_state{state="active"} offset 10m) by (namespace, collection, shard, replica, base_url) unless count(solr_collections_replica_state{state="active"}) by (namespace, collection, shard, replica, base_url)
+      for: 10m
+      labels:
+        severity: major
+        impact: collection health
+    - alert: SolrNoActiveReplicaForShard
+      annotations:
+        description: No active replicas for {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count by (namespace, collection, shard) (solr_collections_replica_state{state="active"}) < 1
+      for: 5m
+      labels:
+        severity: critical
+        impact: collection health
+    - alert: SolrCoreQueryErrors
+      annotations:
+        description: Too many errors from {{ $labels.core }} in {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection,shard,core) (solr_metrics_core_query_errors_1minRate{searchHandler="/select"}) > 10
+      for: 5m
+      labels:
+        severity: warning
+        impact: core health
+    - alert: SolrUnbalancedQueryLoad
+      annotations:
+        description: Unbalanced query load ({{ $value }}% of total requests) sent to Solr pod {{ $labels.base_url }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: 100 * (sum by(namespace,base_url) (solr_metrics_core_query_1minRate) / ignoring(namespace,base_url) group_left() sum(solr_metrics_core_query_1minRate) > ignoring(namespace,base_url) group_left() max(1 / solr_collections_live_nodes) + 0.1)
+      for: 10m
+      labels:
+        severity: warning
+        impact: query performance
+
+  - name: SolrIndexing
+    rules:
+    - alert: SolrNoLeaderForShard
+      annotations:
+        description: No leader for {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection, shard) (solr_collections_shard_leader) < 1
+      for: 2m
+      labels:
+        severity: critical
+        impact: collection health
+    - alert: SolrSlowRecovery
+      annotations:
+        description: Slow recovery for replica {{ $labels.replica }} for {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count by (namespace,collection, shard, replica) (solr_collections_replica_state{state="recovering"}) > 0
+      for: 15m
+      labels:
+        severity: major
+        impact: collection health
+    - alert: SolrHighCommitFrequency
+      annotations:
+        description: High commit rate for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection) (rate(solr_metrics_core_update_handler_commits_total[1m])) > 3
+      for: 5m
+      labels:
+        severity: major
+        impact: indexing
+    - alert: SolrUpdateErrors
+      annotations:
+        description: High update error rate (1-minute window) for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection) (rate(solr_metrics_core_update_handler_errors_total[1m])) >= 20
+      for: 5m
+      labels:
+        severity: major
+        impact: indexing
+    - alert: SolrSlowCacheWarmup
+      annotations:
+        description: Slow {{ $labels.type }} warm-up time for core {{ $labels.core }} in {{ $labels.shard }} in {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection, shard, core, type) (solr_metrics_core_searcher_warmup_time_seconds{item="warmupTime"}) > 10
+      for: 5m
+      labels:
+        severity: major
+        impact: indexing
+
+  - name: SolrNodeHealth
+    rules:
+    - alert: SolrAvailableDiskVeryLow
+      expr: min by (namespace,base_url) (solr_metrics_core_fs_bytes{item="usableSpace"}) / max by (namespace,base_url) (solr_metrics_core_fs_bytes{item="totalSpace"}) <= 0.1
+      for: 10m
+      labels:
+        severity: major
+        impact: potential index corruption
+      annotations:
+        description: Available disk on Solr pod {{ $labels.base_url }} is very low; only {{ $value }}% of total disk is available
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrAvailableDiskLow
+      expr: min by (namespace,base_url) (solr_metrics_core_fs_bytes{item="usableSpace"}) / max by (namespace,base_url) (solr_metrics_core_fs_bytes{item="totalSpace"}) <= 0.2
+      for: 10m
+      labels:
+        severity: warning
+        impact: node health
+      annotations:
+        description: Available disk on Solr pod {{ $labels.base_url }} is low; only {{ $value }}% of total disk is available
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrHighHeapUsage
+      expr: sum by (namespace,base_url) (solr_metrics_jvm_memory_heap_bytes{item="used"}) / sum by (namespace,base_url) (solr_metrics_jvm_memory_heap_bytes{item="max"}) > 0.9
+      for: 5m
+      labels:
+        severity: major
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} high heap usage at {{ $value }}%
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrHighCPU
+      expr: max by (namespace,base_url) (solr_metrics_jvm_os_cpu_load{item="systemCpuLoad"}) > 20
+      for: 10m
+      labels:
+        severity: major
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} high system CPU usage {{ $value }}%
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrHighCPULoadAvg
+      expr: max by (namespace,base_url) (solr_metrics_jvm_os_load_average) > 20
+      for: 10m
+      labels:
+        severity: major
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} high system CPU load {{ $value }}%
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrBlockedOrDeadlockedThreads
+      expr: max by (namespace,base_url) (solr_metrics_jvm_threads{item=~"blocked|deadlock"}) > 0
+      for: 2m
+      labels:
+        severity: critical
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} has blocked / deadlocked threads
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrUnbalancedLoad
+      # compares the number of requests per base_url to the expected balanced load 1/N (+10% for flexibility)
+      expr: 100 * (sum by(namespace,base_url) (solr_metrics_jetty_requests_total) / ignoring(namespace,base_url) group_left() sum(solr_metrics_jetty_requests_total) > ignoring(namespace,base_url) group_left() max(1 / solr_collections_live_nodes) + 0.1)
+      for: 15m
+      labels:
+        severity: major
+        impact: node performance
+      annotations:
+        description: Unbalanced load ({{ $value }}% of total requests) sent to Solr pod {{ $labels.base_url }}
+        runbook_url: link_to_runbook_for_this_problem
+
+  - name: SolrClusterHealth
+    rules:
+    - alert: SolrZkEnsembleBelowQuorum
+      expr: count(solr_zookeeper_nodestatus == 1) / sum(solr_zookeeper_ensemble_size) <= 0.5
+      for: 2m
+      labels:
+        severity: critical
+        impact: cluster health
+      annotations:
+        description: Healthy Zookeeper node count is below Quorum {{ $value }}
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrZkEnsembleStatus
+      expr: count by (status) (solr_zookeeper_status{status!="green"}) > 0

Review Comment:
   ```suggestion
         expr: count by (namespace,status,cluster_id,zk_host) (solr_zookeeper_status{status!="green"}) > 0
   ```
   
   Need more context for this rule



##########
solr/prometheus-exporter/conf/solr-alert-rules.yaml:
##########
@@ -0,0 +1,243 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Use with the Prometheus stack (https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
+# to monitor SolrCloud clusters on Kubernetes. Prior to importing these rules into your K8s cluster, you'll need to
+# adjust the various thresholds for each rule for your specific use case(s). Moreover, you should set the "for" interval
+# for each rule based on how aggressive you want alerts to be raised. Lastly, you'll have to configure the "receivers",
+# e.g. Slack or PagerDuty, for the alerts using alertmanager.
+
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  labels:
+    prometheus: k8s
+    role: alert-rules
+  name: solr-alert-rules
+spec:
+  groups:
+  - name: SolrQuery
+    rules:
+    - alert: SolrHighQueryLatencyP95
+      annotations:
+        description: High latency (p95 > {{ $value }}ms) for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection) (solr_metrics_core_query_p95_ms) > 100
+      for: 5m
+      labels:
+        severity: major
+        impact: query performance
+    - alert: SolrHighQPSPerCore
+      annotations:
+        description: QPS (1-min rate) for {{ $labels.core }} in {{ $labels.shard }} for collection {{ $labels.collection }} is {{ $value }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection,shard,core) (solr_metrics_core_query_1minRate) > 50
+      for: 5m
+      labels:
+        severity: major
+        impact: query performance
+        workflow: solr-handle-qps
+    - alert: SolrReplicaNotActive
+      annotations:
+        description: Replica {{ $labels.replica }} for {{ $labels.shard }} for collection {{ $labels.collection }} is not active
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count by (namespace, collection, shard, replica) (solr_collections_replica_state{state!="active"}) > 0
+      for: 10m
+      labels:
+        severity: major
+        impact: collection health
+    - alert: SolrReplicaLost
+      annotations:
+        description: Replica {{ $labels.replica }} for {{ $labels.shard }} for collection {{ $labels.collection }} on node {{ $labels.base_url }} was lost! 
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count(solr_collections_replica_state{state="active"} offset 10m) by (namespace, collection, shard, replica, base_url) unless count(solr_collections_replica_state{state="active"}) by (namespace, collection, shard, replica, base_url)
+      for: 10m
+      labels:
+        severity: major
+        impact: collection health
+    - alert: SolrNoActiveReplicaForShard
+      annotations:
+        description: No active replicas for {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count by (namespace, collection, shard) (solr_collections_replica_state{state="active"}) < 1
+      for: 5m
+      labels:
+        severity: critical
+        impact: collection health
+    - alert: SolrCoreQueryErrors
+      annotations:
+        description: Too many errors from {{ $labels.core }} in {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection,shard,core) (solr_metrics_core_query_errors_1minRate{searchHandler="/select"}) > 10
+      for: 5m
+      labels:
+        severity: warning
+        impact: core health
+    - alert: SolrUnbalancedQueryLoad
+      annotations:
+        description: Unbalanced query load ({{ $value }}% of total requests) sent to Solr pod {{ $labels.base_url }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: 100 * (sum by(namespace,base_url) (solr_metrics_core_query_1minRate) / ignoring(namespace,base_url) group_left() sum(solr_metrics_core_query_1minRate) > ignoring(namespace,base_url) group_left() max(1 / solr_collections_live_nodes) + 0.1)
+      for: 10m
+      labels:
+        severity: warning
+        impact: query performance
+
+  - name: SolrIndexing
+    rules:
+    - alert: SolrNoLeaderForShard
+      annotations:
+        description: No leader for {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection, shard) (solr_collections_shard_leader) < 1
+      for: 2m
+      labels:
+        severity: critical
+        impact: collection health
+    - alert: SolrSlowRecovery
+      annotations:
+        description: Slow recovery for replica {{ $labels.replica }} for {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count by (namespace,collection, shard, replica) (solr_collections_replica_state{state="recovering"}) > 0
+      for: 15m
+      labels:
+        severity: major
+        impact: collection health
+    - alert: SolrHighCommitFrequency
+      annotations:
+        description: High commit rate for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection) (rate(solr_metrics_core_update_handler_commits_total[1m])) > 3
+      for: 5m
+      labels:
+        severity: major
+        impact: indexing
+    - alert: SolrUpdateErrors
+      annotations:
+        description: High update error rate (1-minute window) for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection) (rate(solr_metrics_core_update_handler_errors_total[1m])) >= 20
+      for: 5m
+      labels:
+        severity: major
+        impact: indexing
+    - alert: SolrSlowCacheWarmup
+      annotations:
+        description: Slow {{ $labels.type }} warm-up time for core {{ $labels.core }} in {{ $labels.shard }} in {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection, shard, core, type) (solr_metrics_core_searcher_warmup_time_seconds{item="warmupTime"}) > 10
+      for: 5m
+      labels:
+        severity: major
+        impact: indexing
+
+  - name: SolrNodeHealth
+    rules:
+    - alert: SolrAvailableDiskVeryLow
+      expr: min by (namespace,base_url) (solr_metrics_core_fs_bytes{item="usableSpace"}) / max by (namespace,base_url) (solr_metrics_core_fs_bytes{item="totalSpace"}) <= 0.1
+      for: 10m
+      labels:
+        severity: major
+        impact: potential index corruption
+      annotations:
+        description: Available disk on Solr pod {{ $labels.base_url }} is very low; only {{ $value }}% of total disk is available
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrAvailableDiskLow
+      expr: min by (namespace,base_url) (solr_metrics_core_fs_bytes{item="usableSpace"}) / max by (namespace,base_url) (solr_metrics_core_fs_bytes{item="totalSpace"}) <= 0.2
+      for: 10m
+      labels:
+        severity: warning
+        impact: node health
+      annotations:
+        description: Available disk on Solr pod {{ $labels.base_url }} is low; only {{ $value }}% of total disk is available
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrHighHeapUsage
+      expr: sum by (namespace,base_url) (solr_metrics_jvm_memory_heap_bytes{item="used"}) / sum by (namespace,base_url) (solr_metrics_jvm_memory_heap_bytes{item="max"}) > 0.9
+      for: 5m
+      labels:
+        severity: major
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} high heap usage at {{ $value }}%
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrHighCPU
+      expr: max by (namespace,base_url) (solr_metrics_jvm_os_cpu_load{item="systemCpuLoad"}) > 20
+      for: 10m
+      labels:
+        severity: major
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} high system CPU usage {{ $value }}%
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrHighCPULoadAvg
+      expr: max by (namespace,base_url) (solr_metrics_jvm_os_load_average) > 20
+      for: 10m
+      labels:
+        severity: major
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} high system CPU load {{ $value }}%
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrBlockedOrDeadlockedThreads
+      expr: max by (namespace,base_url) (solr_metrics_jvm_threads{item=~"blocked|deadlock"}) > 0
+      for: 2m
+      labels:
+        severity: critical
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} has blocked / deadlocked threads
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrUnbalancedLoad
+      # compares the number of requests per base_url to the expected balanced load 1/N (+10% for flexibility)
+      expr: 100 * (sum by(namespace,base_url) (solr_metrics_jetty_requests_total) / ignoring(namespace,base_url) group_left() sum(solr_metrics_jetty_requests_total) > ignoring(namespace,base_url) group_left() max(1 / solr_collections_live_nodes) + 0.1)
+      for: 15m
+      labels:
+        severity: major
+        impact: node performance
+      annotations:
+        description: Unbalanced load ({{ $value }}% of total requests) sent to Solr pod {{ $labels.base_url }}
+        runbook_url: link_to_runbook_for_this_problem
+
+  - name: SolrClusterHealth
+    rules:
+    - alert: SolrZkEnsembleBelowQuorum
+      expr: count(solr_zookeeper_nodestatus == 1) / sum(solr_zookeeper_ensemble_size) <= 0.5
+      for: 2m
+      labels:
+        severity: critical
+        impact: cluster health
+      annotations:
+        description: Healthy Zookeeper node count is below Quorum {{ $value }}
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrZkEnsembleStatus
+      expr: count by (status) (solr_zookeeper_status{status!="green"}) > 0
+      for: 2m
+      labels:
+        severity: major
+        impact: cluster health
+      annotations:
+        description: Zookeeper health is degraded {{ $labels.status }}
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrNoLiveNodes
+      expr: solr_collections_live_nodes < 1
+      for: 1m

Review Comment:
   Will `1m` be robust enough? Given the prometheus-exporter scrapes the solr cluster every minute or so, and prometheus scrapes the exporter once a minute, and the alert manager tests this rule on certain intervals, I can imagine that a short outage may get picked up as lasting for 1 minute even if it is shorter?
   ```suggestion
         for: 2m
   ```
   
   Not sure if the same worry applies to `[1m]` expr. But I have seen funny-looking saw-toot graphs in Grafana due to a 60s scrape interval both in exporter and in prometheus.



##########
solr/prometheus-exporter/conf/solr-alert-rules.yaml:
##########
@@ -0,0 +1,243 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Use with the Prometheus stack (https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
+# to monitor SolrCloud clusters on Kubernetes. Prior to importing these rules into your K8s cluster, you'll need to
+# adjust the various thresholds for each rule for your specific use case(s). Moreover, you should set the "for" interval
+# for each rule based on how aggressive you want alerts to be raised. Lastly, you'll have to configure the "receivers",
+# e.g. Slack or PagerDuty, for the alerts using alertmanager.
+
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  labels:
+    prometheus: k8s
+    role: alert-rules
+  name: solr-alert-rules
+spec:
+  groups:
+  - name: SolrQuery
+    rules:
+    - alert: SolrHighQueryLatencyP95
+      annotations:
+        description: High latency (p95 > {{ $value }}ms) for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection) (solr_metrics_core_query_p95_ms) > 100
+      for: 5m
+      labels:
+        severity: major
+        impact: query performance
+    - alert: SolrHighQPSPerCore
+      annotations:
+        description: QPS (1-min rate) for {{ $labels.core }} in {{ $labels.shard }} for collection {{ $labels.collection }} is {{ $value }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection,shard,core) (solr_metrics_core_query_1minRate) > 50
+      for: 5m
+      labels:
+        severity: major
+        impact: query performance
+        workflow: solr-handle-qps
+    - alert: SolrReplicaNotActive
+      annotations:
+        description: Replica {{ $labels.replica }} for {{ $labels.shard }} for collection {{ $labels.collection }} is not active
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count by (namespace, collection, shard, replica) (solr_collections_replica_state{state!="active"}) > 0
+      for: 10m
+      labels:
+        severity: major
+        impact: collection health
+    - alert: SolrReplicaLost
+      annotations:
+        description: Replica {{ $labels.replica }} for {{ $labels.shard }} for collection {{ $labels.collection }} on node {{ $labels.base_url }} was lost! 
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count(solr_collections_replica_state{state="active"} offset 10m) by (namespace, collection, shard, replica, base_url) unless count(solr_collections_replica_state{state="active"}) by (namespace, collection, shard, replica, base_url)
+      for: 10m
+      labels:
+        severity: major
+        impact: collection health
+    - alert: SolrNoActiveReplicaForShard
+      annotations:
+        description: No active replicas for {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count by (namespace, collection, shard) (solr_collections_replica_state{state="active"}) < 1
+      for: 5m
+      labels:
+        severity: critical
+        impact: collection health
+    - alert: SolrCoreQueryErrors
+      annotations:
+        description: Too many errors from {{ $labels.core }} in {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection,shard,core) (solr_metrics_core_query_errors_1minRate{searchHandler="/select"}) > 10
+      for: 5m
+      labels:
+        severity: warning
+        impact: core health
+    - alert: SolrUnbalancedQueryLoad
+      annotations:
+        description: Unbalanced query load ({{ $value }}% of total requests) sent to Solr pod {{ $labels.base_url }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: 100 * (sum by(namespace,base_url) (solr_metrics_core_query_1minRate) / ignoring(namespace,base_url) group_left() sum(solr_metrics_core_query_1minRate) > ignoring(namespace,base_url) group_left() max(1 / solr_collections_live_nodes) + 0.1)
+      for: 10m
+      labels:
+        severity: warning
+        impact: query performance
+
+  - name: SolrIndexing
+    rules:
+    - alert: SolrNoLeaderForShard
+      annotations:
+        description: No leader for {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection, shard) (solr_collections_shard_leader) < 1
+      for: 2m
+      labels:
+        severity: critical
+        impact: collection health
+    - alert: SolrSlowRecovery
+      annotations:
+        description: Slow recovery for replica {{ $labels.replica }} for {{ $labels.shard }} for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: count by (namespace,collection, shard, replica) (solr_collections_replica_state{state="recovering"}) > 0
+      for: 15m
+      labels:
+        severity: major
+        impact: collection health
+    - alert: SolrHighCommitFrequency
+      annotations:
+        description: High commit rate for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection) (rate(solr_metrics_core_update_handler_commits_total[1m])) > 3
+      for: 5m
+      labels:
+        severity: major
+        impact: indexing
+    - alert: SolrUpdateErrors
+      annotations:
+        description: High update error rate (1-minute window) for collection {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection) (rate(solr_metrics_core_update_handler_errors_total[1m])) >= 20
+      for: 5m
+      labels:
+        severity: major
+        impact: indexing
+    - alert: SolrSlowCacheWarmup
+      annotations:
+        description: Slow {{ $labels.type }} warm-up time for core {{ $labels.core }} in {{ $labels.shard }} in {{ $labels.collection }}
+        runbook_url: link_to_runbook_for_this_problem
+      expr: max by (namespace,collection, shard, core, type) (solr_metrics_core_searcher_warmup_time_seconds{item="warmupTime"}) > 10
+      for: 5m
+      labels:
+        severity: major
+        impact: indexing
+
+  - name: SolrNodeHealth
+    rules:
+    - alert: SolrAvailableDiskVeryLow
+      expr: min by (namespace,base_url) (solr_metrics_core_fs_bytes{item="usableSpace"}) / max by (namespace,base_url) (solr_metrics_core_fs_bytes{item="totalSpace"}) <= 0.1
+      for: 10m
+      labels:
+        severity: major
+        impact: potential index corruption
+      annotations:
+        description: Available disk on Solr pod {{ $labels.base_url }} is very low; only {{ $value }}% of total disk is available
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrAvailableDiskLow
+      expr: min by (namespace,base_url) (solr_metrics_core_fs_bytes{item="usableSpace"}) / max by (namespace,base_url) (solr_metrics_core_fs_bytes{item="totalSpace"}) <= 0.2
+      for: 10m
+      labels:
+        severity: warning
+        impact: node health
+      annotations:
+        description: Available disk on Solr pod {{ $labels.base_url }} is low; only {{ $value }}% of total disk is available
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrHighHeapUsage
+      expr: sum by (namespace,base_url) (solr_metrics_jvm_memory_heap_bytes{item="used"}) / sum by (namespace,base_url) (solr_metrics_jvm_memory_heap_bytes{item="max"}) > 0.9
+      for: 5m
+      labels:
+        severity: major
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} high heap usage at {{ $value }}%
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrHighCPU
+      expr: max by (namespace,base_url) (solr_metrics_jvm_os_cpu_load{item="systemCpuLoad"}) > 20
+      for: 10m
+      labels:
+        severity: major
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} high system CPU usage {{ $value }}%
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrHighCPULoadAvg
+      expr: max by (namespace,base_url) (solr_metrics_jvm_os_load_average) > 20
+      for: 10m
+      labels:
+        severity: major
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} high system CPU load {{ $value }}%
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrBlockedOrDeadlockedThreads
+      expr: max by (namespace,base_url) (solr_metrics_jvm_threads{item=~"blocked|deadlock"}) > 0
+      for: 2m
+      labels:
+        severity: critical
+        impact: node health
+      annotations:
+        description: Solr pod {{ $labels.base_url }} has blocked / deadlocked threads
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrUnbalancedLoad
+      # compares the number of requests per base_url to the expected balanced load 1/N (+10% for flexibility)
+      expr: 100 * (sum by(namespace,base_url) (solr_metrics_jetty_requests_total) / ignoring(namespace,base_url) group_left() sum(solr_metrics_jetty_requests_total) > ignoring(namespace,base_url) group_left() max(1 / solr_collections_live_nodes) + 0.1)
+      for: 15m
+      labels:
+        severity: major
+        impact: node performance
+      annotations:
+        description: Unbalanced load ({{ $value }}% of total requests) sent to Solr pod {{ $labels.base_url }}
+        runbook_url: link_to_runbook_for_this_problem
+
+  - name: SolrClusterHealth
+    rules:
+    - alert: SolrZkEnsembleBelowQuorum
+      expr: count(solr_zookeeper_nodestatus == 1) / sum(solr_zookeeper_ensemble_size) <= 0.5
+      for: 2m
+      labels:
+        severity: critical
+        impact: cluster health
+      annotations:
+        description: Healthy Zookeeper node count is below Quorum {{ $value }}
+        runbook_url: link_to_runbook_for_this_problem
+    - alert: SolrZkEnsembleStatus
+      expr: count by (status) (solr_zookeeper_status{status!="green"}) > 0
+      for: 2m
+      labels:
+        severity: major
+        impact: cluster health
+      annotations:
+        description: Zookeeper health is degraded {{ $labels.status }}

Review Comment:
   ```suggestion
           description: Zookeeper health is degraded in namespace {{ $labels.namespace }} for cluster {{ $labels.cluster_id }}. Status is {{ $labels.status }}
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org