273 lines
15 KiB
YAML
273 lines
15 KiB
YAML
---
|
|
# Source: gatus/charts/postgres-17-cluster/templates/prometheus-rule.yaml
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: PrometheusRule
|
|
metadata:
|
|
name: gatus-postgresql-17-alert-rules
|
|
namespace: gatus
|
|
labels:
|
|
helm.sh/chart: postgres-17-cluster-6.16.1
|
|
app.kubernetes.io/name: gatus-postgresql-17
|
|
app.kubernetes.io/instance: gatus
|
|
app.kubernetes.io/part-of: gatus
|
|
app.kubernetes.io/version: "6.16.1"
|
|
app.kubernetes.io/managed-by: Helm
|
|
spec:
|
|
groups:
|
|
- name: cloudnative-pg/gatus-postgresql-17
|
|
rules:
|
|
- alert: CNPGClusterBackendsWaitingWarning
|
|
annotations:
|
|
summary: CNPG Cluster a backend is waiting for longer than 5 minutes.
|
|
description: |-
|
|
Pod {{ $labels.pod }}
|
|
has been waiting for longer than 5 minutes
|
|
expr: |
|
|
cnpg_backends_waiting_total > 300
|
|
for: 1m
|
|
labels:
|
|
severity: warning
|
|
namespace: gatus
|
|
cnpg_cluster: gatus-postgresql-17-cluster
|
|
- alert: CNPGClusterDatabaseDeadlockConflictsWarning
|
|
annotations:
|
|
summary: CNPG Cluster has over 10 deadlock conflicts.
|
|
description: |-
|
|
There are over 10 deadlock conflicts in
|
|
{{ $labels.pod }}
|
|
expr: |
|
|
cnpg_pg_stat_database_deadlocks > 10
|
|
for: 1m
|
|
labels:
|
|
severity: warning
|
|
namespace: gatus
|
|
cnpg_cluster: gatus-postgresql-17-cluster
|
|
- alert: CNPGClusterHACritical
|
|
annotations:
|
|
summary: CNPG Cluster has no standby replicas!
|
|
description: |-
|
|
CloudNativePG Cluster "{{`{{`}} $labels.job {{`}}`}}" has no ready standby replicas. Your cluster at a severe
|
|
risk of data loss and downtime if the primary instance fails.
|
|
|
|
The primary instance is still online and able to serve queries, although connections to the `-ro` endpoint
|
|
will fail. The `-r` endpoint os operating at reduced capacity and all traffic is being served by the main.
|
|
|
|
This can happen during a normal fail-over or automated minor version upgrades in a cluster with 2 or less
|
|
instances. The replaced instance may need some time to catch-up with the cluster primary instance.
|
|
|
|
This alarm will be always trigger if your cluster is configured to run with only 1 instance. In this
|
|
case you may want to silence it.
|
|
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterHACritical.md
|
|
expr: |
|
|
max by (job) (cnpg_pg_replication_streaming_replicas{namespace="gatus"} - cnpg_pg_replication_is_wal_receiver_up{namespace="gatus"}) < 1
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
namespace: gatus
|
|
cnpg_cluster: gatus-postgresql-17-cluster
|
|
- alert: CNPGClusterHAWarning
|
|
annotations:
|
|
summary: CNPG Cluster less than 2 standby replicas.
|
|
description: |-
|
|
CloudNativePG Cluster "{{`{{`}} $labels.job {{`}}`}}" has only {{`{{`}} $value {{`}}`}} standby replicas, putting
|
|
your cluster at risk if another instance fails. The cluster is still able to operate normally, although
|
|
the `-ro` and `-r` endpoints operate at reduced capacity.
|
|
|
|
This can happen during a normal fail-over or automated minor version upgrades. The replaced instance may
|
|
need some time to catch-up with the cluster primary instance.
|
|
|
|
This alarm will be constantly triggered if your cluster is configured to run with less than 3 instances.
|
|
In this case you may want to silence it.
|
|
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterHAWarning.md
|
|
expr: |
|
|
max by (job) (cnpg_pg_replication_streaming_replicas{namespace="gatus"} - cnpg_pg_replication_is_wal_receiver_up{namespace="gatus"}) < 2
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
namespace: gatus
|
|
cnpg_cluster: gatus-postgresql-17-cluster
|
|
- alert: CNPGClusterHighConnectionsCritical
|
|
annotations:
|
|
summary: CNPG Instance maximum number of connections critical!
|
|
description: |-
|
|
CloudNativePG Cluster "gatus/gatus-postgresql-17-cluster" instance {{`{{`}} $labels.pod {{`}}`}} is using {{`{{`}} $value {{`}}`}}% of
|
|
the maximum number of connections.
|
|
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsCritical.md
|
|
expr: |
|
|
sum by (pod) (cnpg_backends_total{namespace="gatus", pod=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$"}) / max by (pod) (cnpg_pg_settings_setting{name="max_connections", namespace="gatus", pod=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$"}) * 100 > 95
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
namespace: gatus
|
|
cnpg_cluster: gatus-postgresql-17-cluster
|
|
- alert: CNPGClusterHighConnectionsWarning
|
|
annotations:
|
|
summary: CNPG Instance is approaching the maximum number of connections.
|
|
description: |-
|
|
CloudNativePG Cluster "gatus/gatus-postgresql-17-cluster" instance {{`{{`}} $labels.pod {{`}}`}} is using {{`{{`}} $value {{`}}`}}% of
|
|
the maximum number of connections.
|
|
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsWarning.md
|
|
expr: |
|
|
sum by (pod) (cnpg_backends_total{namespace="gatus", pod=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$"}) / max by (pod) (cnpg_pg_settings_setting{name="max_connections", namespace="gatus", pod=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$"}) * 100 > 80
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
namespace: gatus
|
|
cnpg_cluster: gatus-postgresql-17-cluster
|
|
- alert: CNPGClusterHighReplicationLag
|
|
annotations:
|
|
summary: CNPG Cluster high replication lag
|
|
description: |-
|
|
CloudNativePG Cluster "gatus/gatus-postgresql-17-cluster" is experiencing a high replication lag of
|
|
{{`{{`}} $value {{`}}`}}ms.
|
|
|
|
High replication lag indicates network issues, busy instances, slow queries or suboptimal configuration.
|
|
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterHighReplicationLag.md
|
|
expr: |
|
|
max(cnpg_pg_replication_lag{namespace="gatus",pod=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$"}) * 1000 > 1000
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
namespace: gatus
|
|
cnpg_cluster: gatus-postgresql-17-cluster
|
|
- alert: CNPGClusterInstancesOnSameNode
|
|
annotations:
|
|
summary: CNPG Cluster instances are located on the same node.
|
|
description: |-
|
|
CloudNativePG Cluster "gatus/gatus-postgresql-17-cluster" has {{`{{`}} $value {{`}}`}}
|
|
instances on the same node {{`{{`}} $labels.node {{`}}`}}.
|
|
|
|
A failure or scheduled downtime of a single node will lead to a potential service disruption and/or data loss.
|
|
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterInstancesOnSameNode.md
|
|
expr: |
|
|
count by (node) (kube_pod_info{namespace="gatus", pod=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$"}) > 1
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
namespace: gatus
|
|
cnpg_cluster: gatus-postgresql-17-cluster
|
|
- alert: CNPGClusterLongRunningTransactionWarning
|
|
annotations:
|
|
summary: CNPG Cluster query is taking longer than 5 minutes.
|
|
description: |-
|
|
CloudNativePG Cluster Pod {{ $labels.pod }}
|
|
is taking more than 5 minutes (300 seconds) for a query.
|
|
expr: |-
|
|
cnpg_backends_max_tx_duration_seconds > 300
|
|
for: 1m
|
|
labels:
|
|
severity: warning
|
|
namespace: gatus
|
|
cnpg_cluster: gatus-postgresql-17-cluster
|
|
- alert: CNPGClusterLowDiskSpaceCritical
|
|
annotations:
|
|
summary: CNPG Instance is running out of disk space!
|
|
description: |-
|
|
CloudNativePG Cluster "gatus/gatus-postgresql-17-cluster" is running extremely low on disk space. Check attached PVCs!
|
|
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceCritical.md
|
|
expr: |
|
|
max(max by(persistentvolumeclaim) (1 - kubelet_volume_stats_available_bytes{namespace="gatus", persistentvolumeclaim=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$"} / kubelet_volume_stats_capacity_bytes{namespace="gatus", persistentvolumeclaim=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$"})) > 0.9 OR
|
|
max(max by(persistentvolumeclaim) (1 - kubelet_volume_stats_available_bytes{namespace="gatus", persistentvolumeclaim=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$-wal"} / kubelet_volume_stats_capacity_bytes{namespace="gatus", persistentvolumeclaim=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$-wal"})) > 0.9 OR
|
|
max(sum by (namespace,persistentvolumeclaim) (kubelet_volume_stats_used_bytes{namespace="gatus", persistentvolumeclaim=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$-tbs.*"})
|
|
/
|
|
sum by (namespace,persistentvolumeclaim) (kubelet_volume_stats_capacity_bytes{namespace="gatus", persistentvolumeclaim=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$-tbs.*"})
|
|
*
|
|
on(namespace, persistentvolumeclaim) group_left(volume)
|
|
kube_pod_spec_volumes_persistentvolumeclaims_info{pod=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$"}
|
|
) > 0.9
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
namespace: gatus
|
|
cnpg_cluster: gatus-postgresql-17-cluster
|
|
- alert: CNPGClusterLowDiskSpaceWarning
|
|
annotations:
|
|
summary: CNPG Instance is running out of disk space.
|
|
description: |-
|
|
CloudNativePG Cluster "gatus/gatus-postgresql-17-cluster" is running low on disk space. Check attached PVCs.
|
|
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceWarning.md
|
|
expr: |
|
|
max(max by(persistentvolumeclaim) (1 - kubelet_volume_stats_available_bytes{namespace="gatus", persistentvolumeclaim=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$"} / kubelet_volume_stats_capacity_bytes{namespace="gatus", persistentvolumeclaim=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$"})) > 0.7 OR
|
|
max(max by(persistentvolumeclaim) (1 - kubelet_volume_stats_available_bytes{namespace="gatus", persistentvolumeclaim=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$-wal"} / kubelet_volume_stats_capacity_bytes{namespace="gatus", persistentvolumeclaim=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$-wal"})) > 0.7 OR
|
|
max(sum by (namespace,persistentvolumeclaim) (kubelet_volume_stats_used_bytes{namespace="gatus", persistentvolumeclaim=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$-tbs.*"})
|
|
/
|
|
sum by (namespace,persistentvolumeclaim) (kubelet_volume_stats_capacity_bytes{namespace="gatus", persistentvolumeclaim=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$-tbs.*"})
|
|
*
|
|
on(namespace, persistentvolumeclaim) group_left(volume)
|
|
kube_pod_spec_volumes_persistentvolumeclaims_info{pod=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$"}
|
|
) > 0.7
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
namespace: gatus
|
|
cnpg_cluster: gatus-postgresql-17-cluster
|
|
- alert: CNPGClusterOffline
|
|
annotations:
|
|
summary: CNPG Cluster has no running instances!
|
|
description: |-
|
|
CloudNativePG Cluster "gatus/gatus-postgresql-17-cluster" has no ready instances.
|
|
|
|
Having an offline cluster means your applications will not be able to access the database, leading to
|
|
potential service disruption and/or data loss.
|
|
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterOffline.md
|
|
expr: |
|
|
(count(cnpg_collector_up{namespace="gatus",pod=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$"}) OR on() vector(0)) == 0
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
namespace: gatus
|
|
cnpg_cluster: gatus-postgresql-17-cluster
|
|
- alert: CNPGClusterPGDatabaseXidAgeWarning
|
|
annotations:
|
|
summary: CNPG Cluster has a number of transactions from the frozen XID to the current one.
|
|
description: |-
|
|
Over 300,000,000 transactions from frozen xid
|
|
on pod {{ $labels.pod }}
|
|
expr: |
|
|
cnpg_pg_database_xid_age > 300000000
|
|
for: 1m
|
|
labels:
|
|
severity: warning
|
|
namespace: gatus
|
|
cnpg_cluster: gatus-postgresql-17-cluster
|
|
- alert: CNPGClusterPGReplicationWarning
|
|
annotations:
|
|
summary: CNPG Cluster standby is lagging behind the primary.
|
|
description: |-
|
|
Standby is lagging behind by over 300 seconds (5 minutes)
|
|
expr: |
|
|
cnpg_pg_replication_lag > 300
|
|
for: 1m
|
|
labels:
|
|
severity: warning
|
|
namespace: gatus
|
|
cnpg_cluster: gatus-postgresql-17-cluster
|
|
- alert: CNPGClusterReplicaFailingReplicationWarning
|
|
annotations:
|
|
summary: CNPG Cluster has a replica is failing to replicate.
|
|
description: |-
|
|
Replica {{ $labels.pod }}
|
|
is failing to replicate
|
|
expr: |
|
|
cnpg_pg_replication_in_recovery > cnpg_pg_replication_is_wal_receiver_up
|
|
for: 1m
|
|
labels:
|
|
severity: warning
|
|
namespace: gatus
|
|
cnpg_cluster: gatus-postgresql-17-cluster
|
|
- alert: CNPGClusterZoneSpreadWarning
|
|
annotations:
|
|
summary: CNPG Cluster instances in the same zone.
|
|
description: |-
|
|
CloudNativePG Cluster "gatus/gatus-postgresql-17-cluster" has instances in the same availability zone.
|
|
|
|
A disaster in one availability zone will lead to a potential service disruption and/or data loss.
|
|
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterZoneSpreadWarning.md
|
|
expr: |
|
|
3 > count(count by (label_topology_kubernetes_io_zone) (kube_pod_info{namespace="gatus", pod=~"gatus-postgresql-17-cluster-([1-9][0-9]*)$"} * on(node,instance) group_left(label_topology_kubernetes_io_zone) kube_node_labels)) < 3
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
namespace: gatus
|
|
cnpg_cluster: gatus-postgresql-17-cluster
|