我們知道Prometheus可以收集各種監控資料,需要設定告警規則,比如,監控磁碟使用率到80%,就發出warning告警,到達90%就發出severity告警。當然這裡告警規則是設定在Prometheus配置檔案裡,到達告警閥值後,可以透過alertmanager傳送通知(Email,Slack)
接下來,我們先寫一個Node_exporter告警規則
1.Node_exporter告警規則# mkdir -p /etc/prometheus/rules.d# vi /etc/prometheus/rules.d/host-status.rulesgroups:- name: host-status-rule rules: - alert: InstanceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} down" description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes." - alert: NodeFilesystemSpaceUsage expr: ( 1 - (node_filesystem_avail_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"} / node_filesystem_size_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"}) ) * 100 > 80 for: 1m labels: severity: warning annotations: summary: "{{$labels.instance}}: Filesystem space usage is high" description: "{{$labels.instance}}: Filesystem space usage is more than 80% (current usage is: {{ $value }})" - alert: NodeFilesystemSpaceUsage expr: ( 1 - (node_filesystem_avail_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"} / node_filesystem_size_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"}) ) * 100 > 90 for: 1m labels: severity: critical annotations: summary: "{{$labels.instance}}: Filesystem space usage is too high" description: "{{$labels.instance}}: Filesystem space usage is more than 90% (current usage is: {{ $value }})" - alert: NodeFilesystemInodeUsage expr: ( 1 - (node_filesystem_files_free{fstype=~"ext[234]|btrfs|xfs|zfs"} / node_filesystem_files{fstype=~"ext[234]|btrfs|xfs|zfs"}) ) * 100 > 85 for: 1m labels: severity: critical annotations: summary: "{{$labels.instance}}: Filesystem inode usage is high" description: "{{$labels.instance}}: Filesystem inode usage is more than 85% (current usage is: {{ $value }})" - alert: NodeFilesystemReadOnly expr: node_filesystem_readonly{job="node-exporter",device!~'rootfs'} == 1 for: 1m labels: severity: critical annotations: summary: "{{$labels.instance}}: Filesystem read only" description: "{{$labels.instance}}: Filesystem read only" - alert: NodeMemoryUsage_Warning expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 80 for: 1m labels: severity: warning annotations: summary: "{{$labels.instance}}: Host memory usage is high" description: "{{$labels.instance}}: Host memory usage is more than 80% (current usage is : {{ $value }})" - alert: NodeMemoryUsage_Critical expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 90 for: 1m labels: severity: critical annotations: summary: "{{$labels.instance}}: Host memory usage is too high" description: "{{$labels.instance}}: Host memory usage is more than 90% (current usage is : {{ $value }})" - alert: NodeCPUUsage expr: (100 - (avg by (instance) (irate(node_cpu_seconds_total{mode='idle'}[1m])) * 100)) > 80 for: 1m labels: severity: warning annotations: summary: "{{$labels.instance}}: CPU usage is high" description: "{{$labels.instance}}: CPU usage is more than 80% (current usage: {{ $value }})" - alert: NodeCPUUsage expr: (100 - (avg by (instance) (irate(node_cpu_seconds_total{mode='idle'}[1m])) * 100)) > 90 for: 1m labels: severity: critical annotations: summary: "{{$labels.instance}}: CPU usage is too high" description: "{{$labels.instance}}: CPU usage is more than 90% (current usage: {{ $value }})" - alert: Network_Incoming expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400 for: 1m labels: severity: warning annotations: summary: "{{$labels.mountpoint}} Incoming network bandwidth is too high!" description: "{{$labels.mountpoint }}The incoming network bandwidth lasts more than 100M. RX bandwidth usage{{$value}}"
說明:
- alert: InstanceDown #規則名 expr: up == 0 #規則條件 for: 1m #每分鐘檢查一次 labels: #規則打標 severity: critical annotations: #告警說明 summary: "Instance {{ $labels.instance }} down" description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
對於 NodeFilesystemSpaceUsage 規則,設定了80% 和 90% 2個級別,同一個規則名,不同的閥值。這樣設定,會有一個問題,當FS使用率超過90%時,alertmanager那邊會有2條告警記錄,1個是大於80%,一個是大於90%,為了避免這個問題,後面需要在alertmanager上配置抑制規則。
2.修改/etc/prometheus/prometheus.yml# my global configglobal: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s).# Alertmanager configurationalerting: alertmanagers: - static_configs: - targets: - localhost:9093# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.rule_files: # - "first_rules.yml" # - "second_rules.yml"- /etc/prometheus/rules.d/*.rules #新增這條,這裡指定rules目錄,Prometheus會讀取這個目錄下的所有rule檔案# A scrape configuration containing exactly one endpoint to scrape:# Here it's Prometheus itself.scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node_exporter' scrape_interval: 5s static_configs: - targets: ['9.98.12.85:9100'] labels: env: production monitor: node_exporter
重啟Prometheus,使配置生效
systemctl restart prometheus
進入Prometheus頁面檢視是否生效。看到如下結果,說明配置成功。
http://192.168.0.107:9090/rules
host rules
3.測試一下告警規則檢視下Prometheus Alert頁面 http://192.168.0.107:9090/alerts
Prometheus alert
可以看出來,現在沒有告警。一切正常。
那我們就來觸發一個告警看看。由於我這裡是測試機,建立一個大檔案,讓根目錄使用率超過90%就可以了。
# df -hFilesystem Size Used Avail Use% Mounted on/dev/mapper/centos-root 8.3G 6.6G 1.8G 79% /devtmpfs 911M 0 911M 0% /devtmpfs 921M 0 921M 0% /dev/shmtmpfs 921M 12M 909M 2% /runtmpfs 921M 0 921M 0% /sys/fs/cgroup/dev/sda1 497M 165M 333M 34% /bootnone 236G 214G 22G 91% /vagranttmpfs 185M 0 185M 0% /run/user/1000建立一個1G的檔案#dd if=/dev/zero of=/tmp/test bs=1M count=10241024+0 records in1024+0 records out1073741824 bytes (1.1 GB) copied, 2.68688 s, 400 MB/s# df -hFilesystem Size Used Avail Use% Mounted on/dev/mapper/centos-root 8.3G 7.6G 764M 91% /devtmpfs 911M 0 911M 0% /devtmpfs 921M 0 921M 0% /dev/shmtmpfs 921M 12M 909M 2% /runtmpfs 921M 0 921M 0% /sys/fs/cgroup/dev/sda1 497M 165M 333M 34% /bootnone 236G 214G 22G 91% /vagranttmpfs 185M 0 185M 0% /run/user/1000
可以看到根目錄已經91%使用率,這裡會觸發Prometheus的告警。80%和90%的規則都觸發了。
檢視Prometheus Alert http://192.168.0.107:9090/alerts 。
RULE狀態說明:
inactive: 不滿足告警規則
pending: 中間狀態,我們設定For 1m ,1m有滿足告警規則的,但不是每次滿足
Firing: 滿足告警規則
FS超過90%觸發
至此,主機的監控規則設定就完成了。
下面我分享一下,mysql/postgresql/confluence的rules。
/etc/prometheus/rules.d/mysql-status.rules
groups: - name: MySQLStatsAlert rules: - alert: MySQL is down expr: mysql_up == 0 for: 1m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} MySQL is down" description: "MySQL database is down. This requires immediate action!" - alert: open files high expr: mysql_global_status_innodb_num_open_files > (mysql_global_variables_open_files_limit) * 0.75 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} open files high" description: "Open files is high. Please consider increasing open_files_limit." - alert: Read buffer size is bigger than max. allowed packet size expr: mysql_global_variables_read_buffer_size > mysql_global_variables_slave_max_allowed_packet for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} Read buffer size is bigger than max. allowed packet size" description: "Read buffer size (read_buffer_size) is bigger than max. allowed packet size (max_allowed_packet).This can break your replication." - alert: Sort buffer possibly missconfigured expr: mysql_global_variables_innodb_sort_buffer_size <256*1024 or mysql_global_variables_read_buffer_size > 4*1024*1024 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} Sort buffer possibly missconfigured" description: "Sort buffer size is either too big or too small. A good value for sort_buffer_size is between 256k and 4M." - alert: Thread stack size is too small expr: mysql_global_variables_thread_stack <196608 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} Thread stack size is too small" description: "Thread stack size is too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size." - alert: Used more than 80% of max connections limited expr: mysql_global_status_max_used_connections > mysql_global_variables_max_connections * 0.8 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} Used more than 80% of max connections limited" description: "Used more than 80% of max connections limited" - alert: InnoDB Force Recovery is enabled expr: mysql_global_variables_innodb_force_recovery != 0 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} InnoDB Force Recovery is enabled" description: "InnoDB Force Recovery is enabled. This mode should be used for data recovery purposes only. It prohibits writing to the data." - alert: InnoDB Log File size is too small expr: mysql_global_variables_innodb_log_file_size < 16777216 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} InnoDB Log File size is too small" description: "The InnoDB Log File size is possibly too small. Choosing a small InnoDB Log File size can have significant performance impacts." - alert: InnoDB Flush Log at Transaction Commit expr: mysql_global_variables_innodb_flush_log_at_trx_commit != 1 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} InnoDB Flush Log at Transaction Commit" description: "InnoDB Flush Log at Transaction Commit is set to a values != 1. This can lead to a loss of commited transactions in case of a power failure." - alert: Table definition cache too small expr: mysql_global_status_open_table_definitions > mysql_global_variables_table_definition_cache for: 1m labels: severity: page annotations: summary: "Instance {{ $labels.instance }} Table definition cache too small" description: "Your Table Definition Cache is possibly too small. If it is much too small this can have significant performance impacts!" - alert: Table open cache too small expr: mysql_global_status_open_tables >mysql_global_variables_table_open_cache * 99/100 for: 1m labels: severity: page annotations: summary: "Instance {{ $labels.instance }} Table open cache too small" description: "Your Table Open Cache is possibly too small (old name Table Cache). If it is much too small this can have significant performance impacts!" - alert: Thread stack size is possibly too small expr: mysql_global_variables_thread_stack < 262144 for: 1m labels: severity: page annotations: summary: "Instance {{ $labels.instance }} Thread stack size is possibly too small" description: "Thread stack size is possibly too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size." - alert: InnoDB Plugin is enabled expr: mysql_global_variables_ignore_builtin_innodb == 1 for: 1m labels: severity: page annotations: summary: "Instance {{ $labels.instance }} InnoDB Plugin is enabled" description: "InnoDB Plugin is enabled" - alert: Binary Log is disabled expr: mysql_global_variables_log_bin != 1 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} Binary Log is disabled" description: "Binary Log is disabled. This prohibits you to do Point in Time Recovery (PiTR)." - alert: Binlog Cache size too small expr: mysql_global_variables_binlog_cache_size < 1048576 for: 1m labels: severity: page annotations: summary: "Instance {{ $labels.instance }} Binlog Cache size too small" description: "Binlog Cache size is possibly to small. A value of 1 Mbyte or higher is OK." - alert: Binlog Transaction Cache size too small expr: mysql_global_variables_binlog_cache_size < 1048576 for: 1m labels: severity: page annotations: summary: "Instance {{ $labels.instance }} Binlog Transaction Cache size too small" description: "Binlog Transaction Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK."
/etc/prometheus/rules.d/postgresql-status.rules
groups:- name: PostgreSQL-Status-Alert rules: ########## EXPORTER RULES ########## - alert: PGExporterScrapeError expr: pg_exporter_last_scrape_error > 0 for: 60s labels: service: postgresql severity: critical severity_num: 300 annotations: summary: 'Postgres Exporter running on {{ $labels.job }} (instance: {{ $labels.instance }}) is encountering scrape errors processing queries. Error count: ( {{ $value }} )' - alert: NodeExporterScrapeError expr: node_textfile_scrape_error > 0 for: 60s labels: service: system severity: critical severity_num: 300 annotations: summary: 'Node Exporter running on {{ $labels.job }} (instance: {{ $labels.instance }}) is encountering scrape errors processing custom metrics. Error count: ( {{ $value }} )'########## POSTGRESQL RULES ########## - alert: PGIsUp expr: pg_up < 1 for: 60s labels: service: postgresql severity: critical severity_num: 300 annotations: summary: 'postgres_exporter running on {{ $labels.job }} is unable to communicate with the configured database'# Whether a system switches from primary to replica or vice versa must be configured per named job.# No way to tell what value a system is supposed to be without a rule expression for that specific system# 2 to 1 means it changed from primary to replica. 1 to 2 means it changed from replica to primary# Set this alert for each system that you want to monitor a recovery status change# Below is an example for a target job called "Replica" and watches for the value to change above 1 which means it's no longer a replica## - alert: PGRecoveryStatusSwitch_Replica # expr: ccp_is_in_recovery_status{job="Replica"} > 1 # for: 60s# labels:# service: postgresql# severity: critical# severity_num: 300# annotations:# summary: '{{ $labels.job }} has changed from replica to primary'# Absence alerts must be configured per named job, otherwise there's no way to know which job is down# Below is an example for a target job called "Prod"# - alert: PGConnectionAbsent# expr: absent(ccp_connection_stats_max_connections{job="Prod"})# for: 10s# labels:# service: postgresql# severity: critical# severity_num: 300# annotations:# description: 'Connection metric is absent from target (Prod). Check that postgres_exporter can connect to PostgreSQL.' - alert: PGIdleTxn expr: ccp_connection_stats_max_idle_in_txn_time > 300 for: 60s labels: service: postgresql severity: warning severity_num: 200 annotations: description: '{{ $labels.job }} has at least one session idle in transaction for over 5 minutes.' summary: 'PGSQL Instance idle transactions' - alert: PGIdleTxn expr: ccp_connection_stats_max_idle_in_txn_time > 900 for: 60s labels: service: postgresql severity: critical severity_num: 300 annotations: description: '{{ $labels.job }} has at least one session idle in transaction for over 15 minutes.' summary: 'PGSQL Instance idle transactions' - alert: PGQueryTime expr: ccp_connection_stats_max_query_time > 43200 for: 60s labels: service: postgresql severity: warning severity_num: 200 annotations: description: '{{ $labels.job }} has at least one query running for over 12 hours.' summary: 'PGSQL Max Query Runtime' - alert: PGQueryTime expr: ccp_connection_stats_max_query_time > 86400 for: 60s labels: service: postgresql severity: critical severity_num: 300 annotations: description: '{{ $labels.job }} has at least one query running for over 1 day.' summary: 'PGSQL Max Query Runtime' - alert: PGConnPerc expr: 100 * (ccp_connection_stats_total / ccp_connection_stats_max_connections) > 75 for: 60s labels: service: postgresql severity: warning severity_num: 200 annotations: description: '{{ $labels.job }} is using 75% or more of available connections ({{ $value }}%)' summary: 'PGSQL Instance connections' - alert: PGConnPerc expr: 100 * (ccp_connection_stats_total / ccp_connection_stats_max_connections) > 90 for: 60s labels: service: postgresql severity: critical severity_num: 300 annotations: description: '{{ $labels.job }} is using 90% or more of available connections ({{ $value }}%)' summary: 'PGSQL Instance connections' - alert: PGDBSize expr: ccp_database_size > 1.073741824e+11 for: 60s labels: service: postgresql severity: warning severity_num: 200 annotations: description: 'PGSQL Instance {{ $labels.job }} over 100GB in size: {{ $value }} bytes' summary: 'PGSQL Instance size warning' - alert: PGDBSize expr: ccp_database_size > 2.68435456e+11 for: 60s labels: service: postgresql severity: critical severity_num: 300 annotations: description: 'PGSQL Instance {{ $labels.job }} over 250GB in size: {{ $value }} bytes' summary: 'PGSQL Instance size critical' - alert: PGReplicationByteLag expr: ccp_replication_status_byte_lag > 5.24288e+07 for: 60s labels: service: postgresql severity: warning severity_num: 200 annotations: description: 'PGSQL Instance {{ $labels.job }} has at least one replica lagging over 50MB behind.' summary: 'PGSQL Instance replica lag warning' - alert: PGReplicationByteLag expr: ccp_replication_status_byte_lag > 1.048576e+08 for: 60s labels: service: postgresql severity: critical severity_num: 300 annotations: description: 'PGSQL Instance {{ $labels.job }} has at least one replica lagging over 100MB behind.' summary: 'PGSQL Instance replica lag warning' - alert: PGReplicationSlotsInactive expr: ccp_replication_slots_active == 0 for: 60s labels: service: postgresql severity: critical severity_num: 300 annotations: description: 'PGSQL Instance {{ $labels.job }} has one or more inactive replication slots' summary: 'PGSQL Instance inactive replication slot' - alert: PGXIDWraparound expr: ccp_transaction_wraparound_percent_towards_wraparound > 50 for: 60s labels: service: postgresql severity: warning severity_num: 200 annotations: description: 'PGSQL Instance {{ $labels.job }} is over 50% towards transaction id wraparound.' summary: 'PGSQL Instance {{ $labels.job }} transaction id wraparound imminent' - alert: PGXIDWraparound expr: ccp_transaction_wraparound_percent_towards_wraparound > 75 for: 60s labels: service: postgresql severity: critical severity_num: 300 annotations: description: 'PGSQL Instance {{ $labels.job }} is over 75% towards transaction id wraparound.' summary: 'PGSQL Instance transaction id wraparound imminent' - alert: PGEmergencyVacuum expr: ccp_transaction_wraparound_percent_towards_emergency_autovac > 75 for: 60s labels: service: postgresql severity: warning severity_num: 200 annotations: description: 'PGSQL Instance {{ $labels.job }} is over 75% towards emergency autovacuum processes beginning' summary: 'PGSQL Instance emergency vacuum imminent' - alert: PGEmergencyVacuum expr: ccp_transaction_wraparound_percent_towards_emergency_autovac > 90 for: 60s labels: service: postgresql severity: critical severity_num: 300 annotations: description: 'PGSQL Instance {{ $labels.job }} is over 90% towards emergency autovacuum processes beginning' summary: 'PGSQL Instance emergency vacuum imminent' - alert: PGArchiveCommandStatus expr: ccp_archive_command_status_seconds_since_last_fail > 300 for: 60s labels: service: postgresql severity: critical severity_num: 300 annotations: description: 'PGSQL Instance {{ $labels.job }} has a recent failing archive command' summary: 'Seconds since the last recorded failure of the archive_command' - alert: PGSequenceExhaustion expr: ccp_sequence_exhaustion_count > 0 for: 60s labels: service: postgresql severity: critical severity_num: 300 annotations: description: 'Count of sequences on instance {{ $labels.job }} at over 75% usage: {{ $value }}. Run following query to see full sequence status: SELECT * FROM monitor.sequence_status() WHERE percent >= 75'########## SYSTEM RULES ########## - alert: ExporterDown expr: avg_over_time(up[5m]) < 0.9 for: 10s labels: service: system severity: critical severity_num: 300 annotations: description: 'Metrics exporter service for {{ $labels.job }} running on {{ $labels.instance }} has been down at least 50% of the time for the last 5 minutes. Service may be flapping or down.' summary: 'Prometheus Exporter Service Down' - alert: DiskUsagePerc expr: (100 - 100 * sum(node_filesystem_avail_bytes{device!~"tmpfs|by-uuid",fstype=~"xfs|ext"} / node_filesystem_size_bytes{device!~"tmpfs|by-uuid",fstype=~"xfs|ext"}) BY (job,device)) > 70 for: 2m labels: service: system severity: warning severity_num: 200 annotations: description: 'Disk usage on target {{ $labels.job }} at {{ $value }}%' - alert: DiskUsagePerc expr: (100 - 100 * sum(node_filesystem_avail_bytes{device!~"tmpfs|by-uuid",fstype=~"xfs|ext"} / node_filesystem_size_bytes{device!~"tmpfs|by-uuid",fstype=~"xfs|ext"}) BY (job,device)) > 85 for: 2m labels: service: system severity: critical severity_num: 300 annotations: description: 'Disk usage on target {{ $labels.job }} at {{ $value }}%' - alert: DiskFillPredict expr: predict_linear(node_filesystem_free_bytes{device!~"tmpfs|by-uuid",fstype=~"xfs|ext"}[1h], 4 * 3600) < 0 for: 5m labels: service: system severity: warning severity_num: 200 annotations: description: '(EXPERIMENTAL) Disk {{ $labels.device }} on target {{ $labels.job }} is predicted to fill in 4 hrs based on current usage' - alert: SystemLoad5m expr: node_load5 > 5 for: 10m labels: service: system severity: warning severity_num: 200 annotations: description: 'System load for target {{ $labels.job }} is high ({{ $value }})' - alert: SystemLoad5m expr: node_load5 > 10 for: 10m labels: service: system severity: critical severity_num: 300 annotations: description: 'System load for target {{ $labels.job }} is high ({{ $value }})' - alert: MemoryAvailable expr: (100 * (node_memory_Available_bytes) / node_memory_MemTotal_bytes) < 25 for: 1m labels: service: system severity: warning severity_num: 200 annotations: description: 'Memory available for target {{ $labels.job }} is at {{ $value }}%' - alert: MemoryAvailable expr: (100 * (node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes) < 10 for: 1m labels: service: system severity: critical severity_num: 300 annotations: description: 'Memory available for target {{ $labels.job }} is at {{ $value }}%' - alert: SwapUsage expr: (100 - (100 * (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes))) > 60 for: 1m labels: service: system severity: warning severity_num: 200 annotations: description: 'Swap usage for target {{ $labels.job }} is at {{ $value }}%' - alert: SwapUsage expr: (100 - (100 * (node_memory_SwapFree_byte / node_memory_SwapTotal_bytes))) > 80 for: 1m labels: service: system severity: critical severity_num: 300 annotations: description: 'Swap usage for target {{ $labels.job }} is at {{ $value }}%'########## PGBACKREST RULES ############ Uncomment and customize one or more of these rules to monitor your pgbackrest backups. # Full backups are considered the equivalent of both differentials and incrementals since both are based on the last full# And differentials are considered incrementals since incrementals will be based off the last diff if one exists# This avoid false alerts, for example when you don't run diff/incr backups on the days that you run a full# Stanza should also be set if different intervals are expected for each stanza. # Otherwise rule will be applied to all stanzas returned on target system if not set.# Otherwise, all backups returned by the pgbackrest info command run from where the database exists will be checked## Relevant metric names are: # ccp_backrest_last_full_time_since_completion_seconds# ccp_backrest_last_incr_time_since_completion_seconds# ccp_backrest_last_diff_time_since_completion_seconds## - alert: PGBackRestLastCompletedFull_main# expr: ccp_backrest_last_full_backup_time_since_completion_seconds{stanza="main"} > 604800# for: 60s# labels:# service: postgresql# severity: critical# severity_num: 300# annotations:# summary: 'Full backup for stanza [main] on system {{ $labels.job }} has not completed in the last week.'## - alert: PGBackRestLastCompletedIncr_main# expr: ccp_backrest_last_incr_backup_time_since_completion_seconds{stanza="main"} > 86400# for: 60s# labels:# service: postgresql# severity: critical# severity_num: 300# annotations:# summary: 'Incremental backup for stanza [main] on system {{ $labels.job }} has not completed in the last 24 hours.'### Runtime monitoring is handled with a single metric:## ccp_backrest_last_runtime_backup_runtime_seconds## Runtime monitoring should have the "backup_type" label set. # Otherwise the rule will apply to the last run of all backup types returned (full, diff, incr)# Stanza should also be set if runtimes per stanza have different expected times## - alert: PGBackRestLastRuntimeFull_main# expr: ccp_backrest_last_runtime_backup_runtime_seconds{backup_type="full", stanza="main"} > 14400# for: 60s# labels:# service: postgresql# severity: critical# severity_num: 300# annotations:# summary: 'Expected runtime of full backup for stanza [main] has exceeded 4 hours'## - alert: PGBackRestLastRuntimeDiff_main# expr: ccp_backrest_last_runtime_backup_runtime_seconds{backup_type="diff", stanza="main"} > 3600# for: 60s# labels:# service: postgresql# severity: critical# severity_num: 300# annotations:# summary: 'Expected runtime of diff backup for stanza [main] has exceeded 1 hour'##### If the pgbackrest command fails to run, the metric disappears from the exporter output and the alert never fires. ## An absence alert must be configured explicitly for each target (job) that backups are being monitored.## Checking for absence of just the full backup type should be sufficient (no need for diff/incr).## Note that while the backrest check command failing will likely also cause a scrape error alert, the addition of this ## check gives a clearer answer as to what is causing it and that something is wrong with the backups.## - alert: PGBackrestAbsentFull_Prod# expr: absent(ccp_backrest_last_full_backup_time_since_completion_seconds{job="Prod"})# for: 10s# labels:# service: postgresql# severity: critical# severity_num: 300# annotations:# description: 'Backup Full status missing for Prod. Check that pgbackrest info command is working on target system.'
/etc/prometheus/rules.d/confluence-status.rules
groups:- name: ConfluenceStatsAlert rules: - alert: Confluence is down expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} Confluence is down" description: "Confluence is down. This requires immediate action!" - alert: Confluence memory used percentage over 60% expr: process_resident_memory_bytes/process_virtual_memory_bytes > 0.6 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} memory used percentage over 60% " description: "Confluence memory used percentage over 60%. This requires immediate action!" - alert: Confluence threads deadlocked expr: jvm_threads_deadlocked >0 or jvm_threads_deadlocked_monitor >0 or jvm_threads_state{state="BLOCKED"}>0 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} threads deadlocked " description: "Confluence threads deadlocked. This requires immediate action!" - alert: Confluence jmx systemstat db latency expr: confluence_jmx_systemstat_db_latency > 0 for: 1m labels severity: warning annotations: summary: "Instance {{ $labels.instance }} jmx systemstat db latency " description: "Confluence jmx systemstat db latency. This requires immediate action!" - alert: Confluence jmx request avg exectime of ten requests over 200 expr: confluence_jmx_request_avg_exectime_of_ten_requests > 200 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} confluence_jmx_request_avg_exectime_of_ten_requests over 200 " description: "Confluence jmx request avg exectime of ten requests over 200. This requires immediate action!" - alert: Confluence user failed login count over 10 expr: confluence_user_failed_login_count > 10 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} confluence_user_failed_login_count over 10 " description: "Confluence user failed login count over 10. This requires immediate action!"