首頁>技術>

我們知道Prometheus可以收集各種監控資料,需要設定告警規則,比如,監控磁碟使用率到80%,就發出warning告警,到達90%就發出severity告警。當然這裡告警規則是設定在Prometheus配置檔案裡,到達告警閥值後,可以透過alertmanager傳送通知(Email,Slack)

接下來,我們先寫一個Node_exporter告警規則

1.Node_exporter告警規則
# mkdir -p /etc/prometheus/rules.d# vi /etc/prometheus/rules.d/host-status.rulesgroups:- name: host-status-rule  rules:  - alert: InstanceDown    expr: up == 0    for: 1m    labels:      severity: critical    annotations:      summary: "Instance {{ $labels.instance }} down"      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."  - alert: NodeFilesystemSpaceUsage    expr: ( 1 - (node_filesystem_avail_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"} / node_filesystem_size_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"}) ) * 100 > 80    for: 1m    labels:      severity: warning    annotations:      summary: "{{$labels.instance}}: Filesystem space usage is high"      description: "{{$labels.instance}}: Filesystem space usage is more than 80% (current usage is: {{ $value }})"  - alert: NodeFilesystemSpaceUsage    expr: ( 1 - (node_filesystem_avail_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"} / node_filesystem_size_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"}) ) * 100 > 90    for: 1m    labels:      severity: critical    annotations:      summary: "{{$labels.instance}}: Filesystem space usage is too high"      description: "{{$labels.instance}}: Filesystem space usage is more than 90% (current usage is: {{ $value }})"  - alert: NodeFilesystemInodeUsage    expr: ( 1 - (node_filesystem_files_free{fstype=~"ext[234]|btrfs|xfs|zfs"} / node_filesystem_files{fstype=~"ext[234]|btrfs|xfs|zfs"}) ) * 100 > 85    for: 1m    labels:      severity: critical    annotations:      summary: "{{$labels.instance}}: Filesystem inode usage is high"      description: "{{$labels.instance}}: Filesystem inode usage is more than 85% (current usage is: {{ $value }})"  - alert: NodeFilesystemReadOnly    expr: node_filesystem_readonly{job="node-exporter",device!~'rootfs'} == 1    for: 1m    labels:      severity: critical    annotations:      summary: "{{$labels.instance}}: Filesystem read only"      description: "{{$labels.instance}}: Filesystem read only"  - alert: NodeMemoryUsage_Warning    expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 80    for: 1m    labels:      severity: warning    annotations:      summary: "{{$labels.instance}}: Host memory usage is high"      description: "{{$labels.instance}}: Host memory usage is more than 80% (current usage is : {{ $value }})"  - alert: NodeMemoryUsage_Critical    expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 90    for: 1m    labels:      severity: critical    annotations:      summary: "{{$labels.instance}}: Host memory usage is too high"      description: "{{$labels.instance}}: Host memory usage is more than 90% (current usage is : {{ $value }})"  - alert: NodeCPUUsage    expr: (100 - (avg by (instance) (irate(node_cpu_seconds_total{mode='idle'}[1m])) * 100)) > 80    for: 1m    labels:      severity: warning    annotations:      summary: "{{$labels.instance}}: CPU usage is high"      description: "{{$labels.instance}}: CPU usage is more than 80% (current usage: {{ $value }})"  - alert: NodeCPUUsage    expr: (100 - (avg by (instance) (irate(node_cpu_seconds_total{mode='idle'}[1m])) * 100)) > 90    for: 1m    labels:      severity: critical    annotations:      summary: "{{$labels.instance}}: CPU usage is too high"      description: "{{$labels.instance}}: CPU usage is more than 90% (current usage: {{ $value }})"  - alert: Network_Incoming    expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400    for: 1m    labels:      severity: warning    annotations:      summary: "{{$labels.mountpoint}} Incoming network bandwidth is too high!"      description: "{{$labels.mountpoint }}The incoming network bandwidth lasts more than 100M. RX bandwidth usage{{$value}}"

說明:

- alert: InstanceDown                #規則名    expr: up == 0                        #規則條件    for: 1m                                   #每分鐘檢查一次    labels:                                     #規則打標      severity: critical    annotations:                           #告警說明      summary: "Instance {{ $labels.instance }} down"      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

對於 NodeFilesystemSpaceUsage 規則,設定了80% 和 90% 2個級別,同一個規則名,不同的閥值。這樣設定,會有一個問題,當FS使用率超過90%時,alertmanager那邊會有2條告警記錄,1個是大於80%,一個是大於90%,為了避免這個問題,後面需要在alertmanager上配置抑制規則。

2.修改/etc/prometheus/prometheus.yml
# my global configglobal:  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.  # scrape_timeout is set to the global default (10s).# Alertmanager configurationalerting:  alertmanagers:  - static_configs:    - targets:      - localhost:9093# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.rule_files:  # - "first_rules.yml"  # - "second_rules.yml"- /etc/prometheus/rules.d/*.rules   #新增這條,這裡指定rules目錄,Prometheus會讀取這個目錄下的所有rule檔案# A scrape configuration containing exactly one endpoint to scrape:# Here it's Prometheus itself.scrape_configs:  - job_name: 'prometheus'    static_configs:    - targets: ['localhost:9090']  - job_name: 'node_exporter'    scrape_interval: 5s    static_configs:      - targets: ['9.98.12.85:9100']        labels:          env: production          monitor: node_exporter

重啟Prometheus,使配置生效

systemctl restart prometheus

進入Prometheus頁面檢視是否生效。看到如下結果,說明配置成功。

http://192.168.0.107:9090/rules

host rules

3.測試一下告警規則

檢視下Prometheus Alert頁面 http://192.168.0.107:9090/alerts

Prometheus alert

可以看出來,現在沒有告警。一切正常。

那我們就來觸發一個告警看看。由於我這裡是測試機,建立一個大檔案,讓根目錄使用率超過90%就可以了。

# df -hFilesystem               Size  Used Avail Use% Mounted on/dev/mapper/centos-root  8.3G  6.6G  1.8G  79% /devtmpfs                 911M     0  911M   0% /devtmpfs                    921M     0  921M   0% /dev/shmtmpfs                    921M   12M  909M   2% /runtmpfs                    921M     0  921M   0% /sys/fs/cgroup/dev/sda1                497M  165M  333M  34% /bootnone                     236G  214G   22G  91% /vagranttmpfs                    185M     0  185M   0% /run/user/1000建立一個1G的檔案#dd if=/dev/zero of=/tmp/test bs=1M count=10241024+0 records in1024+0 records out1073741824 bytes (1.1 GB) copied, 2.68688 s, 400 MB/s# df -hFilesystem               Size  Used Avail Use% Mounted on/dev/mapper/centos-root  8.3G  7.6G  764M  91% /devtmpfs                 911M     0  911M   0% /devtmpfs                    921M     0  921M   0% /dev/shmtmpfs                    921M   12M  909M   2% /runtmpfs                    921M     0  921M   0% /sys/fs/cgroup/dev/sda1                497M  165M  333M  34% /bootnone                     236G  214G   22G  91% /vagranttmpfs                    185M     0  185M   0% /run/user/1000

可以看到根目錄已經91%使用率,這裡會觸發Prometheus的告警。80%和90%的規則都觸發了。

檢視Prometheus Alert http://192.168.0.107:9090/alerts 。

RULE狀態說明:

inactive: 不滿足告警規則

pending: 中間狀態,我們設定For 1m ,1m有滿足告警規則的,但不是每次滿足

Firing: 滿足告警規則

FS超過90%觸發

至此,主機的監控規則設定就完成了。

下面我分享一下,mysql/postgresql/confluence的rules。

/etc/prometheus/rules.d/mysql-status.rules

groups:  - name: MySQLStatsAlert    rules:    - alert: MySQL is down      expr: mysql_up == 0      for: 1m      labels:        severity: critical      annotations:        summary: "Instance {{ $labels.instance }} MySQL is down"        description: "MySQL database is down. This requires immediate action!"    - alert: open files high      expr: mysql_global_status_innodb_num_open_files > (mysql_global_variables_open_files_limit) * 0.75      for: 1m      labels:        severity: warning      annotations:        summary: "Instance {{ $labels.instance }} open files high"        description: "Open files is high. Please consider increasing open_files_limit."    - alert: Read buffer size is bigger than max. allowed packet size      expr: mysql_global_variables_read_buffer_size > mysql_global_variables_slave_max_allowed_packet       for: 1m      labels:        severity: warning      annotations:        summary: "Instance {{ $labels.instance }} Read buffer size is bigger than max. allowed packet size"        description: "Read buffer size (read_buffer_size) is bigger than max. allowed packet size (max_allowed_packet).This can break your replication."    - alert: Sort buffer possibly missconfigured      expr: mysql_global_variables_innodb_sort_buffer_size <256*1024 or mysql_global_variables_read_buffer_size > 4*1024*1024       for: 1m      labels:        severity: warning      annotations:        summary: "Instance {{ $labels.instance }} Sort buffer possibly missconfigured"        description: "Sort buffer size is either too big or too small. A good value for sort_buffer_size is between 256k and 4M."    - alert: Thread stack size is too small      expr: mysql_global_variables_thread_stack <196608      for: 1m      labels:        severity: warning      annotations:        summary: "Instance {{ $labels.instance }} Thread stack size is too small"        description: "Thread stack size is too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."    - alert: Used more than 80% of max connections limited       expr: mysql_global_status_max_used_connections > mysql_global_variables_max_connections * 0.8      for: 1m      labels:        severity: warning      annotations:        summary: "Instance {{ $labels.instance }} Used more than 80% of max connections limited"        description: "Used more than 80% of max connections limited"    - alert: InnoDB Force Recovery is enabled      expr: mysql_global_variables_innodb_force_recovery != 0       for: 1m      labels:        severity: warning      annotations:        summary: "Instance {{ $labels.instance }} InnoDB Force Recovery is enabled"        description: "InnoDB Force Recovery is enabled. This mode should be used for data recovery purposes only. It prohibits writing to the data."    - alert: InnoDB Log File size is too small      expr: mysql_global_variables_innodb_log_file_size < 16777216       for: 1m      labels:        severity: warning      annotations:        summary: "Instance {{ $labels.instance }} InnoDB Log File size is too small"        description: "The InnoDB Log File size is possibly too small. Choosing a small InnoDB Log File size can have significant performance impacts."    - alert: InnoDB Flush Log at Transaction Commit      expr: mysql_global_variables_innodb_flush_log_at_trx_commit != 1      for: 1m      labels:        severity: warning      annotations:        summary: "Instance {{ $labels.instance }} InnoDB Flush Log at Transaction Commit"        description: "InnoDB Flush Log at Transaction Commit is set to a values != 1. This can lead to a loss of commited transactions in case of a power failure."    - alert: Table definition cache too small      expr: mysql_global_status_open_table_definitions > mysql_global_variables_table_definition_cache      for: 1m      labels:        severity: page      annotations:        summary: "Instance {{ $labels.instance }} Table definition cache too small"        description: "Your Table Definition Cache is possibly too small. If it is much too small this can have significant performance impacts!"    - alert: Table open cache too small      expr: mysql_global_status_open_tables >mysql_global_variables_table_open_cache * 99/100      for: 1m      labels:        severity: page      annotations:        summary: "Instance {{ $labels.instance }} Table open cache too small"        description: "Your Table Open Cache is possibly too small (old name Table Cache). If it is much too small this can have significant performance impacts!"    - alert: Thread stack size is possibly too small      expr: mysql_global_variables_thread_stack < 262144      for: 1m      labels:        severity: page      annotations:        summary: "Instance {{ $labels.instance }} Thread stack size is possibly too small"        description: "Thread stack size is possibly too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."    - alert: InnoDB Plugin is enabled      expr: mysql_global_variables_ignore_builtin_innodb == 1      for: 1m      labels:        severity: page      annotations:        summary: "Instance {{ $labels.instance }} InnoDB Plugin is enabled"        description: "InnoDB Plugin is enabled"    - alert: Binary Log is disabled      expr: mysql_global_variables_log_bin != 1      for: 1m      labels:        severity: warning      annotations:        summary: "Instance {{ $labels.instance }} Binary Log is disabled"        description: "Binary Log is disabled. This prohibits you to do Point in Time Recovery (PiTR)."    - alert: Binlog Cache size too small      expr: mysql_global_variables_binlog_cache_size < 1048576      for: 1m      labels:        severity: page      annotations:        summary: "Instance {{ $labels.instance }} Binlog Cache size too small"        description: "Binlog Cache size is possibly to small. A value of 1 Mbyte or higher is OK."    - alert: Binlog Transaction Cache size too small      expr: mysql_global_variables_binlog_cache_size  < 1048576      for: 1m      labels:        severity: page      annotations:        summary: "Instance {{ $labels.instance }} Binlog Transaction Cache size too small"        description: "Binlog Transaction Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK."

/etc/prometheus/rules.d/postgresql-status.rules

groups:- name: PostgreSQL-Status-Alert  rules:      ########## EXPORTER RULES ##########  - alert: PGExporterScrapeError    expr: pg_exporter_last_scrape_error > 0    for: 60s    labels:      service: postgresql      severity: critical      severity_num: 300    annotations:      summary: 'Postgres Exporter running on {{ $labels.job }} (instance: {{ $labels.instance }}) is encountering scrape errors processing queries. Error count: ( {{ $value }} )'  - alert: NodeExporterScrapeError    expr: node_textfile_scrape_error > 0    for: 60s    labels:      service: system       severity: critical      severity_num: 300    annotations:      summary: 'Node Exporter running on {{ $labels.job }} (instance: {{ $labels.instance }}) is encountering scrape errors processing custom metrics.  Error count: ( {{ $value }} )'########## POSTGRESQL RULES ##########  - alert: PGIsUp    expr: pg_up < 1    for: 60s    labels:      service: postgresql      severity: critical      severity_num: 300    annotations:      summary: 'postgres_exporter running on {{ $labels.job }} is unable to communicate with the configured database'# Whether a system switches from primary to replica or vice versa must be configured per named job.# No way to tell what value a system is supposed to be without a rule expression for that specific system# 2 to 1 means it changed from primary to replica. 1 to 2 means it changed from replica to primary# Set this alert for each system that you want to monitor a recovery status change# Below is an example for a target job called "Replica" and watches for the value to change above 1 which means it's no longer a replica##  - alert: PGRecoveryStatusSwitch_Replica #    expr: ccp_is_in_recovery_status{job="Replica"} > 1 #    for: 60s#    labels:#      service: postgresql#      severity: critical#      severity_num: 300#    annotations:#      summary: '{{ $labels.job }} has changed from replica to primary'# Absence alerts must be configured per named job, otherwise there's no way to know which job is down# Below is an example for a target job called "Prod"#  - alert: PGConnectionAbsent#    expr: absent(ccp_connection_stats_max_connections{job="Prod"})#    for: 10s#    labels:#      service: postgresql#      severity: critical#      severity_num: 300#    annotations:#      description: 'Connection metric is absent from target (Prod). Check that postgres_exporter can connect to PostgreSQL.'  - alert: PGIdleTxn    expr: ccp_connection_stats_max_idle_in_txn_time > 300    for: 60s    labels:      service: postgresql      severity: warning      severity_num: 200    annotations:      description: '{{ $labels.job }} has at least one session idle in transaction for over 5 minutes.'      summary: 'PGSQL Instance idle transactions'  - alert: PGIdleTxn    expr: ccp_connection_stats_max_idle_in_txn_time > 900    for: 60s    labels:      service: postgresql      severity: critical      severity_num: 300    annotations:      description: '{{ $labels.job }} has at least one session idle in transaction for over 15 minutes.'      summary: 'PGSQL Instance idle transactions'  - alert: PGQueryTime    expr: ccp_connection_stats_max_query_time > 43200    for: 60s    labels:      service: postgresql      severity: warning       severity_num: 200    annotations:      description: '{{ $labels.job }} has at least one query running for over 12 hours.'      summary: 'PGSQL Max Query Runtime'  - alert: PGQueryTime    expr: ccp_connection_stats_max_query_time > 86400     for: 60s    labels:      service: postgresql      severity: critical      severity_num: 300    annotations:      description: '{{ $labels.job }} has at least one query running for over 1 day.'      summary: 'PGSQL Max Query Runtime'  - alert: PGConnPerc    expr: 100 * (ccp_connection_stats_total / ccp_connection_stats_max_connections) > 75    for: 60s    labels:      service: postgresql      severity: warning      severity_num: 200    annotations:      description: '{{ $labels.job }} is using 75% or more of available connections ({{ $value }}%)'      summary: 'PGSQL Instance connections'  - alert: PGConnPerc    expr: 100 * (ccp_connection_stats_total / ccp_connection_stats_max_connections) > 90     for: 60s    labels:      service: postgresql      severity: critical      severity_num: 300    annotations:      description: '{{ $labels.job }} is using 90% or more of available connections ({{ $value }}%)'      summary: 'PGSQL Instance connections'  - alert: PGDBSize    expr: ccp_database_size > 1.073741824e+11    for: 60s    labels:      service: postgresql      severity: warning      severity_num: 200    annotations:      description: 'PGSQL Instance {{ $labels.job }} over 100GB in size: {{ $value }} bytes'      summary: 'PGSQL Instance size warning'  - alert: PGDBSize    expr: ccp_database_size > 2.68435456e+11    for: 60s    labels:      service: postgresql      severity: critical      severity_num: 300    annotations:      description: 'PGSQL Instance {{ $labels.job }} over 250GB in size: {{ $value }} bytes'      summary: 'PGSQL Instance size critical'  - alert: PGReplicationByteLag    expr: ccp_replication_status_byte_lag > 5.24288e+07    for: 60s    labels:      service: postgresql      severity: warning      severity_num: 200    annotations:      description: 'PGSQL Instance {{ $labels.job }} has at least one replica lagging over 50MB behind.'      summary: 'PGSQL Instance replica lag warning'  - alert: PGReplicationByteLag    expr: ccp_replication_status_byte_lag > 1.048576e+08    for: 60s    labels:      service: postgresql      severity: critical      severity_num: 300    annotations:      description: 'PGSQL Instance {{ $labels.job }} has at least one replica lagging over 100MB behind.'      summary: 'PGSQL Instance replica lag warning'  - alert: PGReplicationSlotsInactive    expr: ccp_replication_slots_active == 0    for: 60s    labels:      service: postgresql      severity: critical      severity_num: 300    annotations:      description: 'PGSQL Instance {{ $labels.job }} has one or more inactive replication slots'      summary: 'PGSQL Instance inactive replication slot'  - alert: PGXIDWraparound    expr: ccp_transaction_wraparound_percent_towards_wraparound > 50    for: 60s    labels:      service: postgresql      severity: warning      severity_num: 200    annotations:      description: 'PGSQL Instance {{ $labels.job }} is over 50% towards transaction id wraparound.'      summary: 'PGSQL Instance {{ $labels.job }} transaction id wraparound imminent'  - alert: PGXIDWraparound    expr: ccp_transaction_wraparound_percent_towards_wraparound > 75    for: 60s    labels:      service: postgresql      severity: critical      severity_num: 300    annotations:      description: 'PGSQL Instance {{ $labels.job }} is over 75% towards transaction id wraparound.'      summary: 'PGSQL Instance transaction id wraparound imminent'  - alert: PGEmergencyVacuum    expr: ccp_transaction_wraparound_percent_towards_emergency_autovac > 75    for: 60s    labels:      service: postgresql      severity: warning      severity_num: 200    annotations:      description: 'PGSQL Instance {{ $labels.job }} is over 75% towards emergency autovacuum processes beginning'      summary: 'PGSQL Instance emergency vacuum imminent'  - alert: PGEmergencyVacuum    expr: ccp_transaction_wraparound_percent_towards_emergency_autovac > 90    for: 60s    labels:      service: postgresql      severity: critical      severity_num: 300    annotations:      description: 'PGSQL Instance {{ $labels.job }} is over 90% towards emergency autovacuum processes beginning'      summary: 'PGSQL Instance emergency vacuum imminent'  - alert: PGArchiveCommandStatus    expr: ccp_archive_command_status_seconds_since_last_fail > 300    for: 60s    labels:        service: postgresql        severity: critical        severity_num: 300    annotations:        description: 'PGSQL Instance {{ $labels.job }} has a recent failing archive command'        summary: 'Seconds since the last recorded failure of the archive_command'  - alert: PGSequenceExhaustion    expr: ccp_sequence_exhaustion_count > 0    for: 60s    labels:        service: postgresql        severity: critical        severity_num: 300    annotations:        description: 'Count of sequences on instance {{ $labels.job }} at over 75% usage: {{ $value }}. Run following query to see full sequence status: SELECT * FROM monitor.sequence_status() WHERE percent >= 75'########## SYSTEM RULES ##########  - alert: ExporterDown    expr: avg_over_time(up[5m]) < 0.9     for: 10s     labels:      service: system      severity: critical      severity_num: 300    annotations:      description: 'Metrics exporter service for {{ $labels.job }} running on {{ $labels.instance }} has been down at least 50% of the time for the last 5 minutes. Service may be flapping or down.'      summary: 'Prometheus Exporter Service Down'  - alert: DiskUsagePerc    expr: (100 - 100 * sum(node_filesystem_avail_bytes{device!~"tmpfs|by-uuid",fstype=~"xfs|ext"} / node_filesystem_size_bytes{device!~"tmpfs|by-uuid",fstype=~"xfs|ext"}) BY (job,device)) > 70    for: 2m    labels:      service: system      severity: warning      severity_num: 200    annotations:      description: 'Disk usage on target {{ $labels.job }} at {{ $value }}%'  - alert: DiskUsagePerc    expr: (100 - 100 * sum(node_filesystem_avail_bytes{device!~"tmpfs|by-uuid",fstype=~"xfs|ext"} / node_filesystem_size_bytes{device!~"tmpfs|by-uuid",fstype=~"xfs|ext"}) BY (job,device)) > 85    for: 2m    labels:      service: system      severity: critical      severity_num: 300    annotations:      description: 'Disk usage on target {{ $labels.job }} at {{ $value }}%'  - alert: DiskFillPredict    expr: predict_linear(node_filesystem_free_bytes{device!~"tmpfs|by-uuid",fstype=~"xfs|ext"}[1h], 4 * 3600) < 0    for: 5m    labels:      service: system      severity: warning      severity_num: 200    annotations:      description: '(EXPERIMENTAL) Disk {{ $labels.device }} on target {{ $labels.job }} is predicted to fill in 4 hrs based on current usage'  - alert: SystemLoad5m    expr: node_load5 > 5    for: 10m    labels:      service: system      severity: warning      severity_num: 200    annotations:      description: 'System load for target {{ $labels.job }} is high ({{ $value }})'  - alert: SystemLoad5m    expr: node_load5 > 10    for: 10m    labels:      service: system      severity: critical      severity_num: 300    annotations:      description: 'System load for target {{ $labels.job }} is high ({{ $value }})'  - alert: MemoryAvailable    expr: (100 * (node_memory_Available_bytes) / node_memory_MemTotal_bytes) < 25    for: 1m    labels:      service: system      severity: warning      severity_num: 200    annotations:      description: 'Memory available for target {{ $labels.job }} is at {{ $value }}%'  - alert: MemoryAvailable    expr: (100 * (node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes) < 10    for: 1m    labels:      service: system      severity: critical      severity_num: 300    annotations:      description: 'Memory available for target {{ $labels.job }} is at {{ $value }}%'  - alert: SwapUsage    expr: (100 - (100 * (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes))) > 60    for: 1m    labels:      service: system      severity: warning      severity_num: 200    annotations:      description: 'Swap usage for target {{ $labels.job }} is at {{ $value }}%'  - alert: SwapUsage    expr: (100 - (100 * (node_memory_SwapFree_byte / node_memory_SwapTotal_bytes))) > 80    for: 1m    labels:      service: system      severity: critical      severity_num: 300    annotations:      description: 'Swap usage for target {{ $labels.job }} is at {{ $value }}%'########## PGBACKREST RULES ############ Uncomment and customize one or more of these rules to monitor your pgbackrest backups. # Full backups are considered the equivalent of both differentials and incrementals since both are based on the last full#   And differentials are considered incrementals since incrementals will be based off the last diff if one exists#   This avoid false alerts, for example when you don't run diff/incr backups on the days that you run a full# Stanza should also be set if different intervals are expected for each stanza. #   Otherwise rule will be applied to all stanzas returned on target system if not set.# Otherwise, all backups returned by the pgbackrest info command run from where the database exists will be checked## Relevant metric names are: #   ccp_backrest_last_full_time_since_completion_seconds#   ccp_backrest_last_incr_time_since_completion_seconds#   ccp_backrest_last_diff_time_since_completion_seconds##  - alert: PGBackRestLastCompletedFull_main#    expr: ccp_backrest_last_full_backup_time_since_completion_seconds{stanza="main"} > 604800#    for: 60s#    labels:#       service: postgresql#       severity: critical#       severity_num: 300#    annotations:#       summary: 'Full backup for stanza [main] on system {{ $labels.job }} has not completed in the last week.'##  - alert: PGBackRestLastCompletedIncr_main#    expr: ccp_backrest_last_incr_backup_time_since_completion_seconds{stanza="main"} > 86400#    for: 60s#    labels:#       service: postgresql#       severity: critical#       severity_num: 300#    annotations:#       summary: 'Incremental backup for stanza [main] on system {{ $labels.job }} has not completed in the last 24 hours.'### Runtime monitoring is handled with a single metric:##   ccp_backrest_last_runtime_backup_runtime_seconds## Runtime monitoring should have the "backup_type" label set. #   Otherwise the rule will apply to the last run of all backup types returned (full, diff, incr)# Stanza should also be set if runtimes per stanza have different expected times##  - alert: PGBackRestLastRuntimeFull_main#    expr: ccp_backrest_last_runtime_backup_runtime_seconds{backup_type="full", stanza="main"} > 14400#    for: 60s#    labels:#       service: postgresql#       severity: critical#       severity_num: 300#    annotations:#       summary: 'Expected runtime of full backup for stanza [main] has exceeded 4 hours'##  - alert: PGBackRestLastRuntimeDiff_main#    expr: ccp_backrest_last_runtime_backup_runtime_seconds{backup_type="diff", stanza="main"} > 3600#    for: 60s#    labels:#       service: postgresql#       severity: critical#       severity_num: 300#    annotations:#       summary: 'Expected runtime of diff backup for stanza [main] has exceeded 1 hour'##### If the pgbackrest command fails to run, the metric disappears from the exporter output and the alert never fires. ## An absence alert must be configured explicitly for each target (job) that backups are being monitored.## Checking for absence of just the full backup type should be sufficient (no need for diff/incr).## Note that while the backrest check command failing will likely also cause a scrape error alert, the addition of this ## check gives a clearer answer as to what is causing it and that something is wrong with the backups.##  - alert: PGBackrestAbsentFull_Prod#    expr: absent(ccp_backrest_last_full_backup_time_since_completion_seconds{job="Prod"})#    for: 10s#    labels:#      service: postgresql#      severity: critical#      severity_num: 300#    annotations:#      description: 'Backup Full status missing for Prod. Check that pgbackrest info command is working on target system.'

/etc/prometheus/rules.d/confluence-status.rules

groups:- name: ConfluenceStatsAlert  rules:  - alert: Confluence is down    expr: up == 0    for: 1m    labels:      severity: critical    annotations:      summary: "Instance {{ $labels.instance }} Confluence is down"      description: "Confluence is down. This requires immediate action!"  - alert: Confluence memory used percentage over 60%    expr: process_resident_memory_bytes/process_virtual_memory_bytes > 0.6    for: 1m    labels:      severity: warning    annotations:      summary: "Instance {{ $labels.instance }} memory used percentage over 60% "      description: "Confluence memory used percentage over 60%. This requires immediate action!"  - alert: Confluence threads deadlocked    expr: jvm_threads_deadlocked >0 or jvm_threads_deadlocked_monitor >0 or jvm_threads_state{state="BLOCKED"}>0    for: 1m    labels:      severity: warning    annotations:      summary: "Instance {{ $labels.instance }} threads deadlocked "      description: "Confluence threads deadlocked. This requires immediate action!"  - alert: Confluence jmx systemstat db latency     expr: confluence_jmx_systemstat_db_latency > 0    for: 1m    labels      severity: warning    annotations:      summary: "Instance {{ $labels.instance }} jmx systemstat db latency "      description: "Confluence jmx systemstat db latency. This requires immediate action!"  - alert: Confluence jmx request avg exectime of ten requests over 200    expr: confluence_jmx_request_avg_exectime_of_ten_requests > 200    for: 1m    labels:      severity: warning    annotations:      summary: "Instance {{ $labels.instance }} confluence_jmx_request_avg_exectime_of_ten_requests over 200 "      description: "Confluence jmx request avg exectime of ten requests over 200. This requires immediate action!"  - alert: Confluence user failed login count over 10    expr: confluence_user_failed_login_count > 10    for: 1m    labels:      severity: warning    annotations:      summary: "Instance {{ $labels.instance }} confluence_user_failed_login_count over 10 "      description: "Confluence user failed login count over 10. This requires immediate action!"

9
最新評論
  • BSA-TRITC(10mg/ml) TRITC-BSA 牛血清白蛋白改性標記羅丹明
  • 如何在瀏覽器賬中檢視網頁的HTML原始碼?