alertmanager报警规则详解
这篇⽂章介绍prometheus和alertmanager的报警和通知规则,prometheus的配置⽂件名为l,alertmanager的配置⽂件名为l
报警:指prometheus将监测到的异常事件发送给alertmanager,⽽不是指发送邮件通知
通知:指alertmanager发送异常事件的通知(邮件、webhook等)
报警规则
在l中指定匹配报警规则的间隔
# How frequently to evaluate rules.
[ evaluation_interval: <duration> | default = 1m ]
在l中指定规则⽂件(可使⽤通配符,如rules/*.rules)
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/etc/prometheus/alert.rules"
并基于以下模板:
ALERT <alert name>resolved是什么状态
IF <expression>
[ FOR <duration> ]
[ LABELS <label set> ]
[ ANNOTATIONS <label set> ]
其中:
Alert name是警报标识符。它不需要是唯⼀的。
Expression是为了触发警报⽽被评估的条件。它通常使⽤现有指标作为/metrics端点返回的指标。
Duration是规则必须有效的时间段。例如,5s表⽰5秒。
Label set是将在消息模板中使⽤的⼀组标签。
在prometheus-k8s-statefulset.yaml ⽂件创建ruleSelector,标记报警规则⾓⾊。在prometheus-k8s-rules.yaml 报警规则⽂件中引⽤
ruleSelector:
matchLabels:
role: prometheus-rulefiles
prometheus: k8s
在prometheus-k8s-rules.yaml 使⽤configmap ⽅式引⽤prometheus-rulefiles
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-k8s-rules
namespace: monitoring
labels:
role: prometheus-rulefiles
prometheus: k8s
data:
pod.rules.yaml: |+
groups:
- name: noah_pod.rules
rules:
- alert: Pod_all_cpu_usage
expr: (sum by(name)(rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 10
for: 5m
labels:
severity: critical
service: pods
annotations:
description: 容器 {{ $labels.name }} CPU 资源利⽤率⼤于 75% , (current value is {{ $value }})
summary: Dev CPU 负载告警
- alert: Pod_all_memory_usage
expr: sort_desc(avg by(name)(irate(container_memory_usage_bytes{name!=""}[5m]))*100) > 1024*10^3*2
for: 10m
labels:
severity: critical
annotations:
description: 容器 {{ $labels.name }} Memory 资源利⽤率⼤于 2G , (current value is {{ $value }})
summary: Dev Memory 负载告警
- alert: Pod_all_network_receive_usage
expr: sum by (name)(irate(container_network_receive_bytes_total{container_name="POD"}[1m])) > 1024*1024*50
for: 10m
labels:
severity: critical
annotations:
description: 容器 {{ $labels.name }} network_receive 资源利⽤率⼤于 50M , (current value is {{ $value }})
summary: network_receive 负载告警
配置⽂件设置好后,prometheus-opeartor⾃动重新读取配置。
如果⼆次修改comfigmap 内容只需要apply
kubectl apply -f prometheus-k8s-rules.yaml
将邮件通知与rules对⽐⼀下(还需要配置l才能收到邮件)
通知规则
设置l的的route与receivers
global:
# ResolveTimeout is the time after which an alert is declared resolved
# if it has not been updated.
resolve_timeout: 5m
# The smarthost and SMTP sender used for mail notifications.
smtp_smarthost: 'xxxxx'
smtp_from: 'xxxxxxx'
smtp_auth_username: 'xxxxx'
smtp_auth_password: 'xxxxxx'
# The API URL to use for Slack notifications.
slack_api_url: 'hooks.slack/services/some/api/token'
# # The directory from which notification templates are read.
templates:
- '*.tmpl'
# The root route on which each incoming alert enters.
route:
# The labels by which incoming alerts are grouped together. For example,  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would  # be batched into a single group.
group_by: ['alertname', 'cluster', 'service']
# When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start  # firing shortly after another are batched together on the first
# notification.
group_wait: 30s
# When the first notification was sent, wait 'group_interval' to send a batch  # of new alerts that started
firing for that group.
group_interval: 5m
# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
#repeat_interval: 1m
repeat_interval: 15m
# A default receiver
# If an alert isn't caught by a route, send it to default.
receiver: default
# All the above attributes are inherited by all child routes and can
# overwritten on each.
# The child route trees.
routes:
- match:
severity: critical
receiver: email_alert
receivers:
- name: 'default'
email_configs:
- to : 'yi.hu@dianrong'
send_resolved: true
- name: 'email_alert'
email_configs:
- to : 'yi.hu@dianrong'
send_resolved: true
名词解释
Route
route属性⽤来设置报警的分发策略,它是⼀个树状结构,按照深度优先从左向右的顺序进⾏匹配。
// Match does a depth-first left-to-right search through the route tree
// and returns the matching routing nodes.
func (r *Route) Match(lset model.LabelSet) []*Route {
Alert
Alert是alertmanager接收到的报警,类型如下。
/
/ Alert is a generic representation of an alert in the Prometheus eco-system.
type Alert struct {
// Label value pairs for purpose of aggregation, matching, and disposition
// dispatching. This must minimally include an "alertname" label.
Labels LabelSet `json:"labels"`
// Extra key/value information which does not define alert identity.
Annotations LabelSet `json:"annotations"`
// The known time range for this alert. Both ends are optional.
StartsAt    time.Time `json:"startsAt,omitempty"`
EndsAt      time.Time `json:"endsAt,omitempty"`
GeneratorURL string    `json:"generatorURL"`
}
具有相同Lables的Alert(key和value都相同)才会被认为是同⼀种。在prometheus rules⽂件配置的⼀条规则可能会产⽣多种报警
Group
alertmanager会根据group_by配置将Alert分组。如下规则,当go_goroutines等于4时会收到三条报警,alertmanager会将这三条报警分成两组向receivers发出通知。ALERT test1
IF go_goroutines > 1
LABELS {label1="l1", label2="l2", status="test"}
ALERT test2
IF go_goroutines > 2
LABELS {label1="l2", label2="l2", status="test"}
ALERT test3
IF go_goroutines > 3
LABELS {label1="l2", label2="l1", status="test"}
主要处理流程
1. 接收到Alert,根据labels判断属于哪些Route(可存在多个Route,⼀个Route有多个Group,⼀个Group有多个Alert)
2. 将Alert分配到Group中,没有则新建Group
3. 新的Group等待group_wait指定的时间(等待时可能收到同⼀Group的Alert),根据resolve_timeout判断Alert是否解决,然后发送通知
4. 已有的Group等待group_interval指定的时间,判断Alert是否解决,当上次发送通知到现在的间隔⼤于repeat_interval或者Group有更新时会发送通知Alertmanager
Alertmanager是警报的缓冲区,它具有以下特征:
可以通过特定端点(不是特定于Prometheus)接收警报。
可以将警报重定向到接收者,如hipchat、邮件或其他⼈。
⾜够智能,可以确定已经发送了类似的通知。所以,如果出现问题,你不会被成千上万的电⼦邮件淹没。
Alertmanager客户端(在这种情况下是Prometheus)⾸先发送POST消息,并将所有要处理的警报发送到/ api / v1 / alerts。例如:
[
{
"labels": {
"alertname": "low_connected_users",
"severity": "warning"
},
"annotations": {
"description": "Instance play-app:9000 under lower load",
"summary": "play-app:9000 of job playframework-app is under lower load"
}
}]
alert⼯作流程
⼀旦这些警报存储在Alertmanager,它们可能处于以下任何状态:
Inactive:这⾥什么都没有发⽣。
Pending:客户端告诉我们这个警报必须被触发。然⽽,警报可以被分组、压抑/抑制或者静默/静⾳。⼀旦所有的验证都通过了,我们就转到Firing。
Firing:警报发送到Notification Pipeline,它将联系警报的所有接收者。然后客户端告诉我们警报解除,所以转换到状Inactive状态。
Prometheus有⼀个专门的端点,允许我们列出所有的警报,并遵循状态转换。Prometheus所⽰的每个状态以及导致过渡的条件如下所⽰:
规则不符合。警报没有激活。
规则符合。警报现在处于活动状态。执⾏⼀些验证是为了避免淹没接收器的消息。
警报发送到接收者
接收器 receiver
顾名思义,警报接收的配置。
通⽤配置格式
# The unique name of the receiver.
name: <string>
# Configurations for several notification integrations.
email_configs:
[ - <email_config>, ... ]
pagerduty_configs:
[ - <pagerduty_config>, ... ]
slack_config:
[ - <slack_config>, ... ]
opsgenie_configs:
[ - <opsgenie_config>, ... ]
webhook_configs:
[ - <webhook_config>, ... ]
邮件接收器 email_config
# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = false ]
# The email address to send notifications to.
to: <tmpl_string>
# The sender address.
[ from: <tmpl_string> | default = global.smtp_from ]
# The SMTP host through which emails are sent.
[ smarthost: <string> | default = global.smtp_smarthost ]
# The HTML body of the email notification.
[ html: <tmpl_string> | default = '{{ template "email.default.html" . }}' ]
# Further headers email header key/value pairs. Overrides any headers
# previously set by the notification implementation.
[ headers: { <string>: <tmpl_string>, ... } ]
Slack接收器 slack_config
# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = true ]
# The Slack webhook URL.
[ api_url: <string> | default = global.slack_api_url ]
# The channel or user to send notifications to.
channel: <tmpl_string>
# API request data as defined by the Slack webhook API.
[ color: <tmpl_string> | default = '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}' ]
[ username: <tmpl_string> | default = '{{ template "slack.default.username" . }}'
[ title: <tmpl_string> | default = '{{ template "slack.default.title" . }}' ]
[ title_link: <tmpl_string> | default = '{{ template "slack.default.titlelink" . }}' ]
[ pretext: <tmpl_string> | default = '{{ template "slack.default.pretext" . }}' ]
[ text: <tmpl_string> | default = '{{ template "" . }}' ]
[ fallback: <tmpl_string> | default = '{{ template "slack.default.fallback" . }}' ]
Webhook接收器 webhook_config
# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = true ]
# The endpoint to send HTTP POST requests to.
url: <string>
Alertmanager会使⽤以下的格式向配置端点发送HTTP POST请求:
{
"version": "2",
"status": "<resolved|firing>",
"alerts": [
{
"labels": <object>,
"annotations": <object>,
"startsAt": "<rfc3339>",
"endsAt": "<rfc3339>"
},
...
]
}
Inhibition
抑制是指当警报发出后,停⽌重复发送由此警报引发其他错误的警报的机制。
例如,当警报被触发,通知整个集不可达,可以配置Alertmanager忽略由该警报触发⽽产⽣的所有其他警报,这可以防⽌通知数百或数千与此问题不相关的其他警报。
抑制机制可以通过Alertmanager的配置⽂件来配置。
Inhibition允许在其他警报处于触发状态时,抑制⼀些警报的通知。例如,如果同⼀警报(基于警报名称)已经⾮常紧急,那么我们可以配置⼀个抑制来使任何警告级别的通知静⾳。 l⽂件的相关部分如下所⽰:
inhibit_rules:- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['low_connected_users']
配置抑制规则,是存在另⼀组匹配器匹配的情况下,静⾳其他被引发警报的规则。这两个警报,必须有⼀组相同的标签。
# Matchers that have to be fulfilled in the alerts to be muted.
target_match:
[ <labelname>: <labelvalue>, ... ]
target_match_re:
[ <labelname>: <regex>, ... ]
# Matchers for which one or more alerts have to exist for the
# inhibition to take effect.
source_match:
[ <labelname>: <labelvalue>, ... ]
source_match_re:
[ <labelname>: <regex>, ... ]
# Labels that must have an equal value in the source and target
# alert for the inhibition to take effect.
[ equal: '[' <labelname>, ... ']' ]
Silences
Silences是快速地使警报暂时静⾳的⼀种⽅法。我们直接通过Alertmanager管理控制台中的专⽤页⾯来配置它们。在尝试解决严重的⽣产问题时,这对避免收到垃圾邮件很有⽤。