In this blogpost we try and clear up some confusion by outlining the key differences between commonly confused alerting configuration options:
Before digging into these 3 Alertmanager configuration options, let's recap on some Prometheus alerting basics.
Prometheus itself has two global clocks:
scrape_interval is the time between each Prometheus scrape (i.e when Prometheus is pulling data from exporters etc.), and the
evaluation_interval is the time between each evaluation of Prometheus' alerting rules.
When a rule is evaluated, its state can be altered to be either inactive, pending, or firing.
Following evaluation, this state is sent to the connected Alertmanager to potentially start/stop the sending of alert notifications.
This is where
group_by comes into play.
In order to avoid continuously sending notifications for similar alerts (like the same process failing on multiple instances, nodes, and data centres), the Alertmanager may be configured to group these related alerts into one alert:
group_by: ['alertname', 'job']
Instead we wait for the
group_interval since the last notification was sent to the group, and then send all alerts firing (and any resolved alerts) to the receiver.
group_wait sets how long to initially wait to send a notification for a particular group of alerts.
This allows the Alertmanager to wait for an inhibiting alert to arrive or to collect more initial alerts for the same group. It essentially buffers alerts from Prometheus sent to the Alertmanager that are grouped by the same labels:
group_by: ['alertname', 'job'] group_wait: 45s # Usually set between ~0s to a few minutes.
While this reduces noisy alerts and saves the people receiving them some headache, it may introduce longer delays in receiving said alert notifications.
Another issue we must consider is that we'll receive the same grouped alert notification again next time the rules are evaluated.
This is where we use
group_interval dictates how long to wait before sending notifications about new alerts that are added to a group of alerts that have been alerted on before:
group_by: ['instance', 'job'] group_wait: 45s group_interval: 10m # Usually ~5 mins or more.
So where does
repeat_interval fit into all of this?
repeat_interval is used to determine the wait time before a firing alert that has already been successfully sent to the receiver is sent again.
How long to wait to buffer alerts of the same group before sending initially.
How long to wait before sending an alert that has been added to a group which contains already fired alerts.
How long to wait before re-sending a given alert that has already been sent.
Want expert help on Prometheus configuration? Contact us.