Reliable Insights

A blog on monitoring, scale and operational Sanity

July 20, 2020

Delete All Your Alerts

Trying to improve alerting piecemeal can be difficult.

Read more

March 2, 2020

Setting Thresholds on Alerts

Alert thresholds can be surprisingly tricky to get right.

Read more

July 22, 2019

How should pipelines be monitored?

For online serving systems it's fairly well known that you should look for request rate, errors and duration. What about offline processing pipelines though?

Read more

June 17, 2019

Idempotent Cron Jobs are Operable Cron Jobs

Having to reconstruct how far a failed cron job had gotten and what exact parameters it was run with can be error prone and time consuming. There is a better way.

Read more

January 14, 2019

Why do resolved notifications contain old values?

It often confuses users as to why resolved notifications don't contain updated annotations values. Let's dig into why.

Read more

December 31, 2018

Don’t put the value in alert labels

The labels of an alert are its identity, so you have to be a little careful what you put in there.

Read more

November 26, 2018

Unit testing alerts with Prometheus

In the previous post we looked at testing rules. You can also test alerts.

Read more

October 8, 2018

Checking for HTTP 200s with the Blackbox Exporter

It's easy to check if HTTP and HTTPS endpoints are working with the Blackbox Exporter.

Read more

September 24, 2018

Alerting on approaching open file limits

In a previous post we looked at dealing with reaching the open file limit. How about alerting before it happens?

Read more

July 23, 2018

Absent Alerting for Scraped Metrics

In the previous post we looked at dealing with when all the targets for a job had disappeared. What if you wanted to alert on specific metrics from one target disappearing?

Read more

twitter
youtube
linkedin

Blog   |   Training   |   Book   |   Careers   |   Privacy   |   Demo