A blog on monitoring, scale and operational Sanity
July 22, 2019
For online serving systems it's fairly well known that you should look for request rate, errors and duration. What about offline processing pipelines though?
June 17, 2019
Having to reconstruct how far a failed cron job had gotten and what exact parameters it was run with can be error prone and time consuming. There is a better way.
January 14, 2019
It often confuses users as to why resolved notifications don't contain updated annotations values. Let's dig into why.
December 31, 2018
The labels of an alert are its identity, so you have to be a little careful what you put in there.
November 26, 2018
In the previous post we looked at testing rules. You can also test alerts.
October 8, 2018
It's easy to check if HTTP and HTTPS endpoints are working with the Blackbox Exporter.
September 24, 2018
In a previous post we looked at dealing with reaching the open file limit. How about alerting before it happens?
July 23, 2018
In the previous post we looked at dealing with when all the targets for a job had disappeared. What if you wanted to alert on specific metrics from one target disappearing?
July 16, 2018
Alerting on numbers being too big or small is easy with Prometheus. But what if the numbers go missing?
April 30, 2018
Since Prometheus 2.1 there is a feature to view alerting rule evaluation times in the rules UI. In this blogpost we'll see an example of how this can be used to identify an expensive rule expression.
Blog | Training | Book | Careers | Privacy | Demo