A blog on monitoring, scale and operational Sanity
July 23, 2018
In the previous post we looked at dealing with when all the targets for a job had disappeared. What if you wanted to alert on specific metrics from one target disappearing?
July 16, 2018
Alerting on numbers being too big or small is easy with Prometheus. But what if the numbers go missing?
May 28, 2018
Sometimes you want the raw samples inside Prometheus for analysis or debugging. How do you get that?
April 30, 2018
Since Prometheus 2.1 there is a feature to view alerting rule evaluation times in the rules UI. In this blogpost we'll see an example of how this can be used to identify an expensive rule expression.
April 23, 2018
When using the count aggregation operator you may have noticed that it sometimes returns nothing rather than 0. Why is this?
March 19, 2018
If your applications are restarting regularly, whether due to segfaults or OOMs, it'd be nice to know.
February 12, 2018
One of the major changes introduced in Prometheus 2.0 was that of staleness handling. Previously for instant vectors, Prometheus would return a point up to 5 minutes in the past which caused a number of different issues.
February 5, 2018
Have you ever wondered what percentage of time a given service or application spends up or down?
January 1, 2018
Prometheus 2.0 brought with it rule groups, making hierarchical aggregation easier than ever.
December 11, 2017
Have you ever wondered why the buckets in histograms are not just counters of events that fall into each bucket?
Blog | Training | Book | Careers | Privacy | Demo