Reliable Insights

A blog on monitoring, scale and operational Sanity

April 23, 2018

Why can count(x > 5) not return 0?

When using the count aggregation operator you may have noticed that it sometimes returns nothing rather than 0. Why is this?

Read more

March 19, 2018

Alerting on crash loops with Prometheus

If your applications are restarting regularly, whether due to segfaults or OOMs, it'd be nice to know.

Read more

February 12, 2018

Alerting on gauges in Prometheus 2.0

One of the major changes introduced in Prometheus 2.0 was that of staleness handling. Previously for instant vectors, Prometheus would return a point up to 5 minutes in the past which caused a number of different issues.

Read more

February 5, 2018

What percentage of time is my service down for?

Have you ever wondered what percentage of time a given service or application spends up or down?

Read more

January 1, 2018

Rule groups for hierarchical aggregation

Prometheus 2.0 brought with it rule groups, making hierarchical aggregation easier than ever.

Read more

December 11, 2017

Why are Prometheus histograms cumulative?

Have you ever wondered why the buckets in histograms are not just counters of events that fall into each bucket?

Read more

December 4, 2017

Using time series as alert thresholds

Usually alert thresholds are hardcoded in the alert. In more sophisticated setups, it would be useful for it to be parameterised based on another time series.

Read more

October 23, 2017

Converting Rules to the Prometheus 2.0 Format

With the upcoming release of Prometheus 2.0 comes a new format for writing recording and alerting rules.

Read more

September 4, 2017

Functions to Avoid

As PromQL has evolved, there are some functions that should no longer be used.

Read more

August 28, 2017

Avoid irate() in alerts

While the irate() function is useful for granular graphs, it is not suitable for alerting.

Read more


Blog   |   Training   |   ´╗┐Book   |   Careers   |   Privacy   |   Demo