Reliable Insights

A blog on monitoring, scale and operational sanity

Rate then sum, never sum then rate

There’s a common misunderstanding when dealing with Prometheus counters, and that is how to apply aggregation and other operations when using the rate and other counter-only functions.

Aggregation is core functionality of Prometheus, and it’s most commonly applied to counters. As you’ll recall from a previous article counters only go up and reset. This has implications for what order you apply operations in.

Let’s say you are aggregating up the rate of requests across all of your Node exporters. The individual rates would be:

rate(http_requests_total{job="node"}[5m])

Now to aggregate those, you’d do:

sum by (job)(rate(http_requests_total{job="node"}[5m]))  # This is okay

Seems simple, right?

How not to do it

A common mistake is to try to take the sum and then the rate:

rate(sum by (job)(http_requests_total{job="node"})[5m])  # Don't do this

Even if you’ve worked around this being invalid expression with a recording rule, the real problem is what happens when one of the servers restarts. The counters from the restarted server will reset to 0, the sum will decrease, which will then be treated by rate as a counter reset and you’d get a large spurious spike in the result.

For the same reason:

rate(counter_a[5m] + counter_b[5m])        # Don't do this

is incorrect, instead do:

rate(counter_a[5m]) + rate(counter_b[5m])  # This is okay

 

Similar applies to all other functions, operators and aggregates such as min, max, avg, ceil, histogram_quantilepredict_linear, division etc.

Rule of Thumb

To help keep you on the straight and narrow, remember this: The only mathematical operations you can safely directly apply to a counter are rate, irateincrease, and resets. Anything else will cause you problems.

Brian BrazilRate then sum, never sum then rate
Share this post

Related Posts