Reliable Insights

A blog on monitoring, scale and operational Sanity

September 12, 2016

Who wants seconds?

The Prometheus instrumentation guidelines say to use seconds, and the timing functions in client libraries follow this. Why?

August 29, 2016

Undoing the benefits of labels

It can seem like a good idea to use recording rules to make more explicit the content of a time series, particularly for those not used to labels. However this usually leads to confusing names and losing the benefits of labels.

August 8, 2016

On the naming of things

How you choose to name metrics is important. If everyone choose different schemes it'd lead to confusion, irritation and prevent us from sharing and reusing each others' work. I'd like to share some guidelines to help keep things sane for everyone.

August 2, 2016

One agent to rule them all

Another not uncommon question we get about Prometheus is as to why we don't have a single per-machine agent that handles all the collection, and instead have one exporter per application. Doesn't that make it harder to manage?

July 25, 2016

Target labels are for life, not just for Christmas

How should you choose the labels to put on your Prometheus monitoring targets? Let's take a look.

July 14, 2016

Monitoring without Consensus

When designing a monitoring system and the datastore that goes with it, it can be tempting to go straight for a clustered highly consistent approach. But is that the best approach?

July 1, 2016

How to Give a Tech Talk

Since starting Robust Perception a year ago I've given 20+ tech talks at various meetups and conferences across the world. I'd like to share some tips around the practicalities of speaking that I've learned along the way.

May 9, 2016

Rate then sum, never sum then rate

There's a common misunderstanding when dealing with Prometheus counters, and that is how to apply aggregation and other operations when using the rate and other counter-only functions.

March 31, 2016

The first step is to document

When you've a complicated manual process that you want to improve, your first instinct as a developer might be to jump in and start coding. Hold off a bit, the first step is to document.

February 9, 2016

I’ve got 99 Failure Modes, Yours is Just One

When running a production system there's an endless stream of issues that have the potential to cause you significant hassle. How should you deal with this?

