Reliable Insights

A blog on monitoring, scale and operational Sanity

June 22, 2020

Remote read and partial failures

What happens when your clustered storage fails?

Read more

May 18, 2020

Atomic Writes and the Textfile Collector

To avoid weirdness, write your files atomically.

Read more

April 20, 2020

Don’t federate instance labels

Federation can be quite useful, but it's not replication.

Read more

March 2, 2020

Setting Thresholds on Alerts

Alert thresholds can be surprisingly tricky to get right.

Read more

February 24, 2020

Regex Selectors are a Smell

Have you ever found yourself having to keep on updating and tweaking certain regexes in PromQL?

Read more

December 2, 2019

Target labels, not metric name prefixes

Services are not distinguished by their metric names in Prometheus.

Read more

November 4, 2019

Don’t cross the screams: Monitoring across failure domains

Scraping targets across datacenters will make things better, right?

Read more

September 23, 2019

Laying out Alertmanager routes

How should you design your Alertmanager routes for flexibility and growth?

Read more

September 16, 2019

Looking beyond retention

How can you view older data, while keeping your monitoring reliable?

Read more

September 2, 2019

Cardinality is key

Prometheus performance almost always comes down to one thing: label cardinality.

Read more


Blog   |   Training   |   ´╗┐Book   |   Careers   |   Privacy   |   Demo