Reliable Insights

A blog on monitoring, scale and operational Sanity

December 2, 2019

Target labels, not metric name prefixes

Services are not distinguished by their metric names in Prometheus.

Read more

November 4, 2019

Don’t cross the screams: Monitoring across failure domains

Scraping targets across datacenters will make things better, right?

Read more

September 23, 2019

Laying out Alertmanager routes

How should you design your Alertmanager routes for flexibility and growth?

Read more

September 16, 2019

Looking beyond retention

How can you view older data, while keeping your monitoring reliable?

Read more

September 2, 2019

Cardinality is key

Prometheus performance almost always comes down to one thing: label cardinality.

Read more

August 5, 2019

Putting queues in front of Prometheus for reliability

On a regular basis a potential Prometheus user says they need a different architecture to make things reliable or scalable. Let's look at that.

Read more

July 22, 2019

How should pipelines be monitored?

For online serving systems it's fairly well known that you should look for request rate, errors and duration. What about offline processing pipelines though?

Read more

June 17, 2019

Idempotent Cron Jobs are Operable Cron Jobs

Having to reconstruct how far a failed cron job had gotten and what exact parameters it was run with can be error prone and time consuming. There is a better way.

Read more

May 13, 2019

Be discerning in what dashboards you share with users

There's no way that sharing metrics with your users or customers can go wrong. Right?

Read more

April 29, 2019

Avoid the Wall of Graphs

Data is not the same as information.

Read more

twitter
youtube
linkedin

Blog   |   Training   |   Book   |   Privacy