Reliable Insights

A blog on monitoring, scale and operational Sanity

December 2, 2019

Target labels, not metric name prefixes

Services are not distinguished by their metric names in Prometheus.

Published by Brian Brazil in Posts

Tags: best practices, instrumentation, prometheus

November 4, 2019

Don’t cross the screams: Monitoring across failure domains

Scraping targets across datacenters will make things better, right?

Published by Brian Brazil in Posts

Tags: best practices, prometheus, reliability

September 23, 2019

Laying out Alertmanager routes

How should you design your Alertmanager routes for flexibility and growth?

Published by Brian Brazil in Posts

Tags: alertmanager, best practices, prometheus

September 16, 2019

Looking beyond retention

How can you view older data, while keeping your monitoring reliable?

Published by Brian Brazil in Posts

Tags: best practices, design, prometheus, reliability

September 2, 2019

Cardinality is key

Prometheus performance almost always comes down to one thing: label cardinality.

Published by Brian Brazil in Posts

Tags: best practices, prometheus

Putting queues in front of Prometheus for reliability

On a regular basis a potential Prometheus user says they need a different architecture to make things reliable or scalable. Let's look at that.

Published by Brian Brazil in Posts

Tags: best practices, design, prometheus, push, reliability, scaling

How should pipelines be monitored?

For online serving systems it's fairly well known that you should look for request rate, errors and duration. What about offline processing pipelines though?

Published by Brian Brazil in Posts

Tags: alerting, best practices, prometheus, promql

Idempotent Cron Jobs are Operable Cron Jobs

Having to reconstruct how far a failed cron job had gotten and what exact parameters it was run with can be error prone and time consuming. There is a better way.

Published by Brian Brazil in Posts

Tags: alerting, best practices, prometheus, reliability

Be discerning in what dashboards you share with users

There's no way that sharing metrics with your users or customers can go wrong. Right?

Published by Brian Brazil in Posts

Tags: best practices, prometheus

Avoid the Wall of Graphs

Data is not the same as information.

Published by Brian Brazil in Posts

Tags: best practices, graphing, prometheus

Newer Posts Older Posts ›

youtube

Blog | Training | Book | Privacy