Reliable Insights

A blog on monitoring, scale and operational Sanity

September 23, 2019

Laying out Alertmanager routes

How should you design your Alertmanager routes for flexibility and growth?

Read more

September 16, 2019

Looking beyond retention

How can you view older data, while keeping your monitoring reliable?

Read more

September 2, 2019

Cardinality is key

Prometheus performance almost always comes down to one thing: label cardinality.

Read more

August 5, 2019

Putting queues in front of Prometheus for reliability

On a regular basis a potential Prometheus user says they need a different architecture to make things reliable or scalable. Let's look at that.

Read more

July 22, 2019

How should pipelines be monitored?

For online serving systems it's fairly well known that you should look for request rate, errors and duration. What about offline processing pipelines though?

Read more

June 17, 2019

Idempotent Cron Jobs are Operable Cron Jobs

Having to reconstruct how far a failed cron job had gotten and what exact parameters it was run with can be error prone and time consuming. There is a better way.

Read more

May 13, 2019

Be discerning in what dashboards you share with users

There's no way that sharing metrics with your users or customers can go wrong. Right?

Read more

April 29, 2019

Avoid the Wall of Graphs

Data is not the same as information.

Read more

October 29, 2018

How many metrics should an application return?

While each application is different, a rough idea of how many metric there should be would be useful.

Read more

October 1, 2018

What is a job label for?

The job label is one of the labels your targets will always have. So how can you use it?

Read more

twitter
youtube
linkedin

Blog   |   Training   |   ´╗┐Book   |   Careers   |   Privacy   |   Demo