Reliable Insights

A blog on monitoring, scale and operational Sanity

November 4, 2019

Don’t cross the screams: Monitoring across failure domains

Scraping targets across datacenters will make things better, right?

Read more

September 16, 2019

Looking beyond retention

How can you view older data, while keeping your monitoring reliable?

Read more

September 9, 2019

What queries were running when Prometheus died?

As of Prometheus 2.12.0 there's a new feature to help find problematic queries.

Read more

August 5, 2019

Putting queues in front of Prometheus for reliability

On a regular basis a potential Prometheus user says they need a different architecture to make things reliable or scalable. Let's look at that.

Read more

June 17, 2019

Idempotent Cron Jobs are Operable Cron Jobs

Having to reconstruct how far a failed cron job had gotten and what exact parameters it was run with can be error prone and time consuming. There is a better way.

Read more

December 17, 2018

Limiting PromQL resource usage

Prometheus has gained a number of features to limit the impact of expensive PromQL queries.

Read more

August 27, 2018

Dealing with “too many open files”

While not a problem specific to Prometheus, being affected by the open files ulimit is something you're likely to run into at some point.

Read more

March 5, 2018

Using sample_limit to avoid overload

Worried that your application metrics might suddenly explode in cardinality? sample_limit can save you.

Read more

July 17, 2017

High Availability Prometheus Alerting and Notification

Prometheus is architected for reliability of alerting, how do you set it up?

Read more

July 14, 2016

Monitoring without Consensus

When designing a monitoring system and the datastore that goes with it, it can be tempting to go straight for a clustered highly consistent approach. But is that the best approach?

Read more

twitter
youtube
linkedin

Blog   |   Training   |   Book   |   Privacy