Reliable Insights

A blog on monitoring, scale and operational Sanity

March 5, 2018

Using sample_limit to avoid overload

Worried that your application metrics might suddenly explode in cardinality? sample_limit can save you.

Read more

July 17, 2017

High Availability Prometheus Alerting and Notification

Prometheus is architected for reliability of alerting, how do you set it up?

Read more

July 14, 2016

Monitoring without Consensus

When designing a monitoring system and the datastore that goes with it, it can be tempting to go straight for a clustered highly consistent approach. But is that the best approach?

Read more

December 11, 2015

It’s overloaded? Try harder!

Failed requests are a fact of life, network weirdness and machine failures are inevitable. It can be tempting to simply retry the request when this happens, but this may cause more harm than good.

Read more

November 18, 2015

Do you have basic infrastructure?

When starting out it's easy to think that you need Docker, Kubernetes, Microservices, Continuous Deployment and all the other trending topics on Hacker News/Reddit/Lobsters. What do you really need?

Read more

November 2, 2015

Avoid outages: Beware the Knee

Your service's traffic is steadily growing, latency has increased a bit but it's within reason. One day you launch a new customer and the latency jumps through the roof causing an outage. What happened? You hit the knee.

Read more

October 29, 2015

Cooking a meal isn’t the same as running a restaurant

I enjoy cooking and regularly make scrumptious meals for myself. Does this mean that I'm capable of running a busy kitchen? Of course not! So why assume that all software engineers can automatically run production services?

Read more

September 28, 2015

Healthchecking is Not Transitive

Systems such as Consul perform healthchecking of local services and expose this information to other machines within the cluster. Does this mean that the service will work when you try to talk to it?

Read more

September 16, 2015

Dropping metrics at scrape time with Prometheus

It's easy to get carried away by the power of labels with Prometheus. In the extreme this can overload your Prometheus server, such as if you create a time series for each of hundreds of thousands of users. Thankfully there's a way to deal with this without having to turn off monitoring or deploy a new version of your code.

Read more

September 14, 2015

Do you know your peak-to-mean ratio?

Traffic from users to your servers isn't a steady stream, it waxes and wanes over the day and week. The peak-to-mean ratio is your primary tool to avoid outages or unnecessary costs due to this.

Read more


Blog   |   Training   |   Book   |   Careers   |   Privacy   |   Demo