Reliable Insights

A blog on monitoring, scale and operational Sanity

March 31, 2016

The first step is to document

When you've a complicated manual process that you want to improve, your first instinct as a developer might be to jump in and start coding. Hold off a bit, the first step is to document.

Read more

February 9, 2016

I’ve got 99 Failure Modes, Yours is Just One

When running a production system there's an endless stream of issues that have the potential to cause you significant hassle. How should you deal with this?

Read more

January 20, 2016

Little Things Matter

As part of designing and building Prometheus, hundreds of technical decisions have to be made. Every one of them is important in building a sustainable consistent ecosystem. Today, let's look at one small decision that was made by the Prometheus developers in Consul service discovery.

Read more

December 21, 2015

You look good, have you lost machines?

Whether you're on bare metal or using a cloud provider, there's a question you should always be able to answer. What machines do I have, and what is meant to be running on them?
Read more

December 11, 2015

It’s overloaded? Try harder!

Failed requests are a fact of life, network weirdness and machine failures are inevitable. It can be tempting to simply retry the request when this happens, but this may cause more harm than good.

Read more

November 28, 2015

Do you know what software you’re running?

When getting something working for the first time, it's easy to get caught up in Docker or Vargant. Before you run it in production with full access and user data, do you know what code you're running?

Read more

November 18, 2015

Do you have basic infrastructure?

When starting out it's easy to think that you need Docker, Kubernetes, Microservices, Continuous Deployment and all the other trending topics on Hacker News/Reddit/Lobsters. What do you really need?

Read more

November 4, 2015

Unlimited costs, Limited revenue

This week Microsoft removed unlimited storage from their OneDrive offering, because surprise surprise people were using it as unlimited storage. Does your product have features that cost you time and money, without your users paying accordingly?

Read more

November 2, 2015

Avoid outages: Beware the Knee

Your service's traffic is steadily growing, latency has increased a bit but it's within reason. One day you launch a new customer and the latency jumps through the roof causing an outage. What happened? You hit the knee.

Read more

October 29, 2015

Cooking a meal isn’t the same as running a restaurant

I enjoy cooking and regularly make scrumptious meals for myself. Does this mean that I'm capable of running a busy kitchen? Of course not! So why assume that all software engineers can automatically run production services?

Read more

twitter
youtube
linkedin

Blog   |   Training   |   Book   |   Privacy