Reliable Insights

A blog on monitoring, scale and operational Sanity

January 20, 2016

Little Things Matter

As part of designing and building Prometheus, hundreds of technical decisions have to be made. Every one of them is important in building a sustainable consistent ecosystem. Today, let's look at one small decision that was made by the Prometheus developers in Consul service discovery.

December 21, 2015

You look good, have you lost machines?

Whether you're on bare metal or using a cloud provider, there's a question you should always be able to answer. What machines do I have, and what is meant to be running on them?
December 11, 2015

It’s overloaded? Try harder!

Failed requests are a fact of life, network weirdness and machine failures are inevitable. It can be tempting to simply retry the request when this happens, but this may cause more harm than good.

November 28, 2015

Do you know what software you’re running?

When getting something working for the first time, it's easy to get caught up in Docker or Vargant. Before you run it in production with full access and user data, do you know what code you're running?

November 18, 2015

Do you have basic infrastructure?

When starting out it's easy to think that you need Docker, Kubernetes, Microservices, Continuous Deployment and all the other trending topics on Hacker News/Reddit/Lobsters. What do you really need?

November 4, 2015

Unlimited costs, Limited revenue

This week Microsoft removed unlimited storage from their OneDrive offering, because surprise surprise people were using it as unlimited storage. Does your product have features that cost you time and money, without your users paying accordingly?

November 2, 2015

Avoid outages: Beware the Knee

Your service's traffic is steadily growing, latency has increased a bit but it's within reason. One day you launch a new customer and the latency jumps through the roof causing an outage. What happened? You hit the knee.

October 29, 2015

Cooking a meal isn’t the same as running a restaurant

I enjoy cooking and regularly make scrumptious meals for myself. Does this mean that I'm capable of running a busy kitchen? Of course not! So why assume that all software engineers can automatically run production services?

October 8, 2015

Monitoring: Not Just For Outages

It's common to think of monitoring as something just to alert you when things are going wrong.  At Robust Perception we believe in Inclusive Monitoring, where all aspects of systems are monitored and available to provide insight and drive decisions.

September 28, 2015

Healthchecking is Not Transitive

Systems such as Consul perform healthchecking of local services and expose this information to other machines within the cluster. Does this mean that the service will work when you try to talk to it?

