Reliable Insights

A blog on monitoring, scale and operational Sanity

January 16, 2017

Federation, what is it good for?

There's various ways Prometheus federation can be used. To ensure your monitoring is scalable and reliable, let's look at how to best use it.

Read more

August 2, 2016

One agent to rule them all

Another not uncommon question we get about Prometheus is as to why we don't have a single per-machine agent that handles all the collection, and instead have one exporter per application. Doesn't that make it harder to manage?

Read more

November 18, 2015

Do you have basic infrastructure?

When starting out it's easy to think that you need Docker, Kubernetes, Microservices, Continuous Deployment and all the other trending topics on Hacker News/Reddit/Lobsters. What do you really need?

Read more

November 4, 2015

Unlimited costs, Limited revenue

This week Microsoft removed unlimited storage from their OneDrive offering, because surprise surprise people were using it as unlimited storage. Does your product have features that cost you time and money, without your users paying accordingly?

Read more

November 2, 2015

Avoid outages: Beware the Knee

Your service's traffic is steadily growing, latency has increased a bit but it's within reason. One day you launch a new customer and the latency jumps through the roof causing an outage. What happened? You hit the knee.

Read more

October 29, 2015

Cooking a meal isn’t the same as running a restaurant

I enjoy cooking and regularly make scrumptious meals for myself. Does this mean that I'm capable of running a busy kitchen? Of course not! So why assume that all software engineers can automatically run production services?

Read more

October 8, 2015

Monitoring: Not Just For Outages

It's common to think of monitoring as something just to alert you when things are going wrong.  At Robust Perception we believe in Inclusive Monitoring, where all aspects of systems are monitored and available to provide insight and drive decisions.

Read more

August 23, 2015

There are 100,000 Seconds in a Day

Just after you've launched is not the best time to find out that you can't handle the load you predicted, or that running costs are much higher than you'd like. By estimating the operational parameters of your system as you design you can gain confidence that the system will work as you expect.

Read more

August 14, 2015

Scaling and Federating Prometheus

A single Prometheus server can easily handle millions of time series. That's enough for a thousand servers with a thousand time series each scraped every 10 seconds. As your systems scale beyond that, Prometheus can scale too.

Read more

twitter
youtube
linkedin

Blog   |   Training   |   Book   |   Careers   |   Privacy   |   Demo