Reliable Insights

A blog on monitoring, scale and operational Sanity

March 4, 2019

Measuring Java garbage collection with Prometheus

GC stats are one of the many metrics that the Java/JVM client library exposes.

Read more

February 25, 2019

Monthly reporting with Prometheus and Python

It's common to want reports from Prometheus, such as how many requests failed over an entire month.

Read more

February 18, 2019

How much of the time is my network usage over a certain amount?

The new subquery feature in Prometheus 2.7 makes this possible in one query.

Read more

December 17, 2018

Limiting PromQL resource usage

Prometheus has gained a number of features to limit the impact of expensive PromQL queries.

Read more

November 26, 2018

Unit testing alerts with Prometheus

In the previous post we looked at testing rules. You can also test alerts.

Read more

November 19, 2018

Unit testing rules with Prometheus

As of 2.5.0, promtool has a feature to allow you to test your recording rules.

Read more

October 15, 2018

Graph top N time series in Grafana

As of Grafana 5.3.0 there's a feature that allows correct graphing of the top N series over a duration.

Read more

September 24, 2018

Alerting on approaching open file limits

In a previous post we looked at dealing with reaching the open file limit. How about alerting before it happens?

Read more

August 6, 2018

Aggregating across batch job runs with push_time_seconds

For counting how many times a thing has happened you can use a counter and rate(), but that doesn't work across batch jobs.

 

Read more

July 23, 2018

Absent Alerting for Scraped Metrics

In the previous post we looked at dealing with when all the targets for a job had disappeared. What if you wanted to alert on specific metrics from one target disappearing?

Read more

twitter
youtube
linkedin

Blog   |   Training   |   Book   |   Privacy