Reliable Insights

A blog on monitoring, scale and operational Sanity

October 1, 2018

What is a job label for?

The job label is one of the labels your targets will always have. So how can you use it?

September 24, 2018

Alerting on approaching open file limits

In a previous post we looked at dealing with reaching the open file limit. How about alerting before it happens?

September 17, 2018

New Features in Prometheus 2.4.0

Prometheus 2.4.0 is now out, following on from 2.3.0 back in June with many fixes and improvements.

September 10, 2018

Using the textfile collector from a shell script

The textfile collector is handy for monitoring machine-level cronjobs. How would you go about that?

September 3, 2018

Deleting time series from Prometheus

If a misconfiguration leads to unwanted time series, it'd good to know how to remove them.

August 27, 2018

Dealing with “too many open files”

While not a problem specific to Prometheus, being affected by the open files ulimit is something you're likely to run into at some point.

August 20, 2018

Using the Java client with Gradle

While the Java client library uses pom.xml and Maven, there's nothing stopping you from using other tools such as Gradle

August 13, 2018

Why predeclare metrics?

The standard way to use metrics in Prometheus is to declare them at file level, before using them. Why?

August 6, 2018

Aggregating across batch job runs with push_time_seconds

For counting how many times a thing has happened you can use a counter and rate(), but that doesn't work across batch jobs.


July 30, 2018

Prometheus: Up and Running is out

After many months of work, Prometheus: Up&Running is now available for purchase!

