What does and doesn't make a good grouping key?

Pushgateway grouping keys are fundamentally target labels, and similar considerations apply. They should be minimal and they should be constant. What the latter means may be a little non-obvious for batch jobs though.

The purpose of the Pushgateway is to hold metrics from the end of a run of a recurring service-level batch job. For example once a day you might have a cron job that does some cleanup against a Cassandra cluster. The main metric of interest will be when it last succeeded, so that you can alert if it fails several days in a row. A key point is that you only care about the last success, not every previous run of the cron job.

Accordingly you should never have in the grouping key an instance label, host port, container name, or anything else that varies from run to run of a cron job. The goal is that when the cron job pushes metrics to the pushgateway, that it will replace the metrics from the previous push. That way you have the same set of time series, making things easy to graph and alert on as series don't appear and disappear.

If you shouldn't have an instance label, what labels should you have? A job label is required, and should follow the usual guidelines. Here for example it might be cassandra_cleanup. Further labels would follow your usual target label taxonomy, so in this case however you usually label your Cassandra clusters. That might be a cluster name, an environment, owner/team, and/or location. Keep in mind though that there's no need to duplicate labels already covered by external_labels in your Prometheus.

 

One objection you might have is that if you don't apply an instance label, how would you know which machine the cron job ran on? While this is a matter for event logging rather than metrics, you can still expose this in a way that doesn't get in the way of the grouping key by using an info metric. For example in Python:

import socket
from prometheus_client import Gauge,Info,CollectorRegistry,push_to_gateway

registry = CollectorRegistry()
last_success = Gauge('mybatchjob_last_success', 
    'Unixtime my batch job last succeeded', registry=registry)
last_success.set_to_current_time()

extra_info = Info('mybatchjob', 'Information about this batchjob run', 
    registry=registry)
extra_info.info({'instance': socket.getfqdn()})

push_to_gateway('localhost:9091', job='my_batch_job', registry=registry)

 

Have questions about monitoring batch jobs? Contact us.