Prometheus monitoring is usually against on long-lived daemons, but what if you've a batch job that you want to monitor?
When monitoring batch jobs, such as cronjobs, the main thing you care about is when it last succeeded. For example if you've a cronjob that runs every hour and needs to work at least once every few hours, then you want to alert if it hasn't worked for at least two runs - rather than on every individual failure. It's also useful to track how long batch jobs take over time.
Prometheus is primarily a pull-based monitoring system. Batch jobs should use the Pushgateway, which will persist metrics for them
First install the Prometheus Python client:
pip install prometheus_client
Then in your batch job push metrics to the Pushgateway:
from prometheus_client import Gauge,CollectorRegistry,pushadd_to_gateway
registry = CollectorRegistry()
duration = Gauge('mybatchjob_duration_seconds',
'Duration of batch job', registry=registry)
try:
with duration.time():
pass # Your code here
except: pass
else:
last_success = Gauge('mybatchjob_last_success',
'Unixtime my batch job last succeeded', registry=registry)
last_success.set_to_current_time()
finally:
pushadd_to_gateway('localhost:9091', job='my_batch_job', registry=registry)
The mybatchjob_last_success metric is only pushed when we succeed. As we're using pushadd_to_gateway rather than push_to_gateway a failed run won't overwrite the value of a previous success. mybatchjob_duration_seconds is always pushed, so you can graph it over time.
Once you've setup Prometheus to scrape the Pushgateway, you can add an alert in a rule file:
groups:
- name: test.rules
rules:
- alert: MyBatchJobNoRecentSuccess
expr: time() - mybatchjob_last_success{job="my_batch_job"} > 3600 * 3.5
annotations:
description: mybatchjob last succeeded {{humanizeDuration $value}} ago




No comments.