Prometheus monitoring is usually against on long-lived daemons, but what if you've a batch job that you want to monitor?

When monitoring batch jobs, such as cronjobs, the main thing you care about is when it last succeeded. For example if you've a cronjob that runs every hour and needs to work at least once every few hours, then you want to alert if it hasn't worked for at least two runs - rather than on every individual failure. It's also useful to track how long batch jobs take over time.

Prometheus is primarily a pull-based monitoring system. Batch jobs should use the Pushgateway, which will persist metrics for them

First install the Prometheus Python client:

pip install prometheus_client

Then in your batch job push metrics to the Pushgateway:

from prometheus_client import Gauge,CollectorRegistry,pushadd_to_gateway

registry = CollectorRegistry()
duration = Gauge('mybatchjob_duration_seconds',
    'Duration of batch job', registry=registry)
  with duration.time():
    pass  # Your code here
except: pass
  last_success = Gauge('mybatchjob_last_success', 
      'Unixtime my batch job last succeeded', registry=registry)
  pushadd_to_gateway('localhost:9091', job='my_batch_job', registry=registry)

The mybatchjob_last_success metric is only pushed when we succeed. As we're using pushadd_to_gateway rather than push_to_gateway a failed run won't overwrite the value of a previous success. mybatchjob_duration_seconds is always pushed, so you can graph it over time.


Once you've setup Prometheus to scrape the Pushgateway, you can add an alert in a rule file:

- name: test.rules
  - alert: MyBatchJobNoRecentSuccess
    expr: time() - mybatchjob_last_success{job="my_batch_job"} > 3600 * 3.5
      description: mybatchjob last succeeded {{humanizeDuration $value}} ago