What does and doesn't make a good grouping key?
A blog on monitoring, scale and operational Sanity
For counting how many times a thing has happened you can use a counter and
rate(), but that doesn't work across batch jobs.
Jobs of an ephemeral nature are often not around long enough to have their metrics scraped by Prometheus. In order to remedy this the Pushgateway was developed to allow for these types of jobs to push their metrics to a metrics cache in order to be scraped by Prometheus long after the original jobs have gone away. This blogpost discusses some of the common pitfalls users tend to fall into when adding the Pushgateway to their monitoring stack.