For online serving systems it's fairly well known that you should look for request rate, errors and duration. What about offline processing pipelines though?
A blog on monitoring, scale and operational Sanity
Alerting is an art. One must be sure to alert just enough to be aware of all problems arising in the monitored system while at the same time not drown out the signal with excess noise. In this blogpost we'll explain some of the best practices to use when alerting with Prometheus.