One simple rule can improve your alerting. Alert on symptoms, not causes. If users are experience errors or high latency, that’s an alert we care about. If a server is down, that shouldn’t be an alert. Maybe an engineer is doing some repairs on that server. If it isn’t impacting real end-user performance we shouldn’t care.
Alerting on causes, like a server being down, will result in false-positives. It is easy to underestimate the danger of false-positives. Often we think, “the on-call engineer will just ignore that alert if it is a false-positive”. But they may not know it is a false positive and end up wasting hours trying to track down the source, even though it is having no user impact. Or, in a crisis, they may misprioritize and end up investigating the wrong source of issues. But, the major danger of false positives is when it trains engineers on-call to ignore the alerts.
Now, of course when a symptom alert does fire, it is the on-call engineer’s job to figure out why. Don’t be tempted attempt to pre-emptively help them by putting alerts on both the symptoms and causes, or you run the risk of false-positives and decrease the signal-to-noise ratio. It is far better to have helpful, complete, and consistent dashboards, allowing the people on-call to understand for themselves. Link to the relevant dashboard and metrics from the alert. From there the on-call engineer can investigate and understand the issue.
As a side note, the mantra, “symptoms not causes” will help you write robust tests to outlast the code they are testing.