That's the approximate number of alert emails that got auto-deleted while I was away on vacation. That number will rise further before I officially come back from vacation, but it's still a big number. The sad part is, 98% of those emails are for:
- Problems I don't care about.
- Unsnoozable known issues.
- Repeated alarms for the first two points (puppet, I'm looking at you)
We've made great efforts in our attempt to cut down our monitoring fatigue problem, but we're not there yet. In part this is because the old, verbose monitoring system is still up and running, in part this is due to limitations in the alerting systems we have access to, and in part due to organizational habits that over-notify for alarms under the theory of, "if we tell everyone, someone will notice."
A couple weeks ago, PagerDuty had a nice blog-post about tackling alert fatigue, and had a lot of good points to consider. I want to spend some time on point 6:
Make sure the right people are getting alerts.
How many of you have a mailing list you dump random auto-generated crap like cron errors and backup failure notices to?
This pattern is very common in sysadmin teams, especially teams that began as one or a very few people. It just doesn't scale. Also, you learn to just ignore a bunch of things like backup "failures" for always-open files. You don't build an effective alerting system with the assumption that alerts can be ignored; if you find yourself telling new hires, "oh ignore those, they don't mean anything," you have a problem.
The failure mode of tell-everyone is that everyone can assume someone else saw it first and is working on it. And no one works on it.
I've seen exactly this failure mode many times. I've even perpetrated it, since I know certain coworkers are always on top of certain kinds of alerts so I can safely ignore actually-critical alerts. It breaks down if those people have a baby and are out of the office for four weeks. Or were on the Interstate for three hours and not checking mail at that moment.
When this happens and big stuff gets dropped, technical management gets kind of cranky. Which leads to hypervigilence and...
The failure mode of tell-everyone is that everyone will pile into the problem at the same time and make things worse.
I've seen this one too. A major-critical alarm is sent to a big distribution list, six admins immediately VPN in and start doing low-impact diagnostics. Diagnostics that aren't low impact if six people are doing them at the same time. Diagnostics that aren't meant to be run in parallel and can return non-deterministic results if run that way, which tells six admins different stories about what's actually wrong sending six admins into six different directions to solve not-actually-a-problem issues.
This is the Thundering Herd problem as it applies to sysadmins.
The usual fix for this is to build in a culture of, "I've got this," emails and to look for those messages before working on a problem.
The usual fix for this fails if admins do a little "verify the problem is actually a problem" work before sending the email and stomp on each other's toes in the process.
The usual fix for that is to build a culture of, "I'm looking into it," emails.
Which breaks down if a sysadmin is reasonably sure they're the only one who saw the alert and works on it anyway. Oops.
Really, these are all examples of telling the right people about the problem, but you really do need to go into more detail than "the right people". You need, "the right person". You need an on-call schedule that will notify one or two of the Right People about problems. Build that with the expectation that if you're in the hot seat you will answer ALL alerts, and build a rotation so no one is in the hotseat long enough to start ignoring alarms, and you have a far more reliable alerting system.
PagerDuty sells such a scheduling system. But what if you can't afford X-dollars a seat for something like that? You have some options. Here is one:
An on-call distribution-list and scheduler tasks
This recipe will provide an on-call rotation using nothing but free tools. It won't work with all environments. Scripting or API access to the email system is required.
- 1 on-call distribution list.
- A list of names of people who can go into the DL.
- A task scheduler such as cron or Windows Task Scheduler.
- A database of who is supposed to be on-call when (can substitute a flat file if needed)
- A scripting language that can talk to both email system management and database.
Build a script that can query the database (or flat-file) to determine who is supposed to be on-call right now, and can update the distribution-list with that name. Powershell can do all of this for full MS-stack environments. For non-MS environments more creativity may be needed.
Populate the database (or flat-file) with the times and names of who is to be on-call.
Schedule execution of the script using a task scheduler.
Configure your alert-emailing system to send mail to the on-call distribution list.
Nice and free! You don't get a GUI to manage the schedule and handling on-call shift swaps will be fully manual, but you at least are now sending alerts to people who know they need to respond to alarms. You can even build the watch-list so that it'll always include certain names that always want to know whenever something happens, such as managers. The thundering herd and circle-of-not-me problems are abated.
This system doesn't handle escalations at all, that's going to cost you either money or internal development time. You kind of do get what you pay for, after all.
How long should on-call shifts be?
That depends on your alert-frequency, how long it takes to remediate an alert, and the response time required.
Alert Frequency and Remediation:
- Faster than once per 30 minutes:
- They're a professional fire-fighter now. This is their full-time job, schedule them accordingly.
- One every 30 minutes to an hour:
- If remediation takes longer than 1 minute on average, the watch-stander can't do much of anything else but wait for alerts to show up. 8-12 hours is probably the most you can expect reasonable performance.
- If remediation takes less than a minute, 16 hours is the most you can expect because this frequency ensures no sleep will be had by the watch-stander.
- One every 1-2 hours:
- If remediation takes longer than 10 minutes on average, the watch-stander probably can't sleep on their shift. 16 hours is probably the maximum shift length.
- If remediation takes less than 10 minutes, sleep is more possible. However, if your watch-standers are the kind of people who don't fall asleep fast, you can't rely on that. 1 day for people who sleep at the drop of a hat, 16 hours for the rest of us.
- One every 2-4 hours:
- Sleep will be significantly disrupted by the watch. 2-4 days for people who sleep at the drop of a hat. 1 day for the rest of us.
- One every 4-6 hours:
- If remediation takes longer than an hour, 1 week for people who sleep at the drop of a hat. 2-4 days for the rest of us.
- Slower than one every 6 hours:
This is a fuzzy one, since it's about work/life balance. If all alerts need to be responded to within 5 minutes of their arrival, the watch-stander needs to be able to respond in 5 minutes. This means no driving or doing anything that requires not paying attention to the phone such as kid's performances or after-work meetups. For a watch-stander that drives to work, their on-call shift can't overlap their commute.
For 30 minute response, things are easier. Driving short trips is easier, and longer ones so long as the watch-stander pulls over to check what each alert is when they arrive. Kid performances are still problematic, and longer commutes just as much.
And then there is the curve-ball known as, "define 'response'". If Response is acking the alert, that's one thing and much less disruptive to off-hours lie. If Response is defined as "starts working on the problem," that's much more disruptive since the watch-stander has to have a laptop and bandwidth at all times.
The answers here will determine what a reasonable on-call shift looks like. A week of 5 minute time-to-work is going to cause the watch-stander to be house-bound for that entire week and that sucks a lot; there better be on-call pay associated with a schedule like that or you're going to get turnover as sysadmins go work for someone less annoying.
It's more than just make sure the right people are getting alerts, it's building a system of notifying the Right People in such a way that the alerts will get responded to and handled.
This will build a better alerting system overall.