Even great engineering organizations have issues on occasion. When you inevitably have a service disruption you’ll probably see (at least) these five personalities in the chat.
- The Gawkers / Watchers:
Drawn to the spectacle. The folks who show up at a house fire or slow down on the highway to look at a wreck on the other side. That may sound mean, but it’s not meant to be. We’ve all been there. Watchers are harmless so long as they don’t impede the troubleshooting and recovery activities. It’s good for them to be there, to learn the processes and prepare for when it’s their turn to engage in some future incident. Grab some 🍿and enjoy the show.
- The Detective 🕵️:
Looking for the root cause; the metaphorical “who done it?” Following leads in the form of errors; interrogating the logs and asking questions of the monitoring tools like suspects and witnesses. This person is tracing the symptoms back to their source. They’ll report findings which may help remediation but won’t stop digging until the culprit is known.
- ER Doc 🩺:
This person is an expert in triaging; assimilating lots of incoming info, asking questions and doling out actions rapid fire. Responding quickly by tackling the highest priorities first. The Product is the patient, the service outage is a critical wound, and the ER Doc is going to stabilize the situation ASAP.
- Fire Fighter 🧑🚒:
When things flare up this person shows up ready to contain the 🔥. They are the one that works on remediation. They take on the manual tasks like restarting servers or executing the steps in a run book. While they may be interested in what caused the fire their main focus is doing anything they can to put it out; taking direction on how best to do that.
- The News Reporter 📰:
Leadership and the wider org need updates as the incident progresses. The Reporter is there to listen and observe as the action unfolds, then summarize to all interested parties. Just like the winter storm team at a New England news station they’re periodically giving you the snow totals (impacted user counts), updates on road conditions (where is it getting better or worse), and letting you know the plows are still running (remediation activities).
To resolve an incident efficiently you’ll likely need a mix of these personalities. Having more Detectives and Fire Fighters helps, but there’s usually only room for one Doc (in a different analogy I’d say something about too many cooks in the kitchen). Some individuals may wear multiple hats or different has across incidents.
I love a good mystery so I usually start out playing detective. I challenge myself to find the cause before the on-call or SME shows up. Depending on the incident management experience and technical depth of the other participants I (or someone like me) may assume the role of the ER Doc to start the triage and remediation process. In my experience this is where most folks hesitate; afraid of making the wrong call and making things worse. A good topic for another time would be a quick rubric or heuristic to help make decisions in these situations.
PS: yes I know it’s supposed to be ED (Emergency Department) and not ER (Emergency Room) these days. ED Doc just doesn’t evoke the same imagery for me. I’m picturing an actual room filled with folks trying to stabilize a patient, not a department.