Your incident response has a bus factor of one
There is a quick test for how fragile your incident response really is. Pick your most reliable engineer, the one who always gets pulled into the bad incidents. Now imagine them on a long flight with no wifi during your next outage. If the honest answer is that you would be in serious trouble, your bus factor is one.
A bus factor of one means a single person leaving, quitting, or just being unreachable takes a chunk of your ability to recover with them. Everyone is grateful for that engineer. Almost nobody notices how much risk is sitting on their shoulders.
Why one is a dangerous number
When the knowledge that resolves incidents lives in one head, that person never gets a real break, because every serious incident routes back to them. They burn out. Burnt-out people leave. And when they go, the knowledge goes too, and the team finds out exactly how much it was relying on someone's memory.
The team also stops learning. Why would anyone else dig into the service nobody understands when the one person will just handle it. The dependency gets deeper with every incident.
Get it out of the one head
The fix is not to value that engineer less. It is to move what they know into a place the whole team can reach. Write the runbook for the service only they understand. After the next incident they solve, have them explain not just what they did but how they knew to do it, because that part almost never gets written down.
It is slower than letting them keep saving the day. It is the only thing that raises the number above one.
Shared records lower the bus factor on their own
A lot of that engineer's value is simply that they remember what happened last time and what fixed it. If that history lived somewhere the team could read, the gap would close by itself. That is why we keep every incident in Vigiles as a durable record, timeline, resolution, and postmortem, so the next person to hit a similar failure can read what happened instead of hunting for the one human who was awake last time.
If every incident routes to one person, that person is a single point of failure with a pulse. Start free, or see how incident management works.