After an incident, MTTR is the wrong thing to brag about
Mean time to recovery is easy to measure and easy to game. What to track after an incident instead, so your postmortems actually change something.
Posts tagged
Product updates, incident management notes, and lessons from building Vigiles.
Mean time to recovery is easy to measure and easy to game. What to track after an incident instead, so your postmortems actually change something.
Alert fatigue is a trust problem, not a volume problem. Why false positives erode your team, and how confirming failures across locations fixes it.
When your service is down, silence is worse than the outage. How to communicate during an incident in a way that keeps customer trust.
In an outage, restarting the unhealthy box is often the fix. In a breach, it is how you destroy the evidence you needed. Why your outage instincts betray you in a security incident.
An outage wants you fast. A security incident wants you careful. Why the two need different responses, and the backbone they share.
Your first on-call shift is less about knowing everything and more about staying calm, reading before you touch, and knowing when to escalate.