Runbooks people actually use
Most runbooks are written once, never opened, and useless by the time you need them. What separates a runbook that helps mid-incident from one that just exists.
Product updates, incident management notes, and lessons from building Vigiles.
Most runbooks are written once, never opened, and useless by the time you need them. What separates a runbook that helps mid-incident from one that just exists.
Severity levels exist to tell people how hard to run. When every incident is a P1, they tell people nothing. How to keep severity meaningful.
Most incidents that fall through the cracks fall through at the handoff between shifts. How to hand off on-call so the context travels with the pager.
Your app can be perfectly healthy and still be down because something it relies on failed. Why you should monitor your dependencies, not only yourself.
Blameless postmortems are not about going easy. They are about getting honest answers, because you cannot fix what people are afraid to admit.
Knowing something broke is half the job. Knowing it came back, and how long it was down, is the other half. Why recovery notifications matter.