The fishbone diagram for incident root cause analysis
Most postmortems go looking for the cause, singular, and stop the moment they find one. A deploy went out, it broke, we reverted it, done. The trouble is that real incidents almost never have one cause. They have a handful of smaller things that lined up at the wrong moment, and if you only name one of them, the others are still sitting there waiting for the next bad day.
The fishbone diagram is a tool for finding all of them. It is old, it is simple, and it is one of the more useful things you can put in front of a team during an incident review.
What is a fishbone diagram
A fishbone diagram is a visual way to map the causes of a problem, sorted into categories. It is also called an Ishikawa diagram, after Kaoru Ishikawa, the Japanese quality expert who developed it in the 1960s, and a cause-and-effect diagram, which is what it actually is.
The name comes from the shape. You write the problem on the right as the head of the fish. A horizontal line runs left from it, the spine. Off the spine you draw a few large bones, one per category of cause. Off each of those you add the specific causes that fall under it. When you finish, the whole thing looks like a fish skeleton, and you can see at a glance which categories are crowded with causes and which are empty.
It started in manufacturing and is one of the seven basic quality tools, but it works just as well on a software incident.
Why it beats asking for a single root cause
The most common root cause method is the 5 Whys. You take the failure, ask why it happened, ask why again of that answer, and keep going until you hit something fundamental. It is quick, and it is good for an incident with a single clean chain of events.
Its weakness is right there in how it works. It follows one thread. It pushes you toward a single root cause, and most incidents do not have one. If your system has any redundancy at all, an outage usually requires several things to fail together, so naming one of them and stopping leaves the rest in place.
A fishbone diagram fixes that by going wide before it goes deep. Instead of one chain, you brainstorm causes across several categories at once. It forces the team to look in places a single line of questioning would skip, which is where the contributing factors you did not expect tend to hide.
The two tools work well together. Use the fishbone to map the breadth of what contributed. Then take the branches that matter most and run a 5 Whys down each one.
The categories to use for software incidents
The original manufacturing version uses the six Ms, Man, Machine, Method, Material, Measurement, and Mother Nature. Those do not map cleanly to a software outage, so most engineering teams adapt them. A set that works well.
People. Knowledge gaps, who was on call and how experienced, staffing, handoffs, unclear ownership.
Process. Deploy practices, change management, review steps, missing or stale runbooks, whether anyone followed them.
Technology. The code, the architecture, a missing timeout or retry, the design choices that let one failure spread.
Tooling. Monitoring, alerting, dashboards, the deploy pipeline, whether you could even see what was happening.
Environment. Infrastructure, third-party dependencies, traffic spikes, the things outside your own code.
Communication. How the incident was coordinated, who knew what, where information got stuck.
You do not have to use exactly these. The point of the categories is to make the team look in more than one direction, so pick the ones that fit how you work and keep them stable enough to compare incidents over time.
How to run one in a postmortem
Start with a clear problem statement at the head. Not "the site went down" but something specific, like the checkout API returned errors for forty minutes across all regions. A vague head produces a vague diagram.
Draw the spine and the category bones, then brainstorm. For each category, ask what in here contributed, and write every answer on a bone without arguing yet about which mattered most. The goal at this stage is breadth. Get everything on the board.
Once it is full, look at the shape. A category crowded with bones is telling you something. Pick the contributing factors that actually moved the outcome and drill into those with a few whys each, until you reach something you can change.
Then turn the significant causes into actions, each with an owner and a date. A fishbone diagram that does not end in changes is just a drawing.
Keep it blameless. When a cause lands in the People category, the answer is almost never that someone was careless. It is that the system let a normal human mistake turn into an outage. If a branch reads "engineer ran the wrong command," keep going, because the real cause is usually that the wrong command was that easy to run with nothing to catch it.
Where it helps and where it does not
A fishbone diagram is a thinking tool, not a magic answer. It is at its best when an incident clearly had several contributing factors and you want to make sure none of them get missed. It is overkill for a small, obvious, single-cause issue, where a couple of whys will do.
It also has a limit worth naming. In a large distributed system, the idea of a tidy set of causes can itself be misleading. Complex systems fail in messy, interacting ways that do not always sort into neat categories. The diagram is still useful there, as a way to structure the conversation, but treat it as a map for the discussion, not as proof you have found everything.
It only works if you have the facts
Every part of a fishbone diagram depends on knowing what actually happened. The timeline, the order of events, what fired when, what changed just before the failure. If your team is reconstructing all of that from memory and scattered chat logs two days later, the diagram fills up with guesses.
That is the part Vigiles handles. Every incident carries its own timeline, built automatically as events happen, so when you sit down to map causes you are working from a record instead of arguing about who remembers what. The better your facts, the better your fishbone, and the more likely the actions that come out of it actually prevent the next one.
A fishbone diagram turns a postmortem from "what was the cause" into "what were all of them." Vigiles gives you the timeline that makes the exercise honest. Start free, or see how incident management works.
Common questions
- What is a fishbone diagram?
- A fishbone diagram, also called an Ishikawa or cause-and-effect diagram, is a visual tool that maps the causes of a problem sorted into categories. The problem sits at the head of the fish and the categories branch off the spine like bones.
- What is the difference between a fishbone diagram and the 5 Whys?
- The 5 Whys follows a single chain of cause and effect and suits incidents with one clear cause. A fishbone diagram explores many categories of cause at once, which fits incidents with several contributing factors. The two work well together.
- What categories should a fishbone diagram use for software incidents?
- Most engineering teams adapt the original manufacturing categories to People, Process, Technology, Tooling, Environment, and Communication, so the team looks for causes in more than one direction.