Pick the right SLO before you worry about the error budget
Plenty of teams adopt service level objectives the same way. Someone reads the Google SRE book, the team agrees that 99.99 percent sounds appropriately serious, it goes into a document, and three months later the error budget is a number nobody looks at and nobody enforces. The mechanics were copied. The thing that makes them work was left behind.
The part that matters is the target underneath, and most teams pick it last when they should pick it first.
The terms, briefly
A service level indicator, the SLI, is a measurement of something real. The share of requests that succeed. How long a key page takes to load. A service level objective, the SLO, is the target you hold that measurement to, like 99.9 percent of requests succeeding over thirty days. A service level agreement, the SLA, is a promise you make to a customer in a contract, usually looser than your internal SLO because it carries penalties.
The error budget falls out of the SLO. If your objective allows one request in a thousand to fail across a month, that 0.1 percent is a budget. You spend it on failures, on risky deploys, on experiments, and when it runs low you are meant to slow down and stabilize.
That is the whole idea, and it is a good one. It turns reliability from an argument into a number both sides can point at.
The SLO has to come from what users feel
That is where it goes wrong. The objective gets chosen because it sounds right, or because a larger company published it, rather than because it describes a moment your users care about.
Start the other way around. Find one thing a user feels when it breaks. A request that fails. A checkout that times out. A search that returns nothing. Turn that into an SLI you can measure, then set the objective at a level you can defend and sustain. An SLO you chose because you can hold it is worth more than a grander one you breach every week.
Why copying a big number backfires
99.99 percent reads better than 99.9, so teams reach for it. The difference is not cosmetic. 99.9 percent is roughly forty three minutes of error budget a month. 99.99 percent is about four. A small team running real software will spend four minutes on a single bad deploy, so the budget sits permanently in the red, and a budget that is always gone stops meaning anything. People learn to ignore it, which is the exact outcome the budget was meant to prevent.
Pick a target you can live inside. You can tighten it later, when the system and the team can hold the tighter one. We wrote more about the price of each additional nine in what five nines actually costs.
The budget is a conversation, not a chart
The value of an error budget is not the burn-down graph. It is the decision it forces. Budget left this month, ship the risky thing. Budget gone, the next stretch is about stability, and that is not a punishment, it is the agreement working as designed. When product and engineering both accept the budget in advance, the awkward question of whether now is a good time to deploy answers itself.
None of this works without measurement you trust. You cannot hold an SLO you cannot see, which makes reliable uptime and latency data the floor the whole practice stands on. Get the measurement right first, pick one SLI that maps to real pain, set a target you can keep, and the error budget becomes useful instead of decorative.
If you want the data layer an SLO depends on, Vigiles monitors your endpoints from across the region and keeps the history you measure against. Start free.
Common questions
- What is the difference between an SLI, an SLO, and an SLA?
- An SLI is the measurement, such as the share of requests served successfully. An SLO is the target you hold that measurement to, such as 99.9 percent over thirty days. An SLA is a contractual promise to a customer, usually looser than the internal SLO, with penalties attached.
- What is an error budget?
- An error budget is the amount of unreliability your SLO allows. If the objective is 99.9 percent over a month, the remaining 0.1 percent is the budget you can spend on failures, risky deploys, and experiments before you are expected to slow down and stabilize.
- How do I choose a good SLO?
- Start from one thing users actually feel, like whether a request succeeds or how long a key page takes, measure it as an SLI, and set the target at a level you can defend and sustain. Avoid copying a number from a much larger company, since their traffic and staffing support targets you cannot hold.