Your uptime check passed while the service was down
The dashboard was green. Every monitor a calm shade of up. And the message on my screen said customers could not complete a payment.
Both things were true at the same time. The check was hitting our homepage, getting a 200, and marking the service healthy. The homepage was fine. The part customers actually needed, the payment call, had been failing for twenty minutes, and nothing we were watching had noticed.
Up is a number, working is a different question
A basic uptime check asks one small thing. Did the page respond. It does not ask whether the response was correct, or whether the system behind it still works.
That gap is where partial outages live. The front door opens, so the building looks fine, while a room at the back is on fire. A homepage can return 200 while the login service is down. A cached error page returns 200. A page can load perfectly and the button on it can call a dead API. In every one of those cases a simple check sees green and says nothing.
The worst version is the one I had that day. The thing customers came to do was broken, and the only signal that something was wrong came from customers themselves, which is the slowest and most expensive alarm you can own.
Watch the path that matters, not just the front door
The fix is to stop monitoring the easy thing and start monitoring the important one.
Point checks at the endpoints customers actually depend on. The login route. The checkout or payment API. A health endpoint that touches the database instead of returning a static OK. If the payment path is what loses you money when it breaks, that is the thing that deserves a monitor, not the marketing page.
Then check more than the status code. A 200 with the wrong body is still a failure. Assert that the response contains what it should, an expected field, a known string, the shape of a real answer, so a page that returns the wrong content counts as down instead of up. That turns "it responded" into "it responded correctly".
And run those checks from more than one place, often, so you catch the failure in seconds rather than learning about it from a support ticket. A problem confirmed from several locations is a real one.
Define down as what your customers feel
Pick the handful of actions that define your product working. Someone can sign in. Someone can buy. Search returns results. Monitor those directly, check the content, and you close the gap between a green dashboard and a working service.
The day I remember, the dashboard was never wrong about what it measured. The homepage really was up. It was measuring the wrong thing. Once we pointed checks at the payment path and looked at the body of the response, the next time that API failed we knew in under a minute, not after the first angry email.
Up tells you the server answered. Working tells you the customer got what they came for. Monitor the second one.
If your checks only watch the front page, a broken checkout can run for twenty minutes before anyone notices. Vigiles monitors the endpoints that matter, checks the response content, and confirms failures across locations. Start free, or see where we monitor from.