DNS fails silently, and you find out last
DNS is the failure that does not look like one. Your servers are up. Your application is healthy. Your usual checks may even pass. And yet a chunk of your users cannot reach you, because the names that point at your service stopped resolving for them.
It is the quietest kind of outage. Nothing in your infrastructure is on fire. The problem sits in a layer most monitoring ignores, and you usually learn about it from an annoyed customer, not a dashboard.
Why it slips through
A lot of monitoring assumes that if the server responds, everything is fine. But users do not connect to your server, they connect to a name that has to resolve to your server first. If a record was changed wrong, a propagation lagged, or a nameserver started failing, resolution breaks while the box behind it stays perfectly healthy. Your server-side checks see green the entire time.
It is uneven, which makes it worse
DNS problems rarely hit everyone at once. They hit some resolvers and some regions while others keep working, so half your team says the site is fine and half says it is down, and you lose the first twenty minutes arguing about whether there is even an incident.
Check resolution itself
The fix is to monitor DNS resolution directly, from more than one place, as its own check. Vigiles runs DNS checks from multiple locations, so a resolution failure shows up as a failure, not as a mystery you piece together from support tickets an hour later.
If your monitoring only watches the server, it cannot see the layer that decides whether anyone reaches it. Start free, or see where we monitor from.