When "all green" lies · James Gault

A support ticket taught me that a monitoring dashboard full of green checkmarks can sit calmly next to a customer who has been unable to complete a purchase for three days. The uptime graph was fine. The synthetic probe was fine. The checkout flow was, according to every instrument we had, working.

The customer was not imagining the problem. A rule had fired silently — somewhere between the CDN, the WAF, and a rate limiter — and the customer's particular combination of geography and header signature tripped it. Every time. Meanwhile our monitors hit the site from the same origin, with the same user agent, and got the same 200 OK they always got.

Why monitoring has blind spots by design

Monitors observe from a vantage point. That vantage point is never the same as a real customer's. It's usually privileged — a known IP, a generic user agent, a profile that's been implicitly whitelisted across the half-dozen security layers a modern site has accreted.

The failures that matter most aren't "the site is down for everyone." Those get noticed. The failures that eat at revenue are:

Rate limits that fire for a class of visitor nobody tested.
Geo-IP rules that flag valid traffic from a real country as bot traffic.
WAF rules that were tightened after an incident and never re-reviewed for collateral damage.
Feature flags that roll out cleanly for internal accounts but misbehave for a demographic that isn't in the rollout sample.
TLS or cookie quirks that only surface in older browsers — the ones your less-technical customers are still using.

Each of these is invisible to "all green" because "all green" is reporting on a profile that doesn't experience them.

What "all green" actually means

"All green" means: the one specific visitor the monitor simulates can reach the site. That's the claim. Nothing more.

It is not the claim people read off the dashboard. People read it as "the site is working." And when you're running operations, you eventually make decisions on that belief — triage priorities, incident escalations, postmortem scope. The gap between what the dashboard says and what it actually proves is where the expensive failures live.

What to do instead

Vary the vantage point. The specific mechanism is less important than the principle: probe the same endpoints from multiple profiles — different user agents, headers, geographies, cookie states, request rates. The goal isn't to catch every discrepancy. The goal is to notice discrepancies exist in the first place.

When a profile starts returning something different from its peers, that's a signal. Maybe an expected one ("of course the bot profile gets 403'd"), maybe not ("why is this one region getting throttled at a rate none of the others are?"). The discrepancy is the alert. The dashboard can't tell you which is which, but it can tell you where to look.

This doesn't replace your uptime monitors. It sits next to them. Uptime monitors answer "is the site reachable?" Profile-varied monitors answer the question you usually wanted answered: "is the site reachable for my customers?"

The harder lesson

You can't fix what you can't see. Most of the silent failures I've run into in shipping web products were invisible not because the instrumentation was bad, but because it was narrow — measuring one class of visitor, one code path, one moment, and reporting confidently on all three.

If your monitoring has never caught a surprise, it might be because it can't. Worth checking before the next support ticket is the one that tells you.

When "all green" lies: monitoring only what you can see.

Why monitoring has blind spots by design

What "all green" actually means

What to do instead

The harder lesson