How We Measure Email-Validation Accuracy (Honestly)

Every email validation service claims to be accurate. You have seen the headline number on the pricing page: some figure in the high nineties, followed by a percent sign and very little else. It is meant to end the conversation. In practice it should start one, because a bare accuracy percentage is one of the least falsifiable claims in this industry. Accurate at what? Measured against what? On whose data? Without answers to those questions, the number is decoration.

This is an essay about method rather than marketing. We want to explain how we think about email verification accuracy, what we actually measure, and why we would rather tell you we are not sure about an address than guess and be wrong. If you are evaluating any validation tool, including ours, these are the questions worth asking.

Why "99% accurate" means almost nothing

An accuracy claim is only meaningful if you can state three things: the prediction being graded, the ground truth it is graded against, and the population it was measured on. Most headline numbers leave all three unstated, which makes the claim impossible to check and impossible to disprove.

Consider the gaps a single percentage hides:

Accurate at what task? Catching obvious syntax errors is easy and almost everyone does it well. Correctly judging a live mailbox behind a catch-all domain is hard. A score that averages the easy cases with the hard ones tells you nothing about the cases you actually paid to resolve.
Measured against what truth? The only honest ground truth for "is this address deliverable" is whether mail sent to it was actually delivered or bounced. If a vendor grades its predictions against its own earlier predictions, the number is circular and self-congratulatory.
On which population? Accuracy measured on a clean, well-formed sample will look far better than accuracy on the messy, typo-ridden, half-abandoned lists that real customers upload. The interesting number is the one on real-world input.

When a number cannot be tied to a task, a truth source, and a population, it is not a measurement. It is a slogan.

The two numbers that actually matter

We think about validation quality as a trade-off between two distinct measurements. Collapsing them into one figure is exactly how the meaningful detail gets lost.

Decisive accuracy

Of the addresses we give you a confident verdict on — a clear valid or invalid — how many of those verdicts turn out to be right? This is decisive accuracy. It only counts the calls we were willing to stand behind. It deliberately excludes the addresses we flagged as unknown, catch_all, or risky, because on those we did not make a confident claim and it would be dishonest to grade ourselves as if we had.

Coverage

Of every address you submit, what share do we resolve to a confident verdict at all? This is coverage. An engine that answered unknown for everything would have perfect decisive accuracy and useless coverage. An engine that forced a confident answer for everything would have high coverage and untrustworthy accuracy. Neither extreme is a product.

The trade-off

These two numbers pull against each other, and that tension is the whole game. You can always raise coverage by guessing on the hard cases — but every guess you are not sure about drags decisive accuracy down. You can always raise decisive accuracy by abstaining whenever you are unsure — but that lowers coverage. Anyone quoting one number without the other is hiding which lever they pulled.

If you optimize only for…	You can score…	But you hide…
Coverage	A high "we have an answer for everything" rate	That many of those answers are guesses
Decisive accuracy	A near-perfect "our confident calls are right" rate	That you abstained on most of the hard work
A single blended number	An impressive headline	Which of the above two you actually traded away

Our position is that both numbers should be visible and improved together over time, not that one should be sacrificed quietly to flatter the other.

Not all errors cost the same

There is a second principle underneath the math: the two ways of being wrong are not equally bad. This asymmetry shapes how the engine behaves.

A false invalid — telling you a good address is dead — costs you one lost contact. That is unfortunate, but it is recoverable and contained. A false valid — rubber-stamping a dead address as safe to send — is far worse. You send to it, it produces a hard bounce, and that bounce is reported to the mailbox providers who decide whether your future mail reaches the inbox. Enough of them and your sender reputation drops, which quietly suppresses deliverability for the good addresses too. One bad "valid" can poison the well for an entire send.

An honest abstention sits between the two and costs the least. When we tell you an address is unknown, you keep your options: hold it back, send to it cautiously, or warm it slowly. You are informed, not misled. Because the downside of a false valid is so much steeper than the downside of saying "we are not sure," the engine is tuned to abstain rather than guess whenever the evidence is thin.

This is why, when an SMTP probe comes back ambiguous — the mailbox sits behind a catch-all, the server is greylisting us, we are being rate-limited, or outbound port 25 is blocked — the engine returns unknown or catch_all with a low confidence score instead of upgrading the address to valid. A confidence number that does not drop when the evidence is weak is not a confidence number at all.

Grading against real delivery, not against ourselves

If decisive accuracy is the number that matters, then the ground truth we grade it against has to be real. The only ground truth that counts for deliverability is what happened when mail was actually sent: did it deliver, did it bounce, did it draw a complaint?

This is where the crowdsourced reputation network earns its place. When customers connect their email service provider webhooks — from providers like Mailgun, SendGrid, Postmark, and Amazon SES — the real outcomes of their sends flow back to us. A delivery, a bounce, a complaint: each is a fact about the world, not a prediction. Those facts are how we check our predictions. Addresses are stored only as one-way cryptographic hashes; the raw address is never kept, so the network grows as a body of outcomes rather than a list of people.

Suppressing an address's own reputation when grading

There is a trap here, and avoiding it matters more than any single metric. Reputation data is one of the inputs the engine uses to make a verdict. If we then grade that verdict against the very same reputation data, we are grading the engine against its own input. It would score beautifully and prove nothing — a closed loop congratulating itself.

So when we measure ourselves, we suppress an address's own reputation history before asking the engine to judge it, and only afterward compare its verdict to the real delivery outcome that was held out. We are asking the harder, fairer question: given everything except the answer, did the engine get it right? That is the only version of the question worth scoring, and it is the version that keeps us honest about what the pipeline can do on an address it has never seen.

Why we return "unknown" on purpose

Put these principles together and a clear behavior falls out. When the cheaper, deterministic layers settle an address — broken syntax, a disposable domain, a missing MX record, a known bad reputation — we answer with confidence, because those signals are decisive. When the evidence runs out, we say so rather than inventing certainty.

Returning unknown or catch_all is not the engine failing. It is the engine refusing to convert your reputation into a coin flip. An honest unknown protects your decisive accuracy and protects your sending, which is the entire point of paying for validation in the first place.

The honest way to raise coverage is not to start guessing — it is to gather more real ground truth so that yesterday's unknown can become today's confident answer. This is the quiet compounding value of the reputation network. As more outcomes flow in from connected mail streams, more catch-all and ambiguous addresses gain enough real-world signal to be resolved confidently. Coverage rises because we learned something true, not because we lowered the bar for what counts as a verdict.

What this means for you

You do not have to take our framing on faith. Test it. Submit a sample where you already know the outcomes, send to the addresses we marked valid, and watch your bounce rate — a healthy hard-bounce rate is generally under 2 percent, and complaint rates above roughly 0.1 percent are dangerous. A validation tool earns trust by keeping those numbers low on the addresses it vouched for, not by quoting a percentage on a slide.

If you want the mechanics of how each layer reaches its verdict, see how email verification works. If you are cleaning a list before a send, how to clean an email list walks through the practical steps. Our promise here is narrower and, we hope, more durable than a headline number: we will tell you what we measure, grade ourselves against reality rather than against our own guesses, and abstain out loud when we are not sure — because an honest "I don't know" is worth more to your reputation than a confident wrong answer.