The Base Rate Problem
The vendor's headline is a good one: our assessment is 85% accurate at identifying top performers. The demo is clean, the validation study is real, and the number is true. The company buys it, screens its applicants, and starts hiring the people the test flags. A year later the new cohort looks about like every other cohort, and nobody can say why the 85% didn't show up in the results.
It didn't show up because of a number nobody put on the slide: how rare a top performer actually is. Work it through. Say genuine top performers are 10% of the applicant pool, and the test catches 85% of them and correctly clears 85% of everyone else. Run a thousand applicants through it. Of the 100 real top performers, the test flags 85. Of the 900 who aren't, it wrongly flags 15% — 135 people. So the test hands you 220 names, and only 85 of them are the real thing. The "85% accurate" test is wrong about 61% of the people it tells you to hire.
Nothing about the test changed. The base rate did the damage.
They say: the test is X% accurate
It is the most natural credential in the world — for a hiring assessment, an attrition model, a fraud flag, an AI résumé screener. Ninety percent accurate. AUC of 0.88. Validated against outcomes. A single number that says the instrument works, and works well. And often the instrument really is good in the lab, against the balanced sample it was validated on.
Then it gets deployed into a population where the thing it's looking for is rare, and the same instrument starts producing mostly false alarms — not because it got worse, but because rarity is unforgiving in a way the accuracy number completely hides.
When the target is rare, accuracy lies
Here is the principal issue, and it is pure arithmetic, not opinion. When you're hunting for something uncommon, the people who don't have it vastly outnumber the people who do — so even a small error rate on that huge majority generates a flood of false positives that can swamp the true ones. The question that matters is not "how often is the test right" but "of the people it flags, how many are real," and at a low base rate those two numbers come apart violently. A test can be 90% accurate and still be wrong about most of the people it selects.
This is one of the oldest documented failures in human judgment. Kahneman and Tversky showed that people routinely ignore base rates entirely, judging by how well a case fits a stereotype rather than how common the category is.1 Even experts do it: when researchers asked physicians the textbook version — a disease in 1 of 1,000 people, a test with a 5% false-positive rate, what are the odds a positive result means disease — most answered around 95%. The right answer is about 2%. They had anchored on the test's accuracy and forgotten the rarity of the disease.2
And psychometrics has known the selection-specific version since 1955. Meehl and Rosen proved that a test only improves on simply betting the base rate under particular conditions — and that for sufficiently rare (or sufficiently common) outcomes, a cutting score can actually do worse than ignoring the test and predicting the base rate every time.3 A test that flags "future top performers" or "flight risks" or "high-potential leaders" — all genuinely rare — is operating in exactly the regime where accuracy flatters and base rates bite.
Work the matrix at your base rate
The fix is not to abandon tests. It's to stop reading a single accuracy number and start working the confusion matrix at the prevalence you actually face.
Before you adopt anything, ask the precision question — of everyone this flags, what fraction will be the real thing — and compute it at your real base rate, not the validation study's. If your applicant pool is 10% top performers, an "85% accurate" test selects a group that's only ~39% top performers; that may still beat your status quo, or it may not, but now you can tell. Calibrate the threshold to the base rate instead of taking the vendor's default. Be especially wary of moving a test from the balanced sample it was validated on into a lopsided real population — that move alone can turn a great-looking instrument into a false-positive machine. And for the AI screeners now doing this at scale, the same arithmetic governs, with a darker corner: a model tuned to flag a rare "great hire" can quietly bury its false negatives — the good people it rejects — where no one ever measures them.
The honest version is a worse sentence than "85% accurate." It sounds like "at your applicant mix, fewer than half the people this flags will pan out, and it barely beats your current process" — which is exactly the kind of true thing that ends a procurement conversation, and exactly why it's worth saying.
Why it's worth raising your voice about
Because the base-rate blind spot doesn't just waste money on a weak tool; it manufactures false confidence and real unfairness. A company trusts the 85% and treats the flagged candidates as a sure thing, when most of them aren't — and treats the un-flagged as safely excluded, when a chunk of the people it rejected were exactly who it was looking for. Scale that across every applicant, every quarter, run it through an automated screen that no one re-checks, and the test's confident accuracy becomes a confident, systematic error with people's careers inside it.
So when someone quotes you a test's accuracy, ask the one question that converts it into something you can use: how rare is the thing it's looking for, and of everyone it flags, how many will be real — here, in my population? Accuracy is a property of the test. Usefulness is a property of the test and the base rate, together. Quote one without the other and you haven't said whether it works.
Measurement-first method, useful whether or not you ever work with us. Reading the confusion matrix at the real base rate — precision over headline accuracy, thresholds calibrated to prevalence — is the posture behind the Principia measurement program and any honest selection or prediction work in the portfolio. The sixth of the field-craft traps, alongside Correlation Isn't a Driver, The Benchmark Trap, The Law of Small Numbers, What the Exit Data Can't See, and Significance on Demand. Every footnote names a real, checkable work.
Footnotes
-
Daniel Kahneman & Amos Tversky, "On the Psychology of Prediction," Psychological Review 80, no. 4 (1973): 237–251 — prediction by representativeness, and the systematic neglect of prior probabilities (base rates) even when they are known and relevant. ↩
-
Ward Casscells, Arno Schoenberger & Thomas B. Graboys, "Interpretation of Physicians of Clinical Laboratory Results," New England Journal of Medicine 299 (1978): 999–1001 — asked a base-rate problem (1/1,000 prevalence, 5% false-positive rate), most respondents answered ~95% where the correct positive predictive value is ~2%. ↩
-
Paul E. Meehl & Albert Rosen, "Antecedent Probability and the Efficiency of Psychometric Signs, Patterns, or Cutting Scores," Psychological Bulletin 52, no. 3 (1955): 194–216 — a test improves on predicting from the base rate only under specific conditions; for sufficiently rare or common outcomes, a cutting score can perform worse than ignoring the test. ↩