Borrowed Validity
The assessment had been scoring candidates for three years, and no one had ever checked whether it worked.
Not out of negligence. The opposite — out of confidence. The tool had been bought on the strength of a number, and the number was real: the vendor's deck cited the research, and the research was good, and the research said this kind of assessment predicts job performance at a validity around one-half. So the company wired it into the funnel, set a cutoff, and moved on. Every applicant since had been scored, ranked, advanced or dropped, against a number that had been measured — carefully, defensibly, with decades of meta-analysis behind it — on other people, at other companies, doing other jobs.
What nobody had done was the boring thing: pull the people the tool scored high, pull the people it scored low, look at how they actually performed, and ask whether the line the assessment drew was the line that mattered here. The validity on the deck was an inheritance. It had never been earned in this building.
This is the most common measurement mistake in all of hiring, and it has a shape worth naming: borrowed validity. You adopt a predictor because it works on average, somewhere, and you treat the average as if it were yours. The number is borrowed, the decision is local, and the gap between them is where good people get screened out and bad bets get waved through — invisibly, because nobody is keeping score.
They say validity generalizes
The borrowing is not a mistake the field made by accident. It's the field's hardest-won victory, slightly misread.
For a long time, selection psychology believed in situational specificity — the idea that a test predicting performance in one setting told you nothing about another, so every employer had to validate every tool locally, on its own people, before trusting it. Then a generation of meta-analysts, Frank Schmidt and John Hunter foremost among them, took the accumulated validation studies and showed that most of the apparent setting-to-setting variation was a mirage: it was sampling error and measurement artifacts, not real differences. Correct for those, and the validities were remarkably stable across settings. General mental ability predicted performance at around .51 nearly everywhere you looked; structured interviews and work samples sat in the same neighborhood.1 The practical upshot was liberating and mostly correct: you do not have to re-derive the wheel in every org. You can borrow the meta-analytic estimate and stand on it.
Validity generalization is real, and it earned its place. But it quietly taught a generation of practitioners the wrong lesson — not "the average is a good starting estimate" but "the average is the estimate, so local validation is a waste of money." The prior got promoted to the verdict. And a verdict you never check is just a belief with a citation.
The number moved
Here is the part that should make anyone who built a funnel on a borrowed number sit up.
In 2022, Paul Sackett and colleagues went back to the foundational meta-analytic estimates — the very numbers on every vendor deck — and examined how the range-restriction corrections had been done. Range restriction is the technical reason validity is hard to see, and we'll get to it; the point for now is that the corrections meant to fix it had been applied too aggressively, across the board, for years. Undo the overcorrection and the famous validities come down — substantially. General mental ability, long crowned the single best predictor of job performance, turns out to be more modest than the canon claimed, and now sits behind structured interviews and biodata rather than atop the table.2 The ranking everyone memorized was wrong, and the magnitudes were inflated.
Sit with what that means for borrowed validity. Organizations spent two decades designing selection systems around a hierarchy of predictors that a careful reanalysis just reordered. Nobody at those organizations did anything wrong by the standards of the day. They borrowed the best number the field had, and the field revised it from under them. That is the structural risk of borrowing: your most important measurement lives on someone else's balance sheet, and you find out it was restated when you read the journal — if you read the journal.
A number you didn't compute is a number you can't defend when it changes. And it will change.
You can't see your own validity
So why don't companies just check? Why did three years go by?
Because the data fights you, and the reason is the same one that made validity hard to estimate in the first place. You only get to observe job performance for the people you hired — and you hired, overwhelmingly, the ones the assessment scored well. The low scorers mostly never walked in the door, so you never learn how they'd have done. Your data is missing exactly the half of the picture that would reveal whether the tool's line was the right line. This is range restriction, named by Thorndike in 1949, and it does something cruel: the very act of selecting on a predictor truncates the evidence you'd need to evaluate that predictor.3 Look at the validity in your hired population and it will look weak — not because the tool is weak, but because you've already cut the range it was sorting on.
The result is a quiet trap. The honest local correlation looks disappointing, so the practitioner concludes local validation is uninformative, and falls back on — the borrowed number, which conveniently looks better because it was corrected for the very restriction that's deflating the local view. Round and round. The borrowing isn't just laziness. It's a rational response to data that's genuinely hard to read. The answer is not to stop looking. It's to look with the right math.
Validity isn't a constant. It's a posterior.
Here is the turn. The decades-long argument — borrow the meta-analytic validity versus validate locally — was framed as a fight, and it was never a fight. It's a Bayesian both/and, and stating it that way dissolves the whole problem.
The meta-analytic validity is your prior: the best estimate of how well this predictor works, before you've looked at your own outcomes. Your own predictor-and-performance data, range-restriction-corrected, is the likelihood: what your building actually shows. Fuse them and you get a posterior — your validity, anchored to the literature but updated by your evidence, sharper than either one alone. When your data is thin, the posterior leans on the prior and you've lost nothing. When your data is rich, it pulls toward what's true here. The org with two hires a year keeps the borrowed estimate; the org with two thousand earns its own. Nobody has to choose.
And the posterior does something the prior never could: it tells you which of your predictors are pulling weight. Run several through the same loop and you can see each one's contribution against the others — and prune the ones that, in your data, add nothing distinguishable from what you already have. Borrowed validity gives you a ranking of predictors in general. Your posterior gives you the ranking in your org, for your roles, which is the only ranking that sets your cutoffs.
And it decays
There's one more reason a borrowed number is dangerous, and it's the one most easily forgotten: validity is not fixed in time. The thing being predicted moves. The job changes as tools and markets change; the applicant pool shifts; what made someone effective at hire is not always what makes them effective three years in. The performance criterion itself is dynamic — researchers have shown that the rank-order of performers reshuffles over time, which means a predictor's correlation with performance is a moving target, not a constant you measure once and file.4
So even a perfect local validation has a shelf life. A one-time study is a photograph of a relationship that keeps changing — useful the day you take it, slowly going stale after. This is why the answer can't be "validate locally, once, and you're done." It has to be a loop: re-estimate as new outcomes arrive, watch for the predictor whose validity is drifting toward zero, and retire it before it quietly starts costing you good candidates. A validity you established in 2023 and never revisited is just a fresher kind of borrowed.
Closing the loop
Put the pieces together and the prescription is concrete, and it is not "trust the deck."
Take your own predictors. Pair them with your own outcomes. Correct for the range restriction that selection imposed, so the local signal is readable. Anchor the estimate to the meta-analytic prior — borrow it honestly, as a starting point — and then update, so the number converges on what's true in your organization as your evidence accumulates. Watch each predictor's contribution, prune the ones that don't earn their place, and flag the ones whose validity is decaying. What comes out the other end is no longer borrowed. It's a posterior you computed, that you can defend when someone asks "how do you know," and that gets more yours every quarter instead of going stale.
The assessment that scored candidates for three years was never the problem. The problem was treating a number measured on strangers as if it had been measured on you — and never building the loop that would turn the one into the other. Borrowed validity is where everyone starts. It should not be where anyone stays.
This is a companion in the Measurement Meets AI program — that the unglamorous machinery of measurement science is the under-used answer to the decisions organizations are automating. Its siblings, The Reliability Problem and Themes Aren't Evidence, take up the same move for noisy raters and for open text. The closed-loop validity capability described here — anchoring your predictors to the literature prior and updating to your own posterior, with pruning and decay flags — belongs to the Principia measurement stack; the capability positioning carries the compressed version. No numbers here are invented: every figure traces to its cited source, and the opening scenario is a composite, not a client engagement.
Footnotes
-
Frank L. Schmidt & John E. Hunter, "The Validity and Utility of Selection Methods in Personnel Psychology: Practical and Theoretical Implications of 85 Years of Research Findings," Psychological Bulletin, 124 (1998): 262–274 — the canonical operational-validity table (general mental ability ≈ .51, structured interviews ≈ .51, work-sample tests ≈ .54) and the validity-generalization argument that situational specificity is largely an artifact of sampling error and measurement unreliability. Building on Schmidt & Hunter's earlier validity-generalization work from the late 1970s onward. ↩
-
Paul R. Sackett, Charlene Zhang, Christopher M. Berry & Filip Lievens, "Revisiting Meta-Analytic Estimates of Validity in Personnel Selection: Addressing Systematic Overcorrection for Restriction of Range," Journal of Applied Psychology, 107 (2022): 2040–2068. The authors find that standard approaches to range-restriction correction produced substantial overcorrection, so the operational validity of many widely used predictors had been overestimated; general mental ability in particular is more modest than long believed and falls behind predictors such as structured interviews and biodata in the revised matrix. See also Sackett et al. (2023) on redesigning selection systems in light of the revised estimates. ↩
-
Robert L. Thorndike, Personnel Selection: Test and Measurement Techniques (1949) — the classic treatment of range restriction (the "Case II" direct-selection scenario): selecting on a predictor truncates its variance in the retained sample, attenuating the observed predictor–criterion correlation. You observe the criterion only for those selected, which is precisely the subgroup in which range is most restricted. ↩
-
David A. Hofmann, Rick Jacobs & Joseph E. Baratta, "Dynamic Criteria and the Measurement of Change," Journal of Applied Psychology, 78 (1993): 194–204 — evidence that job performance is not temporally stable: both average performance and the rank-ordering of individuals change over time, so a predictor's validity against a criterion measured at one point is not guaranteed to hold at another. The broader "dynamic criterion" literature makes the same point: what predicts performance can drift. ↩