peopleanalyst

magazine · Thesis · the AI × people-analytics spine

Everyone says AI will transform people analytics. The under-told story is the reverse — an AI judgment is a measurement, and the century-old science of trustworthy measurement is exactly what AI lacks. The empty direction is where the moat is.

By Mike West

June 22, 2026

The Other Direction

A company decides to add AI to its people analytics. A copilot appears on the dashboard. A model starts ranking the résumés. Another one reads the engagement comments and hands back themes. Everywhere the arrow points the same way — AI is the upgrade, people analytics is the thing being upgraded, the junior partner finally getting some horsepower. That is the entire conversation, in every conference track and vendor deck, and it is half the story.

The other half, the one almost nobody is telling, is that the arrow runs the other way too. And the other direction is the one with the moat.

They say AI will transform people analytics

The one-directional story is not wrong, exactly. AI genuinely does things to people analytics that are useful — drafts the survey, speeds the dashboard, flags the at-risk team, reads more open text in an afternoon than a research team could in a month. Real work, happening now. But casting people analytics as the beneficiary — the field about to be transformed — quietly smuggles in an assumption that turns out to be backwards: that the valuable thing in the room is the AI, and the measurement is just the data it feeds on.

Look at what the AI is actually doing in each of those examples, though. It is measuring.

An instrument arrived, and by our standards it's undisciplined

When a model scores a résumé, it is assigning a number to a person. When it rates a comment for sentiment, it is a rater. When it judges whether an answer is "good," or sorts eleven thousand verbatims into themes, it is running an assessment — the same kind of act a psychologist performs with a validated scale, except at a million times the volume and none of the scrutiny.

And here is the principal issue, stated plainly: by the standards people analytics has used for a century to decide whether a measurement can be trusted, the model is an undisciplined instrument. Nobody checked whether it agrees with itself when you run it twice — reliability. Nobody checked whether it measures the construct it claims or some fluent artifact standing in for it — validity. Nobody checked whether it scores one group systematically lower than another for reasons that have nothing to do with the job — bias. We would never let a human assessment into a hiring decision without that scrutiny; it would be malpractice, and in the United States it would be illegal. We are letting the model in anyway, because it is new and it is fluent, and fluent reads as competent.

The discipline it needs is the one we treated as the poor cousin

Here is the part that should change how the field sees itself. The science of measuring noisy, human, hard-to-pin-down things — and knowing how much to trust the result — already exists, and people analytics has been quietly carrying it the whole time.

Psychometrics was founded on exactly this problem. Spearman split a measurement into true score and error in 1904 because he understood that every instrument is unreliable and the only question is how much.1 Reliability is, at bottom, the mathematics of do two raters agree, and how much of that agreement is just chance — Cohen's kappa for categories, Krippendorff's alpha for content coding, both built decades ago for precisely the situation an LLM-as-judge now creates at scale.23 Construct validity is the discipline of asking whether a score reflects the thing or its shadow.4 And test fairness — whether a selection instrument is biased against a group — is not a vibe in this tradition; it is a regulated, regression-grade science with a half-century of case law behind it.5 That hundred-year-old toolkit is a direct answer to the hardest open question in applied AI: how do you know the machine measured what you think it measured, consistently, and fairly? The field that got treated as AI's poor cousin is holding the one thing AI cannot generate for itself, which is a reason to believe the output.

Run the arrow the other way on purpose

So run it the other way, deliberately. Treat every AI judgment as a measurement and put it through the same gates you would put a human rater through. Does the model agree with itself on a re-run, and with expert humans, often enough to trust — and is that agreement better than chance? Does it track the construct or a length-and-confidence artifact? Does it produce adverse impact across groups? These are not new questions anyone has to invent. They are the standing questions of measurement, pointed at a new kind of rater.

This is the chocolate-and-peanut-butter of it, and the metaphor is exact because neither half is the dessert alone. AI gives measurement a reach it never had — you can finally code a million comments to a validated construct, run an adaptive instrument for the price of a few tokens, measure things that were too expensive to measure last year. And measurement gives AI the one discipline it has never had on its own — a principled reason to believe the number. Each genuinely makes the other better. The difference is that one direction is crowded and the other is nearly empty, and the empty one is where the defensibility lives: anyone can call an API, and almost no one can tell you whether the answer is valid.

The spine

This is the organizing idea underneath everything on this site, and it's worth saying out loud because it inverts the usual posture. The books, the registry that grades its own sources and carries priors, the diagnostic that is real psychometrics with AI riding shotgun as a consumer rather than driving as the engine — they are all the same move, the arrow run the under-told way.

It is, I'll admit, the less thrilling half. Behavioral science quietly makes AI trustworthy will never headline the keynote that AI changes everything gets. It is the unglamorous side of the trade, and it requires saying the thing the room doesn't want to hear — that the magic box is an instrument, and instruments get audited, however fluent they sound. But people analytics is not about to be replaced by AI. The century of measurement science it has been carrying, half-ignored, is precisely what AI needs most. The field's future isn't being upgraded by the machine. It's that the machine finally needs the field. Run the arrow the other way.


The keystone of the AI × people analytics thesis: the under-told direction is behavioral science making AI trustworthy, not just AI making analytics faster — and the under-told direction is where the moat is. Its instances are spread across this site: the source-graded Principia registry, the confirmatory-measurement argument in Themes Aren't Evidence and The Reliability Problem, and the proof graph that holds our own claims to the standard. Every footnote names a real, checkable work.

Footnotes

  1. Charles Spearman, "'General Intelligence,' Objectively Determined and Measured," American Journal of Psychology 15 (1904): 201–292 — the origin of classical test theory's split of an observed score into true score plus measurement error.

  2. Jacob Cohen, "A Coefficient of Agreement for Nominal Scales," Educational and Psychological Measurement 20, no. 1 (1960): 37–46 — kappa, agreement between raters corrected for the agreement expected by chance.

  3. Klaus Krippendorff, Content Analysis: An Introduction to Its Methodology (Sage, 1980 and later editions) — Krippendorff's alpha, a general reliability coefficient for coding by multiple raters; the standard measure for whether a coding scheme (human or automated) is trustworthy.

  4. Lee J. Cronbach & Paul E. Meehl, "Construct Validity in Psychological Tests," Psychological Bulletin 52, no. 4 (1955): 281–302 — whether a measure reflects the theoretical construct it claims to, rather than an artifact.

  5. T. Anne Cleary, "Test Bias: Prediction of Grades of Negro and White Students in Integrated Colleges," Journal of Educational Measurement 5, no. 2 (1968): 115–124 — the regression model of predictive bias; codified for employment in the Uniform Guidelines on Employee Selection Procedures (1978), the U.S. standard (including the four-fifths rule for adverse impact) that governs whether a selection instrument may be used.

Was this useful?

Anchored in

Keep going

New issues, oriented to your goals — methodology-first, source-anchored, not a firehose.

Work together

See how this shows up in the work

If this is how you want measurement done, that's the whole point of the products — and the build/advisory work behind them.

Build with me →
← All magazine pieces