The Reliability Problem
We ran the experiment four times, on purpose, and got four different books.
The text never changed. It was the same management title each time — the same chapters, the same sentences, the same author making the same arguments in the same order. We handed it to four readers and asked each the same plain question: what is this book actually claiming? Lay out its model — the moving parts and how they connect. Four careful readers. One book. The answers came back at thirty-two parts, twenty-two, forty, and sixteen.
Not four readings that differed at the edges. Four readings that couldn't agree on how big the thing was.
The readers were language models — two of them, plus the same model run two different ways — and that detail is the whole problem, because a language model is supposed to be the cure for this, not a fresh case of it. We reach for the machine precisely when we want the subjective made objective: read these ten thousand résumés, code these survey comments, pull the structure out of this document, and do it the same way every time. The promise is consistency. What we got was a reader who, asked twice, disagreed with himself — and a second reader who disagreed with the first by almost half.
So which one is right? Thirty-two or sixteen?
It is a genuinely uncomfortable question, and the instinct in the room — the AI room, in 2025 — is to answer it with more machine. A bigger model. A better prompt. A benchmark to rank them on. Take a vote and call the majority the truth. The field has a gold-rush energy about reliability right now, and most of that energy is pointed at building something new.
Here is the turn this essay wants to make: the uncomfortable question is not new, and it is not even open. A hundred years ago a different set of people walked into the exact same room. They were not measuring books with machines; they were measuring intelligence, and attitudes, and the severity of a diagnosis, and the quality of an essay — measuring people, with other people as the instrument. And they ran headfirst into the thing we just ran into: give two careful human raters the same case and they will disagree, and you cannot publish "well, one of them said sixteen." They could not wave the disagreement away, so they did something harder. They built the mathematics of trusting a noisy reader.
That mathematics has a name in every corner — reliability, agreement, generalizability — and it has been sitting, finished and load-bearing, in the social sciences for the better part of a century. It is the answer to thirty-two-versus-sixteen. The strange part of this moment is not that AI is unreliable. It is that we are reinventing, badly and at great expense, a manual someone already wrote.
They say it's new
Spend an afternoon in the current literature on trusting AI judgment and you will watch the manual get re-derived in real time. A team runs the same classification task through a model several times, notices the answers move, and lands on a fix: run it a few more times and take the majority answer. They give it a name — self-consistency, ensembling, majority voting — and report it as a contribution.1 It is a contribution to that paper's accuracy. It is also exactly what a survey methodologist would have told them to do in 1955, under a different name, with the math already worked out for how many raters you need and how much the agreement is worth.
The tell is not that the field is wrong. The fixes mostly work. The tell is the vocabulary. "Trust" gets reported as a vibe, or as a single accuracy number against a benchmark, as though reliability were one scalar you print at the top of the model card. The question which of the disagreements matter, and how much of the variance is the model versus the prompt versus the run versus the difficulty of the item — the question that turns "it's noisy" into a number you can act on — mostly doesn't get asked. Not because it's unanswerable. Because the people asking it haven't met the people who answered it.
The first coefficient
In 1904 Charles Spearman wrote down the idea that every measurement is two things added together: the true value you're after, and error. Reliability is just the share of what you measured that isn't the error.2 It sounds obvious now. It was not obvious then, and its consequences are still routinely ignored, because the first one is counterintuitive: raw agreement lies.
Suppose two coders label survey comments as "complaint" or "not," and ninety percent of the comments aren't complaints. Two coders who both just write "not" every time will agree ninety percent of the time — and they will have learned nothing, measured nothing, demonstrated no skill whatsoever. Percent agreement rewards them anyway. The fix arrived in 1960, when Jacob Cohen published kappa: agreement corrected for the agreement you'd expect by chance alone.3 Two coders agreeing ninety percent of the time on a ninety-percent-prevalent label score a kappa near zero, which is the honest answer. Fleiss extended it to any number of raters; Krippendorff built a coefficient — alpha — that handles any number of coders, any kind of data, and missing values without flinching.4 These are not exotic instruments. They are the thermometers of anyone who has ever had to defend a measurement made by a human being.
Then Lee Cronbach did the thing that matters most for the age we just entered. He took reliability — which had been a single number, the share that isn't noise — and asked the better question: which noise? In the framework he and his colleagues called generalizability theory, the disagreement gets decomposed into its sources. How much of the wobble is the rater? How much is the occasion? How much is the particular item? Generalizability theory turns "it's unreliable" into a budget: here is exactly where your error is coming from, and therefore here is what to fix.5
An LLM is a rater
Say it plainly, because the whole argument turns on it: a language model doing a classification or extraction task is a rater. It takes a thing in the world — a résumé, a comment, a chapter — and returns a judgment. That is the same act a human coder performs, and it inherits the same mathematics, one-to-one. Kappa and alpha for agreement. The intraclass correlation when the judgment is a number on a scale. Generalizability theory to ask whether your noise is the model, the prompt, the run, or the altitude you read at. Many-facet Rasch measurement to model the fact that one rater is simply more lenient than another — a thing that is as true of model-as-judge as it was of the professor who grades easy.
This isn't a metaphor I'm proposing. It is already happening, just without the discipline claiming credit. Open the recent papers and you find researchers reporting Fleiss's kappa and the intraclass correlation across several models on the same task, or Krippendorff's alpha across repeated runs of a single model.6 They are using the social scientist's instruments. They mostly do not cite the social scientist, and — more importantly — they mostly do not inherit the interpretive discipline that comes with the instrument: the conventions for how high is high enough, the insistence that the rating task be bounded before the coefficient means anything, the warning that comes next.
The wall everyone hits
Here is where the story almost always goes wrong, and where I want to be careful, because the lazy version of this essay writes itself and it is the wrong essay.
The lazy version says: AI is unreliable, so put a human back in the loop. That conclusion feels safe and it is mostly false, for a reason that should unsettle anyone who has run a performance-review cycle. Single human raters of other humans were never reliable either.
Look at performance ratings, the most-studied rater task in all of organizational life. When researchers decompose where a performance rating actually comes from, the largest single source of variance is not the employee's performance. It is the idiosyncrasy of the rater — the manager's personal, systematic, repeatable way of seeing — which in the canonical decomposition accounts for more of the score than the ratee's actual performance does.7 The reliability of a single supervisor's rating, measured across raters, sits around one-half on a scale where one is perfect and zero is noise.7 Not because managers are lazy. Because one human judging another human is a single rater, and a single rater is a noisy instrument, and we have known this with numbers since before most current managers were born.
So when the AI walks up to the same task and wobbles, it is not failing because it is AI. It is failing the way we fail, for the same structural reason: it is one rater, asked to do alone what the mathematics says no single rater can do reliably. The hope was that the machine would transcend human subjectivity. The gut-punch is that it walked into the identical wall — and the genuinely interesting, genuinely open question is not whether to trust it but how close to us it lands. AI at a given rating task may turn out to be a little worse than a person, about the same, or marginally better. It will not be error-free. We should measure that gap, honestly, per task — not assume it, and certainly not assume the human is the fix.
Agreement is not truth
There is a second wall, behind the first, and it is the more dangerous one.
Even if your raters agree — even if four models, or four people, return the same answer — they can be uniformly, confidently wrong. Reliability is agreement. Validity is being right about the thing you claim to measure. They are not the same, and agreement is necessary but nowhere near sufficient. Worse: models that share a base have correlated errors, so high inter-model agreement can be a measurement of their common blind spot rather than of the truth. Four readers nodding in unison is not evidence; sometimes it is just four copies of the same mistake.
The psychometricians left us a hard piece of math here, and it deserves to be a slogan: reliability caps validity. A measure can correlate with the real outcome you care about no better than the square root of its own reliability. A rater at reliability one-half can never correlate above about point-seven with anything real — before a single bias even enters the picture.8 The floor sets the height of the building. As one measurement text puts it, if you cannot reliably measure an attitude, you will never be able to predict behavior from it.8 This is the bridge the AI-evaluation conversation keeps missing: you cannot validate your way out of an unreliable instrument. You have to fix the reliability first, and there is a known way to do it.
How many readers, and the marriage
Generalizability theory doesn't just diagnose the noise; it prices the fix. Its decision studies turn "add more raters" from a shrug into arithmetic: given where your error actually lives, here is how many raters of which kind it takes to reach a target reliability. That is the move our whole portfolio keeps returning to under a different name — value of information. Reliability stops being a mystery and becomes a dial with a posted price: this much certainty costs this many reads, and here is whether the decision in front of you is worth buying it down.
And the prescription the human-judgment literature wrote, a century in the making, is not "use a human." It is four things, and they apply identically to humans and to machines:
- Standardize the inputs. Bound the task. Give the rater a rubric. An agreement coefficient is only meaningful on a task that is rubric-bounded in the first place — which is exactly why "an AI is just a rater, compute its alpha" earns the coefficient only when the task is bounded classification and not open interpretation. Standardization isn't the polish; it's the gate that makes the statistic mean anything.
- Anchor to a criterion that matters. Tie the rating to a real outcome where one exists — the callback, the actual performance, the diagnosis confirmed — and lean on construct validity where it doesn't.
- Train the raters. Calibrate them against the rubric and against each other. For models, this is prompt design, examples, and instruction — the same act under a new name.
- Use a diverse, multi-rater panel. Not as an inefficiency to be optimized away. As the feature. The single rater is the disease; the panel is the cure, and it always was.
Do those four things and you get a different instrument — one that returns an answer with a measured reliability attached: a confidence tier per claim, and a queue of exactly the disagreements that are worth a human's scarce attention. There is early evidence this works at the high end: in one clinical study, an agreement-thresholded panel of model raters — answer only when they agree, withhold and escalate when they don't — exceeded the human gold standard on the large majority of cases.9 Withhold-on-disagreement is not a hack. It is triage, and it is the oldest idea in measurement: know what you don't know, and route it.
That is the marriage. Not "AI, then a human to check it." Human and machine assembled as a measurement system, with the reliability engineered in — which beats either one alone and, more to the point, beats the single-rater status quo that most organizations are still running on, every day, to decide who gets hired and who gets promoted.
Everything is a measurement now
Step back far enough and the screen you are reading this on is covered in instruments we never calibrated. Every AI-mediated decision — the résumé it ranked, the comment it flagged, the summary it handed your boss, the score it assigned a person — is a measurement, taken by a rater whose reliability nobody computed and whose validity nobody checked. We deployed the instruments first and asked what they measure second, if at all.
The century of work on trusting noisy human judgment is not a historical footnote to that problem. It is the manual for it. The people who measured intelligence and attitudes and pain and essays already lived through the disagreement, already built the math, already learned the hard lesson that agreement isn't truth and the single rater is a trap. We did not need to reinvent it badly. We needed to read it.
Thirty-two or sixteen. The answer was never one of the numbers. The answer is that you do not trust a single reading from a single reader — and that this was settled long before the reader was a machine.
This is the flagship essay of an ongoing program — Measurement Meets AI — arguing that a century of psychometric reliability theory is the under-used answer to AI's noisy-rater problem. The four-pass extraction figures above are from our own pipeline and are reported in full, including the uncomfortable parts. Companion work in progress: the encyclopedia entries on the measurement canon, and the Consensus Coder reliability-scored extraction brief.
Footnotes
-
The LLM-ensemble / self-consistency / majority-vote literature, 2024–2025 (e.g., Mackay 2025 on voting and yield; Niimi 2025, which notes that human annotation itself uses majority voting). Filed program insight cards, sourced via Consensus, 2026-06. ↩
-
Charles Spearman, "The Proof and Measurement of Association between Two Things," American Journal of Psychology, 1904 — the origin of classical test theory's true-score-plus-error decomposition and the first reliability correction. ↩
-
Jacob Cohen, "A Coefficient of Agreement for Nominal Scales," Educational and Psychological Measurement, 1960 (chance-corrected agreement, κ); Joseph Fleiss (1971) generalized chance-corrected agreement to any number of raters. ↩
-
Klaus Krippendorff's α — a single chance-corrected reliability coefficient defined for any number of coders, any level of measurement, and incomplete data (see Krippendorff, Content Analysis, and the methodological literature thereon). ↩
-
Lee J. Cronbach, "Coefficient Alpha and the Internal Structure of Tests," Psychometrika, 1951; Cronbach, Gleser, Nanda & Rajaratnam, The Dependability of Behavioral Measurements (generalizability theory), 1972 — decomposition of measurement error into its sources, with decision (D-) studies for choosing a design that reaches a target reliability. ↩
-
Recent model-evaluation papers reporting psychometric agreement statistics directly — e.g., Young 2025 (Fleiss's κ and ICC across models) and Ntinopoulos 2025 (Krippendorff's α across repeated runs of a model). Filed program insight cards, sourced via Consensus, 2026-06. ↩
-
On the unreliability of single human raters of performance: Scullen, Mount & Goff (2000) decompose performance ratings and find idiosyncratic rater effects to be the single largest variance component, exceeding ratee performance; Viswesvaran, Ones & Schmidt (1996) estimate single-rater interrater reliability of job-performance ratings at roughly .52; Murphy & DeShon (2000) argue rater variance is systematic rather than random error. Multi-rater pooling and diverse source panels raise accuracy (Borman 1978; Hoffman et al. 2010). ↩ ↩2
-
The attenuation relationship from classical test theory: a measure's correlation with any criterion is bounded above by the square root of its reliability (Lord & Novick, Statistical Theories of Mental Test Scores, 1968; Carmines & Zeller, Reliability and Validity Assessment, 1979). The behavior-prediction corollary is Bohrnstedt's: an attitude you cannot reliably measure cannot be used to predict behavior. ↩ ↩2
-
A 2025 clinical-extraction study (Courvoisier et al.) in which an agreement-thresholded panel of model raters — answering only on agreement and withholding on disagreement — exceeded the human gold standard on roughly 85%+ of cases. Reported in a program insight card; presented as one supportive result, not a general claim. The general comparison of AI-to-human rater reliability is the subject of a preregistered study, not this essay. ↩