peopleanalyst

Research substrate

Insight Cards

Atomic quantitative findings from the research underlying the magazine and the AI Human Interaction Guide. Each card carries a single headline finding, full source attribution, methodology, and framing claims. Cards cite into longer editorial work by ID.

analyticsQ7verified

Scullen, Mount & Goff 2000 (J. Applied Psychology) — idiosyncratic rater effects are the largest single source of variance in performance ratings

Decomposing managerial performance ratings into five postulated sources, idiosyncratic rater effects accounted for 62% and 53% of the rating variance across two large data sets — over half — while the ratee's actual performance (general + dimensional) accounted for only 21% and 25%. The single largest thing a performance rating measures is the rater, not the ratee.

Share of performance-rating variance attributable to idiosyncratic rater effects vs. ratee performance vs. random errorIdiosyncratic rater effects: 62% and 53% (two data sets). General + dimensional ratee performance: 21% and 25%. Random measurement error: 11% and 18%. Small perspective-related (organizational level) effects in boss and subordinate ratings, none in peer ratings.
Sample
Two data sets of managers (n = 2,350 and n = 2,142), each rated on 3 performance dimensions by 7 raters (2 bosses, 2 peers, 2 subordinates, self)
Methodology
Confirmatory factor analysis decomposing developmental multisource ratings into five variance components: ratee general performance, ratee dimensional performance, idiosyncratic rater tendencies, rater organizational perspective, and random error.

What this means

  • This is the empirical core of the 'humans were never reliable single raters either' argument: when you ask where a performance rating actually comes from, the rater's idiosyncratic way of seeing dominates the ratee's actual performance by more than 2-to-1. The instrument measures itself.
  • It reframes the AI-reliability conversation. A noisy LLM rater is not a regression from a reliable human baseline; the human single-rater baseline was already saturated with rater variance. The disease is single-rater measurement, in humans and machines alike.
  • It is the quantitative warrant for the prescription the literature already wrote: pool diverse raters. If 53-62% of a single rating is rater idiosyncrasy, averaging across independent raters is not an efficiency tradeoff — it is the only way to recover the ratee signal.

Source

Understanding the latent structure of job performance ratings

Journal of Applied Psychology · Steven E. Scullen et al. · 2000 · peer-reviewed

Context

What came before
Performance ratings were widely treated as a workable proxy for performance, with rater variance relegated to 'measurement error' to be minimized rather than understood as the dominant signal. The hope of AI raters inherits the same unexamined premise: that the human rating was a trustworthy gold standard.
What comes next
Sets up the inter-rater reliability figure (single-supervisor reliability ≈ .52, Viswesvaran/Ones/Schmidt 1996) and the attenuation ceiling. Cross-link to the LLM-rater cards (Young 2025, Ntinopoulos 2025) — the AI raters disagree for the same structural reason — and to the multi-rater / G-theory D-study fix.
Where this lands
Magazine: 'The Reliability Problem' §'The wall everyone hits' (footnote [^8]). Encyclopedia Part I (single-rater unreliability of human judgment) and Part II (variance decomposition / generalizability theory). Book 1 Unreliable, the human-failure lead case.
analyticsQ7verified

Viswesvaran, Ones & Schmidt 1996 (J. Applied Psychology) — single-supervisor interrater reliability of overall job performance ≈ .52

A meta-analysis of job-performance rating reliabilities found the mean interrater reliability of supervisory ratings of overall job performance to be .52 — i.e., two supervisors rating the same employee agree at roughly one-half on a 0-to-1 reliability scale. Supervisory ratings were more reliable than peer ratings; interrater reliability was uniformly lower than intrarater reliability.

Mean interrater reliability of single-supervisor ratings of overall job performance.52 (overall job performance, supervisory single-rater). Supervisory > peer reliability; interrater reliability < intrarater reliability throughout. Corroborated: Conway & Huffcutt (1997) ≈ .50; Rothstein (1990) ≈ .55; Shen et al. (2014) confirm .52 as the best estimate. Updated meta-analyses revise it upward (Zhou et al. 2024 = .65; Speer et al. 2023 = .65, direct-supervisor designs).
Sample
Meta-analysis aggregating job-performance rating reliability studies (Viswesvaran et al. 1996); corroborating meta-analyses span 22-224 independent samples and tens of thousands of ratees
Methodology
Psychometric meta-analysis of interrater and intrarater reliabilities across 10 performance dimensions plus overall job performance.

What this means

  • The canonical number for 'how reliable is one human rater of another human's performance' — about one-half. It is the empirical floor that the attenuation theorem then operates on: a measure at reliability .52 can correlate no higher than ~.72 with any real outcome, before bias enters.
  • It anchors the corrected thesis. AI raters that disagree are not falling short of a reliable human baseline; the single-human baseline was ≈ .52 to begin with. The honest comparison is AI-rater reliability beside this number, per task — not AI against an assumed-perfect human.
  • The live scholarly debate strengthens rather than weakens the program's point: Murphy & DeShon (2000) argue interrater correlations are not reliability at all because rater variance is systematic (not random error) — which is exactly Scullen et al.'s 53-62% idiosyncratic-rater finding, and exactly why generalizability theory (decompose the facets) is the right instrument rather than a single coefficient.

Source

Comparative analysis of the reliability of job performance ratings

Journal of Applied Psychology · Chockalingam Viswesvaran et al. · 1996 · peer-reviewed

Context

What came before
Performance ratings were corrected for attenuation using intrarater reliabilities (a single rater rating twice), which overstates reliability; this meta-analysis established interrater reliability as the conceptually correct, and much lower, estimate.
What comes next
Feeds the attenuation ceiling (√.52 ≈ .72) and the multi-rater fix. Note the upward revision in newer meta-analyses (~.65) and the Murphy-DeShon dispute over whether interrater correlations estimate reliability at all — both belong in the encyclopedia validity entry. Cross-link to Scullen 2000 (variance decomposition) and the LLM-rater cards.
Where this lands
Magazine: 'The Reliability Problem' §'The wall everyone hits' (footnote [^8], 'around one-half'). Encyclopedia Part I (single-rater unreliability) and Part II (reliability estimation, interrater vs intrarater). Book 1 Unreliable.
analyticsQ7to verify

Bertrand & Mullainathan 2004 (AER) — identical resumes, white-sounding names get 50% more callbacks

In a field experiment sending ~5,000 fictitious resumes to Boston and Chicago help-wanted ads, resumes were identical except that each was randomly assigned a very white-sounding or very African-American-sounding name. White names received 50% more callbacks for interviews. The single human screener's response varied systematically with a feature (the name) that has no relationship to the candidate's qualifications.

Differential interview-callback rate by randomly assigned race-signaling name on otherwise-identical resumesWhite-sounding names received 50% more callbacks than African-American-sounding names. A higher-quality resume raised callbacks 30% for white names but produced a far smaller increase for African-American names. The gap was uniform across occupation, industry, and employer size; Equal-Opportunity-Employer and federal-contractor ads discriminated as much as others.
Sample
~5,000 fictitious resumes sent to help-wanted ads in Boston and Chicago
Methodology
Resume correspondence / audit field experiment with random assignment of race-signaling first names to otherwise-matched resumes; outcome = employer callback for interview.

What this means

  • This is the canonical demonstration that single-rater resume screening is not reliable as a measurement of candidate qualification: holding the resume's substance constant, the screener's decision moves with an irrelevant attribute (the name). The 'rater' is reacting to construct-irrelevant variance, exactly the failure mode psychometrics names.
  • Because the names were randomly assigned to identical applications, the 50% callback gap is causal evidence of bias in the human screening judgment itself, not a reflection of true differences between applicants — the cleanest possible separation of rater variance from ratee variance in the screening domain.
  • Later meta-analysis (Quillian et al. 2017, PNAS, 28 studies / 55,842 applications) shows the effect is durable: whites averaged 36% more callbacks than African Americans with no decline over 25 years — establishing the human-failure baseline against which AI resume-screeners must be compared.

Source

Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination

American Economic Review (NBER working-paper version; AER 2004) · Marianne Bertrand & Sendhil Mullainathan · 2004 · peer-reviewed

Context

What came before
Discrimination in hiring had been studied via wage-gap regressions and survey self-report, both confounded by unobserved differences between real applicants. The resume-audit design removed that confound by randomizing the race signal onto identical applications.
What comes next
Establishes the human-screener failure baseline for the resume-screening case study. Sets up the AI-side question: do LLM/embedding resume screeners reproduce the same name-driven callback gap (Wilson et al. 2024; Armstrong et al. 2024)? Verify exact callback rates (~6.5% white vs ~9.7% — confirm direction) and N=4,870 against full text before drafting.
analyticsQ7to verify

Conway, Jako & Goodman 1995 (JAP) — interview interrater reliability rises with standardization; validity ceiling .67 structured vs .34 unstructured

Meta-analyzing 111 interrater-reliability coefficients and 49 coefficient alphas from selection interviews, the authors found that interview reliability is moderated by standardization of questions, standardization of response evaluation, and how multiple ratings are combined. The estimated upper limit of validity was .67 for highly structured interviews versus .34 for unstructured interviews — roughly double — and mechanically combining multiple ratings helped while subjective combination did not.

Estimated upper-limit validity of the selection interview by structure, and moderators of interrater reliabilityUpper limit of validity ≈ .67 for highly structured interviews vs ≈ .34 for unstructured interviews. Interrater reliability moderated by standardization of questions, standardization of response evaluation, and method of combining multiple ratings; mechanical combination of multiple ratings was useful, subjective combination showed no evidence of usefulness. Standardizing questions had a stronger effect for separate (vs panel) interviews.
Sample
111 interrater-reliability coefficients + 49 coefficient alphas from selection-interview studies
Methodology
Psychometric meta-analysis of interrater reliability and internal-consistency reliability; moderator analysis on study design, interviewer training, and three dimensions of interview structure.

What this means

  • Direct quantification of the human-failure baseline for the interview case: the unstructured interview — the default in most organizations — tops out at validity ≈ .34, and its weakness is traced to low reliability driven by un-standardized inputs and idiosyncratic rater judgment.
  • The cure is named explicitly and matches the essay's shared prescription: standardize the questions, standardize how responses are scored, train raters, and combine multiple ratings mechanically rather than letting raters blend impressions subjectively. Structure roughly doubles the validity ceiling (.34 to .67).
  • The finding that mechanical combination of multiple ratings helps but subjective combination does not is the multi-rater discipline in its precise form — averaging raters buys reliability only when the aggregation is rule-governed, not when a dominant rater overwrites the panel.

Source

A meta-analysis of interrater and internal consistency reliability of selection interviews

Journal of Applied Psychology · James M. Conway et al. · 1995 · peer-reviewed

Context

What came before
The employment interview is the most widely used selection method and is intuitively trusted by hiring managers, yet early reviews (e.g., Mayfield 1964; Hunter & Hunter 1984 put interview validity near .14) flagged its low reliability and validity. The open question was what made some interviews work.
What comes next
Establishes the structured-vs-unstructured reliability/validity gap and the standardization/training/multi-rater fix that the AI-interview case must be measured against. Pairs with Huffcutt et al. 2013 (panel .74 vs separate .44) and Gardner et al. 2022 (ICC .50 to ~.69 after structure+training) as the human-side fix evidence. Verify exact coefficient counts and the .67/.34 ceilings against full text.
analyticsQ6to verify

Courvoisier et al. 2025 (Research Synthesis Methods) — an N-of-M LLM agreement rule beats the human gold standard on ≥85% of abstracts and withholds the rest

A multimodel framework for abstract classification and information extraction decides only when at least N of M LLMs agree, and otherwise withholds. Several combinations (e.g., 3 of 5) reached >95% accuracy and exceeded the human gold standard on at least 85% of abstracts; the cases where the models disagreed were precisely the hard ones flagged for human review.

Accuracy of N-of-M agreement-thresholded LLM combinations, share of abstracts on which the framework exceeds the human gold standard, and the withhold-on-disagreement routing of hard casesSeveral N-of-M combinations (e.g., 3 of 5) achieved >95% accuracy and exceeded the human gold standard on ≥85% of abstracts. Disagreement cases (the remainder) are withheld and routed to humans.
Sample
M LLMs combined under N-of-M agreement rules on a corpus of abstracts (exact abstract N not extracted to verification)
Methodology
Agreement-based multimodel framework: a decision is emitted only when ≥N of M LLMs agree, otherwise the item is withheld for human review; accuracy benchmarked against — and exceeding — a human gold standard.

What this means

  • This is the purest statement of the thesis in the cluster: 'beyond human gold standards' is literally the claim that an aggregation rule over noisy LLM raters can outperform the single human-coded gold standard — the same result that motivated decades of work showing that the mean of several imperfect raters beats any one rater.
  • The withhold-on-disagreement design is selective prediction made operational: the framework abstains exactly where its raters disagree, and disagreement is shown to concentrate on the hard cases. Reliability theory's 'low-agreement items are the ambiguous items' becomes a triage mechanism for routing work to humans.
  • The N-of-M knob is the same accuracy-vs-yield dial seen in the Mackay echocardiography ensemble: tightening N raises accuracy on accepted items and shrinks coverage. Two independent clinical-science teams converge on the same reliability-coverage frontier.

Source

Context

What came before
The human-coded gold standard is treated as the ceiling that automated extraction aspires to match. Single-LLM pipelines are scored against it and assumed to be bounded by it; the idea that an LLM ensemble could exceed it was not the default framing.
What comes next
Verify the exact N-of-M combinations, the >95% / ≥85% figures, and the corpus size against the full article. Cross-link to the measurement-concept entries on aggregation-beats-single-rater, selective prediction / abstention, and item difficulty as the driver of low agreement.
Where this lands
Encyclopedia Part II (measurement — aggregation exceeding the single human gold standard; withhold-on-disagreement as triage) and Part V (research frontier — agreement-thresholded human-in-the-loop routing).
analyticsQ7to verify

Huffcutt, Culbertson & Weyhrauch 2013 — interview interrater reliability .74 (panel) vs .44 (separate interviewers)

Updating the meta-analytic estimates of employment-interview interrater reliability with 125 coefficients (total N = 32,428), the authors found mean interrater reliability of .74 for panel interviews versus .44 for separate interviews conducted by different interviewers — and showed that credible estimates require accounting for all three sources of measurement error (random response, transient, and conspect/rater).

Mean interrater reliability of employment interviews by format (panel vs separate interviewers)Mean interrater reliability ≈ .74 for panel interviews vs ≈ .44 for separate interviews by different interviewers. Estimates depend on modeling all three sources of measurement error (random response, transient, conspect); highly structured interviews conducted separately showed lower-than-expected reliability.
Sample
125 interrater-reliability coefficients; total sample size 32,428
Methodology
Psychometric meta-analysis of interrater reliability partitioned by interview structure and format, decomposing random-response, transient, and conspect (rater) error sources.

What this means

  • Quantifies the multi-rater fix in the interview domain: pooling raters into a panel raises interrater reliability from ≈ .44 (a single separate interviewer) to ≈ .74 — the same averaging-buys-reliability result seen in performance rating and in LLM ensembles, restated for interviews.
  • A single interviewer's judgment (.44) is a strikingly unreliable instrument, reinforcing that the disease is single-rater measurement; the panel is not bureaucratic overhead but the mechanism that makes the interview a defensible measurement.
  • The three-source error decomposition (random-response, transient, conspect) is generalizability-theory machinery applied to interviews: most reliability over-claims come from estimates that ignore transient and rater-specific (conspect) error, exactly the systematic-rather-than-random rater variance the essay foregrounds.

Source

Employment Interview Reliability: New Meta-Analytic Estimates by Structure and Format

International Journal of Selection and Assessment · Allen I. Huffcutt et al. · 2013 · peer-reviewed

Context

What came before
Earlier interview-reliability estimates often ignored transient and conspect error, inflating apparent reliability. Conway, Jako & Goodman 1995 had established structure as a reliability moderator; this study updated the magnitudes and isolated the panel-vs-separate gap.
What comes next
Supplies the precise multi-rater coefficients (.74 panel vs .44 separate) for the interview case study's fix section, alongside Conway 1995 (validity ceilings) and Gardner 2022 (ICC gain from structure + training). Sets the human reliability bar against which AI/async-video interview reliability should be measured. Verify the .74/.44 split and N=32,428 against full text.
analyticsQ5to verify

Mackay et al. 2025 (BJA) — 5-LLM consensus ensembles trade accuracy for yield across four voting strategies (unanimous→plurality)

In automated structured-data extraction from intraoperative echocardiography reports, a 5-LLM ensemble was scored under four voting strategies from strictest (unanimous) to loosest (plurality). The unanimous ensemble reached 99.4% consensus accuracy but accepted only ~81% of cases (the rest fell below the agreement threshold); the plurality strategy delivered the highest raw accuracy (96.1%) and highest yield (99.4%) but admitted more errors. The voting rule is an explicit, tunable accuracy-vs-yield dial.

Consensus accuracy and yield (% of cases reaching the agreement threshold) of a 5-LLM ensemble under four voting strategies, from unanimous to pluralityUnanimous ensemble: 99.4% consensus accuracy at ~81% yield. Plurality ensemble: 96.1% raw accuracy (highest) at 99.4% yield (highest), with higher error than the unanimous rule. Intermediate strategies fall between, tracing an accuracy-vs-yield frontier.
Sample
5 LLMs scored across four voting strategies on intraoperative echocardiography reports (exact report N not extracted to verification)
Methodology
Consensus-based multi-LLM ensemble for structured data extraction; four voting strategies (unanimous, then progressively looser, down to plurality) evaluated on the accuracy-vs-yield tradeoff.

What this means

  • This is a withhold-on-disagreement design rendered as a tunable knob: the stricter the agreement rule among raters, the higher the accuracy on accepted items and the more items get withheld. That is exactly the classic psychometric move of trading coverage for reliability — here the 'raters' are LLMs and the dial is the voting threshold rather than an item-discrimination cutoff.
  • The unanimous-vs-plurality span (99.4% accuracy / ~81% yield vs 96.1% accuracy / 99.4% yield) quantifies a frontier that reliability theory predicts and that a century of inter-rater-agreement work already knows how to characterize. The 'noisy LLM rater' problem is the noisy-human-rater problem with a new substrate.
  • Designers do not have to pick one operating point: the abstained cases (the ~19% the unanimous rule withholds) are precisely the hard cases an aggregation rule should route to a human or a stronger model — the disagreement signal is itself diagnostic.

Source

Context

What came before
Single-LLM extraction is evaluated as one rater against a gold standard, with accuracy reported as a point estimate and no native mechanism to express 'this case is contested.' The conventional framing treats LLM error as a fixed property of the model rather than a controllable function of the aggregation rule.
What comes next
Verify exact report N, the four named voting strategies, and the per-strategy accuracy/yield pairs against the full text. Cross-link to the measurement-concept entries on inter-rater agreement and coverage-vs-reliability tradeoffs, where this paper is a clean modern instance.
Where this lands
Encyclopedia Part II (measurement — voting-threshold as an accuracy-vs-yield dial; the modern LLM analog of coverage-vs-reliability) and Part V (research frontier — consensus ensembles as the operational form of withhold-on-disagreement).
analyticsQ5to verify

Naik 2024 (arXiv) — model-consensus framework lifts precision 73%→96% while keeping enough independence to catch errors via disagreement

A probabilistic-consensus framework for LLM reliability improved extraction precision from 73.1% with a single model to 93.9% with two models and 95.6% with three. Inter-model agreement was κ > 0.76 — high enough to consense, but low enough that the models retained sufficient independence for their disagreements to surface errors.

Precision as a function of number of consensing models (1→3), and inter-model agreement (Cohen/Fleiss κ) of the ensemblePrecision: 73.1% (1 model) → 93.9% (2 models) → 95.6% (3 models). Inter-model agreement κ > 0.76, retaining enough independence that disagreements flag errors.
Sample
Ensemble-validation experiments across model counts of 1–3 (exact item N not extracted to verification)
Methodology
Probabilistic-consensus / ensemble-validation framework; precision measured as models are added; inter-model agreement quantified via κ; analysis of the independence-vs-agreement balance.

What this means

  • Names the central reliability tension explicitly: raters must agree enough to be aggregable, but not so much that they are redundant — perfectly correlated raters add no information and cannot catch each other's errors. This is the classic 'effective number of independent raters' point from generalizability theory, restated for LLMs.
  • The precision curve (73%→94%→96% as raters go 1→2→3) is a Spearman–Brown-shaped diminishing-returns climb: each added independent rater lifts reliability, with the marginal gain shrinking. A century-old prediction reproduced on a 2024 LLM stack.
  • κ > 0.76 as the operating band is the load-bearing detail for the thesis: the framework deliberately does not maximize agreement, because the residual disagreement is the error-detection channel. Reliability theory has always distinguished agreement from validity; here disagreement is harnessed as a diagnostic.

Source

Context

What came before
Single-model LLM outputs are accepted or rejected against a gold standard with no native confidence-from-agreement signal. Ensembling was often framed purely as an accuracy booster, not as a reliability framework with an explicit independence requirement.
What comes next
Verify the exact precision values, the task/dataset, and the κ computation against the full preprint. Cross-link to the measurement-concept entries on generalizability theory, effective number of raters, and the agreement-vs-independence tradeoff.
Where this lands
Encyclopedia Part II (measurement — diminishing-returns reliability gain from added raters; agreement-without-redundancy) and Part V (research frontier — disagreement as an error-detection channel).
analyticsQ5to verify

Niimi 2025 — ensembling repeated medium-LLM inferences (majority-vote style) cuts RMSE 18.6% vs a single large-model attempt

Drawing the explicit analogy to human annotation — where majority voting resolves coder disagreements — Niimi shows that ensembling multiple inferences of a medium-sized LLM reduced text-classification RMSE by 18.6% relative to a single attempt by a larger model. Aggregating several cheap, noisy reads outperformed one expensive read.

RMSE reduction from ensembling repeated medium-LLM inferences (majority-vote-style aggregation) vs a single large-model inference, in text classification18.6% RMSE reduction for the ensemble of repeated medium-model inferences vs a single large-model attempt.
Sample
Text-classification task with repeated inferences of a medium LLM aggregated and compared to a single large-model run (exact item N not extracted to verification)
Methodology
Simple ensemble strategy: multiple LLM inferences aggregated (analogous to majority voting across human annotators); RMSE compared against a single large-model inference.

What this means

  • The paper makes the bridge to the thesis explicit by name: human annotation resolves disagreement by majority vote, and the same procedure stabilizes a noisy LLM rater. The 'new' technique is the oldest reliability fix there is — average more raters.
  • An 18.6% RMSE cut from aggregating repeated reads of a smaller model, beating one read of a bigger model, is the measurement argument against scale-as-the-only-lever: reliability gained through replication can dominate capability gained through size, exactly as the multi-rater-averaging math predicts.
  • Repeated inferences of one model is the intra-rater (test–retest) version of ensembling; averaging them lowers variance the same way averaging several human reads of one coder would. It complements the inter-model κ work in this cluster — both are reliability-through-aggregation, one within a rater and one across raters.

Source

Context

What came before
The dominant framing for improving LLM classification quality is to use a larger or better model. Run-to-run variability is treated as a nuisance to suppress (lower temperature) rather than a signal to aggregate over.
What comes next
Verify the 18.6% figure, the task and dataset, the specific model pairing, and the exact aggregation rule against the preprint. Cross-link to the measurement-concept entries on majority voting, error-of-measurement reduction through replication, and reliability-vs-capability.
Where this lands
Encyclopedia Part II (measurement — majority voting / replication as the oldest reliability fix; reliability-through-aggregation vs capability-through-scale) and Part V (research frontier — intra-model ensembling for stability).
analyticsQ6to verify

Ntinopoulos et al. 2025 (BMJ HCI) — 18-LLM EHR-extraction benchmark reports multi-run consistency as Krippendorff's alpha up to 1.0

Evaluating 18 LLMs against a baseline for data extraction from electronic health records, top models exceeded 0.98 accuracy. The study reported intra-model multi-run consistency as Krippendorff's alpha — the same chance-corrected agreement coefficient used for human coders — reaching values up to 1.0, with Claude 3 Opus at alpha 0.996.

Extraction accuracy of top models and intra-model multi-run consistency (Krippendorff's alpha) across 18 LLMsTop-model accuracy > 0.98. Multi-run consistency (Krippendorff's alpha) up to 1.0; Claude 3 Opus alpha 0.996.
Sample
18 LLMs evaluated vs a baseline on EHR data-extraction (exact record N not extracted to verification)
Methodology
Multiple-model performance evaluation; accuracy vs a baseline; intra-model consistency across repeated runs quantified with Krippendorff's alpha.

What this means

  • The reliability question here is intra-rater (test–retest) rather than inter-rater: does the same model give the same answer on repeated runs? The authors answer it with Krippendorff's alpha — a coefficient built for human coder agreement — making the point that LLM stochasticity is just rater inconsistency, and the discipline already has the instrument to measure it.
  • An alpha of 0.996 (Claude 3 Opus) is the LLM analog of a near-perfectly consistent human coder. Framing run-to-run variability as a measurable reliability coefficient, rather than an unquantified 'temperature' nuisance, is exactly the move the century-of-psychometrics thesis predicts.
  • High accuracy (>0.98) and high multi-run consistency (alpha up to 1.0) are reported as separate axes — the classic reliability-vs-validity distinction. A model can be perfectly consistent and still wrong; reporting both keeps that distinction visible instead of collapsing it into a single accuracy number.

Source

Large language models for data extraction from electronic health records: a multiple model performance evaluation

BMJ Health & Care Informatics · Ntinopoulos et al. · 2025 · peer-reviewed

Context

What came before
LLM run-to-run variability is usually discussed informally as a function of sampling temperature, without a chance-corrected consistency coefficient. Benchmarks report accuracy but rarely report intra-model reliability as a named statistic.
What comes next
Verify the exact alpha values, the full 18-model table, and the record N against the published article. Cross-link to the measurement-concept entries on Krippendorff's alpha, test–retest reliability, and the reliability-vs-validity distinction.
Where this lands
Encyclopedia Part II (measurement — intra-model multi-run consistency as test–retest reliability; alpha as the chosen coefficient) and Part V (research frontier — separating consistency from accuracy in LLM evaluation).
analyticsQ6to verify

Wilson et al. 2024 — embedding-model resume screeners replicate name-based bias, favoring white-associated names in 85% of cases

Running a resume-audit study through a document-retrieval framework that simulates candidate selection, the authors tested Massive Text Embedding (MTE) models on 500+ resumes against 500+ job descriptions across nine occupations. The models significantly favored White-associated names in 85.1% of cases and female-associated names in only 11.1% of cases; Black males were disadvantaged in up to 100% of cases — replicating the human resume-audit pattern in the AI screener.

Share of resume-screening cases in which the embedding model favored a protected-group-associated name, by groupWhite-associated names favored in 85.1% of cases; female-associated names favored in only 11.1% of cases; Black males disadvantaged in up to 100% of cases. Document length and corpus frequency of names also affected selection.
Sample
500+ publicly available resumes x 500+ job descriptions across 9 occupations; selection of Massive Text Embedding (MTE) models
Methodology
Document-retrieval framework simulating candidate selection; resume-audit design (names varied by race/gender) ported to LLM-embedding retrieval; statistical comparison of selection rates across protected groups, testing three intersectionality hypotheses.

What this means

  • The AI screener walks into the same wall: the disease is single-rater judgment of construct-irrelevant signals, and swapping a human screener for an embedding model does not cure it — the name-driven bias reappears, here at an 85.1% rate favoring white-associated names.
  • The study is methodologically the AI analogue of Bertrand & Mullainathan: the same audit design (randomized race/gender name signals on otherwise-comparable applications) applied to the new substrate, which is precisely why the findings are directly comparable to the human baseline.
  • Intersectional structure persists (Black males disadvantaged up to 100% of cases), and the bias couples to surface features the model is sensitive to (document length, name corpus frequency) — evidence that the model is scoring text statistics, not the underlying construct of candidate fit.

Source

Gender, Race, and Intersectional Bias in Resume Screening via Language Model Retrieval

ArXiv · Kyra Wilson & Aylin Caliskan · 2024 · peer-reviewed

Context

What came before
AI resume-screening tools are marketed as more objective than human reviewers. The human resume-audit literature (Bertrand & Mullainathan 2004; Quillian et al. 2017) established that human screeners exhibit large name-driven callback bias.
What comes next
Pairs directly with the Bertrand & Mullainathan card as the human-vs-AI comparison for the resume-screening case study. The shared fix is the same as the human case: standardize/anonymize inputs, validate selection criteria against an outcome (criterion validity), and audit for adverse impact. Verify exact model list and per-occupation breakdown against full text; note this is a preprint at capture time.
analyticsQ6to verify

Young et al. 2025 (Algorithms) — 5-LLM ensemble reports inter-model Fleiss κ and ICC as native reliability coefficients for clinical-trial extraction

Benchmarking five LLMs on automated clinical-trial data extraction in aging research, the authors reported inter-model agreement as classical reliability coefficients: Fleiss κ ≈ 0.92 on binary fields, κ ≈ 0.71 on categorical fields, and ICC 0.95–0.96 on numeric fields when reported. Ensemble consensus resolved model disagreements to κ ≈ 0.94, and the pipeline roughly doubled the trial yield versus keyword search.

Inter-model agreement among 5 LLMs by field type (Fleiss κ for binary and categorical; ICC for numeric), post-consensus agreement, and trial-yield gain vs keyword searchFleiss κ ≈ 0.92 (binary fields); κ ≈ 0.71 (categorical fields); ICC 0.95–0.96 (numeric fields when reported). Ensemble consensus resolved disagreements to κ ≈ 0.94. Trial yield roughly doubled vs keyword search.
Sample
5 LLMs benchmarked on clinical-trial records in aging research (exact trial/record N not extracted to verification)
Methodology
Multi-LLM benchmark with inter-model agreement quantified via Fleiss κ (categorical/binary) and intraclass correlation (numeric); ensemble-consensus resolution of disagreements; yield comparison against keyword search.

What this means

  • This is the thesis in miniature: the authors did not invent a new 'LLM reliability metric' — they reached for Fleiss κ and ICC, the same coefficients psychometrics has used for human raters for decades. The noisy-rater problem and its measurement vocabulary are imported wholesale.
  • The field-type gradient (κ ≈ 0.92 binary, κ ≈ 0.71 categorical, ICC ≈ 0.95–0.96 numeric) mirrors the long-known human-rater pattern that agreement is highest on low-ambiguity item formats and degrades as the response space gets richer and more interpretive. The substrate changed; the structure of disagreement did not.
  • Post-consensus κ ≈ 0.94 shows aggregation buying reliability — the ensemble's agreed answer is more reliable than any single model's, which is the multi-rater-averaging result (Spearman–Brown intuition) restated for LLMs.

Source

Context

What came before
LLM-extraction benchmarks typically report accuracy against a single gold standard and treat between-model variation as noise to be averaged away rather than as a measurable reliability quantity. Reliability coefficients were rarely reported as first-class results.
What comes next
Verify the exact κ/ICC values, field counts, and the keyword-search baseline against the full text. Cross-link to the measurement-concept entries on Fleiss κ, ICC, and inter-rater reliability, where this is a clean LLM-substrate instance.
Where this lands
Encyclopedia Part II (measurement — LLM inter-rater agreement reported in classical coefficients; the field-type agreement gradient) and Part V (research frontier — consensus resolution as reliability gain).
analyticsQ6to verify

Zhang et al. 2024 (IEEE TAC) — GPT-3.5/GPT-4 rate async video interviews with insufficient test-retest reliability and emergent bias

Evaluating GPT-3.5 and GPT-4 as raters of personality and interview performance from asynchronous video interviews (simulated AVI responses of 685 participants), the LLMs achieved validity comparable to or better than a task-specific AI model for some traits, but suffered from uneven performance across traits, insufficient test-retest reliability, and emergent biases — leading the authors to urge caution before using LLMs for employment decisions.

Validity, test-retest reliability, and fairness of GPT-3.5/GPT-4 as AVI raters vs a task-specific AI model and human annotatorsLLMs reached similar or better zero-shot validity than a task-specific AI model on some personality traits, but exhibited uneven performance across traits, insufficient test-retest reliability, and certain emergent biases. (Specific reliability/fairness coefficients reported in the paper not extracted to verification.)
Sample
Simulated AVI responses of 685 participants; raters = GPT-3.5 and GPT-4, compared against a task-specific AI model and human annotators
Methodology
Comprehensive psychometric evaluation (validity, reliability, fairness, rating patterns) of two LLMs as zero-shot raters of personality and interview performance from asynchronous video interviews, benchmarked against a task-specific model and human ratings.

What this means

  • The AI interviewer walks into the same wall the human interviewer did: insufficient test-retest reliability means the LLM rater gives different scores to the same response on different occasions — the single-rater instability problem, now in silicon. Swapping the human for an LLM did not deliver the hoped-for objectivity.
  • The authors evaluate the LLM with the classic psychometric quartet — validity, reliability, fairness, rating patterns — the same vocabulary the human-interview literature (Conway 1995; Huffcutt 2013) built. The measurement frame is imported wholesale; the substrate changed, the standards did not.
  • Comparable validity but unstable reliability is exactly the essay's open question rendered concrete: the LLM is not error-free and not obviously better than humans; the honest finding is how close to the human failure mode it lands. The implied fix is the same — standardize prompts/scoring, average multiple passes/raters, validate against an outcome.

Source

Context

What came before
Automated video interviews (AVIs) are marketed as faster and more objective than human interviews. Machine-learning AVI research (Hickman et al. 2021; Koutsoumpis et al. 2024) had already found test-retest reliability below desired personnel-selection standards. This study extended the question to general-purpose LLMs (GPT-3.5/GPT-4) as raters.
What comes next
Anchors the AI side of the interview case study, set beside the human reliability baseline (single interviewer ≈ .44; structured/panel/trained pushes higher). Corroborated by Hickman et al. 2021 (AVI personality, mixed reliability) and Koutsoumpis et al. 2024 (test-retest below selection standards). Extract the exact reliability/fairness coefficients from full text before citing specific numbers.
analyticsQ5to verify

Pelikan & Broth 2016 (CHI) — humans adapt their turn designs when playing charades with a Nao humanoid robot

In a multimodal conversation-analytic study of participants playing a charade game with a Nao humanoid robot, humans systematically adjusted their turn designs in response to robot behavior — shortening turns, simplifying vocabulary, and adapting timing. The interactional achievement of 'the robot as an interlocutor' was transient, sustained or lapsing depending on what the robot did and how participants interpreted it.

Human turn-design adaptation patterns (length, vocabulary, timing) during charade gameplay with a Nao robot vs human-only baseline; characterizations of when robot is/isn't treated as an interlocutorConsistent human turn-design adaptation across participants: shorter turns, simpler vocabulary, adjusted prosody and timing when addressing the Nao robot vs human co-participants (exact magnitude/percentages not extracted to verification — primarily a qualitative CA study)
Sample
Charade-game sessions with participants and a Nao humanoid robot (specific N participants + N sessions not extracted to verification)
Methodology
Multimodal conversation analysis of recorded charade-game sessions; transcription at CA granularity (including pauses, overlap, gaze, gesture); sequential analysis of turn-design adaptation across rounds.

What this means

  • Foundational CA-of-HRI demonstration: humans adapt their turn designs to AI/robot interlocutors. This pattern recurs across subsequent CA-of-HAI work and has direct implications for model training — the conversational data the AI sees from users is already adapted.
  • The 'interactional achievement of agency as a transient phenomenon' framing is load-bearing for the AHI program: agency in HAI is not a designed-in property but is locally accomplished in interaction, and it can lapse. This is a measurement target the AHI program's multi-session data is well-positioned to capture.
  • Implication for AI evaluation: benchmark performance on user-curated test prompts (which are already adapted to AI's expected register) systematically overestimates real-deployment performance, because the deployed system sees user prompts that have been pre-adapted in ways the model's training data shaped.

Source

Why that Nao? How humans adapt to a conventional humanoid robot in taking turns-at-talk

Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (ACM) · Hannah R. M. Pelikan & Mathias Broth · 2016 · peer-reviewed

Context

What came before
Pre-2016 HRI work focused on robot-side capabilities (perception, recognition, synthesis). Pelikan & Broth shifts the analytical focus to the human side — what humans do to make the interaction work, and how this differs systematically from human-human interaction.
What comes next
Verify session counts, participant N, the specific CA-coded adaptation categories. Connect to Albert et al.'s voice-assistant repair work and to the broader CA-of-HAI literature where humans-adapt-to-AI is now a stable finding.
Where this lands
Encyclopedia Part II (workforce — implications for measurement: any benchmark using user-collected prompts inherits adapted-input bias), Part V (research frontier — CA-of-HAI as a methodological resource the mainstream HAI evaluation tradition has not absorbed).
← AI Human Interaction Guide