analyticsQ6to verify
Young et al. 2025 (Algorithms) — 5-LLM ensemble reports inter-model Fleiss κ and ICC as native reliability coefficients for clinical-trial extraction
Benchmarking five LLMs on automated clinical-trial data extraction in aging research, the authors reported inter-model agreement as classical reliability coefficients: Fleiss κ ≈ 0.92 on binary fields, κ ≈ 0.71 on categorical fields, and ICC 0.95–0.96 on numeric fields when reported. Ensemble consensus resolved model disagreements to κ ≈ 0.94, and the pipeline roughly doubled the trial yield versus keyword search.
Inter-model agreement among 5 LLMs by field type (Fleiss κ for binary and categorical; ICC for numeric), post-consensus agreement, and trial-yield gain vs keyword searchFleiss κ ≈ 0.92 (binary fields); κ ≈ 0.71 (categorical fields); ICC 0.95–0.96 (numeric fields when reported). Ensemble consensus resolved disagreements to κ ≈ 0.94. Trial yield roughly doubled vs keyword search.
- Sample
- 5 LLMs benchmarked on clinical-trial records in aging research (exact trial/record N not extracted to verification)
- Methodology
- Multi-LLM benchmark with inter-model agreement quantified via Fleiss κ (categorical/binary) and intraclass correlation (numeric); ensemble-consensus resolution of disagreements; yield comparison against keyword search.
What this means
- This is the thesis in miniature: the authors did not invent a new 'LLM reliability metric' — they reached for Fleiss κ and ICC, the same coefficients psychometrics has used for human raters for decades. The noisy-rater problem and its measurement vocabulary are imported wholesale.
- The field-type gradient (κ ≈ 0.92 binary, κ ≈ 0.71 categorical, ICC ≈ 0.95–0.96 numeric) mirrors the long-known human-rater pattern that agreement is highest on low-ambiguity item formats and degrades as the response space gets richer and more interpretive. The substrate changed; the structure of disagreement did not.
- Post-consensus κ ≈ 0.94 shows aggregation buying reliability — the ensemble's agreed answer is more reliable than any single model's, which is the multi-rater-averaging result (Spearman–Brown intuition) restated for LLMs.
Source
Algorithms · Young et al. · 2025 · peer-reviewed
Context
- What came before
- LLM-extraction benchmarks typically report accuracy against a single gold standard and treat between-model variation as noise to be averaged away rather than as a measurable reliability quantity. Reliability coefficients were rarely reported as first-class results.
- What comes next
- Verify the exact κ/ICC values, field counts, and the keyword-search baseline against the full text. Cross-link to the measurement-concept entries on Fleiss κ, ICC, and inter-rater reliability, where this is a clean LLM-substrate instance.
- Where this lands
- Encyclopedia Part II (measurement — LLM inter-rater agreement reported in classical coefficients; the field-type agreement gradient) and Part V (research frontier — consensus resolution as reliability gain).