analyticsQ6to verify
Ntinopoulos et al. 2025 (BMJ HCI) — 18-LLM EHR-extraction benchmark reports multi-run consistency as Krippendorff's alpha up to 1.0
Evaluating 18 LLMs against a baseline for data extraction from electronic health records, top models exceeded 0.98 accuracy. The study reported intra-model multi-run consistency as Krippendorff's alpha — the same chance-corrected agreement coefficient used for human coders — reaching values up to 1.0, with Claude 3 Opus at alpha 0.996.
Extraction accuracy of top models and intra-model multi-run consistency (Krippendorff's alpha) across 18 LLMsTop-model accuracy > 0.98. Multi-run consistency (Krippendorff's alpha) up to 1.0; Claude 3 Opus alpha 0.996.
- Sample
- 18 LLMs evaluated vs a baseline on EHR data-extraction (exact record N not extracted to verification)
- Methodology
- Multiple-model performance evaluation; accuracy vs a baseline; intra-model consistency across repeated runs quantified with Krippendorff's alpha.
What this means
- The reliability question here is intra-rater (test–retest) rather than inter-rater: does the same model give the same answer on repeated runs? The authors answer it with Krippendorff's alpha — a coefficient built for human coder agreement — making the point that LLM stochasticity is just rater inconsistency, and the discipline already has the instrument to measure it.
- An alpha of 0.996 (Claude 3 Opus) is the LLM analog of a near-perfectly consistent human coder. Framing run-to-run variability as a measurable reliability coefficient, rather than an unquantified 'temperature' nuisance, is exactly the move the century-of-psychometrics thesis predicts.
- High accuracy (>0.98) and high multi-run consistency (alpha up to 1.0) are reported as separate axes — the classic reliability-vs-validity distinction. A model can be perfectly consistent and still wrong; reporting both keeps that distinction visible instead of collapsing it into a single accuracy number.
Source
BMJ Health & Care Informatics · Ntinopoulos et al. · 2025 · peer-reviewed
Context
- What came before
- LLM run-to-run variability is usually discussed informally as a function of sampling temperature, without a chance-corrected consistency coefficient. Benchmarks report accuracy but rarely report intra-model reliability as a named statistic.
- What comes next
- Verify the exact alpha values, the full 18-model table, and the record N against the published article. Cross-link to the measurement-concept entries on Krippendorff's alpha, test–retest reliability, and the reliability-vs-validity distinction.
- Where this lands
- Encyclopedia Part II (measurement — intra-model multi-run consistency as test–retest reliability; alpha as the chosen coefficient) and Part V (research frontier — separating consistency from accuracy in LLM evaluation).