peopleanalyst

Insight Cards · analytics

analyticsQ6to verify

Ntinopoulos et al. 2025 (BMJ HCI) — 18-LLM EHR-extraction benchmark reports multi-run consistency as Krippendorff's alpha up to 1.0

Evaluating 18 LLMs against a baseline for data extraction from electronic health records, top models exceeded 0.98 accuracy. The study reported intra-model multi-run consistency as Krippendorff's alpha — the same chance-corrected agreement coefficient used for human coders — reaching values up to 1.0, with Claude 3 Opus at alpha 0.996.

Extraction accuracy of top models and intra-model multi-run consistency (Krippendorff's alpha) across 18 LLMsTop-model accuracy > 0.98. Multi-run consistency (Krippendorff's alpha) up to 1.0; Claude 3 Opus alpha 0.996.
Sample
18 LLMs evaluated vs a baseline on EHR data-extraction (exact record N not extracted to verification)
Methodology
Multiple-model performance evaluation; accuracy vs a baseline; intra-model consistency across repeated runs quantified with Krippendorff's alpha.

What this means

  • The reliability question here is intra-rater (test–retest) rather than inter-rater: does the same model give the same answer on repeated runs? The authors answer it with Krippendorff's alpha — a coefficient built for human coder agreement — making the point that LLM stochasticity is just rater inconsistency, and the discipline already has the instrument to measure it.
  • An alpha of 0.996 (Claude 3 Opus) is the LLM analog of a near-perfectly consistent human coder. Framing run-to-run variability as a measurable reliability coefficient, rather than an unquantified 'temperature' nuisance, is exactly the move the century-of-psychometrics thesis predicts.
  • High accuracy (>0.98) and high multi-run consistency (alpha up to 1.0) are reported as separate axes — the classic reliability-vs-validity distinction. A model can be perfectly consistent and still wrong; reporting both keeps that distinction visible instead of collapsing it into a single accuracy number.

Source

Large language models for data extraction from electronic health records: a multiple model performance evaluation

BMJ Health & Care Informatics · Ntinopoulos et al. · 2025 · peer-reviewed

Context

What came before
LLM run-to-run variability is usually discussed informally as a function of sampling temperature, without a chance-corrected consistency coefficient. Benchmarks report accuracy but rarely report intra-model reliability as a named statistic.
What comes next
Verify the exact alpha values, the full 18-model table, and the record N against the published article. Cross-link to the measurement-concept entries on Krippendorff's alpha, test–retest reliability, and the reliability-vs-validity distinction.
Where this lands
Encyclopedia Part II (measurement — intra-model multi-run consistency as test–retest reliability; alpha as the chosen coefficient) and Part V (research frontier — separating consistency from accuracy in LLM evaluation).
← All insight cards