analyticsQ5to verify
Mackay et al. 2025 (BJA) — 5-LLM consensus ensembles trade accuracy for yield across four voting strategies (unanimous→plurality)
In automated structured-data extraction from intraoperative echocardiography reports, a 5-LLM ensemble was scored under four voting strategies from strictest (unanimous) to loosest (plurality). The unanimous ensemble reached 99.4% consensus accuracy but accepted only ~81% of cases (the rest fell below the agreement threshold); the plurality strategy delivered the highest raw accuracy (96.1%) and highest yield (99.4%) but admitted more errors. The voting rule is an explicit, tunable accuracy-vs-yield dial.
Consensus accuracy and yield (% of cases reaching the agreement threshold) of a 5-LLM ensemble under four voting strategies, from unanimous to pluralityUnanimous ensemble: 99.4% consensus accuracy at ~81% yield. Plurality ensemble: 96.1% raw accuracy (highest) at 99.4% yield (highest), with higher error than the unanimous rule. Intermediate strategies fall between, tracing an accuracy-vs-yield frontier.
- Sample
- 5 LLMs scored across four voting strategies on intraoperative echocardiography reports (exact report N not extracted to verification)
- Methodology
- Consensus-based multi-LLM ensemble for structured data extraction; four voting strategies (unanimous, then progressively looser, down to plurality) evaluated on the accuracy-vs-yield tradeoff.
What this means
- This is a withhold-on-disagreement design rendered as a tunable knob: the stricter the agreement rule among raters, the higher the accuracy on accepted items and the more items get withheld. That is exactly the classic psychometric move of trading coverage for reliability — here the 'raters' are LLMs and the dial is the voting threshold rather than an item-discrimination cutoff.
- The unanimous-vs-plurality span (99.4% accuracy / ~81% yield vs 96.1% accuracy / 99.4% yield) quantifies a frontier that reliability theory predicts and that a century of inter-rater-agreement work already knows how to characterize. The 'noisy LLM rater' problem is the noisy-human-rater problem with a new substrate.
- Designers do not have to pick one operating point: the abstained cases (the ~19% the unanimous rule withholds) are precisely the hard cases an aggregation rule should route to a human or a stronger model — the disagreement signal is itself diagnostic.
Source
British Journal of Anaesthesia · Mackay et al. · 2025 · peer-reviewed
Context
- What came before
- Single-LLM extraction is evaluated as one rater against a gold standard, with accuracy reported as a point estimate and no native mechanism to express 'this case is contested.' The conventional framing treats LLM error as a fixed property of the model rather than a controllable function of the aggregation rule.
- What comes next
- Verify exact report N, the four named voting strategies, and the per-strategy accuracy/yield pairs against the full text. Cross-link to the measurement-concept entries on inter-rater agreement and coverage-vs-reliability tradeoffs, where this paper is a clean modern instance.
- Where this lands
- Encyclopedia Part II (measurement — voting-threshold as an accuracy-vs-yield dial; the modern LLM analog of coverage-vs-reliability) and Part V (research frontier — consensus ensembles as the operational form of withhold-on-disagreement).