analyticsQ6to verify
Courvoisier et al. 2025 (Research Synthesis Methods) — an N-of-M LLM agreement rule beats the human gold standard on ≥85% of abstracts and withholds the rest
A multimodel framework for abstract classification and information extraction decides only when at least N of M LLMs agree, and otherwise withholds. Several combinations (e.g., 3 of 5) reached >95% accuracy and exceeded the human gold standard on at least 85% of abstracts; the cases where the models disagreed were precisely the hard ones flagged for human review.
Accuracy of N-of-M agreement-thresholded LLM combinations, share of abstracts on which the framework exceeds the human gold standard, and the withhold-on-disagreement routing of hard casesSeveral N-of-M combinations (e.g., 3 of 5) achieved >95% accuracy and exceeded the human gold standard on ≥85% of abstracts. Disagreement cases (the remainder) are withheld and routed to humans.
- Sample
- M LLMs combined under N-of-M agreement rules on a corpus of abstracts (exact abstract N not extracted to verification)
- Methodology
- Agreement-based multimodel framework: a decision is emitted only when ≥N of M LLMs agree, otherwise the item is withheld for human review; accuracy benchmarked against — and exceeding — a human gold standard.
What this means
- This is the purest statement of the thesis in the cluster: 'beyond human gold standards' is literally the claim that an aggregation rule over noisy LLM raters can outperform the single human-coded gold standard — the same result that motivated decades of work showing that the mean of several imperfect raters beats any one rater.
- The withhold-on-disagreement design is selective prediction made operational: the framework abstains exactly where its raters disagree, and disagreement is shown to concentrate on the hard cases. Reliability theory's 'low-agreement items are the ambiguous items' becomes a triage mechanism for routing work to humans.
- The N-of-M knob is the same accuracy-vs-yield dial seen in the Mackay echocardiography ensemble: tightening N raises accuracy on accepted items and shrinks coverage. Two independent clinical-science teams converge on the same reliability-coverage frontier.
Source
Research Synthesis Methods · Courvoisier et al. · 2025 · peer-reviewed
Context
- What came before
- The human-coded gold standard is treated as the ceiling that automated extraction aspires to match. Single-LLM pipelines are scored against it and assumed to be bounded by it; the idea that an LLM ensemble could exceed it was not the default framing.
- What comes next
- Verify the exact N-of-M combinations, the >95% / ≥85% figures, and the corpus size against the full article. Cross-link to the measurement-concept entries on aggregation-beats-single-rater, selective prediction / abstention, and item difficulty as the driver of low agreement.
- Where this lands
- Encyclopedia Part II (measurement — aggregation exceeding the single human gold standard; withhold-on-disagreement as triage) and Part V (research frontier — agreement-thresholded human-in-the-loop routing).