peopleanalyst

Insight Cards · analytics

analyticsQ6to verify

Courvoisier et al. 2025 (Research Synthesis Methods) — an N-of-M LLM agreement rule beats the human gold standard on ≥85% of abstracts and withholds the rest

A multimodel framework for abstract classification and information extraction decides only when at least N of M LLMs agree, and otherwise withholds. Several combinations (e.g., 3 of 5) reached >95% accuracy and exceeded the human gold standard on at least 85% of abstracts; the cases where the models disagreed were precisely the hard ones flagged for human review.

Accuracy of N-of-M agreement-thresholded LLM combinations, share of abstracts on which the framework exceeds the human gold standard, and the withhold-on-disagreement routing of hard casesSeveral N-of-M combinations (e.g., 3 of 5) achieved >95% accuracy and exceeded the human gold standard on ≥85% of abstracts. Disagreement cases (the remainder) are withheld and routed to humans.
Sample
M LLMs combined under N-of-M agreement rules on a corpus of abstracts (exact abstract N not extracted to verification)
Methodology
Agreement-based multimodel framework: a decision is emitted only when ≥N of M LLMs agree, otherwise the item is withheld for human review; accuracy benchmarked against — and exceeding — a human gold standard.

What this means

  • This is the purest statement of the thesis in the cluster: 'beyond human gold standards' is literally the claim that an aggregation rule over noisy LLM raters can outperform the single human-coded gold standard — the same result that motivated decades of work showing that the mean of several imperfect raters beats any one rater.
  • The withhold-on-disagreement design is selective prediction made operational: the framework abstains exactly where its raters disagree, and disagreement is shown to concentrate on the hard cases. Reliability theory's 'low-agreement items are the ambiguous items' becomes a triage mechanism for routing work to humans.
  • The N-of-M knob is the same accuracy-vs-yield dial seen in the Mackay echocardiography ensemble: tightening N raises accuracy on accepted items and shrinks coverage. Two independent clinical-science teams converge on the same reliability-coverage frontier.

Source

Context

What came before
The human-coded gold standard is treated as the ceiling that automated extraction aspires to match. Single-LLM pipelines are scored against it and assumed to be bounded by it; the idea that an LLM ensemble could exceed it was not the default framing.
What comes next
Verify the exact N-of-M combinations, the >95% / ≥85% figures, and the corpus size against the full article. Cross-link to the measurement-concept entries on aggregation-beats-single-rater, selective prediction / abstention, and item difficulty as the driver of low agreement.
Where this lands
Encyclopedia Part II (measurement — aggregation exceeding the single human gold standard; withhold-on-disagreement as triage) and Part V (research frontier — agreement-thresholded human-in-the-loop routing).
← All insight cards