peopleanalyst

Insight Cards · analytics

analyticsQ5to verify

Naik 2024 (arXiv) — model-consensus framework lifts precision 73%→96% while keeping enough independence to catch errors via disagreement

A probabilistic-consensus framework for LLM reliability improved extraction precision from 73.1% with a single model to 93.9% with two models and 95.6% with three. Inter-model agreement was κ > 0.76 — high enough to consense, but low enough that the models retained sufficient independence for their disagreements to surface errors.

Precision as a function of number of consensing models (1→3), and inter-model agreement (Cohen/Fleiss κ) of the ensemblePrecision: 73.1% (1 model) → 93.9% (2 models) → 95.6% (3 models). Inter-model agreement κ > 0.76, retaining enough independence that disagreements flag errors.
Sample
Ensemble-validation experiments across model counts of 1–3 (exact item N not extracted to verification)
Methodology
Probabilistic-consensus / ensemble-validation framework; precision measured as models are added; inter-model agreement quantified via κ; analysis of the independence-vs-agreement balance.

What this means

  • Names the central reliability tension explicitly: raters must agree enough to be aggregable, but not so much that they are redundant — perfectly correlated raters add no information and cannot catch each other's errors. This is the classic 'effective number of independent raters' point from generalizability theory, restated for LLMs.
  • The precision curve (73%→94%→96% as raters go 1→2→3) is a Spearman–Brown-shaped diminishing-returns climb: each added independent rater lifts reliability, with the marginal gain shrinking. A century-old prediction reproduced on a 2024 LLM stack.
  • κ > 0.76 as the operating band is the load-bearing detail for the thesis: the framework deliberately does not maximize agreement, because the residual disagreement is the error-detection channel. Reliability theory has always distinguished agreement from validity; here disagreement is harnessed as a diagnostic.

Source

Context

What came before
Single-model LLM outputs are accepted or rejected against a gold standard with no native confidence-from-agreement signal. Ensembling was often framed purely as an accuracy booster, not as a reliability framework with an explicit independence requirement.
What comes next
Verify the exact precision values, the task/dataset, and the κ computation against the full preprint. Cross-link to the measurement-concept entries on generalizability theory, effective number of raters, and the agreement-vs-independence tradeoff.
Where this lands
Encyclopedia Part II (measurement — diminishing-returns reliability gain from added raters; agreement-without-redundancy) and Part V (research frontier — disagreement as an error-detection channel).
← All insight cards