analyticsQ5to verify
Naik 2024 (arXiv) — model-consensus framework lifts precision 73%→96% while keeping enough independence to catch errors via disagreement
A probabilistic-consensus framework for LLM reliability improved extraction precision from 73.1% with a single model to 93.9% with two models and 95.6% with three. Inter-model agreement was κ > 0.76 — high enough to consense, but low enough that the models retained sufficient independence for their disagreements to surface errors.
Precision as a function of number of consensing models (1→3), and inter-model agreement (Cohen/Fleiss κ) of the ensemblePrecision: 73.1% (1 model) → 93.9% (2 models) → 95.6% (3 models). Inter-model agreement κ > 0.76, retaining enough independence that disagreements flag errors.
- Sample
- Ensemble-validation experiments across model counts of 1–3 (exact item N not extracted to verification)
- Methodology
- Probabilistic-consensus / ensemble-validation framework; precision measured as models are added; inter-model agreement quantified via κ; analysis of the independence-vs-agreement balance.
What this means
- Names the central reliability tension explicitly: raters must agree enough to be aggregable, but not so much that they are redundant — perfectly correlated raters add no information and cannot catch each other's errors. This is the classic 'effective number of independent raters' point from generalizability theory, restated for LLMs.
- The precision curve (73%→94%→96% as raters go 1→2→3) is a Spearman–Brown-shaped diminishing-returns climb: each added independent rater lifts reliability, with the marginal gain shrinking. A century-old prediction reproduced on a 2024 LLM stack.
- κ > 0.76 as the operating band is the load-bearing detail for the thesis: the framework deliberately does not maximize agreement, because the residual disagreement is the error-detection channel. Reliability theory has always distinguished agreement from validity; here disagreement is harnessed as a diagnostic.
Source
Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability
arXiv · Naik · 2024 · peer-reviewed
Context
- What came before
- Single-model LLM outputs are accepted or rejected against a gold standard with no native confidence-from-agreement signal. Ensembling was often framed purely as an accuracy booster, not as a reliability framework with an explicit independence requirement.
- What comes next
- Verify the exact precision values, the task/dataset, and the κ computation against the full preprint. Cross-link to the measurement-concept entries on generalizability theory, effective number of raters, and the agreement-vs-independence tradeoff.
- Where this lands
- Encyclopedia Part II (measurement — diminishing-returns reliability gain from added raters; agreement-without-redundancy) and Part V (research frontier — disagreement as an error-detection channel).