analyticsQ5to verify
Niimi 2025 — ensembling repeated medium-LLM inferences (majority-vote style) cuts RMSE 18.6% vs a single large-model attempt
Drawing the explicit analogy to human annotation — where majority voting resolves coder disagreements — Niimi shows that ensembling multiple inferences of a medium-sized LLM reduced text-classification RMSE by 18.6% relative to a single attempt by a larger model. Aggregating several cheap, noisy reads outperformed one expensive read.
RMSE reduction from ensembling repeated medium-LLM inferences (majority-vote-style aggregation) vs a single large-model inference, in text classification18.6% RMSE reduction for the ensemble of repeated medium-model inferences vs a single large-model attempt.
- Sample
- Text-classification task with repeated inferences of a medium LLM aggregated and compared to a single large-model run (exact item N not extracted to verification)
- Methodology
- Simple ensemble strategy: multiple LLM inferences aggregated (analogous to majority voting across human annotators); RMSE compared against a single large-model inference.
What this means
- The paper makes the bridge to the thesis explicit by name: human annotation resolves disagreement by majority vote, and the same procedure stabilizes a noisy LLM rater. The 'new' technique is the oldest reliability fix there is — average more raters.
- An 18.6% RMSE cut from aggregating repeated reads of a smaller model, beating one read of a bigger model, is the measurement argument against scale-as-the-only-lever: reliability gained through replication can dominate capability gained through size, exactly as the multi-rater-averaging math predicts.
- Repeated inferences of one model is the intra-rater (test–retest) version of ensembling; averaging them lowers variance the same way averaging several human reads of one coder would. It complements the inter-model κ work in this cluster — both are reliability-through-aggregation, one within a rater and one across raters.
Source
A Simple Ensemble Strategy for LLM Inference: Towards More Stable Text Classification
arXiv (preprint) · Niimi · 2025 · peer-reviewed
Context
- What came before
- The dominant framing for improving LLM classification quality is to use a larger or better model. Run-to-run variability is treated as a nuisance to suppress (lower temperature) rather than a signal to aggregate over.
- What comes next
- Verify the 18.6% figure, the task and dataset, the specific model pairing, and the exact aggregation rule against the preprint. Cross-link to the measurement-concept entries on majority voting, error-of-measurement reduction through replication, and reliability-vs-capability.
- Where this lands
- Encyclopedia Part II (measurement — majority voting / replication as the oldest reliability fix; reliability-through-aggregation vs capability-through-scale) and Part V (research frontier — intra-model ensembling for stability).