peopleanalyst

Insight Cards · analytics

analyticsQ5to verify

Niimi 2025 — ensembling repeated medium-LLM inferences (majority-vote style) cuts RMSE 18.6% vs a single large-model attempt

Drawing the explicit analogy to human annotation — where majority voting resolves coder disagreements — Niimi shows that ensembling multiple inferences of a medium-sized LLM reduced text-classification RMSE by 18.6% relative to a single attempt by a larger model. Aggregating several cheap, noisy reads outperformed one expensive read.

RMSE reduction from ensembling repeated medium-LLM inferences (majority-vote-style aggregation) vs a single large-model inference, in text classification18.6% RMSE reduction for the ensemble of repeated medium-model inferences vs a single large-model attempt.
Sample
Text-classification task with repeated inferences of a medium LLM aggregated and compared to a single large-model run (exact item N not extracted to verification)
Methodology
Simple ensemble strategy: multiple LLM inferences aggregated (analogous to majority voting across human annotators); RMSE compared against a single large-model inference.

What this means

  • The paper makes the bridge to the thesis explicit by name: human annotation resolves disagreement by majority vote, and the same procedure stabilizes a noisy LLM rater. The 'new' technique is the oldest reliability fix there is — average more raters.
  • An 18.6% RMSE cut from aggregating repeated reads of a smaller model, beating one read of a bigger model, is the measurement argument against scale-as-the-only-lever: reliability gained through replication can dominate capability gained through size, exactly as the multi-rater-averaging math predicts.
  • Repeated inferences of one model is the intra-rater (test–retest) version of ensembling; averaging them lowers variance the same way averaging several human reads of one coder would. It complements the inter-model κ work in this cluster — both are reliability-through-aggregation, one within a rater and one across raters.

Source

Context

What came before
The dominant framing for improving LLM classification quality is to use a larger or better model. Run-to-run variability is treated as a nuisance to suppress (lower temperature) rather than a signal to aggregate over.
What comes next
Verify the 18.6% figure, the task and dataset, the specific model pairing, and the exact aggregation rule against the preprint. Cross-link to the measurement-concept entries on majority voting, error-of-measurement reduction through replication, and reliability-vs-capability.
Where this lands
Encyclopedia Part II (measurement — majority voting / replication as the oldest reliability fix; reliability-through-aggregation vs capability-through-scale) and Part V (research frontier — intra-model ensembling for stability).
← All insight cards