peopleanalyst

Insight Cards · analytics

analyticsQ7verified

Scullen, Mount & Goff 2000 (J. Applied Psychology) — idiosyncratic rater effects are the largest single source of variance in performance ratings

Decomposing managerial performance ratings into five postulated sources, idiosyncratic rater effects accounted for 62% and 53% of the rating variance across two large data sets — over half — while the ratee's actual performance (general + dimensional) accounted for only 21% and 25%. The single largest thing a performance rating measures is the rater, not the ratee.

Share of performance-rating variance attributable to idiosyncratic rater effects vs. ratee performance vs. random errorIdiosyncratic rater effects: 62% and 53% (two data sets). General + dimensional ratee performance: 21% and 25%. Random measurement error: 11% and 18%. Small perspective-related (organizational level) effects in boss and subordinate ratings, none in peer ratings.
Sample
Two data sets of managers (n = 2,350 and n = 2,142), each rated on 3 performance dimensions by 7 raters (2 bosses, 2 peers, 2 subordinates, self)
Methodology
Confirmatory factor analysis decomposing developmental multisource ratings into five variance components: ratee general performance, ratee dimensional performance, idiosyncratic rater tendencies, rater organizational perspective, and random error.

What this means

  • This is the empirical core of the 'humans were never reliable single raters either' argument: when you ask where a performance rating actually comes from, the rater's idiosyncratic way of seeing dominates the ratee's actual performance by more than 2-to-1. The instrument measures itself.
  • It reframes the AI-reliability conversation. A noisy LLM rater is not a regression from a reliable human baseline; the human single-rater baseline was already saturated with rater variance. The disease is single-rater measurement, in humans and machines alike.
  • It is the quantitative warrant for the prescription the literature already wrote: pool diverse raters. If 53-62% of a single rating is rater idiosyncrasy, averaging across independent raters is not an efficiency tradeoff — it is the only way to recover the ratee signal.

Source

Understanding the latent structure of job performance ratings

Journal of Applied Psychology · Steven E. Scullen et al. · 2000 · peer-reviewed

Context

What came before
Performance ratings were widely treated as a workable proxy for performance, with rater variance relegated to 'measurement error' to be minimized rather than understood as the dominant signal. The hope of AI raters inherits the same unexamined premise: that the human rating was a trustworthy gold standard.
What comes next
Sets up the inter-rater reliability figure (single-supervisor reliability ≈ .52, Viswesvaran/Ones/Schmidt 1996) and the attenuation ceiling. Cross-link to the LLM-rater cards (Young 2025, Ntinopoulos 2025) — the AI raters disagree for the same structural reason — and to the multi-rater / G-theory D-study fix.
Where this lands
Magazine: 'The Reliability Problem' §'The wall everyone hits' (footnote [^8]). Encyclopedia Part I (single-rater unreliability of human judgment) and Part II (variance decomposition / generalizability theory). Book 1 Unreliable, the human-failure lead case.
← All insight cards