analyticsQ7verified
Scullen, Mount & Goff 2000 (J. Applied Psychology) — idiosyncratic rater effects are the largest single source of variance in performance ratings
Decomposing managerial performance ratings into five postulated sources, idiosyncratic rater effects accounted for 62% and 53% of the rating variance across two large data sets — over half — while the ratee's actual performance (general + dimensional) accounted for only 21% and 25%. The single largest thing a performance rating measures is the rater, not the ratee.
Share of performance-rating variance attributable to idiosyncratic rater effects vs. ratee performance vs. random errorIdiosyncratic rater effects: 62% and 53% (two data sets). General + dimensional ratee performance: 21% and 25%. Random measurement error: 11% and 18%. Small perspective-related (organizational level) effects in boss and subordinate ratings, none in peer ratings.
- Sample
- Two data sets of managers (n = 2,350 and n = 2,142), each rated on 3 performance dimensions by 7 raters (2 bosses, 2 peers, 2 subordinates, self)
- Methodology
- Confirmatory factor analysis decomposing developmental multisource ratings into five variance components: ratee general performance, ratee dimensional performance, idiosyncratic rater tendencies, rater organizational perspective, and random error.
What this means
- This is the empirical core of the 'humans were never reliable single raters either' argument: when you ask where a performance rating actually comes from, the rater's idiosyncratic way of seeing dominates the ratee's actual performance by more than 2-to-1. The instrument measures itself.
- It reframes the AI-reliability conversation. A noisy LLM rater is not a regression from a reliable human baseline; the human single-rater baseline was already saturated with rater variance. The disease is single-rater measurement, in humans and machines alike.
- It is the quantitative warrant for the prescription the literature already wrote: pool diverse raters. If 53-62% of a single rating is rater idiosyncrasy, averaging across independent raters is not an efficiency tradeoff — it is the only way to recover the ratee signal.
Source
Understanding the latent structure of job performance ratings
Journal of Applied Psychology · Steven E. Scullen et al. · 2000 · peer-reviewed
Context
- What came before
- Performance ratings were widely treated as a workable proxy for performance, with rater variance relegated to 'measurement error' to be minimized rather than understood as the dominant signal. The hope of AI raters inherits the same unexamined premise: that the human rating was a trustworthy gold standard.
- What comes next
- Sets up the inter-rater reliability figure (single-supervisor reliability ≈ .52, Viswesvaran/Ones/Schmidt 1996) and the attenuation ceiling. Cross-link to the LLM-rater cards (Young 2025, Ntinopoulos 2025) — the AI raters disagree for the same structural reason — and to the multi-rater / G-theory D-study fix.
- Where this lands
- Magazine: 'The Reliability Problem' §'The wall everyone hits' (footnote [^8]). Encyclopedia Part I (single-rater unreliability of human judgment) and Part II (variance decomposition / generalizability theory). Book 1 Unreliable, the human-failure lead case.