peopleanalyst

Insight Cards · analytics

analyticsQ7verified

Viswesvaran, Ones & Schmidt 1996 (J. Applied Psychology) — single-supervisor interrater reliability of overall job performance ≈ .52

A meta-analysis of job-performance rating reliabilities found the mean interrater reliability of supervisory ratings of overall job performance to be .52 — i.e., two supervisors rating the same employee agree at roughly one-half on a 0-to-1 reliability scale. Supervisory ratings were more reliable than peer ratings; interrater reliability was uniformly lower than intrarater reliability.

Mean interrater reliability of single-supervisor ratings of overall job performance.52 (overall job performance, supervisory single-rater). Supervisory > peer reliability; interrater reliability < intrarater reliability throughout. Corroborated: Conway & Huffcutt (1997) ≈ .50; Rothstein (1990) ≈ .55; Shen et al. (2014) confirm .52 as the best estimate. Updated meta-analyses revise it upward (Zhou et al. 2024 = .65; Speer et al. 2023 = .65, direct-supervisor designs).
Sample
Meta-analysis aggregating job-performance rating reliability studies (Viswesvaran et al. 1996); corroborating meta-analyses span 22-224 independent samples and tens of thousands of ratees
Methodology
Psychometric meta-analysis of interrater and intrarater reliabilities across 10 performance dimensions plus overall job performance.

What this means

  • The canonical number for 'how reliable is one human rater of another human's performance' — about one-half. It is the empirical floor that the attenuation theorem then operates on: a measure at reliability .52 can correlate no higher than ~.72 with any real outcome, before bias enters.
  • It anchors the corrected thesis. AI raters that disagree are not falling short of a reliable human baseline; the single-human baseline was ≈ .52 to begin with. The honest comparison is AI-rater reliability beside this number, per task — not AI against an assumed-perfect human.
  • The live scholarly debate strengthens rather than weakens the program's point: Murphy & DeShon (2000) argue interrater correlations are not reliability at all because rater variance is systematic (not random error) — which is exactly Scullen et al.'s 53-62% idiosyncratic-rater finding, and exactly why generalizability theory (decompose the facets) is the right instrument rather than a single coefficient.

Source

Comparative analysis of the reliability of job performance ratings

Journal of Applied Psychology · Chockalingam Viswesvaran et al. · 1996 · peer-reviewed

Context

What came before
Performance ratings were corrected for attenuation using intrarater reliabilities (a single rater rating twice), which overstates reliability; this meta-analysis established interrater reliability as the conceptually correct, and much lower, estimate.
What comes next
Feeds the attenuation ceiling (√.52 ≈ .72) and the multi-rater fix. Note the upward revision in newer meta-analyses (~.65) and the Murphy-DeShon dispute over whether interrater correlations estimate reliability at all — both belong in the encyclopedia validity entry. Cross-link to Scullen 2000 (variance decomposition) and the LLM-rater cards.
Where this lands
Magazine: 'The Reliability Problem' §'The wall everyone hits' (footnote [^8], 'around one-half'). Encyclopedia Part I (single-rater unreliability) and Part II (reliability estimation, interrater vs intrarater). Book 1 Unreliable.
← All insight cards