The People Analyst Guide to Nine Lies About Work. Format: what the book argues → what the research actually says → how you run it → the analysis you can run → the AI-era turn → what to do Monday. No reproduction of the book's text; the substance is ours. Research anchors verified on read.
What the book argues
The sixth lie is the one a people analyst should have flagged first: people can reliably rate other people. Almost every talent system assumes they can — performance ratings, 360s, calibration sessions, succession grids, interview scorecards all rest on the premise that if you ask a competent person to rate someone on a defined scale, the number means something about the person being rated. Buckingham and Goodall's claim is that it largely doesn't. When you decompose where the variance in a rating actually comes from, most of it turns out to be about the rater — their leniency, their severity, their private theory of what the words on the scale mean — not the person under review. They call it the idiosyncratic rater effect, and it is, for measurement people, the most important sentence in the book.
What the research actually says
This is one of the rare popular-management claims that is understated, not over-sold. The rater-effect finding is old, replicated, and uncomfortable. Decomposing multi-rater performance data into its variance components — true target performance, rater idiosyncrasy, dimension, error — the rater component is consistently the largest single source, with the ratee's actual performance a minority share. Synthesizing that literature, the number Buckingham and Goodall land on is that roughly 60% of the variance in a rating reflects the rater rather than the person rated — and, against every intuition, more granular scales make it worse, because more detail gives idiosyncrasy more room to operate. (Anchors to read before drafting further: Mount, Scullen & Goff 2000; Scullen, Mount & Judge 2003; the broader interrater-reliability tradition.)
Here is the part the book gestures at and a people analyst has to make precise. "Unreliable" is not a mood; it is a measurable quantity. A century of psychometrics built the instruments to quantify exactly how much of a rating is signal and how much is rater: interrater agreement (Cohen's κ, Krippendorff's α), and — the right tool here — Generalizability theory, which apportions variance across raters, items, and occasions in one model and tells you what would actually raise reliability. The lie isn't that ratings are worthless; it's that organizations use them as if their reliability were known and high when it is usually unknown and low. The fix is not to stop measuring. It is to measure the measurement.
How you run it
Three moves, in order. First, estimate reliability before you trust a rating-based decision — run a G-study (or at minimum κ/α) on the rating data you already have; you will almost always find the rater-variance share is large. Second, reduce idiosyncrasy by construction — fewer, behaviorally anchored items; structure; rater training; and, where it matters, more raters whose disagreement you can model rather than average away. Third, report ratings with their reliability attached — a number without its error band is the exact false-precision the lie warns about.
The analysis you can run
This is the reliability program, and it is the one chapter where the analysis already exists in the
toolbox: an interrater-reliability estimation that takes your rating data (performance, 360,
interview, or open-text codes) and returns the variance decomposition — how much is the person, how much
is the rater — with κ / α / G-coefficients and the practical lever (more raters? fewer items? training?)
that would move it. It runs in performance-validity with research-methods; the same machinery scores
the agreement of any set of raters, human or otherwise.
The AI-era turn
The book is pre-LLM, and this is where it compounds. The moment you put an AI in the loop to read résumés, score interviews, or code open-text feedback, you have added a rater — and a rater is exactly the thing this chapter says is the dominant source of noise. An AI rater has its own idiosyncrasy (prompt-, model-, and run-dependent), and it is silent about it; it returns a confident number with no error band by default. The good news is that the solved problem stays solved: an AI is just another rater, so the same reliability theory — agreement across runs, across models, across humans — measures whether its scores are trustworthy. Don't trust an AI rating you haven't reliability-tested any more than a human one. (This is the spine of our Reliability Problem work: a century of psychometrics is the answer to AI's noisy-rater problem.)
What to do Monday
- Pull last cycle's ratings and run the reliability estimate — see your own rater-variance share before you defend the system.
- Stop adding scale detail to "fix" ratings; it makes the rater effect worse. Add structure and raters instead.
- Wherever an AI is rating, scoring, or coding, reliability-test it like a human rater — agreement across runs and against people — before any decision rides on it.
- Attach an error band to every rating-derived number you report. A rating without its reliability is a guess wearing a uniform.