peopleanalyst

← The PeopleAnalyst Guide to Work Rules·Ch 07

Why Everyone Hates Performance Management

What Bock argues

The chapter's claim is that performance management is hated because we make one ritual carry two jobs that fight each other. The annual review tries to be a development conversation (here's how to grow) and a judgment conversation (here's your number, and your pay rides on it) at the same time — and the judgment half poisons the development half, because no one hears coaching through the noise of being graded. Bock's fixes: separate the two conversations; set goals with OKRs (specific, visible, ambitious); calibrate ratings across managers so a "3" means the same thing on two different teams; and stop letting the annual event eat the year. Lighten the ritual, and make the parts that remain actually measure something.

Right instinct, and the research turns it from preference into proof — with a sharper edge than Bock puts on it, because the deepest problem isn't the ritual. It's the instrument.

What the research actually says (and where 2015 needs an update)

Start with the instrument, because it is the part everyone skips. A performance rating is a measurement: one observer watches a person across a year and emits a number. So ask the measurement question — where does that number actually come from? When Scullen, Mount, and Goff (2000) decomposed a large body of performance ratings, the single largest source of variance was not the employee's performance. It was the idiosyncrasy of the rater — the manager's personal, repeatable way of seeing — worth more of the final score than the work the score was supposed to measure. A single supervisor's rating lands at a reliability of about .52, roughly a coin that has overheard a rumor. That is not a story about lazy managers; it is the structural fact that one human rating another is a single rater, and a single rater is a noisy instrument. (This is the spine of The Reliability Problem and Unreliable Ch 3 — the same finding, here applied to the review.)

That reframes Bock's reforms as exactly the right psychometric moves, even though he doesn't name them that way. Calibration is an attempt to subtract each manager's personal equation — the stable offset Unreliable Ch 7 is named for. A multi-rater view (peers, skips, self, alongside the manager) is the panel that Unreliable Ch 8 prescribes: the single rater is the disease; reading the spread is the cure. And separating development from judgment is backed by the most important feedback result there is: Kluger and DeNisi (1996), reviewing hundreds of studies, found that feedback interventions reduced performance in roughly a third of cases — most reliably when the feedback aimed at the self (your rating, your worth) rather than the task. The grade doesn't just fail to help; a meaningful share of the time it actively hurts, by pulling attention onto the ego. Bock's "separate the conversations" is the operational form of "keep feedback on the task, off the self."

On the goals half: goal-setting theory (Locke & Latham) is one of the most replicated findings in the field — specific, challenging goals with commitment and feedback beat "do your best." OKRs are a goal-setting instrument, and the research backs them with a caveat the cheerleaders skip: goals tunnel attention, so a measured goal can crowd out the unmeasured-but-important and invite gaming. Goals are a sharp tool; sharp tools cut both ways.

Where 2015 needs the update: the noisy-rater problem now has a second rater walking into the room. AI-assisted performance ratings and "calibration copilots" are arriving — and they are raters too, subject to the identical discipline. Used well, AI is a cheap second (and third) reader that finally makes the multi-rater panel affordable, and a debiasing aid that can flag a manager whose distribution is drifting. Used naively — one model, one pass, treated as objective — it is just a faster way to ship a single noisy rater with a confident face. Same wall, same fix.

How you run it

Three measurements, then read the gaps.

The analysis you can execute

Near-zero net-new: this is the reliability / inter-rater program (the same Consensus Coder / G-theory machinery behind The Reliability Problem) pointed at performance ratings, plus OKR tracking. Decompose the variance, surface the rater effect per manager, price the calibration with a D-study (how many raters to reach the reliability this decision deserves), all min-N gated. The headline output is the rater-effect share — the fraction of your performance scores that is actually about the person holding the pen — and most leaders have never once seen that number for their own org.

The AI-era turn

If you let AI rate or calibrate performance, treat it as what it is: another rater, with a reliability you must measure, not assume. Its gift is cheapness — the multi-rater panel and the calibration check used to cost more manager-hours than anyone would pay, and now they don't. Its trap is authority — a single model's confident score is still a single rater's score, and shipping it as "objective" repeats the oldest mistake in the file. Engineer the reliability in (panel, calibration, anchor), or inherit the single-rater disease at machine speed.

What to do Monday

  1. Pull last cycle's ratings and estimate the rater-effect share — even a crude decomposition. Show leadership how much of the "performance" number is the manager. It changes the conversation.
  2. Split the two meetings. Run the development conversation on a different day from the rating/pay conversation, and keep development feedback on the task, not the person (Kluger & DeNisi).
  3. If you rate, calibrate from the distribution data and add at least one more reader where the stakes justify it. The panel is the feature, not the inefficiency.
  4. Before adopting any AI rating/calibration tool, ask the reliability question first: what's its measured reliability, and is it one rater or a panel? If it's one model, one pass — it's a single noisy rater wearing a badge.

Cross-refs: content/magazine/the-reliability-problem.md (the spine); Book 1 Unreliable Ch 3 (the rater in the mirror), Ch 7 (the personal equation = calibration), Ch 8 (the panel), Ch 9 (the D-study = pricing the calibration); CHAPTER-MAP.md Ch 5 (selection) and Ch 8 (the two tails) share the reliability machinery.