peopleanalyst

magazine · Methodology · benchmarking

A raw benchmark compares you to a population that differs on everything, then hands you the gap as if it were about the one thing you care about. The percentile is mostly a measurement of who you stood next to.

By Mike West

June 22, 2026

The Benchmark Trap

In the fall of 1973 the University of California, Berkeley appeared to have a problem. Across the graduate school, men were admitted at a noticeably higher rate than women — the kind of gap that ends up in a lawsuit. So the university did the responsible thing and looked closer, department by department. And the gap didn't just shrink. It reversed. In most departments, women were admitted at a slightly higher rate than men. The aggregate said bias against women; the parts said the opposite.1

Both numbers were correct. Women had applied in larger numbers to the most competitive departments — the ones that rejected nearly everyone, of either sex — and that difference in where they applied swamped the within-department picture and flipped the overall rate. The famous name for this is Simpson's paradox, and the lesson under it is bigger than any one statistic: a comparison between two groups that differ in their composition is measuring the composition, not the thing you meant to compare.2

This is the trap built into nearly every benchmark a company runs.

They say: benchmark against the market

The instinct is everywhere because it feels like rigor. How do we compare? Pull the peer set, find the median, place yourself on the percentile curve. You pay at the 45th percentile of market — you're behind. Your turnover is above the industry benchmark — you have a retention problem. Your engagement score is four points under the norm — morale is slipping. The benchmark arrives as a single clean number with a direction attached, and the direction tells you what to do.

The number feels like a measurement of you. It is mostly a measurement of who you happened to be standing next to.

A percentile of what, exactly

Here is the principal issue, stated plainly. A raw benchmark compares your population to some other population that differs from yours on nearly everything that matters — and then hands you the difference as if it were about the one variable you care about. Your pay sits "below market" — but your workforce is younger, more junior, more concentrated in lower-cost cities than the peer set the median came from. The gap you're looking at is your role mix and geography, not your pay practice. A company stuffed with entry-level roles will always look underpaid against a senior-heavy comparison, and a company in three expensive metros will always look generous, and neither number is telling you whether you pay fairly for the work.

It runs through every domain HR benchmarks. Your turnover is "high for the industry" — but turnover is driven hard by occupation and geography and tenure mix, and a benchmark that ignores them tells you about your industry's labor market, not your management. Your engagement is "below the norm" — but published norms blend demographics and tenures and functions that look nothing like yours. You are comparing apples not to apples but to a fruit basket, and reporting the difference as a verdict on the apple.

Adjust, or you're not measuring what you think

The fix is not to stop benchmarking. It is to compare like with like — to hold constant the things that differ before you read the gap that's left.

At its simplest that's stratification: don't compare your overall pay to the market median, compare your software engineers in Austin to the market's software engineers in Austin, role by role, place by place, and see where the real gaps are. Done properly it's a multivariate adjustment — a regression that controls for role, level, geography, tenure, industry, size, all at once — so what comes out the other side is the difference that remains after the composition is accounted for. That residual is interpretable. It is a statement about you. The raw percentile never was.

This is the difference between a benchmark that's a number and a benchmark that's an argument you can act on. The adjusted version will often say something more useful and less dramatic than the headline: overall you're at parity once role mix is controlled — but your senior engineering band lags the market badly, and that's where your regretted attrition is. The raw number said "45th percentile, underpaid." The adjusted number told you the one place to spend.

Why the clean number keeps winning anyway

It costs something to do this, and the cost explains why the trap persists. You're at the 45th percentile fits on a slide and points at an action. After adjusting for role mix, level, and geography, aggregate pay is at parity but the senior band sits at the 30th percentile does not fit on a slide, can't be reduced to a single arrow, and requires the room to hold more than one idea at once. The clean number wins the meeting. The adjusted number wins the decision. Most benchmarking is optimized for the meeting.

And the temptation is getting cheaper to indulge, because pulling a peer median and a percentile takes seconds now, while doing the adjustment still takes thought about what actually differs and why. Speed makes the raw comparison even more seductive and no more correct.

So before you act on a benchmark — before you fund the raise, launch the retention program, or panic about the engagement dip — ask the Berkeley question: a percentile of what, exactly? If the comparison didn't hold constant the things that obviously differ between you and your peer set, you don't have a measurement of your practice. You have a measurement of your composition, wearing a percentile like a verdict. A benchmark you didn't adjust isn't telling you about you.


Measurement-first method, useful whether or not you ever work with us. The adjusted-comparison discipline — comparing like with like and reporting the gap that survives the controls — is the whole point of a multivariate adjusted index rather than a raw-median benchmark; it's the posture behind the Principia measurement program and the compensation work in the portfolio. Companion to Correlation Isn't a Driver, its sibling trap. Every footnote names a real, checkable work.

Footnotes

  1. P. J. Bickel, E. A. Hammel & J. W. O'Connell, "Sex Bias in Graduate Admissions: Data from Berkeley," Science 187, no. 4175 (1975): 398–404 — the aggregate admissions rate favored men, but within almost every department the rates slightly favored women; the reversal was driven by women applying disproportionately to more selective departments.

  2. E. H. Simpson, "The Interpretation of Interaction in Contingency Tables," Journal of the Royal Statistical Society, Series B 13, no. 2 (1951): 238–241 — the formal statement that an association present in aggregated data can disappear or reverse once the data are broken out by a relevant third variable.

Was this useful?

Anchored in

Keep going

New issues, oriented to your goals — methodology-first, source-anchored, not a firehose.

Put it to work

Run it on your own data

The traps are easier to avoid with the right tool in hand — guided wizards and drop-in code packs for the analytics tools you already use.

Browse the tools →
← All magazine pieces