The Law of Small Numbers
The dashboard flags a team in red. Engagement on Maria's team of seven fell eighteen points this quarter — the biggest drop in the org — and now there is a meeting about it, and Maria is in the meeting, explaining herself. She doesn't have a good explanation, because there isn't one. One person on her team had a miserable quarter and rated everything at the bottom of the scale, and another left and didn't respond at all. That is the entire eighteen-point "collapse": one bad mood and one empty seat, divided by seven.
Across the building, the team that rose the most this quarter will be praised, and most of that rise is the same thing running the other way. The dashboard has taken sampling noise, painted it red and green, attached arrows, and handed each one to a manager as a verdict on their leadership. Everyone in the meeting is reacting, earnestly, to a coin flip with a color.
They say track every team's trend
The modern people-analytics dashboard is built to slice. Every team, every quarter, every cut — engagement by manager, by location, by tenure band — each cell with its own number and its own little arrow. Rank the teams. Find the movers. Coach the bottom decile. It feels like getting closer to the truth, because more granularity looks like more resolution.
But resolution and reliability are not the same thing, and the smaller you slice, the further apart they get. A number from a team of seven is not a high-resolution version of the org number. It is a low-reliability one, and the dashboard shows it at full confidence anyway.
Small samples don't behave like big ones
Here is the principal issue, and it is one of the most reliably documented facts about how people misread data. We expect small samples to look like the population they came from — and they don't, not even close. Tversky and Kahneman named this the belief in the law of small numbers: even trained researchers routinely treat a handful of observations as if it carried the stability of a large one, reading a few data points as a pattern when they are mostly chance.1 A coin flipped seven times comes up five-and-two or worse all the time; nobody concludes the coin is broken. The same seven responses on an engagement item swing just as wildly, and we convene a meeting.
The math is not subtle. A proportion estimated from seven people carries an enormous margin of error — the honest confidence interval around it can span half the scale — which is exactly why a single response moves the average so far.2 The right picture of a small team's score is not a point. It is a wide band, and most quarter-to-quarter "movement" lives entirely inside that band: not a change in the team, a wobble in the estimate.
And there is a second trap waiting for anyone who acts on the extremes. The worst-looking teams this quarter and the best-looking ones are extreme partly because they got unlucky or lucky, and luck doesn't repeat — so next quarter they will drift back toward the middle on their own. Galton named this regression to the mean a century and a half ago.3 It guarantees that if you intervene on the bottom teams, most of them will improve next quarter regardless of what you did, and you will credit the intervention for a bounce that was coming anyway. The leaderboard manufactures both the crisis and the cure.
Small-N honesty
The fix is not to stop measuring small teams. It is to measure them honestly, which mostly means letting them stay uncertain instead of forcing them to be precise.
Put the error bar on, and show it. A team-of-seven number that arrives with its band visible tells a manager the truth at a glance: this could be anywhere in here, so don't over-read the dot. Set a floor below which you don't publish a cell at all — not only for privacy, but because below some n there is simply no signal to report, and a greyed-out not enough data to say is more honest than a number. Don't rank tiny teams against each other; the ranking is mostly noise sorted into a column. Expect regression to the mean and refuse to credit or blame it — judge an intervention against what the extreme group would have done anyway, not against its unlucky starting point. And when you genuinely need an answer for a small group, pool it, widen the window, or wait for more data, rather than acting on one quarter of seven.
The honest output for a small team is frequently a shrug with a confidence interval. That is not the analysis failing. That is the analysis declining to make something up.
Why it's worth raising your voice about
This isn't fussiness, because the small cells are where the real damage gets done. The org-level engagement number is built on thousands of responses and is fairly stable; nobody's career turns on it. It's the team number — the one slice small enough to point at a specific manager — that is simultaneously the least reliable and the most consequential. We take the noisiest measurement in the whole system and attach a person's reputation to it. Maria spends a quarter managing to a fluctuation, and the genuinely struggling team three floors up sits at a quiet, stable, unremarkable score and never trips an arrow.
And the slicing only gets cheaper. Every tool will now break any metric into ever-finer cells in an instant, each rendered at the same confident saturation as the org roll-up. The granularity is free; the reliability is not, and nothing about the speed changes the statistics. A team of seven told you almost nothing last year and tells you almost nothing now, faster.
So before you act on the reddest cell on the board, ask how many people are actually in it, and what the error bar around that number would look like if anyone had drawn it. If it's a team of seven, you are most likely looking at weather, not climate. The smallest cells make the most dramatic charts and carry the least information. A number you can't put an honest error bar around isn't a finding. It's a coin flip wearing a color.
Measurement-first method, useful whether or not you ever work with us. Small-N honesty — visible error bars, reporting floors, no ranking of tiny cells, regression-to-the-mean discipline — is a posture, not a feature; it runs through the Principia measurement program and pairs with The Error Bar Is the Product (how much to trust a number) and its sibling traps, Correlation Isn't a Driver and The Benchmark Trap. Every footnote names a real, checkable work.
Footnotes
-
Amos Tversky & Daniel Kahneman, "Belief in the Law of Small Numbers," Psychological Bulletin 76, no. 2 (1971): 105–110 — even sophisticated researchers expect small samples to be representative of the population and over-read short runs as patterns. ↩
-
Edwin B. Wilson, "Probable Inference, the Law of Succession, and Statistical Inference," Journal of the American Statistical Association 22, no. 158 (1927): 209–212 — the score interval for a proportion, which (unlike the normal approximation) stays sensible at small n and makes plain how wide the uncertainty is when the sample is tiny. ↩
-
Francis Galton, "Regression Towards Mediocrity in Hereditary Stature," Journal of the Anthropological Institute 15 (1886): 246–263 — extreme observations tend to be followed by ones closer to the mean, the origin of "regression to the mean." ↩