Significance on Demand
The survey team has good news and a slide to prove it. Buried in this year's engagement data is a finding: remote employees in the Western region, under thirty, in their first two years, are significantly less engaged than everyone else — p < .05, a real result, flagged in red. By the next planning cycle there is a working group, a targeted intervention, and a line in the budget aimed squarely at young remote Westerners.
Here is what the slide didn't say. To find that result, the team had crossed forty survey items against a dozen demographic cuts and a handful of regions and tenure bands — several hundred comparisons in all. At a p < .05 threshold, one in twenty pure-noise comparisons clears the bar by chance. Run five hundred of them and you will harvest roughly twenty-five "significant" findings from data with nothing in it at all. The young remote Westerners didn't pop because something is wrong on their teams. They popped because the team went looking, and a big enough search always finds something.
They say: drill in and find what's significant
This is the promise of the modern survey platform: don't settle for the headline number, drill down. Slice by department, by manager, by location, by generation, by tenure, by remote status, and let the tool flag what's statistically significant. Each significant cell feels like a discovery — a specific, actionable place where something real is happening. The dashboard is built to manufacture these, and it presents each one at full confidence, stripped of the context that would let you judge it: how many other comparisons were run to find it.
That context is the whole story.
The p-value was built for one question, not five hundred
Here is the principal issue. The 5% significance threshold has a precise meaning, and it holds for one pre-specified test: if there's really nothing there, you'll be fooled about one time in twenty. That error rate is a per-test promise. It says nothing about what happens when you run hundreds of tests and report the ones that passed — and what happens is that "significant" stops carrying information, because at twenty-to-one odds, noise alone supplies a steady harvest of winners. The threshold didn't fail. It was asked a question it was never designed to answer, by anyone who ran more than one test and reported the best.
This is the engine behind one of the most uncomfortable findings in modern science — that a large fraction of published, statistically significant results are false, and that the more relationships a field tests, the lower the odds that any given significant one is real.1 An engagement survey is a relationship-testing machine: dozens of items, many subgroups, multiplied together into a field of hundreds of hypotheses, almost none of them specified in advance. It is, structurally, a false-positive factory with a clean interface.
And it gets worse when the slicing happens after the data is in, because then the subgroup itself was chosen because it looked interesting. You didn't ask "are young remote Westerners less engaged" and test it; you scanned the whole grid, saw that cell light up, and drew the box around it afterward. The hypothesis was built from the noise it claims to have found.
Decide before you look, or correct for looking
None of this means stop analyzing the survey. It means be honest about how many questions you asked, and pay the statistical price for asking a lot of them.
The first move is the cheapest and the most powerful: decide your real questions before you open the data. A short list of hypotheses you committed to in advance — we expect the new-manager cohort to lag; we expect the post-reorg teams to dip — can be tested at face value, because you specified them, not the data. Everything beyond that pre-committed list is exploration, and exploration generates hypotheses, not conclusions — candidates to confirm on next quarter's fresh data, never findings to act on today. When you genuinely must test many things at once, correct for it: adjust the threshold for the number of comparisons, or control the false-discovery rate so the list of "significant" cells isn't mostly chance.2 And report the effect size and its interval, not just the asterisk — a difference can be statistically significant and far too small to reorganize around.
The honest version returns fewer exciting surprises. It tends to confirm a couple of things you already suspected and hand you a list of maybes to retest, which is a worse slide and a better decision. The dashboard's job is to make the surprising subgroup look like a discovery. Your job is to remember how many subgroups it had to check to find it.
Why it's worth raising your voice about
Because the false positive isn't free. The working group, the targeted intervention, the budget line — all of it gets spent on a pattern that was never there, and a quarter later engagement among young remote Westerners looks normal again (it always would have; see regression to the mean) and everyone quietly credits the program. Meanwhile the genuine signal, the one real difference that didn't have a dramatic p-value attached because nobody pre-committed to looking for it, goes unnoticed in the noise of two dozen fake ones.
So when a survey hands you a surprising significant subgroup, ask the only question that matters: how many other cuts did we check to find this one? If the answer is "we sliced it every way the tool allows," you are not looking at a discovery. You are looking at the residue of a search. A survey sliced enough ways will always confess to something — and significance you went hunting for isn't evidence. It's the sound of the hunt.
Measurement-first method, useful whether or not you ever work with us. Pre-committed hypotheses, multiple-comparison discipline, and exploration-as-hypothesis-not-conclusion are the posture behind the Principia measurement program; its open-text cousin — confirmatory rather than exploratory reading of comments — is Themes Aren't Evidence. A sibling of the dashboard-trap pieces Correlation Isn't a Driver, The Benchmark Trap, The Law of Small Numbers, and What the Exit Data Can't See. Every footnote names a real, checkable work.
Footnotes
-
John P. A. Ioannidis, "Why Most Published Research Findings Are False," PLoS Medicine 2, no. 8 (2005): e124 — among other results, the demonstration that the more relationships tested in a field (and the more analytic flexibility), the lower the probability that any single statistically significant claim is true. ↩
-
Yoav Benjamini & Yosef Hochberg, "Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing," Journal of the Royal Statistical Society, Series B 57, no. 1 (1995): 289–300 — the standard modern correction for testing many hypotheses at once; the older, stricter Bonferroni adjustment divides the threshold by the number of comparisons. ↩