peopleanalyst

magazine · Methodology · AI × people analytics

Every tool reads your open text for themes. None of them tells you whether your theory is right — and that's a confirmatory question with a century-old answer.

By Mike West

June 16, 2026

Themes Aren't Evidence

The dashboard was beautiful, and it could not answer the question.

A company had run its annual engagement survey, and this year the open-text box had done its job too well: eleven thousand employees had written something back. So they did what everyone does now — fed the whole pile to a model and asked it to make sense of the comments. Out came the artifact of the genre: a ranked list of themes, a sentiment score trending down two points, a word cloud with workload and pay and manager sized large. Real work, rendered cleanly. The room nodded.

Then the CHRO asked the only question that mattered. So is it pay? Are we losing people over comp, or is that just what people always say?

And the dashboard had nothing. It could tell you that pay came up a lot. It could not tell you whether pay was the thing. Those are different questions, and almost every tool sold for reading open text answers only the first one while letting you believe it answered the second.

That gap — between what people said and whether you're right — is the whole subject of this essay. It is not a gap in the tooling. It is a gap in the kind of question being asked, and like the reliability problem before it, the discipline for closing it was worked out long before anyone fed a comment to a machine.

They say the text has themes

Walk through the open-text products on the market and they are, underneath the branding, the same machine: take unstructured language, return structure you didn't have to specify. Cluster the comments. Score the sentiment. Surface the topics. The lineage is honest and old — this is topic modeling, the unsupervised discovery of latent themes from a corpus, the technique that made latent Dirichlet allocation a household word in data-science departments two decades ago.1 You don't tell it what to look for; it tells you what's there. That is the pitch, and it is genuinely useful for the thing it does.

The thing it does is discovery. And discovery has a famous, load-bearing weakness that the word cloud is built to hide: there is no ground truth in the room. A topic model optimized to fit the data best does not produce the topics a human finds most meaningful — the canonical demonstration of this even had a name, reading tea leaves, and showed that the models scoring best on statistical fit were often the ones whose topics made the least sense to people.2 The labels on a theme cluster are assigned after the fact, by a human looking at the top words and deciding what they mean. The structure feels like a finding. It is closer to a Rorschach blot with good production values.

None of this is a scandal. Discovery is supposed to be open-ended; that is its purpose. The scandal is what happens when discovery gets handed to an executive as though it were a test.

Description is not a test

Sixty years ago John Tukey drew the line this essay is about and gave both sides a name. There is exploratory data analysis — you go into the data to see what's there, generate hypotheses, let the patterns suggest themselves. And there is confirmatory data analysis — you arrive with a hypothesis specified in advance and ask the data to render a verdict on it.3 Both are real science. They are not interchangeable, and the cardinal sin of applied analytics is to run the first and report it as the second.

Psychometrics built the same wall into its own house and never took it down. When you don't know how the items hang together, you run an exploratory factor analysis and let the structure emerge. When you have a theory of the structure and want to know if it holds, you run a confirmatory one — you specify the model first and test the fit. The text-as-data field draws the identical border in its own terms: there is discovery, which finds categories, and there is classification, which sorts cases into categories you defined in advance — and the foundational treatment of these methods is blunt that no single approach is best for all purposes and that the results of automated methods must be validated, not trusted on sight.4

A theme is an output of the exploratory mode. Pay comes up a lot is a hypothesis the text generated. To treat it as the answer to is it pay is to skip the entire confirmatory step — to mistake the question for its own answer. The word cloud doesn't test the pay theory. It launders it.

The garden, and what grows there

Here is why this is worth raising your voice about, and not just a tidy distinction for methodologists.

When you go looking through eleven thousand comments for the story, you will find one. You will always find one, because a corpus that large contains evidence for nearly any thesis you carry into it, and the human reading the theme list is not a neutral instrument — he is a person with a prior belief about why people are leaving, scanning a Rorschach blot for confirmation. The replication crisis gave this failure its modern anatomy. Researcher degrees of freedom — the small, defensible-looking choices about what to include and how to group and which cut to report — are enough to manufacture a statistically significant result for a hypothesis that is simply false.5 You need not be dishonest. You need only decide what counts after you've seen the data, down one of the many branching paths the analysis could have taken — the garden of forking paths, where the same well-meaning analyst would have made different choices given different data, and so the finding that emerges was never really tested at all.6

A descriptive open-text tool is a forking-paths engine pointed at your own convictions. It hands you the themes, and it hands the interpretation to whoever is already sure they know the answer. The output looks like evidence and behaves like a mirror. That is the principal issue underneath the beautiful dashboard: it cannot disagree with you, and a thing that cannot disagree with you cannot be evidence.

Code to the science, not to a word cloud

The fix is not to stop reading open text. It is to read it the other way — deductively — and the qualitative-methods literature named this lane decades ago. Set against the conventional analysis that lets categories emerge from the text, there is directed content analysis: you start from existing theory, define your codes before you read, and assign each piece of text to a construct that was specified in advance.7 The codes don't come from the corpus. They come from the science.

This is the move that turns language into a measurement. Instead of asking a model what themes are here, you ask it to what degree does this comment express psychological safety, abusive supervision, distributive injustice, service climate — constructs that exist in the literature, that have been validated, that mean the same thing across studies because someone did the work of defining them. Each verbatim gets coded to a canonical construct with an intensity and a confidence, and the ones that don't fit any construct go in an honest unresolved pile instead of being forced into a theme to round out the chart. Now the open text isn't a cloud of words. It's a column of measurements, anchored to constructs you can actually reason about.

And a column of measurements can be tested.

Against the prior

Here is where the two questions finally separate, and the dashboard's failure becomes a different kind of answer.

The CHRO's theory — we're losing people over pay — is not a vibe. It is a claim about a relationship between two constructs: pay, and the intention to stay. Stated that way, it is exactly the kind of thing the published science has already studied to death. When researchers cumulated ninety-two samples on the link between pay level and how satisfied people actually are, the correlation came back at about .15 — present, real, and far weaker than the folk theory that pay is the thing.8 That number is a prior: the best estimate the field has of how strongly pay and satisfaction move together, before you've looked at your own data at all.

Confirmatory text analysis is what happens when you fuse the two. Take the comments, now coded to constructs rather than themes. Measure the association in your own corpus between the pay construct and the intent-to-leave construct. Then weigh that association against the published prior and return a verdict — supported, refined, contradicted, or simply too thin to say. The output is not a theme. It is a sentence the dashboard could never produce: your engagement comments do not support the pay-drives-attrition story; the construct that actually tracks with leaving in your data is the quality of the immediate manager — which is also what the literature would have predicted. Or the opposite. The point is that the text was allowed to disagree with the executive who brought the theory in the door.

That is the difference between description and a test. Description tells you pay came up a lot. The test tells you the pay theory does not survive your own data — and that costs something to say, which is exactly why it's worth saying.

What you can finally answer

Step back to the room. The same eleven thousand comments are on the table. The descriptive pipeline turned them into a portrait of what employees said — useful, real, and silent on the one question that drives a budget. The confirmatory pipeline turns the same words into a verdict on the theory the organization was about to spend money on. One produces a mirror. The other produces a finding that can be wrong, which is the only kind of finding worth having.

The field keeps building better mirrors — finer sentiment, cleaner clusters, prettier clouds — because description is easy to sell and impossible to falsify. But the question every leader actually has about their open text is confirmatory in shape: is the thing I believe true? That question has had a methodology since Tukey named it, a coding discipline since directed content analysis, and a way to weigh new text against settled evidence the whole time. We didn't need a smarter word cloud. We needed to remember that themes were never evidence — and to start testing the theories instead of decorating them.


This is a companion in the Measurement Meets AI program — the argument that the unglamorous machinery of measurement science is the under-used answer to what AI is being asked to do. Its siblings, The Reliability Problem and Borrowed Validity, take up the same move for noisy raters and for the predictors organizations borrow without checking. The confirmatory-text capability described here — coding open text to canonical constructs and testing a stated theory against the published prior — is the headline capability of the Principia measurement registry; the capability positioning carries the compressed version of this argument. No numbers in this essay are invented: every figure traces to the cited source, and where no prior exists, the honest answer is that there isn't one.

Footnotes

  1. David M. Blei, Andrew Y. Ng & Michael I. Jordan, "Latent Dirichlet Allocation," Journal of Machine Learning Research, 2003 — the canonical unsupervised topic model: themes as latent distributions inferred from a corpus without labeled training data.

  2. Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang & David Blei, "Reading Tea Leaves: How Humans Interpret Topic Models," NeurIPS, 2009 — found that topic models with better held-out likelihood often produced less human-interpretable topics, severing statistical fit from meaning. Topic labels are assigned by human inspection after the fact.

  3. John W. Tukey, Exploratory Data Analysis, 1977 — the foundational distinction between exploratory analysis (hypothesis-generating, open-ended) and confirmatory analysis (hypothesis-testing, specified in advance). Tukey's lifelong warning was against dressing the former as the latter.

  4. Justin Grimmer & Brandon M. Stewart, "Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts," Political Analysis, 2013 — distinguishes discovery (unsupervised, finds categories) from classification (supervised, sorts into pre-defined categories); the paper's repeated injunction is that "no globally best method" exists and that automated results must be validated, never trusted on output alone. The confirmatory/exploratory split also lives in psychometrics as confirmatory vs exploratory factor analysis.

  5. Joseph P. Simmons, Leif D. Nelson & Uri Simonsohn, "False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant," Psychological Science, 2011 — "researcher degrees of freedom" in analysis choices are sufficient to produce significant support for false hypotheses.

  6. Andrew Gelman & Eric Loken, "The Garden of Forking Paths" (working paper, 2013; American Scientist, 2014, as "The Statistical Crisis in Science") — even without intentional fishing, analysis choices made after seeing the data invalidate the test, because a different dataset would have led the same analyst down a different path.

  7. Hsih-Fang Hsieh & Sarah E. Shannon, "Three Approaches to Qualitative Content Analysis," Qualitative Health Research, 2005 — distinguishes conventional (codes emerge inductively from the text), directed (deductive; codes derived from existing theory and applied), and summative analysis. Directed/deductive coding to a pre-specified construct framework is the qualitative counterpart of confirmatory analysis.

  8. Timothy A. Judge, Ronald F. Piccolo, Nathan P. Podsakoff, John C. Shaw & Bruce L. Rich, "The Relationship between Pay and Job Satisfaction: A Meta-Analysis of the Literature," Journal of Vocational Behavior, 77 (2010): 157–167. Cumulating 115 correlations across 92 independent samples, pay level correlated ≈ .15 with job satisfaction and ≈ .23 with pay satisfaction — "only marginally related," against the strength the folk theory assumes. Used here as an illustrative published prior; the pay/attrition narrative in the opening is a composite, not a specific client engagement.

Was this useful?

Anchored in

← All magazine pieces