Tools · People analytics
Effect Size
How big is the effect really — and was the study even powered to find it?
The method
Effect-size estimation and statistical power analysis (Cohen)
The vendor reports their program 'significantly' improved engagement, p < .05, and the renewal decision is due this month. With three thousand respondents, significant is compatible with an effect too small for anyone to notice. Significance says an effect probably exists; the director needs to know whether it is big enough to pay for.
Jacob Cohen spent a career insisting on that distinction. The effect size — his d, the group difference in standard-deviation units — answers how big; statistical power answers whether the study could have detected the effect at all. The two together dissolve most abuses: a huge sample makes trivial effects significant, and an underpowered study that found nothing has not demonstrated absence, only its own inability to look.
David Spiegelhalter's The Art of Statistics is candid about how the significance ritual went wrong — the reproducibility crisis, questionable research practices, findings tortured past p < .05 — and his prescription is the one this method operationalizes: report the size of the effect and the uncertainty around it, in language a person can weigh, like how many people out of a hundred the difference actually touches. Practical Statistics for Data Scientists gives the working analyst's honest accounting of the p-value — what it is, and pointedly what it is not: it is not the probability the finding is true. McNulty's regression handbook brings the power question home to people analytics, where samples are small and decisions are consequential: an analysis has to respect what its data can and cannot detect before anyone trusts its inferences.
One caution the literature itself makes: Cohen offered his small/medium/large thresholds as conventions for when nothing better exists, not laws. Across ten thousand employees, a small d on a cheap intervention can be worth real money; the label matters less than the decision it feeds.
The service computes d, partial eta squared, confidence intervals, achieved power, and the n per group you actually needed — all deterministically in code; the language model never does arithmetic — then writes the practitioner read: how big, how certain, and whether the study could ever have found what it claims.
The books behind this tool
- The Art of Statistics — David Spiegelhalter
- Practical Statistics for Data Scientists — Peter Bruce, Andrew Bruce & Peter Gedeck
- Handbook of Regression Modeling in People Analytics — Keith McNulty
How it works
Deterministic effect-size and power math: Cohen's d from two-group stats or a t statistic, partial eta squared from F and dfs, 95% CIs (Hedges–Olkin SE), achieved power, and required n per group — all in code (the LLM never does arithmetic). The LLM writes the plain-language practitioner read: magnitude in context (overlap language, out-of-100 framings), honest about CIs spanning zero and underpowered designs, no rigid label worship. Grounded in the statistics corpus.
You bring
{ groups?|from_t?|from_f?, target_power?, context?, cluster? }
You get
{ computation (d · CI · r-equivalent · power · n-required · eta² · f), interpretation (narrative · caveats), grounded_in, provenance }
Use it for
- →Vendor claims their program 'significantly' moved engagement: get the d, the CI, and whether it matters
- →Before running the study: the n per group you actually need for 80% power at the effect you expect
- →Translate a paper's partial eta squared into language an exec can weigh
See it work
example outputDid manager coaching move engagement? Two-group comparison (n=120 vs 118) plus a 3-cohort ANOVA — how big is the effect really, and is the study even powered to say?
| Effect | Value | 95% CI | Read |
|---|---|---|---|
| Cohen's d (coaching vs not) | 0.242 | [-0.013, 0.497] | r-equivalent 0.12 |
| Achieved power | 0.464 | — | n/group for 80%: 268 |
| Partial η² (cohort ANOVA) | 0.174 | — | Cohen's f 0.459 |
Two very different signals are sitting side by side here, and they shouldn't be blended. The head-to-head comparison — coached-manager teams vs not — gives d = 0.242, a small effect (r-equivalent 0.12). In plain terms, if you lined up a coached team and an uncoached team at random, the coached one has engagement scores that lean higher only about 57 out of 100 times — the two distributions overlap heavily. That's a nudge, not a transformation, though for a light-touch, low-cost intervention like manager coaching a genuine small lift can still be worth the money if it holds up. The critical problem: the 95% CI runs from -0.013 to 0.497, which just barely crosses zero. So the honest read is that this study cannot rule out no effect at all, while also being compatible with a moderate benefit. Achieved power is only 0.464 — a coin-flip's chance of detecting a true effect this size — so a non-significant result here tells you almost nothing. To settle it at 80% power you'd need roughly 268 per group. The much larger partial η² = 0.174 (Cohen's f = 0.459, d-equivalent 0.918) is from a separate ANOVA across the 3 cohorts and answers a different question — how much engagement variance tracks with cohort — not how much coaching helps. Don't read that big number as the coaching effect; it isn't.
Caveats
- The d CI (-0.013 to 0.497) includes zero, so the coaching-vs-not effect is not statistically distinguishable from no effect — the study is underpowered (power 0.464), not evidence of absence.
- You'd need about 268 per group to test this at 80% power; the current sample is far short, so a null finding is uninformative.
- The partial η² = 0.174 is a cohort-difference effect, not the coaching effect — do not report it as the size of the coaching benefit; the two analyses answer different questions.
- partial η² is biased upward in small samples — report ω² alongside it, and note d_equivalent = 0.918 assumes a two-group case that doesn't match a 3-cohort ANOVA.
- CIs and power use normal approximations (Hedges–Olkin SE, z-based power); fine for planning but use noncentral intervals for publication.
- Even a real d = 0.242 is a small, heavily-overlapping effect — decide whether that magnitude of engagement lift justifies the coaching program's cost before scaling.
Method notes
- CIs and power use normal approximations (Hedges–Olkin SE; z-based power) — fine for planning and interpretation, not a substitute for noncentral intervals in publication.
- partial η² from F is sample-biased upward (report alongside ω² for small samples); d_equivalent assumes the two-group case.
Run it on your data
Call it on your own inputs — over the API, or hand it to your AI agent via MCP. Discovery is open; running it is metered.