Tools · General business
Scale Validator
Upload a home-grown survey scale (items + a response matrix) — get a psychometric credibility report you can defend.
The method
Classical test theory scale validation (reliability · dimensionality · item analysis · DIF)
Somebody built the engagement survey in a workshop — twelve items, a Likert scale, a name — and the organization has been steering on its scores ever since. Nobody has checked whether the items hang together, whether they measure one thing or three, or whether the scale reads differently across groups you compare. Decisions are riding on an instrument that has never been inspected.
DeVellis and Thorpe's Scale Development is the field's standard walkthrough of a claim most survey-writers never confront: a scale is a measurement instrument for an unobservable construct, and its quality is an empirical question, not a matter of whether the items sound right. Reliability, in their classical-test-theory framing, is the proportion of score variance that is true-score rather than noise — estimated by coefficient alpha and its relatives — and it is earned through item quality, content sampling, and an adequate development sample, then verified, not assumed. Carmines and Zeller's slim Reliability and Validity Assessment supplies the distinction that keeps the whole exercise honest: reliability is threatened by random error, validity by systematic error, and the two are different failures. A scale can be beautifully consistent and consistently measure the wrong thing; a high alpha is where validation starts, not where it ends.
Watkins's Exploratory Factor Analysis adds the caution that applies directly to dimensionality checks: EFA is a century old and still routinely misapplied, largely because researchers accept software defaults. The classic trap is the Kaiser eigenvalue-greater-than-one rule for deciding how many factors a scale contains — a convenient default that Watkins's evidence-based-practice review treats as a first estimate at best. So when a dimensionality report gives you a Kaiser factor count, the literature's own advice is to read it alongside the first-factor share and the item loadings rather than as a verdict. That is the right posture for the whole report: psychometric statistics are diagnostics that tell you which items to keep, review, or drop — they do not certify that the construct is the one you named.
The textbooks end with formulas and an exhortation to go compute them; here you upload items and a response matrix and the statistics — alpha, omega, eigenvalues, item-level flags, and Mantel-Haenszel DIF across groups — run in deterministic code, with the model confined to narrating numbers already computed.
The books behind this tool
- Scale Development: Theory and Applications — Robert F. DeVellis & Carolyn T. Thorpe
- Reliability and Validity Assessment — Edward G. Carmines & Richard A. Zeller
- Exploratory Factor Analysis — Marley W. Watkins
How it works
The statistics are code's, the read is the corpus's: a deterministic engine (the vendored MF-158 reliability library) computes internal consistency (Cronbach's α, McDonald's ω, mean inter-item r), dimensionality (correlation-matrix eigenvalues, Kaiser factor count, first-factor share), per-item quality (corrected item-total r, α-if-deleted, single-factor loading → keep/review/drop flags), and — when a two-group variable is supplied — Mantel-Haenszel differential item functioning with ETS A/B/C classification; the model then narrates the already-computed numbers in plain language for a non-statistician, grounded in the measurement corpus. This validates the INSTRUMENT (survey-scale psychometrics) — NOT rater/rating quality (rater convergence/calibration), which is a different question. Numbers always ship; the narrative degrades gracefully if the model is unavailable.
You bring
{ scaleName, items: [{ id?, text }], responses: number[][], groups?, referenceGroup?, cluster? } — responses is a respondents × items matrix; reverse-scored items already recoded
You get
{ scaleName, nItems, nRespondents, nDropped, reliability (cronbachAlpha · mcdonaldOmega · meanInterItemCorrelation · tier), dimensionality (eigenvalues · nFactorsKaiser · firstFactorShare · unidimensional), items[] (itemTotalCorrelation · alphaIfDeleted · loading · flags · recommendation), dif (Mantel-Haenszel · ETS A/B/C · flagged[]), verdict, narrative, grounded_in, provenance }
Use it for
- →Validate a home-grown engagement/climate scale before you trust its scores: items + a response matrix → α, ω, dimensionality, and which items to keep/review/drop
- →Check measurement invariance before comparing groups: supply a two-group variable → a DIF pass that flags items functioning differently across groups (review before comparing means)
- →One-click demo: run the built-in UWES-9 sample data to see the full credibility report (α≈.92, unidimensional, one DIF-flagged item)
See it work
example outputA home-grown 6-item "Manager Trust Scale" with a 220-respondent × 6-item response matrix and a two-group tenure variable for a DIF check.
Scale credibility report — Manager Trust Scale
6 items · 220 respondents · 4 dropped (listwise on non-finite rows) → n = 216 All statistics computed deterministically; the LLM narrates the already-computed numbers.
Reliability — tier: good
| Metric | Value |
|---|---|
| Cronbach's α | 0.87 |
| McDonald's ω | 0.88 |
| Mean inter-item r | 0.53 |
Reading: internal consistency is strong enough to trust composite scores for group-level decisions.
Dimensionality — unidimensional: yes
Eigenvalues: [3.71, 0.78, 0.54, 0.40, 0.32, 0.25] · Kaiser factors: 1 · first-factor share: 62% · first-to-second ratio: 4.8. One dominant factor — the items measure one construct.
Item quality
| Item | Item-total r | α-if-deleted | Loading | Flags | Rec |
|---|---|---|---|---|---|
| MT1 "My manager keeps promises" | 0.71 | 0.84 | 0.78 | — | keep |
| MT2 "…admits mistakes" | 0.66 | 0.85 | 0.73 | — | keep |
| MT3 "…has my back" | 0.74 | 0.83 | 0.80 | — | keep |
| MT4 "…shares context" | 0.69 | 0.84 | 0.75 | — | keep |
| MT5 "…plays favorites" (R) | 0.31 | 0.89 | 0.36 | weak_discrimination, drop_improves_alpha | review |
| MT6 "…is fair" | 0.63 | 0.85 | 0.71 | — | keep |
DIF / measurement invariance — tenure (ref: tenured · focal: new-hire)
Likert items dichotomized at the median for Mantel-Haenszel. Summary: A = 5 · B = 1 · C = 0. Flagged: MT5 (ETS B, ΔMH = -1.06, p = .03, favors reference) — functions slightly differently across tenure; review before comparing tenured vs. new-hire means.
Verdict — acceptable
A trustworthy one-factor scale with one weak, possibly-miskeyed reverse item.
- Review or replace MT5 ("plays favorites") — low discrimination and the lone DIF flag; dropping it would raise α to .89.
- Safe to report a composite; re-run invariance after revising MT5.
Grounded in the measurement corpus (Cronbach, McDonald, ETS DIF conventions).
Run it now
Validate a survey scale
Paste a home-grown scale's items and a respondents × items response matrix to get a psychometric credibility report — Cronbach's α, McDonald's ω, dimensionality (eigenvalues), per-item keep/review/drop flags, and a differential-item-functioning check across groups. The statistics are computed in code; the model only explains them. (The fields below are pre-filled with sample UWES-9 data — just hit run, or replace it with your own.)
One item per line, in the same order as the columns of your response matrix. Reverse-scored items must already be recoded.
One row per respondent; one number per item, comma- or space-separated. At least 3 respondents and 3 items.
Optional. One group label per line (exactly two distinct groups), same order as the rows above. Leave blank to skip the differential-item-functioning check.
Which group is the reference (the other is the focal group). Defaults to the first group seen.
Prefer code? Call it over the API or hand it to your AI agent via MCP — POST /api/bicycle/scale-validator · validate_scale. API & agent access →