Tools · People analytics
Scale Drafter
A survey scale and self-assessment for any management concept — drafted to psychometric standards, honest that it's a draft.
The method
Classical scale development (DeVellis; classical test theory and construct validation)
An executive wants to measure psychological safety, and by Thursday someone has typed eight questions into a survey tool. A year of team interventions will ride on those numbers, and nobody has checked whether the items measure one thing, two things, or nothing at all.
DeVellis and Thorpe's Scale Development opens with the claim that should be printed above every survey tool: measurement is not a technicality, and poor measurement imposes an absolute ceiling on the conclusions any study can support. Their method is a sequence with no skippable steps — define the construct's boundary against theory, generate an item pool larger than you will keep, choose formats deliberately, subject items to expert review, field them to a development sample, and evaluate before you trust. They also draw a distinction most homegrown surveys never consider: scales (where items reflect an underlying construct) and indices (where items form it) are different instruments requiring different methods, and confusing them corrupts everything downstream.
Carmines and Zeller give the two-word vocabulary for what can go wrong. Reliability is consistency, threatened by random error; validity is accuracy, threatened by nonrandom error — and a scale can be reliably wrong, consistently measuring something other than what you named. For abstract concepts like safety or engagement, they argue construct validity is the standard that matters: the measure must behave, across studies, the way the theory says the construct should. Tourangeau, Rips, and Rasinski's The Psychology of Survey Response explains why item-writing rules exist at all: answering a question is four cognitive operations — comprehension, retrieval, judgment, response selection — and a double-barreled item or an ambiguous quantifier is not a style problem, it is a failure at a specific stage that different respondents resolve differently, which is where the noise comes from.
The honest limit: drafting standards raise the floor; they validate nothing. Until items meet data — dimensionality, internal consistency, relations to criteria — the best-drafted scale is a disciplined hypothesis.
The drafter produces the instrument to those standards — refined construct boundary with a not-to-be-confused-with list, balanced keying, no double-barreled items, a response-scale recommendation with its rationale, and the validated instrument families you should consider before deploying anything original. Every draft says plainly that drafted is not validated, with the handoff to the validation pass built in.
The books behind this tool
- Scale Development: Theory and Applications — Robert F. DeVellis & Carolyn T. Thorpe
- Reliability and Validity Assessment — Edward G. Carmines & Richard A. Zeller
- The Psychology of Survey Response — Roger Tourangeau, Lance J. Rips & Kenneth Rasinski
How it works
Psychometric first-draft from the measurement corpus: refined construct boundary (with not-to-be-confused-with), 2–4 sub-dimensions, 6–12 original Likert items (balanced keying, no double-barreled items, plain reading level), response-scale recommendation with rationale, a parallel self-assessment variant, and validated-instrument adjacencies (instrument FAMILIES named — never licensed item text reproduced). Every draft carries drafted-≠-validated caveats and hands off to the scale-validator for the validation pass. The instrument substrate behind Mike's 'a Likert survey for every management concept' program.
You bring
{ concept, definition?, context?, cluster? }
You get
{ construct (refined_definition · not_to_be_confused_with), sub_dimensions[], items[6-12] (keying · dimension), response_scale, self_assessment, adjacencies[], caveats[], handoff, grounded_in, provenance }
Use it for
- →Any book-profile construct → a deployable draft instrument (draft → deploy small-scale → validate_scale)
- →Exec wants to 'measure psychological safety': original items + the validated-instrument families to consider first
- →Pair with the survey-orchestrator for the deploy step; validate with scale-validator after wave 1
See it work
example outputConcept: "psychological ownership of AI tools" — the extent to which an employee feels an AI tool is 'theirs' (possession, responsibility, identity investment) — for knowledge workers in orgs rolling out AI assistants; team pulse survey plus a self-assessment.
Drafted instrument: Psychological Ownership of AI Tools
Refined definition. The degree to which a knowledge worker feels a possessive, self-connected bond with an AI tool they use — experiencing it as "mine/ours", taking personal responsibility for how it is used and what it produces, and investing effort to shape and improve it. Felt ownership, not legal ownership.
Not to be confused with: technology acceptance / perceived usefulness · trust in AI · tool dependence or usage frequency · job engagement.
Sub-dimensions (3)
- Possessive Attachment — the felt sense that the AI tool (or the team's shared use of it) is "mine" or "ours".
- Responsibility for Outputs — felt personal accountability for how the tool is used and the quality of what it produces.
- Identity Investment — how far the tool and its outputs feel tied to one's professional self-image and effort.
Likert items (11 · 3 reverse-keyed) — original drafts
| # | Item | Dimension | Keying |
|---|---|---|---|
| i1 | I think of this AI tool as one of my own work tools. | Possessive Attachment | + |
| i2 | When I talk about the tool, I naturally call it "mine" or "ours." | Possessive Attachment | + |
| i3 | The tool still feels like something the company owns, not something that is mine. | Possessive Attachment | R |
| i4 | I feel personally responsible for how well the tool's answers are used in my work. | Responsibility for Outputs | + |
| i5 | If the tool produces a poor result in my work, I feel it is on me to fix it. | Responsibility for Outputs | + |
| i6 | What the tool produces is really the tool's doing, so I do not feel accountable for it. | Responsibility for Outputs | R |
| i7 | I put personal effort into shaping how the tool works for me, such as adjusting prompts or settings. | Responsibility for Outputs | + |
| i8 | The way I use this tool reflects who I am as a professional. | Identity Investment | + |
| i9 | I take pride in the results I create with the tool. | Identity Investment | + |
| i10 | The outputs I get from the tool feel like my own work. | Identity Investment | + |
| i11 | I feel little personal connection to this tool. | Identity Investment | R |
Response scale: 5-point agree–disagree (Strongly disagree → Strongly agree). Familiar to knowledge workers, quick for a pulse survey, and the midpoint respects genuinely neutral attachment early in an AI rollout.
Self-assessment variant: parallel 11-item first-person form (s1–s11) with the instruction "Think about the AI tool you use most at work, and rate how much you agree with each statement about your own experience."
Validated instruments exist near this construct — consider them first
- Psychological Ownership scales (Pierce/Van Dyne-style organizational ownership measures) — same facet skeleton, different target object; adapt the facet structure as a validity benchmark, never the item text.
- Technology Acceptance / UTAUT-style adoption scales — adjacent (usefulness, ease of use, intention); useful for discriminant-validity testing.
- Trust in Automation / Trust in AI scales — beliefs about reliability; include alongside and expect moderate, not high, correlations.
Psychometric caveats (drafted ≠ validated)
- Early DRAFT: no reliability (α/ω), dimensionality (EFA/CFA), or validity evidence yet.
- Responsibility and Identity items may empirically load together — facets may need merging.
- Ownership items are prone to social desirability and common-method variance in a rollout-era pulse.
- The "mine/ours" wording mixes individual and collective ownership — split if team- vs individual-level ownership are separate targets.
- Low early scores may reflect limited exposure, not detachment — add a tenure/usage covariate.
- Reverse-keyed items (i3, i6, i11) guard against acquiescence but screen for careless responding.
Next step: collect pilot responses (n ≥ 100 recommended) and run the response matrix through the scale-validator service (validate_scale) for reliability, dimensionality, and item-quality evidence before operational use.
Grounded in 17 canonical constructs from the measurement corpus (11 books, incl. Scale Development, The Psychology of Survey Response, Reliability and Validity Assessment). All item wordings are original drafts — no licensed/branded item sets reproduced.
Run it on your data
Call it on your own inputs — over the API, or hand it to your AI agent via MCP. Discovery is open; running it is metered.