← The PeopleAnalyst Guide to Work Rules·Ch 05
Don't Trust Your Gut
What Bock argues
The chapter's spine is simple and correct: the unstructured "let's have a conversation and see how it feels" interview — the dominant hiring ritual at most companies — barely predicts anything. In its place Google leans on a small set of methods that do: work-sample tests (give the candidate a slice of the actual job), structured interviews (every candidate gets the same questions against the same scoring criteria), and tests of general cognitive ability — and it combines them rather than betting on one. Google built a tool, qDroid, to hand interviewers pre-written, job-relevant structured questions so the gut never gets the wheel.
Bock's load-bearing move is to rank the methods by how much they predict, and to insist that the consistency of a structured interview — same questions, same rubric — is the whole point: it makes the variation in scores about the candidate, not the interviewer. He also notes the part most hiring managers skip: structured interviews are rare precisely because they're work — you have to write them, test them, enforce them, and refresh them.
What the research actually says (and where 2015 needs an update)
Bock's hierarchy comes from the most-cited paper in personnel selection: Schmidt & Hunter (1998), a meta-analysis of 85 years of validity evidence. Its headline coefficients — work samples, structured interviews, and general mental ability (GMA) as the strongest single predictors, unstructured interviews well behind — are the empirical backbone of this chapter.
Two honest refinements the Guide must add on top of Bock:
-
"Percent of performance" is a simplification. Bock reports predictors as percentages; the underlying numbers are validity coefficients (correlations), and the meta-analytic values are corrected for range restriction and measurement unreliability. The corrected numbers are the right ones for comparing methods, but they overstate what any single tool delivers on your actual applicant pool. Report both the corrected coefficient and the operational reality.
-
The field revised these numbers after the book. Sackett, Zhang, Berry & Lievens (2022) re-examined the corrections and argued many were overstated — notably pulling GMA's standalone validity down and leaving structured interviews looking relatively stronger. That doesn't overturn Bock's thesis; it sharpens it: structure is doing even more of the work than the 2015 framing implied. This is the chapter's "what's changed" hook — and a reason to trust the method (structure) over the magic of any one instrument.
Underneath all of it sits a measurement fact Bock gestures at but doesn't name: a structured interview works because it raises inter-rater reliability — different interviewers converge on the same judgment of the same candidate. An unstructured interview is a single noisy rater. (That's the same disease the Reliability essay diagnoses; the cure is the same — standardize, anchor to a criterion, use multiple raters.)
And the fairness point is not a footnote: structured, consistent process is perceived as more fair by candidates (procedural justice), independent of who gets the offer. Bock observed it; the justice literature explains why (see the Show Your Work essay).
How you actually run it
The execution layer Bock leaves implicit:
- Define the work first. A work sample or a competency rubric is only valid against a specific basket of work. Write the job's real tasks before writing the interview.
- Build the structured guide. Behavioral ("tell me about a time…") + situational ("what would you do if…") questions, each with a behaviorally-anchored rating scale (what a 1 vs a 3 vs a 5 answer looks like). This is qDroid's job, and it's reproducible without Google's budget.
- Measure inter-rater reliability. Have ≥2 interviewers score independently before they confer, and compute agreement (Cohen's κ / Krippendorff's α / a G-theory generalizability coefficient). Low agreement means the rubric — not the candidate — is the problem.
- Validate against outcomes. Track predictor scores against later performance; the correlation is your local validity. Most companies never close this loop, so they never learn whether their interview predicts anything at all.
The toolbox analysis to build
This chapter maps cleanly onto the reliability/inter-rater program (the "Consensus Coder"):
- Inter-rater reliability analysis — take a panel's independent scores, return κ/α and a G-theory variance decomposition (how much score variance is candidate vs interviewer vs question). The output names whether your interview is measuring the candidate or the interviewer's mood.
- Local selection-validity tracking — correlate predictor scores with downstream performance
ratings (honest, small-N-aware CIs via the
calculusspoke), so a team can see its own validity instead of importing Schmidt & Hunter's and hoping.
Both are runnable analytics, not prose — the chapter ends in a tool, which is the Guide's whole bet.
The AI-era turn
Bock wrote before AI screened a single résumé. The translation is direct and a little uncomfortable: an AI that screens, scores, or ranks candidates is a rater — and everything this chapter says about raters applies to it. A black-box model is the new unstructured interviewer: confident, fast, and unaccountable. Before you trust it you owe it the same psychometric audit you'd owe a human panel — inter-rater reliability (does it agree with itself and with calibrated humans?), validity (does its score predict performance?), and adverse-impact checks. And because hiring is the most procedurally-sensitive decision a person faces, the model has to be legible to survive contact with candidates (the transparency thesis). "Don't trust your gut" becomes "don't trust the black box either — measure it."
Do this Monday
- Pick one role. Write down what the work actually is.
- Replace the unstructured interview with 4–5 structured questions + a 1–5 anchored rubric.
- Have two interviewers score independently, then compute their agreement. If it's low, fix the rubric, not the people.
- Start logging predictor scores so you can correlate them with performance in six months.
- If an AI tool is in your funnel, demand its reliability and validity numbers before it screens one more person.