peopleanalyst

Insight Cards · analytics

analyticsQ7to verify

Conway, Jako & Goodman 1995 (JAP) — interview interrater reliability rises with standardization; validity ceiling .67 structured vs .34 unstructured

Meta-analyzing 111 interrater-reliability coefficients and 49 coefficient alphas from selection interviews, the authors found that interview reliability is moderated by standardization of questions, standardization of response evaluation, and how multiple ratings are combined. The estimated upper limit of validity was .67 for highly structured interviews versus .34 for unstructured interviews — roughly double — and mechanically combining multiple ratings helped while subjective combination did not.

Estimated upper-limit validity of the selection interview by structure, and moderators of interrater reliabilityUpper limit of validity ≈ .67 for highly structured interviews vs ≈ .34 for unstructured interviews. Interrater reliability moderated by standardization of questions, standardization of response evaluation, and method of combining multiple ratings; mechanical combination of multiple ratings was useful, subjective combination showed no evidence of usefulness. Standardizing questions had a stronger effect for separate (vs panel) interviews.
Sample
111 interrater-reliability coefficients + 49 coefficient alphas from selection-interview studies
Methodology
Psychometric meta-analysis of interrater reliability and internal-consistency reliability; moderator analysis on study design, interviewer training, and three dimensions of interview structure.

What this means

  • Direct quantification of the human-failure baseline for the interview case: the unstructured interview — the default in most organizations — tops out at validity ≈ .34, and its weakness is traced to low reliability driven by un-standardized inputs and idiosyncratic rater judgment.
  • The cure is named explicitly and matches the essay's shared prescription: standardize the questions, standardize how responses are scored, train raters, and combine multiple ratings mechanically rather than letting raters blend impressions subjectively. Structure roughly doubles the validity ceiling (.34 to .67).
  • The finding that mechanical combination of multiple ratings helps but subjective combination does not is the multi-rater discipline in its precise form — averaging raters buys reliability only when the aggregation is rule-governed, not when a dominant rater overwrites the panel.

Source

A meta-analysis of interrater and internal consistency reliability of selection interviews

Journal of Applied Psychology · James M. Conway et al. · 1995 · peer-reviewed

Context

What came before
The employment interview is the most widely used selection method and is intuitively trusted by hiring managers, yet early reviews (e.g., Mayfield 1964; Hunter & Hunter 1984 put interview validity near .14) flagged its low reliability and validity. The open question was what made some interviews work.
What comes next
Establishes the structured-vs-unstructured reliability/validity gap and the standardization/training/multi-rater fix that the AI-interview case must be measured against. Pairs with Huffcutt et al. 2013 (panel .74 vs separate .44) and Gardner et al. 2022 (ICC .50 to ~.69 after structure+training) as the human-side fix evidence. Verify exact coefficient counts and the .67/.34 ceilings against full text.
← All insight cards