peopleanalyst

Insight Cards · analytics

analyticsQ6to verify

Zhang et al. 2024 (IEEE TAC) — GPT-3.5/GPT-4 rate async video interviews with insufficient test-retest reliability and emergent bias

Evaluating GPT-3.5 and GPT-4 as raters of personality and interview performance from asynchronous video interviews (simulated AVI responses of 685 participants), the LLMs achieved validity comparable to or better than a task-specific AI model for some traits, but suffered from uneven performance across traits, insufficient test-retest reliability, and emergent biases — leading the authors to urge caution before using LLMs for employment decisions.

Validity, test-retest reliability, and fairness of GPT-3.5/GPT-4 as AVI raters vs a task-specific AI model and human annotatorsLLMs reached similar or better zero-shot validity than a task-specific AI model on some personality traits, but exhibited uneven performance across traits, insufficient test-retest reliability, and certain emergent biases. (Specific reliability/fairness coefficients reported in the paper not extracted to verification.)
Sample
Simulated AVI responses of 685 participants; raters = GPT-3.5 and GPT-4, compared against a task-specific AI model and human annotators
Methodology
Comprehensive psychometric evaluation (validity, reliability, fairness, rating patterns) of two LLMs as zero-shot raters of personality and interview performance from asynchronous video interviews, benchmarked against a task-specific model and human ratings.

What this means

  • The AI interviewer walks into the same wall the human interviewer did: insufficient test-retest reliability means the LLM rater gives different scores to the same response on different occasions — the single-rater instability problem, now in silicon. Swapping the human for an LLM did not deliver the hoped-for objectivity.
  • The authors evaluate the LLM with the classic psychometric quartet — validity, reliability, fairness, rating patterns — the same vocabulary the human-interview literature (Conway 1995; Huffcutt 2013) built. The measurement frame is imported wholesale; the substrate changed, the standards did not.
  • Comparable validity but unstable reliability is exactly the essay's open question rendered concrete: the LLM is not error-free and not obviously better than humans; the honest finding is how close to the human failure mode it lands. The implied fix is the same — standardize prompts/scoring, average multiple passes/raters, validate against an outcome.

Source

Context

What came before
Automated video interviews (AVIs) are marketed as faster and more objective than human interviews. Machine-learning AVI research (Hickman et al. 2021; Koutsoumpis et al. 2024) had already found test-retest reliability below desired personnel-selection standards. This study extended the question to general-purpose LLMs (GPT-3.5/GPT-4) as raters.
What comes next
Anchors the AI side of the interview case study, set beside the human reliability baseline (single interviewer ≈ .44; structured/panel/trained pushes higher). Corroborated by Hickman et al. 2021 (AVI personality, mixed reliability) and Koutsoumpis et al. 2024 (test-retest below selection standards). Extract the exact reliability/fairness coefficients from full text before citing specific numbers.
← All insight cards