analyticsQ6to verify
Zhang et al. 2024 (IEEE TAC) — GPT-3.5/GPT-4 rate async video interviews with insufficient test-retest reliability and emergent bias
Evaluating GPT-3.5 and GPT-4 as raters of personality and interview performance from asynchronous video interviews (simulated AVI responses of 685 participants), the LLMs achieved validity comparable to or better than a task-specific AI model for some traits, but suffered from uneven performance across traits, insufficient test-retest reliability, and emergent biases — leading the authors to urge caution before using LLMs for employment decisions.
Validity, test-retest reliability, and fairness of GPT-3.5/GPT-4 as AVI raters vs a task-specific AI model and human annotatorsLLMs reached similar or better zero-shot validity than a task-specific AI model on some personality traits, but exhibited uneven performance across traits, insufficient test-retest reliability, and certain emergent biases. (Specific reliability/fairness coefficients reported in the paper not extracted to verification.)
- Sample
- Simulated AVI responses of 685 participants; raters = GPT-3.5 and GPT-4, compared against a task-specific AI model and human annotators
- Methodology
- Comprehensive psychometric evaluation (validity, reliability, fairness, rating patterns) of two LLMs as zero-shot raters of personality and interview performance from asynchronous video interviews, benchmarked against a task-specific model and human ratings.
What this means
- The AI interviewer walks into the same wall the human interviewer did: insufficient test-retest reliability means the LLM rater gives different scores to the same response on different occasions — the single-rater instability problem, now in silicon. Swapping the human for an LLM did not deliver the hoped-for objectivity.
- The authors evaluate the LLM with the classic psychometric quartet — validity, reliability, fairness, rating patterns — the same vocabulary the human-interview literature (Conway 1995; Huffcutt 2013) built. The measurement frame is imported wholesale; the substrate changed, the standards did not.
- Comparable validity but unstable reliability is exactly the essay's open question rendered concrete: the LLM is not error-free and not obviously better than humans; the honest finding is how close to the human failure mode it lands. The implied fix is the same — standardize prompts/scoring, average multiple passes/raters, validate against an outcome.
Source
IEEE Transactions on Affective Computing · Tianyi Zhang & et al. · 2024 · peer-reviewed
Context
- What came before
- Automated video interviews (AVIs) are marketed as faster and more objective than human interviews. Machine-learning AVI research (Hickman et al. 2021; Koutsoumpis et al. 2024) had already found test-retest reliability below desired personnel-selection standards. This study extended the question to general-purpose LLMs (GPT-3.5/GPT-4) as raters.
- What comes next
- Anchors the AI side of the interview case study, set beside the human reliability baseline (single interviewer ≈ .44; structured/panel/trained pushes higher). Corroborated by Hickman et al. 2021 (AVI personality, mixed reliability) and Koutsoumpis et al. 2024 (test-retest below selection standards). Extract the exact reliability/fairness coefficients from full text before citing specific numbers.