peopleanalyst

Insight Cards · analytics

analyticsQ7to verify

Bertrand & Mullainathan 2004 (AER) — identical resumes, white-sounding names get 50% more callbacks

In a field experiment sending ~5,000 fictitious resumes to Boston and Chicago help-wanted ads, resumes were identical except that each was randomly assigned a very white-sounding or very African-American-sounding name. White names received 50% more callbacks for interviews. The single human screener's response varied systematically with a feature (the name) that has no relationship to the candidate's qualifications.

Differential interview-callback rate by randomly assigned race-signaling name on otherwise-identical resumesWhite-sounding names received 50% more callbacks than African-American-sounding names. A higher-quality resume raised callbacks 30% for white names but produced a far smaller increase for African-American names. The gap was uniform across occupation, industry, and employer size; Equal-Opportunity-Employer and federal-contractor ads discriminated as much as others.
Sample
~5,000 fictitious resumes sent to help-wanted ads in Boston and Chicago
Methodology
Resume correspondence / audit field experiment with random assignment of race-signaling first names to otherwise-matched resumes; outcome = employer callback for interview.

What this means

  • This is the canonical demonstration that single-rater resume screening is not reliable as a measurement of candidate qualification: holding the resume's substance constant, the screener's decision moves with an irrelevant attribute (the name). The 'rater' is reacting to construct-irrelevant variance, exactly the failure mode psychometrics names.
  • Because the names were randomly assigned to identical applications, the 50% callback gap is causal evidence of bias in the human screening judgment itself, not a reflection of true differences between applicants — the cleanest possible separation of rater variance from ratee variance in the screening domain.
  • Later meta-analysis (Quillian et al. 2017, PNAS, 28 studies / 55,842 applications) shows the effect is durable: whites averaged 36% more callbacks than African Americans with no decline over 25 years — establishing the human-failure baseline against which AI resume-screeners must be compared.

Source

Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination

American Economic Review (NBER working-paper version; AER 2004) · Marianne Bertrand & Sendhil Mullainathan · 2004 · peer-reviewed

Context

What came before
Discrimination in hiring had been studied via wage-gap regressions and survey self-report, both confounded by unobserved differences between real applicants. The resume-audit design removed that confound by randomizing the race signal onto identical applications.
What comes next
Establishes the human-screener failure baseline for the resume-screening case study. Sets up the AI-side question: do LLM/embedding resume screeners reproduce the same name-driven callback gap (Wilson et al. 2024; Armstrong et al. 2024)? Verify exact callback rates (~6.5% white vs ~9.7% — confirm direction) and N=4,870 against full text before drafting.
← All insight cards