peopleanalyst

Tools · People analytics

Model Card

A model card and evaluation audit for any people-data model — documented like it should have been on day one.

The method

Model cards for model reporting (Mitchell et al.) with ML evaluation auditing

A vendor's attrition model arrives with a deck claiming ninety-five percent accuracy. On what population, against what base rate, checked across which subgroups, monitored for what drift — nobody in the room can say. Next month it starts scoring your employees.

Margaret Mitchell and her colleagues proposed the model card in 2019 as a small, stubborn discipline: a model ships with a standardized disclosure — intended use and explicit out-of-scope uses, the populations it was trained and evaluated on, performance broken out by subgroup rather than averaged into a single flattering number. The card is not paperwork. It forces exactly the questions a sales demo is designed to suppress.

Chip Huyen's two books explain why those questions decide outcomes in production. Designing Machine Learning Systems situates the model as one small component in a much larger system of data pipelines, deployment, and monitoring, and argues that transparency — model cards named explicitly — belongs early in the lifecycle rather than as post-deployment ethics theater. Her catalog of production failure modes is the audit's checklist: offline accuracy is not production performance, distributions shift after launch, labels carry their own error, and leakage manufactures results that evaporate on contact with reality. AI Engineering extends the argument to the foundation-model era with a sharper thesis: rigorous evaluation pipelines, not clever prompting, are the scarce discipline separating applications you can trust from applications that merely impress.

Ethan Mollick's Co-Intelligence supplies the working posture. His jagged frontier — AI capability does not follow intuition, so systems excel and fail in adjacent, unpredictable places — means you test where a model fails rather than extrapolate from where it shines. For models scoring people, the stakes are careers, which is why an audit has to track what was actually evidenced against what was merely claimed.

The service drafts the Mitchell-style card and runs the six-check evaluation audit — discrimination, calibration, subgroup performance, drift, label quality, leakage — from what your input actually evidences. Reported-by-input only, never an invented performance number; the gaps list is the evaluation workplan you hand the data science team.

How it works

Drafts a Mitchell-et-al model card (intended use & out-of-scope, population/data provenance, performance as-reported-by-input, evaluation gaps, ethical considerations, monitoring plan) and runs an evaluation audit across six checks (discrimination, calibration, subgroup performance, drift monitoring, label quality, leakage risk) — each with evidence status, why-it-matters, and how-to-close. Reported-by-input only: never invents performance numbers. Quantitative disparity metrics delegated to a dedicated fairness-monitoring engine — this is the documentation/audit-plan layer. Grounded in the ai-applications corpus.

You bring

{ model_description, cluster? }

You get

{ model_summary, card[6 sections], audit[6 checks], priority_gaps[], grounded_in, provenance }

Use it for

See it work

example output

A vendor attrition-risk model scoring 4,200 employees monthly into manager watchlists — AUC 0.79 claimed, no subgroup data, never re-validated in 3 years.

Model: A vendor-supplied attrition-risk model scores all 4,200 employees monthly on a 0-100 scale using tenure, compa-ratio, manager changes, commute distance, and engagement scores, trained on this organization's 2019-2024 terminations. Scores above 70 populate manager watchlists, meaning the model directly shapes how managers focus retention attention on individual employees. The vendor reports a single overall AUC of 0.79 with no calibration, no subgroup breakdowns, no label-definition detail, and no leakage assessment. The model has run in production for three years without re-validation, so its current discrimination, drift, and fairness behavior are entirely unevidenced despite feeding consequential, person-level managerial decisions.

Intended Use & Out-of-Scope Uses

  • Intended use as deployed: produce a monthly 0-100 attrition-risk score for each of ~4,200 employees; scores above 70 feed manager watchlists intended to prompt retention attention.
  • This is a prioritization/attention-directing tool, not an adjudication tool; scores should inform manager conversations, not automated employment actions.
  • Out of scope: using scores in promotion, compensation, PIP, layoff-selection, or termination decisions — the model was trained on who left, not on performance or potential.
  • Not reported: Vendor's stated intended-use boundaries; Whether managers are trained on how to act on watchlist placement

Population & Data Provenance

  • Scored population: all ~4,200 current employees, scored monthly.
  • Training data: this organization's own terminations from 2019-2024 (reported by input).
  • Features: tenure, compa-ratio, manager changes, commute distance, engagement scores (reported by input).
  • Not reported: Training set size and number of termination events; Class balance (voluntary vs involuntary terminations; whether both are pooled as the label)

Performance Summary (As Reported)

  • Vendor-reported overall discrimination: AUC 0.79 (reported by input).
  • This single aggregate is the only performance figure supplied; it is a rank-ordering statistic and says nothing about calibration or per-subgroup behavior.
  • The AUC is not dated and its evaluation set (holdout, time period, population) is not described.
  • Not reported: Calibration metrics; Precision/recall or watchlist hit-rate at the 70 threshold

Evaluation Gaps

  • No calibration evidence: cannot claim a score of 70 corresponds to any specific attrition probability.
  • No subgroup performance whatsoever — input explicitly states none provided; fairness across gender, age, race, geography, job family is entirely unevidenced.
  • No re-validation in 3 years of deployment — current AUC is unknown and may have decayed with workforce and labor-market change.
  • Not reported: Any evaluation performed by this organization (vs vendor); Any adverse-impact analysis for this employment-related tool

Ethical Considerations & Caveats

  • This is an employment-related person-level model; watchlist placement can affect how managers perceive and treat individuals, creating self-fulfilling or stigmatizing dynamics.
  • Absence of any subgroup analysis means the model could systematically over-flag protected groups (e.g., via commute distance as a proxy for geography/socioeconomic status) with no way to detect it.
  • Managers may over-trust a numeric score without understanding its correlational, backward-looking nature.
  • Not reported: Employee notice/consent posture; Whether high scores are ever seen by employees

Monitoring Plan

  • Quantitative disparity/fairness auditing is delegated to a dedicated single-model fairness-monitoring engine that computes subgroup metrics (AUC, calibration, watchlist rate, flag rate) on live scored data; this card does NOT compute any statistic.
  • WHAT to monitor: overall and per-subgroup discrimination and calibration; watchlist (>70) rate by subgroup; feature-distribution drift (especially commute distance and engagement scores); score-distribution shift month-over-month; manager action rates on watchlisted employees.
  • Subgroups to monitor: gender, age band, race/ethnicity, job family, geography/site, tenure band.
  • Not reported: Existing monitoring tooling or owner; Whether any fairness monitor is currently in place

Evaluation audit

  • Discrimination / rank-ordering validity — partial: The watchlist depends on the model correctly rank-ordering who is likely to leave; a single 3-year-old aggregate AUC gives no assurance the model still separates leavers from stayers in the current 4,200-employee population, and offers no view at the >70 operating point that actually drives decisions.
  • Calibration — missing: The 70 cutoff implicitly treats the score as a probability, but with no calibration evidence there is no basis to claim 70 means any particular attrition risk; managers may act on a threshold that is arbitrary relative to actual likelihood of leaving.
  • Subgroup performance — missing: Input explicitly states no subgroup performance was provided. For an employment-related tool flagging individuals, undetected disparities (e.g., commute distance proxying for geography/race, or age correlating with tenure) could produce adverse impact in who lands on watchlists with zero visibility.
  • Drift monitoring — missing: The model has run 3 years without re-validation and was trained on 2019-2024 (COVID-era) data; workforce composition, labor market, and especially commute/remote-work patterns have likely shifted, so the score distribution and feature relationships may have drifted materially without anyone noticing.
  • Label quality — missing: The label is 'our 2019-2024 terminations' but whether it pools voluntary resignations with involuntary/layoff terminations is unstated; if involuntary exits are in the label, the model partly predicts who the company chose to remove, which is a very different and ethically loaded target for a retention watchlist.
  • Leakage risk — missing: Features like manager changes and engagement scores can be recorded near or after a departure decision, and commute distance may correlate with outcomes captured post-hoc; leakage would inflate the reported 0.79 AUC and make the model look more predictive than it is in true prospective use for the current workforce.

Run it on your data

Call it on your own inputs — over the API, or hand it to your AI agent via MCP. Discovery is open; running it is metered.

REST  POST /api/bicycle/model-card
MCP   design_model_card

← All tools