peopleanalyst

Tools · People analytics

HR Data Quality

Find out what your HR data can and can't answer — before the analysis embarrasses you.

The method

Data-quality dimensions audit with fitness-for-purpose assessment

The CHRO asks for an attrition analysis by Friday. The analyst opens the warehouse and finds three HRIS migrations, terminations coded five different ways, and a manager field that is a third stale. The data's problems will surface either way — the only choice is whether they surface in profiling or in front of the executive team.

Ferrar and Green's Excellence in People Analytics, built on research with over a hundred organizations, treats data as one of nine dimensions of a working analytics function — and pointedly not the first. Their case-study organizations start from business questions and invest in governance and the data foundation deliberately, as infrastructure for value, rather than reactively after an analysis embarrasses someone. The ordering matters: data quality is not a virtue to maximize in the abstract, it is a capability you build toward the questions you intend to answer.

Guenole, Ferrar, and Feinzig's The Power of People makes that concrete with their eight-step model: frame the business question and build hypotheses before touching data, so that gathering and quality-checking serve the question. The implication practitioners live daily is that fitness is purpose-relative — the same dataset can be perfectly fit for a headcount report and unusable for a survival analysis, because the two make different demands on grain, history, and coding consistency. Edwards, Edwards, and Jang's Predictive HR Analytics shows the same truth from the trenches: their click-by-click case studies work only because messy organizational data gets converted, field by field, into something a statistical test can honestly run on. What the data cannot support, the analysis cannot claim.

The method's honest boundary: no audit can certify accuracy from a description. A described stack supports finding structural risk — join keys that will not join, coding drift across migrations, staleness in slowly-updated fields — but accuracy claims need profiling against the actual rows.

Describe the systems and the analyses you intend, and the audit runs the seven dimensions — every finding with severity, what it breaks downstream, and one concrete fix — plus a fitness verdict per intended analysis and a leverage-ordered remediation plan. Where only profiling can answer, it says so honestly: the service never invents facts about data it has not seen.

The books behind this tool

How it works

Audits a described HR dataset (systems, fields, known issues, optionally pasted schema/profile stats) across the seven canonical data-quality dimensions — completeness, validity, consistency, uniqueness, timeliness, accuracy, lineage/joinability — grounded in the people-analytics corpus. Every finding carries severity, what it breaks downstream, and one concrete remediation; every intended analysis gets a fitness-for-purpose verdict; closes with a leverage-ordered remediation plan. Never invents facts about data it hasn't seen — honest cannot-assess and needs-profiling flags.

You bring

{ dataset, intended_analyses?, cluster? }

You get

{ dataset_summary, dimensions[] (findings · cannot_assess), fitness_for_purpose[] (verdict · blocking_issues), remediation_plan[], needs_profiling[], grounded_in, provenance }

Use it for

See it work

example output

Workday + Greenhouse + anonymous Qualtrics stack headed into attrition, source-of-hire, and engagement-linkage analyses — with backfilled term reasons and no ATS↔HRIS key.

Dataset: This is a three-source people-analytics dataset: Workday HRIS (4 years of core HR), Greenhouse ATS (3 years, with no shared employee key to Workday), and an annual Qualtrics engagement survey that is anonymous and only available as team-level rollups. Known defects include term_reason values backfilled from memory for 2021-2022, a 2024 re-leveling event that re-mapped job levels, and contractor records commingled into headcount tables. The intended analyses — attrition drivers by tenure and level, source-of-hire quality, and engagement-attrition linkage — each depend on joins and comparability that the current setup partially or wholly breaks.

undefined

  • [high] term_reason for 2021-2022 was backfilled from memory rather than captured at the time of termination, so this field is functionally missing/unreliable for two of the four Workday years.
    • Breaks: attrition drivers by tenure and level; engagement-attrition linkage
    • Fix: HRIS owner should flag 2021-2022 term_reason as 'reconstructed' in Workday and exclude or separately model those records; capture prospective term_reason at exit going forward.

undefined

  • [high] Contractor records are mixed into headcount tables, meaning worker-type is not a clean valid categorical filter and headcount/attrition denominators include non-employees.
    • Breaks: attrition drivers by tenure and level
    • Fix: HRIS owner should add/validate a worker_type flag in Workday and exclude contractors from employee attrition denominators.

undefined

  • [high] The 2024 re-leveling re-mapped job levels, so level values are not comparable across the 4-year window; the same person/role may sit at different level codes pre- and post-2024.
    • Breaks: attrition drivers by tenure and level
    • Fix: HRIS/comp owner should build a level-crosswalk mapping pre-2024 levels to the post-2024 scheme and apply a consistent normalized level for time-series analysis.

undefined

  • [medium] Engagement data is an annual snapshot, giving coarse temporal resolution that may not align in time with attrition events for linkage.
    • Breaks: engagement-attrition linkage
    • Fix: Analytics team should fix the survey wave date and align attrition windows to the survey period (e.g., attrition in the 12 months following each wave) when constructing the linkage.

undefined

  • [high] term_reason backfilled from memory for 2021-2022 is subject to recall bias and is likely inaccurate, corrupting any voluntary/involuntary or driver attribution for those years.
    • Breaks: attrition drivers by tenure and level; engagement-attrition linkage
    • Fix: Analytics team should treat 2021-2022 term_reason as low-confidence, report driver analysis restricted to prospectively-captured years, and validate a sample against exit documentation where it exists.

undefined

  • [critical] Greenhouse ATS has no shared employee key with Workday, so hires cannot be reliably joined back to their eventual tenure/performance/attrition outcomes — the core input for source-of-hire quality.
    • Breaks: source-of-hire quality
    • Fix: Recruiting Ops/IT should establish a candidate-to-employee key (e.g., stamp Workday employee_id onto the Greenhouse hire record at offer-accept, or build a deterministic email/name+start-date match table).

Fitness for purpose

  • attrition drivers by tenure and level → fit_with_caveats (blocked by: 2024 level re-mapping breaks level comparability)
  • source-of-hire quality → not_fit (blocked by: no shared employee key between Greenhouse and Workday)
  • engagement-attrition linkage → fit_with_caveats (blocked by: survey is anonymous, team-level only)

Remediation plan (top 3)

  1. {"priority":1,"action":"Recruiting Ops/IT establish a candidate-to-employee key (stamp Workday employee_id onto Greenhouse hire records at offer-accept, or build a deterministic match table).","addresses":["lineage_joinability"],"unblocks":["source-of-hire quality"]}
  2. {"priority":2,"action":"HRIS/comp owner build a pre-2024-to-post-2024 job-level crosswalk and apply a normalized level for time-series analysis.","addresses":["consistency"],"unblocks":["attrition drivers by tenure and level"]}
  3. {"priority":3,"action":"HRIS owner add/validate a worker_type flag in Workday and exclude contractors from employee attrition denominators.","addresses":["validity"],"unblocks":["attrition drivers by tenure and level","engagement-attrition linkage"]}

Run it on your data

Call it on your own inputs — over the API, or hand it to your AI agent via MCP. Discovery is open; running it is metered.

REST  POST /api/bicycle/hr-data-quality
MCP   audit_hr_data_quality

← All tools