library / lib3e408c119c6da1d0
Handbook of Regression Modeling in People Analytics
Keith McNulty · 2021
In a sentence
A practical handbook teaching analytics practitioners how to select, run, and interpret the full range of regression models for inferential analysis of people-related questions, with worked examples in R and Python.
Written by a mathematician-turned-practitioner, this open-source handbook fills a critical gap for people analytics professionals who need to move beyond gut instinct and borrowed best practices toward evidence-based decisions. It treats regression as the indispensable 'Swiss army knife' of people analytics, walking the reader from statistical foundations through linear, binomial, multinomial, ordinal, mixed, structural equation, and survival models. Each method is grounded in a relatable problem, demystified with just enough mathematics to interpret outputs credibly, and demonstrated with reproducible code on realistic data sets. The book emphasizes inference (understanding why something happens) over pure prediction, reflecting the reality of small, consequential people data sets, and equips analysts to defend, critique, and communicate their models to non-statistical stakeholders.
The four lenses
- Science
- Statistics
- Systems
- Strategy
Tags
The model
A framework model in which the analyst's design choices (outcome type identification, method selection, assumption checking, parsimony) drive the production of valid statistical inferences about people-related outcomes. Outcome characteristics and data structure condition the choice of regression technique, which through proper interpretation and validation yields trustworthy inference and stakeholder impact.
Outcome Variable Typecontextual condition
The measurement nature of the outcome being modeled (continuous, binary/dichotomous, nominal multi-category, ordinal, time-to-event, or hierarchical), which fundamentally determines which regression technique is appropriate to use.
Data Structure and Hierarchycontextual condition
The structural characteristics of the data including presence of explicit grouping hierarchies, latent thematic structure among many items, missingness, collinearity, and independence assumptions, which condition the modeling approach and required adaptations.
Regression Method Selectiondesign lever
The analyst's deliberate choice of which regression technique to apply (linear, binomial logistic, multinomial, proportional odds, mixed, structural equation, or Cox survival) based on the outcome type and data structure, representing a primary design lever in the modeling workflow.
Coefficient Interpretationbehavioral pattern
The analyst's correct understanding and articulation of model coefficients, including odds ratios, effect directions, units, and significance, enabling translation of model output into meaningful statements about variable relationships.
Assumption Validationdesign lever
The degree to which the analyst checks and confirms the underlying assumptions of the chosen model (linearity, homoscedasticity, normality of residuals, proportional odds, proportional hazards, collinearity) before trusting inferences.
Model Fit and Parsimonyoutcome metric
The combined assessment of how well the model explains outcome variance (R-squared, pseudo-R-squared, goodness-of-fit tests) and how economically it does so (AIC, removal of non-significant variables), balancing explanatory power against interpretability.
Statistical Power and Sample Adequacycontextual condition
The probability that the analysis will correctly detect a true effect given the sample size, effect size, and significance level, determining whether the data has a realistic chance of yielding valid inferences.
Valid Statistical Inferenceoutcome metric
The trustworthy generalizable conclusion about the relationship between input variables and the outcome in the broader population, achieved when an appropriate, well-fitted, assumption-validated model is correctly interpreted.
Stakeholder Impact and Evidence-Based Decisionsoutcome metric
The downstream organizational benefit when valid inferences are clearly communicated to non-statistical decision makers, enabling evidence-based people decisions and improved organizational outcomes.
How they connect
- outcome variable type → predicts method selection
- data structure and hierarchy → moderates method selection
- method selection → influences coefficient interpretation
- method selection → influences model fit and parsimony
- assumption validation → moderates valid inference
- coefficient interpretation → predicts valid inference
- model fit and parsimony → predicts valid inference
- statistical power → moderates valid inference
- valid inference → predicts stakeholder impact
A candidate measure
Handbook of Regression Modeling in People Analytics — derived measurement candidates
Outcome Variable Type
data type of outcome column; number of distinct outcome values; presence of ordering; presence of event timing
self-report suitability: none
Data Structure and Hierarchy
correlation matrix; VIF values; NA counts; pairplot inspection
self-report suitability: none
Regression Method Selection
function family invoked; correspondence of method to outcome type
self-report suitability: low
Coefficient Interpretation
accuracy of interpretation vs statistical convention; correct reference category handling
self-report suitability: low
Assumption Validation
proportion of relevant assumptions checked; results of Brant-Wald, Schoenfeld, VIF tests
self-report suitability: medium
Model Fit and Parsimony
R-squared / pseudo-R-squared; AIC; goodness-of-fit p-value; number of retained variables
self-report suitability: none
Statistical Power and Sample Adequacy
power value; minimum required n; power curves
self-report suitability: none
Valid Statistical Inference
p-values; confidence intervals; replicability
self-report suitability: none
Stakeholder Impact and Evidence-Based Decisions
adoption rate of recommendations; documented decision changes
self-report suitability: medium
The story
The reader A people analytics practitioner or analytics student who wants to deliver more targeted, credible, evidence-based insights to their organization.
External problem
They face messy, often small people data sets and need to explain what drives outcomes like promotion, attrition, performance, or satisfaction.
Internal problem
They feel under-equipped and lack confidence to run, interpret, and defend multivariate models, fearing they cannot respond to critique.
Philosophical problem
People decisions guided by gut instinct or borrowed best practices are just plain wrong when rigorous, data-driven understanding is achievable.
The plan
- Learn the statistical and programming foundations needed to model.
- Match the regression method to the type of outcome you are explaining.
- Run the model and interpret its coefficients and fit.
- Check the model's underlying assumptions and pursue parsimony.
- Communicate the inferences clearly to non-statistical stakeholders.
Success
- Confidently selecting and applying the right regression technique to varied people analytics problems.
- Producing clear, defensible, evidence-based inferences that influence organizational decisions.
- Communicating model results effectively to non-statistical audiences.
At stake
- Continuing to rely on gut instinct or borrowed best practices for critical people decisions.
- Running models without understanding them, leading to inaccurate or indefensible inferences.
- Wasting research effort on under-powered or mis-specified analyses.
Chapter by chapter
ch01The Importance of Regression in People Analytics
This chapter asserts that regression modeling is a crucial tool for making sound data-driven decisions in people analytics, addressing both theoretical frameworks and practical applications.
- Regression modeling is an essential tool for uncovering relationships within HR data, driving more informed decision-making.
- A structured approach to inferential modeling ensures that conclusions drawn from data are statistically valid and relevant.
- By utilizing regression analysis, HR professionals can transition from anecdotal decision-making to a data-driven framework that enhances strategic initiatives.
- The ability to predict future outcomes based on past data is a critical advantage in the competitive landscape of human resources.
ch02The Basics of the R Programming Language
This chapter serves as an introduction to the R programming language, detailing its fundamental concepts, data structures, and functionality while equipping readers with the necessary tools to begin their data analysis journey.
ch03Statistics Foundations
This chapter intricately lays the groundwork for understanding foundational statistical concepts essential for effective data analysis, covering descriptive statistics to hypothesis testing and application in Python.
- An understanding of mean, variance, and standard deviation is essential for summarizing key aspects of any dataset.
- The t-distribution is fundamental for making inferences about population parameters based on sample data.
- Confidence intervals offer a reliable method for expressing the degree of uncertainty in statistical estimates.
- Hypothesis testing is crucial for validating claims made about data, with multiple tests available for different situations, such as Welch’s t-test and Chi-square tests.
ch04Linear Regression for Continuous Outcomes
This chapter explores linear regression as a statistical method for predicting continuous outcomes, detailing its applications, assumptions, and methods of enhancement.
ch05Binomial Logistic Regression for Binary Outcomes
This chapter delves into the intricacies of binomial logistic regression, elucidating its applications for binary outcome prediction and offering insight into its derivation, interpretation, and effective implementation.
ch06p01Multinomial Logistic Regression for Nominal Category Outcomes (part 1/2)
This chapter introduces multinomial logistic regression, emphasizing its application for modeling categorical outcomes with three or more levels and providing practical examples for effective implementation.
ch06p02Multinomial Logistic Regression for Nominal Category Outcomes (part 2/2)
This chapter delves into the intricacies of multinomial logistic regression, contrasting it with linear regression, and elucidates its applications in modeling nominal categorical outcomes for more nuanced data analysis.
- Multinomial logistic regression offers a powerful alternative to linear regression when forecasting nominal categorical outcomes, providing nuanced insights into data classification challenges.
- Proper interpretation of model coefficients is crucial, as they inform how each predictor variable influences the odds of falling into each category.
- Validating model performance against baseline measures is essential to confirming the model's predictive power and enhancing trust in decision-making based on its results.
- Be vigilant about the assumptions underlying multinomial logistic regression to avoid common pitfalls such as collinearity, which can lead to skewed interpretations.
ch08Multinomial Logistic Regression for Nominal Category Outcomes
This chapter explores the intricacies of multinomial logistic regression, detailing its applications for modeling outcomes with multiple, non-ordered categories and the necessary conditions for accurately interpreting results.
ch09Proportional Odds Logistic Regression for Ordered Category Outcomes
Proportional odds logistic regression is a statistical technique designed to analyze ordinal outcomes, offering insight into how varying input factors influence stepwise categorical responses.
ch10Modeling Explicit and Latent Hierarchy in Data
This chapter explores the significance of incorporating both explicit and latent hierarchies in data analysis, highlighting how mixed models and structural equation models enhance the accuracy and interpretability of insights derived from complex datasets.
- Recognizing explicit hierarchies in data yields more reliable modeling results compared to treating observations as independent.
- Mixed models facilitate the accommodation of variation at both observation and group levels, providing richer insights.
- Latent variable modeling permits the synthesis of numerous correlated items into comprehensible constructs, improving model interpretability.
- Rigorously validating measurement models through fit criteria is essential for trustworthy outcomes in SEM.
ch11Survival Analysis for Modeling Singular Events Over Time
This chapter demonstrates how to apply survival analysis techniques to model job retention and attrition over time, using practical examples and statistical methods in R.