What is PeopleAnalyst?

PeopleAnalyst is the front door for people-analytics research: 205+ works indexed and profiled, 40+ citation-grade findings extracted, and peer-reviewed behavioral science translated from academic to actionable — the missing manual for the people analytics you always meant to do.

What is people analytics?

People analytics is not a dashboard. It is behavioral science and statistical inference applied to workforce decisions — a discipline with its own methodology, spanning measurement, organizational design, talent, leadership, and analytics craft.

Why does AI in HR need measurement science?

AI is being deployed in high-stakes people decisions — hiring, performance, attrition — without the measurement science to evaluate whether it works or whom it harms. Construct validity, effect sizes, and criterion validity are the vocabulary for asking an AI vendor the right questions.

How is the research made accessible?

The evidence is indexed and searchable: 205+ works, 40+ citation-grade insight cards, and 8 research arcs, so the right finding reaches the right decision at the right time.

What separates good people measurement from assertion?

Good measurement has a method: construct validity, reliability, and effect-size interpretation are not optional — they are what separates evidence from assertion.

library / lib52300541c109bcc8

Beyond Multiple Linear Regression Applied Generalized Linear Models And Multilevel Models in R

Paul Roback, Julie Legler

In a sentence

An applied textbook that teaches statisticians and data analysts how to move beyond standard linear regression to effectively model non-normal and correlated data using Generalized Linear Models and Multilevel Models in R.

For students and analysts who have mastered multiple linear regression, this book serves as the essential next step for tackling the complexities of real-world data. It provides an accessible, case-study-driven guide to statistical modeling when the core assumptions of linear regression don't hold. Through intuitive explanations and practical R code, readers will learn to model count data with Poisson regression, binary outcomes with logistic regression, and handle correlated data structures like repeated measures or nested groups with multilevel models. By grounding advanced topics like likelihood theory, overdispersion, and random effects in tangible examples, the book empowers readers to expand their analytical toolkit and conduct more appropriate, robust, and insightful analyses.

The four lenses

Science
Statistics
Systems
Strategy

Building and Evaluating Advanced Regression Models

To build, assess, and interpret a statistical model that explains the relationship between a response variable and predictors, specifically for data that does not meet the assumptions of multiple linear regression (e.g., non-normal responses, correlated observations).

When to use: When a researcher needs to model a response variable as a function of one or more explanatory variables and the data violates the assumptions of linearity, normality, equal variance, or independence.

Step 1Frame the analysis and organize the data.
Entry: A raw dataset and a set of research questions are available.
Exit: The data is clean, organized, and ready for exploratory analysis.
In: Raw dataset, Research questions · Out: A tidy dataset ready for modeling
Step 2Perform Exploratory Data Analysis (EDA).
Entry: A tidy dataset is available.
Exit: Initial insights into data structure, relationships, and potential model forms are gained.
In: Tidy dataset · Out: EDA plots and summary statistics, Hypotheses about variable relationships and model structure
Step 3Specify and fit an initial, simple model.
Entry: EDA is complete and a modeling strategy is forming.
Exit: A baseline model is fitted and its parameters are estimated.
In: Tidy dataset, Hypothesized model structure · Out: Fitted baseline model
Step 4Iteratively build and compare more complex models.
Entry: A baseline model has been fit.
Exit: A set of candidate models has been developed and compared.
- For nested models, is the drop-in-deviance (LRT) or F-test significant?
- For non-nested models, which model has a lower AIC or BIC?
In: Fitted baseline model, Additional candidate predictors and terms · Out: A set of fitted candidate models with comparison statistics
Step 5Assess model fit and diagnose issues.
Entry: One or more candidate models have been fitted.
Exit: Potential model deficiencies like overdispersion, excess zeros, or unmodeled correlation are identified.
- Is the residual deviance significantly larger than its degrees of freedom, suggesting overdispersion or lack-of-fit?
- Does the proportion of zero counts exceed what is expected by the model?
- Does the study design imply correlated data that is not yet accounted for?
In: Fitted candidate model · Out: Residual plots, Goodness-of-fit statistics, Diagnosis of model issues
Step 6Address diagnosed model issues.
Entry: A model issue has been diagnosed in the previous step.
Exit: A revised model that addresses the diagnosed issue has been fitted.
In: Fitted model with diagnosed issues · Out: A revised, better-fitting model
Step 7Select and interpret the final model.
Entry: A well-fitting, robust model has been developed.
Exit: Actionable conclusions are drawn from the model, supported by statistical evidence.
In: Final fitted model · Out: Interpretations of model coefficients, Confidence intervals and p-values, Conclusions addressing research questions

Parametric Bootstrap Testing for Model Comparison

To obtain an accurate p-value for comparing two nested multilevel models, especially when testing variance components where standard likelihood ratio test theory does not apply.

When to use: When a standard likelihood ratio test using a chi-square approximation is known to be unreliable, such as when testing if a variance component is equal to zero (a test on the boundary of the parameter space).

Step 1Fit the reduced (null) model to the original data.
Entry: A full model and a nested reduced model have been specified.
Exit: Parameter estimates for the reduced model are obtained.
In: Original dataset, Reduced model specification · Out: Fitted reduced model with parameter estimates
Step 2Simulate a new set of response data from the fitted reduced model.
Entry: The reduced model has been fitted.
Exit: A single simulated dataset is generated.
In: Fitted reduced model · Out: Simulated response vector
Step 3Fit both the reduced and full models to the simulated data.
Entry: A simulated dataset is available.
Exit: Both models are fitted to the simulated data.
In: Simulated dataset, Reduced and full model specifications · Out: Two fitted models based on simulated data
Step 4Calculate and store the likelihood ratio test (LRT) statistic.
Entry: Both models have been fitted to the simulated data.
Exit: One simulated LRT statistic is calculated and stored.
In: Log-likelihood values from the two fitted models · Out: A single LRT statistic
Step 5Repeat the simulation process many times.
Entry: The simulation and testing procedure for a single iteration is established.
Exit: A distribution of LRT statistics under the null hypothesis is generated.
In: Number of bootstrap repetitions · Out: A vector of simulated LRT statistics
Step 6Calculate the final p-value.
Entry: The empirical null distribution of the LRT statistic is available, and the LRT statistic from the original data has been calculated.
Exit: A robust p-value for the model comparison is obtained.
In: Observed LRT statistic (from original data), Distribution of simulated LRT statistics · Out: Parametric bootstrap p-value

A candidate measure

Beyond Multiple Linear Regression Applied Generalized Linear Models And Multilevel Models in R — derived measurement candidates

Data-Model Mismatch Conditions

Histogram of the response variable shows significant skew or discreteness.; Table of response variable shows only two unique values.; Mean of response variable groups is not equal to the variance of those groups.; Intraclass Correlation Coefficient (ICC) is substantially greater than zero.; Spaghetti plots show consistent patterns within subjects that differ between subjects.

self-report suitability: none

Appropriate Model Selection

R code contains `glm()` with `family = poisson` or `family = binomial`.; R code contains `lmer()` or `glmer()`.; Model specification includes terms like `(1 | group_id)`.

self-report suitability: low

Model-Assumption Alignment

Residual deviance divided by degrees of freedom (dispersion parameter) is close to 1.; Plot of deviance residuals vs. fitted values shows random scatter around zero.; Goodness-of-fit test has a non-significant p-value.; AIC/BIC values are lower compared to simpler, less appropriate models.; Likelihood Ratio Test (drop-in-deviance test) shows significant improvement over a nested simpler model.

self-report suitability: none

Inferential Validity

(In simulation) Bias of parameter estimates (difference between estimate and true value) is near zero.; (In simulation) Coverage probability of 95% confidence intervals is near 95%.; Stability of coefficient estimates and standard errors across different, but theoretically justified, model specifications.; Comparison of model-derived standard errors with bootstrapped standard errors.

self-report suitability: none

Run the assessment

The story

The reader A data analyst, researcher, or student who understands basic multiple linear regression but consistently encounters real-world data that violates its strict assumptions, such as count data, binary outcomes, or nested/longitudinal structures.

External problem

Standard statistical tools like multiple linear regression are inappropriate for their data, leading to invalid results, poor model fit, and an inability to answer key research questions.

Internal problem

They feel limited, frustrated, and uncertain about their analytical capabilities, worrying that their conclusions are flawed or that they are missing crucial insights hidden within complex data.

Philosophical problem

It's simply wrong to force data into a model that doesn't fit its fundamental nature; analysts should have the tools to model reality as it is, not as simplified theory dictates.

The plan

Revisit the foundations and limitations of multiple linear regression.
Grasp the unifying principle of likelihood for estimation and model comparison.
Master Generalized Linear Models (GLMs) to handle non-normal outcomes like counts and binary data.
Learn to identify and understand the implications of correlated data structures.
Master Multilevel Models (MLMs) to properly analyze nested and longitudinal data.
Combine GLMs and MLMs to tackle complex data that is both non-normal and correlated.

Success

The reader becomes a confident and versatile analyst, capable of choosing and applying the right model for a wide variety of complex data.
They can produce more accurate, valid, and nuanced insights from their analyses.
Their expanded toolkit opens up new research questions and career opportunities.

At stake

They continue to misapply linear regression to data that violates its assumptions, leading to incorrect inferences and discredited work.
They are forced to ignore or oversimplify complex datasets, missing out on important discoveries.
Their analytical skills stagnate, leaving them unable to tackle the challenges of modern data analysis.

Chapter by chapter

ch01Review of Multiple Linear Regression
This chapter introduces the principles of multiple linear regression, emphasizing the importance of understanding assumptions, identifying violations, and exploring applications in various research contexts.
ch02Beyond Least Squares: Using Likelihoods
This chapter asserts that likelihood methods offer a more flexible and context-aware approach to statistical modeling and estimation than traditional least squares methods, particularly when analyzing complex datasets.
ch03Distribution Theory
This chapter introduces the statistical underpinnings of various probability distributions, emphasizing their relevance in modeling discrete and continuous random variables.
ch04p01Poisson Regression (part 1/3)
This chapter introduces Poisson regression, outlining its applications in modeling count data while establishing a foundation of its assumptions and methodology.
ch04p02Poisson Regression (part 2/3)
In this chapter, we delve into logistic regression, examining its structure, assumptions, and applications through case studies, distinguishing it from linear regression and identifying when to use it for binary outcomes.
ch04p03Poisson Regression (part 3/3)
This chapter examines the complexities of modeling correlated binary data using beta-binomial distributions, emphasizing their application in ecological studies involving deformed offspring in dam populations.
ch05Introduction to Multilevel Models
This chapter introduces the concept of multilevel models, elucidating their importance for analyzing data collected at different nested levels, particularly in the context of performance anxiety among musicians.
- Multilevel models offer a rigorous approach to analyzing nested data structures, providing better insights into individual-level variations.
- Proper identification of response variables and covariates at multiple levels enhances the accuracy of statistical interpretations.
- Ignoring the dependence between observations from the same subject can lead to significantly biased conclusions.
- Exploratory data analysis remains a critical step in understanding relationships within complex datasets.
ch06p01Two-Level Longitudinal Data (part 1/2)
This chapter explores the complexities of analyzing longitudinal data through multilevel models, emphasizing how to manage missing data and compare school performance using the case study of charter versus public non-charter schools.
- Longitudinal data embodies a unique structure that requires careful analytical strategies to yield meaningful insights.
- Attention to missing data is crucial, as improper handling can significantly distort findings and lead to misinterpretations.
- Charter schools present a case study of performance variance, necessitating nuanced statistical approaches to understand their efficacy compared to public non-charter schools.
- Multilevel modeling can reveal trends overlooked in traditional analyses, especially regarding within- and between-school covariance structures.
ch06p02Two-Level Longitudinal Data (part 2/2)
This chapter delves into the intricacies of modeling two-level longitudinal data, emphasizing multilevel models that account for both individual-level and group-level variability in contexts such as healthcare and education.
ch07Initial Exploratory Analyses
This chapter presents an initial examination of foul calls in college basketball games, highlighting the relationships between various game conditions and the likelihood of fouls being called on home or visiting teams.
- The analysis of 4,972 fouls reveals a significant tendency for fouls to skew towards home teams under specific circumstances.
- Histograms demonstrated that both score and foul differentials are typically centered around home team advantages.
- Home teams are significantly more likely to accumulate fouls when they have the lead, suggesting a possible bias in officiating.
- Categorical analysis indicated that foul types and previous fouls strongly influence the likelihood of subsequent fouls.
ch08Two-Level Modeling with a Generalized Response
This chapter explores multilevel modeling to analyze referee calls in basketball games, asserting the necessity of properly accounting for hierarchical data structures to reach accurate conclusions about bias in foul calls.
- Accurate analysis of referee behavior requires attention to the multilevel nature of game data; treating fouls as independent may mislead conclusions.
- Logistic regression models reveal significant trends, yet the complexity lies in accounting for game-specific variability.
- A unified multilevel framework that estimates parameters simultaneously offers a more robust understanding of referee decisions and biases.
- The statistical significance of foul distribution needs to be interpreted in the context of the game's dynamics and referee tendencies.
ch09Crossed Random Effects
This chapter explores how crossed random effects in statistical modeling provide a more nuanced understanding of referee bias in college basketball by accounting for team and game variability.
ch10Parametric Bootstrap for Model Comparisons
This chapter explores the application of parametric bootstrap methods for comparing models in multilevel analysis, particularly in assessing the significance of random effects for home and visiting teams in sports data.
- The inclusion of random effects for home and visiting teams significantly enhances the predictive performance of multilevel models (p=.0003).
- Parametric bootstrap methods provide a more reliable approach for hypothesis testing, particularly when traditional methods encounter boundary constraints.
- The analysis reveals no significant justification for adding complexity through additional random slope effects based on observed data variability.
- Relying solely on likelihood ratio tests may lead analysts to underappreciate the limitations of chi-square approximations in certain contexts.
ch11A Final Model for Examining Referee Bias
This chapter presents a comprehensive model that quantifies and analyzes referee bias in college basketball, particularly through the lens of foul differentials and their interactions with game-specific factors.
ch12Estimated Random Effects
This chapter explores the application of random effects in statistical models to account for variability among teams and games, revealing insights about home foul tendencies in basketball.
- Random effects allow for a nuanced analysis that reflects the unique circumstances of each game, rather than relying on generalized averages.
- Understanding the distribution of estimated random effects enables analysts to identify which teams have atypical foul tendencies.
- Models that incorporate variability through random effects lead to more accurate and strategic insights in sports analytics.
- Traditional fixed effects models may provide clarity but fail to capture the complex realities of team performance dynamics.
ch13Notes on Using R (optional)
This chapter provides in-depth guidance on fitting multilevel generalized linear models using R, focusing on the extraction and visualization of both fixed and random effects through specific coding examples and statistical principles.
- Fitting multilevel models in R requires a nuanced understanding of how to specify error terms and their dependencies correctly.
- Extracting fixed and random effects using `fixef()` and `ranef()` is crucial for accurate model interpretation and a deeper insight into the data.
- Assumptions made during model fitting, such as causal relationships and error independence, can dramatically affect analysis outcomes.
- Visualizing random effects enhances understanding and communication of analysis results, particularly in complex datasets.
ch14Exercises
This chapter provides a series of exercises aimed at deepening the reader's understanding of multilevel generalized linear models, particularly in the context of data analysis and interpretation.
- The importance of conceptual clarity in modeling begins with defining relevant datasets and research questions.
- Visual representations of data, like conditional density and empirical logit plots, are crucial for drawing insights.
- Relying solely on logistic regression for complex datasets can lead to oversights that multilevel models are designed to address.
- Distinguishing between crossed and nested random effects is critical in accurately modeling data.

Questions this book answers

How can one model response variables that are not normally distributed, such as counts (e.g., number of crimes) or binary outcomes (e.g., pass/fail)?
What are the limitations of ordinary least squares regression and when is it necessary to use alternative methods?
What is the theory and application of Generalized Linear Models (GLMs) like Poisson and logistic regression?
How can statistical models account for correlated data, where observations are not independent (e.g., repeated measurements on the same subject, or students nested within schools)?
What are multilevel (or hierarchical) models and how are they used to analyze longitudinal and nested data?

Glossary

Data-Model Mismatch Conditions: The presence of characteristics in a dataset, specifically the distributional form of the response variable or the dependence structure among observations, that are inconsistent with the assumptions of a default or naively chosen statistical model, typically standard Multiple Linear Regression.
Appropriate Model Selection: The analyst's action of choosing a statistical modeling framework (e.g., GLM, Multilevel Model) and specifying its components (e.g., link function, random effects) to match the known characteristics of the data, thereby addressing potential data-model mismatch conditions.
Model-Assumption Alignment: The extent to which the statistical assumptions of the fitted model are satisfied by the data. This includes assumptions about the distribution of residuals, the relationship between mean and variance, and the correctness of the specified model structure.
Inferential Validity: The degree to which the model's outputs—parameter estimates, standard errors, confidence intervals, and p-values—accurately reflect the true underlying relationships and their uncertainty. High validity means conclusions about significance and effect size are correct.

Related in the library

Tools these methods power