What is PeopleAnalyst?

PeopleAnalyst is the front door for people-analytics research: 205+ works indexed and profiled, 40+ citation-grade findings extracted, and peer-reviewed behavioral science translated from academic to actionable — the missing manual for the people analytics you always meant to do.

What is people analytics?

People analytics is not a dashboard. It is behavioral science and statistical inference applied to workforce decisions — a discipline with its own methodology, spanning measurement, organizational design, talent, leadership, and analytics craft.

Why does AI in HR need measurement science?

AI is being deployed in high-stakes people decisions — hiring, performance, attrition — without the measurement science to evaluate whether it works or whom it harms. Construct validity, effect sizes, and criterion validity are the vocabulary for asking an AI vendor the right questions.

How is the research made accessible?

The evidence is indexed and searchable: 205+ works, 40+ citation-grade insight cards, and 8 research arcs, so the right finding reaches the right decision at the right time.

What separates good people measurement from assertion?

Good measurement has a method: construct validity, reliability, and effect-size interpretation are not optional — they are what separates evidence from assertion.

library / lib344b5215e68518bf

Machine Learning and Data Science

In a sentence

A practical, math-light introduction to applying statistical learning and machine learning methods using the R programming environment across the full data science workflow.

Written for analysts, software developers, and researchers who want to move beyond spreadsheet-level analytics, this book casts machine learning as the scientific method applied to data. Organized to mirror an actual data science project, it walks readers step-by-step from data access and munging, through exploratory data analysis, into supervised methods (regression and a broad battery of classifiers), model performance evaluation, and finally unsupervised learning. Every concept is grounded in runnable R code using base packages and well-known data sets, deliberately omitting heavy mathematics so newcomers can become productive quickly while still understanding the workflow, the pitfalls (overfitting, bias-variance, confounders, data leakage), and how to iterate toward better predictive power.

The four lenses

Science
Statistics
Systems
Strategy

The model

A causal-process model expressing how design levers (data access quality, data munging, feature engineering, model/algorithm selection, tuning) and contextual conditions (data volume/quality) drive intermediate analytical states (data tidiness, model fit, overfitting, bias-variance balance) which determine the outcome of generalization (predictive accuracy on new data) and ultimately actionable business value.

Data Volume and Qualitycontextual condition

The amount, completeness, and cleanliness of available raw data feeding a machine learning project; the book repeatedly asserts that more data and higher-quality data tend to improve predictive power more than algorithmic cleverness.

Data Access Capabilitydesign lever

The data scientist's toolset and effectiveness in locating and importing data from disparate sources (CSV, Excel, JSON, HTML, SQL, APIs) into the R environment as the first stage of the data pipeline.

Data Munging Effortdesign lever

The cleaning, transforming, sampling, reshaping, and missing-value handling applied to raw data to produce a tidy form; the book states this can consume up to 80% of project effort and is foundational to downstream success.

Data Tidinessbehavioral pattern

The state of the data set in which each variable forms a column, each observation a row, names are informative, values are consistent, and missing values are minimized; an intermediate state produced by munging that enables effective modeling.

Exploratory Data Analysis Depthdesign lever

The degree to which numeric summaries and visualizations are used to understand data properties, find patterns, suggest modeling strategies, and refine feature selection before applying an algorithm.

Feature Engineering Qualitydesign lever

The identification, creation, and selection of the most informative subset of predictor variables (and transformations) for a model, drawing on domain knowledge and EDA; the book argues good features can outweigh algorithm choice.

Model and Algorithm Selectiondesign lever

The choice of an appropriate machine learning algorithm (regression, classifier, ensemble, clustering) suited to the problem type and data, recognizing that no single algorithm is best across all data sets.

Hyperparameter and Regularization Tuningdesign lever

The adjustment of algorithm tuning parameters and regularization (e.g., lambda, number of trees, hidden neurons, gamma/cost) to control model complexity and balance fit against generalization.

Model Complexitypsychological state

The flexibility of the fitted model, such as polynomial degree, tree depth, or number of features; greater complexity reduces bias but increases variance and the risk of fitting noise.

Overfittingpsychological state

The state in which a model is trained too specifically to a training set's noise, yielding small training error but large test error and reduced ability to generalize to new observations.

Bias-Variance Balancepsychological state

The degree to which a model simultaneously achieves low bias (not underfitting) and low variance (not overfitting); the central tradeoff that determines reducible error and predictive accuracy.

Data Leakagecontextual condition

The contaminating condition in which information that would not be available at prediction time (or the target itself) enters the training data, producing over-optimistic performance estimates and useless real-world models.

Confoundingcontextual condition

The presence of a variable correlated with both the response and the predictors, which distorts estimated relationships and reduces the internal validity of conclusions about causal effects.

Cross Validation Usedesign lever

The use of resampling methods (random subsampling, K-fold, leave-one-out) to estimate test error from training data and to objectively select and validate models without relying solely on training error.

Generalization Performanceoutcome metric

The model's predictive accuracy on new, independent data (out-of-sample / test error), measured by metrics such as RMSE, R-squared, or misclassification rate; the outcome the data scientist most cares about.

Actionable Business Valueoutcome metric

The ultimate organizational benefit realized when interpretable, accurate, and well-communicated model results drive decisions or are deployed into production systems with bottom-line impact.

How they connect

data volume and quality → predicts generalization performance
data access capability → influences data munging effort
data munging effort → predicts data tidiness
data tidiness → mediates generalization performance
exploratory data analysis → influences feature engineering quality
feature engineering quality → predicts generalization performance
feature engineering quality − influences overfitting
model algorithm selection → influences generalization performance
model complexity → predicts overfitting
model complexity → influences bias variance balance
hyperparameter tuning − moderates model complexity
overfitting − predicts generalization performance
bias variance balance → predicts generalization performance
cross validation use − moderates overfitting
cross validation use → predicts generalization performance
data leakage − moderates generalization performance
confounding − moderates generalization performance
generalization performance → predicts actionable business value

test RMSE; test R2; misclassification rate; accuracy/sensitivity/specificity

self-report suitability: none

Actionable Business Value

deployment status; report adoption rate; KPI changes (revenue, churn, cost)

self-report suitability: medium

Run the assessment

The story

The reader An analyst, software developer, or researcher who wants to expand beyond basic tools and become productive doing machine learning and data science.

External problem

They need to access, clean, model, and predict from real data but lack a practical end-to-end methodology and a usable toolset.

Internal problem

They feel intimidated by the mathematics and overwhelmed by the hype and jargon surrounding data science.

Philosophical problem

Powerful predictive insight should not be gatekept behind advanced mathematics; anyone willing to learn the workflow deserves to participate.

The plan

Understand the problem and access an appropriate data set into R.
Mung the data into a tidy, reproducible form.
Explore the data with numeric summaries and plots.
Engineer features and select a supervised or unsupervised model.
Train, validate, and evaluate the model with cross validation and proper error metrics.
Communicate results and deploy or iterate.

Success

The reader can confidently take a raw business problem, build and evaluate a predictive model in R, and communicate actionable results.
They have a reusable data pipeline and a growing personal library of machine learning techniques.
They can recognize and avoid overfitting, confounders, and data leakage.

At stake

The reader stalls in dirty data and never reaches usable predictions.
They build models that look accurate but fail to generalize on new data.
They draw misleading conclusions from confounded or leaked data and lose credibility.

Questions this book answers

What is machine learning and how does it enable data science?
What is the end-to-end process of a data science / machine learning project?
How do you access, clean, and explore data in R before modeling?
Which supervised algorithms predict quantitative vs. qualitative outcomes, and how do you choose among them?
How do you measure and improve a model's predictive performance and avoid overfitting?

Glossary

Data Volume and Quality: The quantity, completeness, and cleanliness of available data for a machine learning project.
Data Access Capability: The breadth and effectiveness of a data scientist's ability to ingest data from diverse sources into R.
Data Munging Effort: The cleansing and transformation work applied to raw data to prepare it for modeling.
Data Tidiness: The structural cleanliness of a data set ready for analysis.
Exploratory Data Analysis Depth: The thoroughness of statistical summarization and visualization used to understand data before modeling.
Feature Engineering Quality: The effectiveness of selecting and creating informative predictor variables for a model.
Model and Algorithm Selection: The choice of a learning algorithm appropriate to the problem and data.
Hyperparameter and Regularization Tuning: The adjustment of tuning parameters and regularization strength to control model behavior.

Related in the library

Tools these methods power