What is PeopleAnalyst?

PeopleAnalyst is the front door for people-analytics research: 205+ works indexed and profiled, 40+ citation-grade findings extracted, and peer-reviewed behavioral science translated from academic to actionable — the missing manual for the people analytics you always meant to do.

What is people analytics?

People analytics is not a dashboard. It is behavioral science and statistical inference applied to workforce decisions — a discipline with its own methodology, spanning measurement, organizational design, talent, leadership, and analytics craft.

Why does AI in HR need measurement science?

AI is being deployed in high-stakes people decisions — hiring, performance, attrition — without the measurement science to evaluate whether it works or whom it harms. Construct validity, effect sizes, and criterion validity are the vocabulary for asking an AI vendor the right questions.

How is the research made accessible?

The evidence is indexed and searchable: 205+ works, 40+ citation-grade insight cards, and 8 research arcs, so the right finding reaches the right decision at the right time.

What separates good people measurement from assertion?

Good measurement has a method: construct validity, reliability, and effect-size interpretation are not optional — they are what separates evidence from assertion.

library / libcf12254fc676357b

Data Mining for Business Analytics: Concepts, Techniques, and Applications

Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, Kenneth C. Lichtendahl Jr. · 2017

In a sentence

A practical, hands-on guide that teaches the core concepts, techniques, and applications of data mining for business analytics using R, organized around a disciplined predictive-modeling process.

This book demystifies data mining for business students and practitioners by treating it as a structured, repeatable process rather than a black box of algorithms. Beginning with how to define a business problem and prepare data, it walks readers through the full predictive-analytics workflow: data exploration and visualization, dimension reduction, performance evaluation, and a comprehensive toolkit of supervised methods (multiple linear and logistic regression, k-nearest neighbors, naive Bayes, classification and regression trees, neural nets, discriminant analysis, ensembles) and unsupervised methods (association rules, collaborative filtering, cluster analysis). It also covers time-series forecasting and emerging data-analytics domains like social network and text mining. Throughout, the authors emphasize the central data-mining danger—overfitting—and the disciplined use of data partitioning (training/validation/test sets) and honest performance metrics to ensure that models generalize to new records. Rich with real business cases and R code, it equips the reader to build, evaluate, select, and deploy models that actually inform decisions.

The four lenses

Science
Statistics
Systems
Strategy

The model

A framework-and-path model expressing how design choices in the data-mining process (problem definition, data preparation, dimension reduction, partitioning, method selection) influence intermediate analytic states (model complexity, generalization) and ultimately predictive performance and business value, with overfitting and class/cost conditions acting as key moderators.

Quality of Problem/Purpose Definitiondesign lever

The degree to which the business purpose, intended use of results, affected stakeholders, and decision context are clearly understood and specified before modeling begins; the book stresses this as the most error-prone step.

Data Preparation and Cleaning Qualitydesign lever

The thoroughness and correctness of obtaining, exploring, cleaning, handling missing values and outliers, encoding categorical variables, and normalizing/rescaling data prior to modeling, which conditions all downstream analysis.

Dimension Reduction Effortdesign lever

The extent to which the number of variables is reduced and consolidated via domain knowledge, data summaries, correlation analysis, category reduction, principal components, or trees to combat the curse of dimensionality.

Use of Data Partitioning and Cross-Validationdesign lever

The practice of splitting data into training, validation, and (often) test sets, or using cross-validation, so models are fit on one set and evaluated on unseen data to obtain honest performance estimates.

Appropriateness of Method Selectiondesign lever

How well the chosen modeling technique (e.g., regression, k-NN, naive Bayes, trees, neural nets, discriminant analysis, ensembles, clustering) matches the data structure, goal, and class separation; the book notes methods coexist because each has different strengths.

Model Complexitypsychological state

The richness/flexibility of the fitted model (number of predictors, tree depth, hidden nodes, interaction terms), which mediates between design choices and generalization; too high invites overfitting, too low underfits.

Overfitting (Failure to Generalize)behavioral pattern

The condition in which a model fits noise/idiosyncrasies of the training data, yielding low training error but poor performance on new records; identified as the central hazard of data mining.

Model Generalizationpsychological state

The degree to which a fitted model performs well on data it has not seen (validation/test/new records), reflecting that it captured signal rather than noise.

Class Imbalance and Misclassification-Cost Contextcontextual condition

Contextual condition describing whether the class of interest is rare and whether misclassification costs are asymmetric, which moderates how performance should be measured and whether oversampling is needed.

Appropriateness of Performance Metric Choicedesign lever

Whether the chosen evaluation metric matches the task (numerical prediction vs. classification vs. ranking) and the importance/cost structure of classes (e.g., RMSE, accuracy, sensitivity/specificity, lift, ROC/AUC, cost-weighted measures).

Predictive Performanceoutcome metric

The accuracy/quality of model predictions or classifications on holdout data, captured by metrics such as RMSE/MAE/MAPE for prediction, accuracy/sensitivity/specificity/AUC for classification, and lift for ranking.

Business Value / Decision Qualityoutcome metric

The ultimate organizational benefit realized when validated models are deployed to score new records and improve decisions (e.g., better targeting, profit, fraud detection), realized only through deployment and monitoring.

How they connect

problem definition quality → influences method selection fit
problem definition quality → influences business value
data preparation quality → influences predictive performance
dimension reduction − influences model complexity
dimension reduction → influences predictive performance
method selection fit → influences predictive performance
method selection fit → influences model complexity
model complexity → predicts overfitting
overfitting − influences generalization
data partitioning − moderates overfitting
generalization → predicts predictive performance
class imbalance cost condition → moderates performance metric choice
performance metric choice → influences business value
predictive performance → predicts business value
class imbalance cost condition → influences data partitioning

A candidate measure

Data Mining for Business Analytics: Concepts, Techniques, and Applications — derived measurement candidates

Quality of Problem/Purpose Definition

Reviewer rubric score (low/medium/high); Presence/absence of defined success criteria

self-report suitability: medium

Data Preparation and Cleaning Quality

Post-cleaning missing-value rate; Count of outliers handled; Encoding completeness index

self-report suitability: low

Dimension Reduction Effort

% reduction in predictor count; Cumulative variance explained by retained components

self-report suitability: medium

Use of Data Partitioning and Cross-Validation

Partition proportions; Number of folds

self-report suitability: high

Appropriateness of Method Selection

Number of candidate methods compared; Validation metric of selected vs. alternatives

self-report suitability: medium

Model Complexity

AIC/BIC/Mallows Cp; Parameter count; Terminal-node count

self-report suitability: none

Overfitting (Failure to Generalize)

Training–validation error gap; Difference in confusion-matrix accuracy across partitions

self-report suitability: none

Model Generalization

Validation/test metric value; Cross-fold performance variance

self-report suitability: none

Class Imbalance and Misclassification-Cost Context

Minority-class %; False-negative-to-false-positive cost ratio

self-report suitability: medium

Appropriateness of Performance Metric Choice

Task-metric alignment flag; Presence of cost in evaluation

self-report suitability: medium

Predictive Performance

RMSE/MAE/MAPE; Accuracy/sensitivity/specificity/AUC; Lift/decile values

self-report suitability: none

Business Value / Decision Quality

Net profit / cost savings; Response-rate lift in deployment; Reduction in misclassification cost

self-report suitability: medium

Run the assessment

The story

The reader A business student, analyst, or manager who wants to use data to make better, more accurate decisions and to build predictive models that work in practice.

External problem

They face large, messy datasets and a bewildering array of methods, and need to build models that reliably predict or classify new records.

Internal problem

They feel overwhelmed by the complexity of algorithms and uncertain whether their models will actually work on new data or are just fooling themselves.

Philosophical problem

It's wrong to deploy data mining as a 'solution in search of a problem' or to trust models that merely fit past data without proving they generalize.

The plan

Define the business purpose and obtain the right data.
Explore, clean, and reduce the dimension of the data.
Partition data into training, validation, and test sets to guard against overfitting.
Choose and run appropriate supervised or unsupervised methods.
Evaluate performance honestly on holdout data using task-appropriate metrics.
Select the best, parsimonious model and deploy it to score new records.

Success

The reader builds models that generalize to new data and genuinely inform decisions.
They can choose, compare, and tune the right method for the problem and data at hand.
They reliably detect and avoid overfitting and measure performance with the right metrics.
They translate raw, complex data into actionable business insight and competitive advantage.

At stake

Models overfit the training data and fail in deployment, eroding trust and wasting resources.
Misleading accuracy metrics conceal poor performance on the cases that matter most.
Analytics becomes an expensive solution searching for a problem, delivering no business value.

Questions this book answers

What is the difference between business analytics, data mining, and data science?
How does one move through a disciplined data-mining process from problem definition to model deployment?
How can predictive models be built that generalize to new data rather than merely fitting the data at hand?
Which supervised and unsupervised methods are appropriate for which data structures and goals?
How should predictive performance be measured for classification, prediction, and ranking tasks?

Glossary

Quality of Problem/Purpose Definition: The clarity and completeness with which the business objective, intended use of results, stakeholders, and decision context are specified prior to modeling.
Data Preparation and Cleaning Quality: The degree to which data are correctly obtained, explored, cleaned, encoded, and normalized before modeling.
Dimension Reduction Effort: The extent to which predictor count and redundancy are reduced via domain knowledge, summaries, correlation analysis, category consolidation, PCA, or trees.
Use of Data Partitioning and Cross-Validation: The practice of dividing data into training, validation, and test sets (or using cross-validation folds) to obtain honest performance estimates and tune models.
Appropriateness of Method Selection: The fit between the chosen modeling technique and the data type, structure, goal, and class separability.
Model Complexity: The flexibility/richness of the fitted model as captured by number of parameters, depth, nodes, or terms.
Overfitting (Failure to Generalize): The phenomenon in which a model fits noise specific to the training data, producing low training error but degraded performance on new data.
Model Generalization: The capacity of a model to perform well on data it has not seen, reflecting capture of underlying signal.

Related in the library

Tools these methods power