library / lib611b5176b03212e7
Practical Statistics for Data Scientists
In a sentence
A practitioner's guide to the 50+ core statistical concepts data scientists actually need, explaining which classical methods matter for modern data work and why.
Written for data scientists with some R/Python familiarity and patchy statistics background, this book distills a century and a half of statistical theory into the concepts that are genuinely useful for data science practice. Rather than burdening readers with the inferential machinery that dominates traditional courses, the authors—two statisticians turned data scientists and a computational scientist—triage statistical ideas, showing which are load-bearing for prediction and exploration and which can be safely de-emphasized. From exploratory data analysis and sampling distributions through significance testing, regression, classification, and statistical machine learning (KNN, trees, bagging, boosting) to unsupervised methods (PCA, clustering), every concept is paired with practical R and Python code and a clear-eyed verdict on its real-world relevance. The result is a navigable, reference-friendly bridge between the disciplines of statistics and data science.
The story it tells the reader
The reader A data scientist with working knowledge of R and/or Python and spotty prior exposure to statistics who wants to apply the right statistical concepts effectively to real data problems.
External problem
Traditional statistics instruction is bloated with inferential machinery, and it is unclear which concepts actually matter for building and evaluating predictive models on real data.
Internal problem
The reader feels uncertain, intimidated, or fooled by random patterns, unsure whether their analyses are sound or whether they are overcomplicating things.
Philosophical problem
Statistics shouldn't be taught as a century-old monolith disconnected from data science; practitioners deserve guidance on what is useful and why.
The plan
- Start by exploring and visualizing your data to build intuition (EDA).
- Understand sampling and sampling distributions, and use the bootstrap to gauge uncertainty.
- Apply experimental design and significance testing judiciously, focusing on resampling intuition.
- Build regression models for prediction and explanation, watching for confounders and diagnostics.
- Build and rigorously evaluate classification models, especially with imbalanced data.
- Use statistical machine learning (KNN, trees, random forests, boosting) with cross-validated tuning.
- Apply unsupervised methods (PCA, clustering) with proper scaling to reduce dimensions and find structure.
Success
- You confidently choose the right statistical method for each problem and know why.
- You quantify uncertainty correctly and avoid being misled by random variation.
- You build predictive models that generalize well and evaluate them with appropriate metrics.
- You communicate results clearly, distinguishing signal from noise.
At stake
- You waste effort on irrelevant statistical formalism while missing what matters for prediction.
- You overfit models and draw conclusions from statistical noise.
- You misinterpret p-values, accuracy, and other metrics, leading to flawed business decisions.
- You are fooled by chance patterns and selection bias in your data.
Model of the world · 12 constructs · 15 relations
A framework capturing how methodological design choices and data conditions in a data science workflow drive intermediate analytical states (model fit, uncertainty quantification, overfitting risk) and ultimately determine predictive accuracy and generalization to new data.
Design levers
Intermediate states & behaviors
Outcomes
- Exploratory Data Analysis Rigor
- Use of Resampling Methods
- Model Complexity Control
- Feature Scaling Appropriateness
- Experimental Design Quality
- +1 more
- Model Validity and Bias Avoidance
- Uncertainty Quantification
- Overfitting Risk
- Predictive Accuracy on New Data
- Analytical Insight and Interpretability
Design levers
- Exploratory Data Analysis Rigor
- Use of Resampling Methods
- Model Complexity Control
- Feature Scaling Appropriateness
- Experimental Design Quality
- +1 more
Intermediate states & behaviors
- Model Validity and Bias Avoidance
- Uncertainty Quantification
- Overfitting Risk
Outcomes
- Predictive Accuracy on New Data
- Analytical Insight and Interpretability
Moderators / context: Sampling Quality and Representativeness
Exploratory Data Analysis Rigordesign lever
The degree to which the analyst summarizes and visualizes data (location/variability estimates, distributions, correlations, multivariate plots) before modeling, building intuition about structure, outliers, and relationships in the data.
Sampling Quality and Representativenesscontextual condition
The extent to which the data sample is drawn randomly and representatively from the target population, minimizing bias and selection effects, and balancing size against quality for the analytical goal.
Use of Resampling Methodsdesign lever
The application of bootstrap and permutation procedures to estimate sampling distributions, standard errors, confidence intervals, and to test hypotheses without relying on strong distributional assumptions.
Experimental Design Qualitydesign lever
The soundness of the testing setup including randomization, control groups, predefined hypotheses, and appropriate significance testing, which protects conclusions from chance and bias.
Model Complexity Controldesign lever
The use of techniques such as variable selection, stepwise/penalized regression, pruning, regularization, and hyperparameter tuning to constrain model complexity in line with the principle of parsimony.
Feature Scaling Appropriatenessdesign lever
The degree to which variables are appropriately normalized, standardized, or encoded (e.g., z-scores, Gower's distance, one-hot encoding) so that no variable unduly dominates distance- or variance-based methods.
Evaluation Metric Appropriatenessdesign lever
The selection of performance metrics (e.g., recall, specificity, precision, AUC, lift, RMSE) that match the business objective and data structure, especially under class imbalance.
Uncertainty Quantificationpsychological state
The intermediate analytical state in which the variability and reliability of estimates and predictions are accurately characterized (via standard errors, confidence/prediction intervals, sampling distributions).
Overfitting Riskbehavioral pattern
The intermediate state reflecting the extent to which a model fits noise rather than signal in the training data, increasing variance and degrading performance on new data.
Model Validity and Bias Avoidancepsychological state
The intermediate state in which conclusions and models are free from spurious relationships, confounding, selection bias, and being fooled by random chance.
Predictive Accuracy on New Dataoutcome metric
The outcome measuring how well the model predicts or classifies out-of-sample records, captured by metrics such as RMSE for regression and AUC/recall for classification.
Analytical Insight and Interpretabilityoutcome metric
The outcome of gaining meaningful, communicable understanding of relationships in the data, including variable importance, structure, and explanatory mechanisms.
How they connect
- exploratory data analysis rigor → predicts analytical insight
- exploratory data analysis rigor → influences model validity
- sampling quality → predicts model validity
- resampling use → predicts uncertainty quantification
- resampling use → influences model validity
- experimental design quality → predicts model validity
- model complexity control − predicts overfitting risk
- overfitting risk − predicts predictive accuracy
- model complexity control → mediates predictive accuracy
- feature scaling appropriateness → moderates predictive accuracy
- evaluation metric appropriateness → moderates predictive accuracy
- uncertainty quantification → predicts model validity
- model validity → predicts predictive accuracy
- model validity → predicts analytical insight
- feature scaling appropriateness → influences analytical insight
Frameworks & instruments in this book
- Look at the data before modeling it.
- Account for uncertainty and random variation rather than being fooled by chance patterns.
- Prefer robust estimates that are not dominated by outliers.
- Evaluate models on out-of-sample data (holdout, cross-validation) rather than in-sample fit.
- Favor simpler models (Occam's razor) and penalize unnecessary complexity to avoid overfitting.
- Scale and normalize variables appropriately before distance-based or variance-based methods.
Several of these are operationalized as tools in the People Analytics Toolbox.
Topics
- applied statistics
Related in the library
- The Art of StatisticsDavid SpiegelhalterStatistics
- Big Data_ A Very Short Introduction (Very Short Introductions)Dawn E. HolmesStatistics · Systems
- Networks_ A Very Short Introduction (Very Short Introductions)Guido Caldarelli & Michele CatanzaroStatistics · Systems
- People Analytics & Text Mining with RCedric Ng Mong ShenStatistics · Systems
- People Analytics For DummiesMike WestStatistics · Systems
- Predictive HR AnalyticsCedric Ng Mong ShenStatistics · Systems