library / libcf12254fc676357b
Data Mining for Business Analytics: Concepts, Techniques, and Applications
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, Kenneth C. Lichtendahl Jr. · 2017
In a sentence
A practical, hands-on guide that teaches the core concepts, techniques, and applications of data mining for business analytics using R, organized around a disciplined predictive-modeling process.
This book demystifies data mining for business students and practitioners by treating it as a structured, repeatable process rather than a black box of algorithms. Beginning with how to define a business problem and prepare data, it walks readers through the full predictive-analytics workflow: data exploration and visualization, dimension reduction, performance evaluation, and a comprehensive toolkit of supervised methods (multiple linear and logistic regression, k-nearest neighbors, naive Bayes, classification and regression trees, neural nets, discriminant analysis, ensembles) and unsupervised methods (association rules, collaborative filtering, cluster analysis). It also covers time-series forecasting and emerging data-analytics domains like social network and text mining. Throughout, the authors emphasize the central data-mining danger—overfitting—and the disciplined use of data partitioning (training/validation/test sets) and honest performance metrics to ensure that models generalize to new records. Rich with real business cases and R code, it equips the reader to build, evaluate, select, and deploy models that actually inform decisions.
The four lenses
- Science
- Statistics
- Systems
- Strategy
The model
A framework-and-path model expressing how design choices in the data-mining process (problem definition, data preparation, dimension reduction, partitioning, method selection) influence intermediate analytic states (model complexity, generalization) and ultimately predictive performance and business value, with overfitting and class/cost conditions acting as key moderators.
Quality of Problem/Purpose Definitiondesign lever
The degree to which the business purpose, intended use of results, affected stakeholders, and decision context are clearly understood and specified before modeling begins; the book stresses this as the most error-prone step.
Data Preparation and Cleaning Qualitydesign lever
The thoroughness and correctness of obtaining, exploring, cleaning, handling missing values and outliers, encoding categorical variables, and normalizing/rescaling data prior to modeling, which conditions all downstream analysis.
Dimension Reduction Effortdesign lever
The extent to which the number of variables is reduced and consolidated via domain knowledge, data summaries, correlation analysis, category reduction, principal components, or trees to combat the curse of dimensionality.
Use of Data Partitioning and Cross-Validationdesign lever
The practice of splitting data into training, validation, and (often) test sets, or using cross-validation, so models are fit on one set and evaluated on unseen data to obtain honest performance estimates.
Appropriateness of Method Selectiondesign lever
How well the chosen modeling technique (e.g., regression, k-NN, naive Bayes, trees, neural nets, discriminant analysis, ensembles, clustering) matches the data structure, goal, and class separation; the book notes methods coexist because each has different strengths.
Model Complexitypsychological state
The richness/flexibility of the fitted model (number of predictors, tree depth, hidden nodes, interaction terms), which mediates between design choices and generalization; too high invites overfitting, too low underfits.
Overfitting (Failure to Generalize)behavioral pattern
The condition in which a model fits noise/idiosyncrasies of the training data, yielding low training error but poor performance on new records; identified as the central hazard of data mining.
Model Generalizationpsychological state
The degree to which a fitted model performs well on data it has not seen (validation/test/new records), reflecting that it captured signal rather than noise.
Class Imbalance and Misclassification-Cost Contextcontextual condition
Contextual condition describing whether the class of interest is rare and whether misclassification costs are asymmetric, which moderates how performance should be measured and whether oversampling is needed.
Appropriateness of Performance Metric Choicedesign lever
Whether the chosen evaluation metric matches the task (numerical prediction vs. classification vs. ranking) and the importance/cost structure of classes (e.g., RMSE, accuracy, sensitivity/specificity, lift, ROC/AUC, cost-weighted measures).
Predictive Performanceoutcome metric
The accuracy/quality of model predictions or classifications on holdout data, captured by metrics such as RMSE/MAE/MAPE for prediction, accuracy/sensitivity/specificity/AUC for classification, and lift for ranking.
Business Value / Decision Qualityoutcome metric
The ultimate organizational benefit realized when validated models are deployed to score new records and improve decisions (e.g., better targeting, profit, fraud detection), realized only through deployment and monitoring.
How they connect
- problem definition quality → influences method selection fit
- problem definition quality → influences business value
- data preparation quality → influences predictive performance
- dimension reduction − influences model complexity
- dimension reduction → influences predictive performance
- method selection fit → influences predictive performance
- method selection fit → influences model complexity
- model complexity → predicts overfitting
- overfitting − influences generalization
- data partitioning − moderates overfitting
- generalization → predicts predictive performance
- class imbalance cost condition → moderates performance metric choice
- performance metric choice → influences business value
- predictive performance → predicts business value
- class imbalance cost condition → influences data partitioning
A candidate measure
Data Mining for Business Analytics: Concepts, Techniques, and Applications — derived measurement candidates
Quality of Problem/Purpose Definition
Reviewer rubric score (low/medium/high); Presence/absence of defined success criteria
self-report suitability: medium
Data Preparation and Cleaning Quality
Post-cleaning missing-value rate; Count of outliers handled; Encoding completeness index
self-report suitability: low
Dimension Reduction Effort
% reduction in predictor count; Cumulative variance explained by retained components
self-report suitability: medium
Use of Data Partitioning and Cross-Validation
Partition proportions; Number of folds
self-report suitability: high
Appropriateness of Method Selection
Number of candidate methods compared; Validation metric of selected vs. alternatives
self-report suitability: medium
Model Complexity
AIC/BIC/Mallows Cp; Parameter count; Terminal-node count
self-report suitability: none
Overfitting (Failure to Generalize)
Training–validation error gap; Difference in confusion-matrix accuracy across partitions
self-report suitability: none
Model Generalization
Validation/test metric value; Cross-fold performance variance
self-report suitability: none
Class Imbalance and Misclassification-Cost Context
Minority-class %; False-negative-to-false-positive cost ratio
self-report suitability: medium
Appropriateness of Performance Metric Choice
Task-metric alignment flag; Presence of cost in evaluation
self-report suitability: medium
Predictive Performance
RMSE/MAE/MAPE; Accuracy/sensitivity/specificity/AUC; Lift/decile values
self-report suitability: none
Business Value / Decision Quality
Net profit / cost savings; Response-rate lift in deployment; Reduction in misclassification cost
self-report suitability: medium
The story
The reader A business student, analyst, or manager who wants to use data to make better, more accurate decisions and to build predictive models that work in practice.
External problem
They face large, messy datasets and a bewildering array of methods, and need to build models that reliably predict or classify new records.
Internal problem
They feel overwhelmed by the complexity of algorithms and uncertain whether their models will actually work on new data or are just fooling themselves.
Philosophical problem
It's wrong to deploy data mining as a 'solution in search of a problem' or to trust models that merely fit past data without proving they generalize.
The plan
- Define the business purpose and obtain the right data.
- Explore, clean, and reduce the dimension of the data.
- Partition data into training, validation, and test sets to guard against overfitting.
- Choose and run appropriate supervised or unsupervised methods.
- Evaluate performance honestly on holdout data using task-appropriate metrics.
- Select the best, parsimonious model and deploy it to score new records.
Success
- The reader builds models that generalize to new data and genuinely inform decisions.
- They can choose, compare, and tune the right method for the problem and data at hand.
- They reliably detect and avoid overfitting and measure performance with the right metrics.
- They translate raw, complex data into actionable business insight and competitive advantage.
At stake
- Models overfit the training data and fail in deployment, eroding trust and wasting resources.
- Misleading accuracy metrics conceal poor performance on the cases that matter most.
- Analytics becomes an expensive solution searching for a problem, delivering no business value.
Related in the library
- Data Science from Scratch: First Principles with Python
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow
- Introduction to Statistical and Machine Learning Methods for Data Science
- Machine Learning and Data Science
- Statistical Rethinking Mcelreath
- Statistics_ A Very Short Introduction (Very Short Introductions)