peopleanalyst

library / libcf12254fc676357b

Data Mining for Business Analytics: Concepts, Techniques, and Applications

Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, Kenneth C. Lichtendahl Jr. · 2017

In a sentence

A practical, hands-on guide that teaches the core concepts, techniques, and applications of data mining for business analytics using R, organized around a disciplined predictive-modeling process.

This book demystifies data mining for business students and practitioners by treating it as a structured, repeatable process rather than a black box of algorithms. Beginning with how to define a business problem and prepare data, it walks readers through the full predictive-analytics workflow: data exploration and visualization, dimension reduction, performance evaluation, and a comprehensive toolkit of supervised methods (multiple linear and logistic regression, k-nearest neighbors, naive Bayes, classification and regression trees, neural nets, discriminant analysis, ensembles) and unsupervised methods (association rules, collaborative filtering, cluster analysis). It also covers time-series forecasting and emerging data-analytics domains like social network and text mining. Throughout, the authors emphasize the central data-mining danger—overfitting—and the disciplined use of data partitioning (training/validation/test sets) and honest performance metrics to ensure that models generalize to new records. Rich with real business cases and R code, it equips the reader to build, evaluate, select, and deploy models that actually inform decisions.

The four lenses

  • Science
  • Statistics
  • Systems
  • Strategy

The model

A framework-and-path model expressing how design choices in the data-mining process (problem definition, data preparation, dimension reduction, partitioning, method selection) influence intermediate analytic states (model complexity, generalization) and ultimately predictive performance and business value, with overfitting and class/cost conditions acting as key moderators.

Quality of Problem/Purpose Definitiondesign lever

The degree to which the business purpose, intended use of results, affected stakeholders, and decision context are clearly understood and specified before modeling begins; the book stresses this as the most error-prone step.

Data Preparation and Cleaning Qualitydesign lever

The thoroughness and correctness of obtaining, exploring, cleaning, handling missing values and outliers, encoding categorical variables, and normalizing/rescaling data prior to modeling, which conditions all downstream analysis.

Dimension Reduction Effortdesign lever

The extent to which the number of variables is reduced and consolidated via domain knowledge, data summaries, correlation analysis, category reduction, principal components, or trees to combat the curse of dimensionality.

Use of Data Partitioning and Cross-Validationdesign lever

The practice of splitting data into training, validation, and (often) test sets, or using cross-validation, so models are fit on one set and evaluated on unseen data to obtain honest performance estimates.

Appropriateness of Method Selectiondesign lever

How well the chosen modeling technique (e.g., regression, k-NN, naive Bayes, trees, neural nets, discriminant analysis, ensembles, clustering) matches the data structure, goal, and class separation; the book notes methods coexist because each has different strengths.

Model Complexitypsychological state

The richness/flexibility of the fitted model (number of predictors, tree depth, hidden nodes, interaction terms), which mediates between design choices and generalization; too high invites overfitting, too low underfits.

Overfitting (Failure to Generalize)behavioral pattern

The condition in which a model fits noise/idiosyncrasies of the training data, yielding low training error but poor performance on new records; identified as the central hazard of data mining.

Model Generalizationpsychological state

The degree to which a fitted model performs well on data it has not seen (validation/test/new records), reflecting that it captured signal rather than noise.

Class Imbalance and Misclassification-Cost Contextcontextual condition

Contextual condition describing whether the class of interest is rare and whether misclassification costs are asymmetric, which moderates how performance should be measured and whether oversampling is needed.

Appropriateness of Performance Metric Choicedesign lever

Whether the chosen evaluation metric matches the task (numerical prediction vs. classification vs. ranking) and the importance/cost structure of classes (e.g., RMSE, accuracy, sensitivity/specificity, lift, ROC/AUC, cost-weighted measures).

Predictive Performanceoutcome metric

The accuracy/quality of model predictions or classifications on holdout data, captured by metrics such as RMSE/MAE/MAPE for prediction, accuracy/sensitivity/specificity/AUC for classification, and lift for ranking.

Business Value / Decision Qualityoutcome metric

The ultimate organizational benefit realized when validated models are deployed to score new records and improve decisions (e.g., better targeting, profit, fraud detection), realized only through deployment and monitoring.

How they connect

  • problem definition quality influences method selection fit
  • problem definition quality influences business value
  • data preparation quality influences predictive performance
  • dimension reduction influences model complexity
  • dimension reduction influences predictive performance
  • method selection fit influences predictive performance
  • method selection fit influences model complexity
  • model complexity predicts overfitting
  • overfitting influences generalization
  • data partitioning moderates overfitting
  • generalization predicts predictive performance
  • class imbalance cost condition moderates performance metric choice
  • performance metric choice influences business value
  • predictive performance predicts business value
  • class imbalance cost condition influences data partitioning

A candidate measure

Data Mining for Business Analytics: Concepts, Techniques, and Applications — derived measurement candidates

Quality of Problem/Purpose Definition

Reviewer rubric score (low/medium/high); Presence/absence of defined success criteria

self-report suitability: medium

Data Preparation and Cleaning Quality

Post-cleaning missing-value rate; Count of outliers handled; Encoding completeness index

self-report suitability: low

Dimension Reduction Effort

% reduction in predictor count; Cumulative variance explained by retained components

self-report suitability: medium

Use of Data Partitioning and Cross-Validation

Partition proportions; Number of folds

self-report suitability: high

Appropriateness of Method Selection

Number of candidate methods compared; Validation metric of selected vs. alternatives

self-report suitability: medium

Model Complexity

AIC/BIC/Mallows Cp; Parameter count; Terminal-node count

self-report suitability: none

Overfitting (Failure to Generalize)

Training–validation error gap; Difference in confusion-matrix accuracy across partitions

self-report suitability: none

Model Generalization

Validation/test metric value; Cross-fold performance variance

self-report suitability: none

Class Imbalance and Misclassification-Cost Context

Minority-class %; False-negative-to-false-positive cost ratio

self-report suitability: medium

Appropriateness of Performance Metric Choice

Task-metric alignment flag; Presence of cost in evaluation

self-report suitability: medium

Predictive Performance

RMSE/MAE/MAPE; Accuracy/sensitivity/specificity/AUC; Lift/decile values

self-report suitability: none

Business Value / Decision Quality

Net profit / cost savings; Response-rate lift in deployment; Reduction in misclassification cost

self-report suitability: medium

Run the assessment

The story

The reader A business student, analyst, or manager who wants to use data to make better, more accurate decisions and to build predictive models that work in practice.

External problem

They face large, messy datasets and a bewildering array of methods, and need to build models that reliably predict or classify new records.

Internal problem

They feel overwhelmed by the complexity of algorithms and uncertain whether their models will actually work on new data or are just fooling themselves.

Philosophical problem

It's wrong to deploy data mining as a 'solution in search of a problem' or to trust models that merely fit past data without proving they generalize.

The plan

  1. Define the business purpose and obtain the right data.
  2. Explore, clean, and reduce the dimension of the data.
  3. Partition data into training, validation, and test sets to guard against overfitting.
  4. Choose and run appropriate supervised or unsupervised methods.
  5. Evaluate performance honestly on holdout data using task-appropriate metrics.
  6. Select the best, parsimonious model and deploy it to score new records.

Success

  • The reader builds models that generalize to new data and genuinely inform decisions.
  • They can choose, compare, and tune the right method for the problem and data at hand.
  • They reliably detect and avoid overfitting and measure performance with the right metrics.
  • They translate raw, complex data into actionable business insight and competitive advantage.

At stake

  • Models overfit the training data and fail in deployment, eroding trust and wasting resources.
  • Misleading accuracy metrics conceal poor performance on the cases that matter most.
  • Analytics becomes an expensive solution searching for a problem, delivering no business value.

Related in the library