library / libde3ec448fb1c2be5
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
Foster Provost, Tom Fawcett · 2013
In a sentence
A conceptual guide that distills the fundamental principles underlying data science so that business people and aspiring data scientists can think data-analytically about extracting useful knowledge from data to improve business decisions.
Data Science for Business is the definitive primer for understanding data science not as a grab-bag of algorithms but as a coherent set of fundamental principles that structure data-analytic thinking. Provost and Fawcett—both seasoned practitioners and researchers—argue that beneath the dizzying array of data mining techniques lies a relatively small set of concepts (treating data as a strategic asset, framing problems with expected value, finding informative attributes, fitting models while controlling overfitting, measuring similarity) that unify the field. Organized around the CRISP data mining process and richly illustrated with real-world business cases—customer churn, targeted marketing, fraud detection, charity solicitation, whiskey recommendation, text mining of news—the book teaches readers to decompose business problems into solvable data science tasks, to evaluate solutions in business terms, and to communicate across the technical/business divide. It is the rare book that equips managers to evaluate data science proposals and equips data scientists to align their work with business value, making both better at extracting competitive advantage from data.
The four lenses
- Science
- Statistics
- Systems
- Strategy
The model
A causal/path model expressing how organizational design levers (investment in data, data science talent, data-analytic management, sound process) and analytical practices (informative attribute selection, expected value framing, complexity control, proper evaluation) produce psychological/behavioral states (data-analytic thinking, model generalization) and ultimately business outcomes (decision quality, competitive advantage). Inferred from the book's recurring themes.
Investment in Data Assetsdesign lever
The deliberate organizational decision to acquire, generate, and curate data (including incurring costs to obtain otherwise unavailable data) as a strategic asset rather than treating data only as a byproduct of operations.
Data Science Talent Qualitydesign lever
The quality, depth, and breadth of the organization's data scientists and their professional networks, recognizing the large variance in data scientist ability and the importance of apprenticeship and connections.
Data-Analytic Management Capabilitydesign lever
The degree to which management understands fundamental data science principles, can ask probing questions, anticipates project outcomes, and creates a culture where data science thrives, bridging technical and business teams.
Adherence to Sound Data Mining Processdesign lever
The extent to which the organization follows a structured, iterative data mining process (CRISP-DM) with proper business understanding, data preparation, modeling, evaluation, and deployment stages.
Informative Attribute Selectiondesign lever
The practice of identifying and selecting descriptive attributes (variables/features) that reduce uncertainty about a target of interest, measured via information gain, entropy reduction, or variance reduction.
Expected Value Problem Framingbehavioral pattern
The practice of structuring business problems using the expected value framework—decomposing them into probabilities (estimable from data) and values (from business knowledge) weighted across possible outcomes.
Model Complexity Controldesign lever
The deliberate management of model complexity (via tree pruning, feature selection, regularization, cross-validation) to find the trade-off between fitting data and generalizing, thereby avoiding overfitting.
Proper Evaluation Practicedesign lever
The use of evaluation methods aligned with the business goal—holdout testing, appropriate metrics (expected profit, ROC/AUC, lift), and meaningful baselines—rather than simplistic accuracy on training data.
Data-Analytic Thinkingpsychological state
The cognitive disposition and capability among managers and analysts to view business problems from a data perspective, assess whether and how data can improve performance, and reason systematically about analytics opportunities and threats.
Model Generalization Performancebehavioral pattern
The degree to which a model's discovered patterns apply to previously unseen data drawn from the same population, as opposed to memorizing idiosyncrasies of the training data.
Decision Qualityoutcome metric
The improvement in business decision-making—accuracy, profitability, and effectiveness—achieved by basing decisions on data analysis rather than intuition alone, including at massive automated scale.
Competitive Advantage from Data Scienceoutcome metric
The sustained business advantage a firm achieves when its data assets and data science capability are valuable, aligned with strategy, and difficult for competitors to replicate.
How they connect
- investment in data assets → influences model generalization
- data science talent → predicts competitive advantage
- data analytic management → influences data analytic thinking
- sound data mining process → influences model generalization
- informative attribute selection → influences model generalization
- complexity control → predicts model generalization
- complexity control → moderates model generalization
- proper evaluation practice → influences decision quality
- expected value framing → mediates decision quality
- data analytic thinking → predicts expected value framing
- data analytic thinking → predicts decision quality
- model generalization → influences decision quality
- decision quality → influences competitive advantage
- investment in data assets → influences competitive advantage
A candidate measure
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking — derived measurement candidates
Investment in Data Assets
Annual data acquisition/curation expenditure; Number of deliberate data-generating experiments; Count and uniqueness of integrated data sources; Ratio of data investment to total analytics budget
self-report suitability: medium
Data Science Talent Quality
Competition placements (KDD Cup, Netflix, Kaggle); Number of successful deployed projects; Apprenticeship lineage with recognized masters; Breadth of professional connections
self-report suitability: low
Data-Analytic Management Capability
Quality of probing questions (rated); Hit rate of selected projects; Maturity-model rating; Team perceptions of management support
self-report suitability: medium
Adherence to Sound Data Mining Process
Process maturity rating; Number of iteration cycles per project; Presence of documented business understanding artifacts; Use of pilot/prototype before full deployment
self-report suitability: medium
Informative Attribute Selection
Information gain (bits) per attribute; Variance reduction for numeric targets; Number of features retained vs. available; Accuracy improvement from selected feature set
self-report suitability: none
Expected Value Problem Framing
Presence/completeness of expected value decomposition (rated); Existence of cost-benefit matrix; Whether decision thresholds derived from EV; Decomposition of problem into subproblems
self-report suitability: medium
Model Complexity Control
Regularization parameter values used; Tree size / number of nodes relative to sweet spot; Training-holdout performance gap; Use of cross-validation for parameter selection
self-report suitability: none
Proper Evaluation Practice
Whether holdout/CV used (binary/rated); Metrics reported (expected profit, AUC, lift vs. only accuracy); Presence of cost-benefit incorporation; Baseline comparison (e.g., majority classifier)
self-report suitability: medium
Data-Analytic Thinking
Scored performance on proposal-critique tasks; Quality/depth of questions asked (rated); Accuracy in identifying proposal flaws; Use of frameworks (expected value, baselines) in reasoning
self-report suitability: medium
Model Generalization Performance
Holdout accuracy / AUC; Expected profit on test set; Lift over baseline; Cross-validation mean and variance
self-report suitability: none
Decision Quality
Profit per decision / expected profit realized; Productivity gain percentage; Reduction in churn/fraud losses; Return on assets/equity, market value
self-report suitability: low
Competitive Advantage from Data Science
Profitability relative to competitors (e.g., charge-off rates); Switching-cost indicators (retention, repeat usage); Uniqueness of data assets (qualitative + market valuation); Market share / valuation attributable to data
self-report suitability: low
The story
The reader A business professional, manager, investor, or aspiring data scientist who wants to extract competitive advantage and better decisions from their organization's data.
External problem
They have vast amounts of data but lack a principled way to turn it into useful knowledge and better business decisions.
Internal problem
They feel intimidated by jargon and algorithms, unsure whether a proposed data science effort is sound or whether they are being misled.
Philosophical problem
In a data-rich world, it is simply wrong to make decisions on intuition alone or to treat data science as opaque magic when its principles can be understood and harnessed.
The plan
- Learn the small set of fundamental concepts that underlie data science.
- Adopt data-analytic thinking and the CRISP process to structure problems.
- Decompose business problems into known data mining tasks using the expected value framework.
- Evaluate models in business terms, guarding against overfitting and misleading metrics.
- Build, nurture, and manage data science capability as a strategic asset.
Success
- The reader confidently frames business problems data-analytically and decomposes them into solvable tasks.
- The reader can evaluate data science proposals, spot flaws, and ask probing questions.
- The reader's organization invests wisely in data and data scientists, gaining and sustaining competitive advantage.
- Business and technical teams communicate with a shared vocabulary and deeper mutual understanding.
At stake
- The reader is bamboozled by jargon and makes wrong decisions or wastes resources on flawed projects.
- The organization deploys models that overfit, fail to generalize, or are misaligned with the business goal.
- Competitors who think data-analytically gain a strategic advantage that becomes hard to overcome.
- Data assets and talent are squandered for lack of understanding and proper management.
Related in the library