peopleanalyst

library / libde3ec448fb1c2be5

Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking

Foster Provost, Tom Fawcett · 2013

In a sentence

A conceptual guide that distills the fundamental principles underlying data science so that business people and aspiring data scientists can think data-analytically about extracting useful knowledge from data to improve business decisions.

Data Science for Business is the definitive primer for understanding data science not as a grab-bag of algorithms but as a coherent set of fundamental principles that structure data-analytic thinking. Provost and Fawcett—both seasoned practitioners and researchers—argue that beneath the dizzying array of data mining techniques lies a relatively small set of concepts (treating data as a strategic asset, framing problems with expected value, finding informative attributes, fitting models while controlling overfitting, measuring similarity) that unify the field. Organized around the CRISP data mining process and richly illustrated with real-world business cases—customer churn, targeted marketing, fraud detection, charity solicitation, whiskey recommendation, text mining of news—the book teaches readers to decompose business problems into solvable data science tasks, to evaluate solutions in business terms, and to communicate across the technical/business divide. It is the rare book that equips managers to evaluate data science proposals and equips data scientists to align their work with business value, making both better at extracting competitive advantage from data.

The four lenses

  • Science
  • Statistics
  • Systems
  • Strategy

The model

A causal/path model expressing how organizational design levers (investment in data, data science talent, data-analytic management, sound process) and analytical practices (informative attribute selection, expected value framing, complexity control, proper evaluation) produce psychological/behavioral states (data-analytic thinking, model generalization) and ultimately business outcomes (decision quality, competitive advantage). Inferred from the book's recurring themes.

Investment in Data Assetsdesign lever

The deliberate organizational decision to acquire, generate, and curate data (including incurring costs to obtain otherwise unavailable data) as a strategic asset rather than treating data only as a byproduct of operations.

Data Science Talent Qualitydesign lever

The quality, depth, and breadth of the organization's data scientists and their professional networks, recognizing the large variance in data scientist ability and the importance of apprenticeship and connections.

Data-Analytic Management Capabilitydesign lever

The degree to which management understands fundamental data science principles, can ask probing questions, anticipates project outcomes, and creates a culture where data science thrives, bridging technical and business teams.

Adherence to Sound Data Mining Processdesign lever

The extent to which the organization follows a structured, iterative data mining process (CRISP-DM) with proper business understanding, data preparation, modeling, evaluation, and deployment stages.

Informative Attribute Selectiondesign lever

The practice of identifying and selecting descriptive attributes (variables/features) that reduce uncertainty about a target of interest, measured via information gain, entropy reduction, or variance reduction.

Expected Value Problem Framingbehavioral pattern

The practice of structuring business problems using the expected value framework—decomposing them into probabilities (estimable from data) and values (from business knowledge) weighted across possible outcomes.

Model Complexity Controldesign lever

The deliberate management of model complexity (via tree pruning, feature selection, regularization, cross-validation) to find the trade-off between fitting data and generalizing, thereby avoiding overfitting.

Proper Evaluation Practicedesign lever

The use of evaluation methods aligned with the business goal—holdout testing, appropriate metrics (expected profit, ROC/AUC, lift), and meaningful baselines—rather than simplistic accuracy on training data.

Data-Analytic Thinkingpsychological state

The cognitive disposition and capability among managers and analysts to view business problems from a data perspective, assess whether and how data can improve performance, and reason systematically about analytics opportunities and threats.

Model Generalization Performancebehavioral pattern

The degree to which a model's discovered patterns apply to previously unseen data drawn from the same population, as opposed to memorizing idiosyncrasies of the training data.

Decision Qualityoutcome metric

The improvement in business decision-making—accuracy, profitability, and effectiveness—achieved by basing decisions on data analysis rather than intuition alone, including at massive automated scale.

Competitive Advantage from Data Scienceoutcome metric

The sustained business advantage a firm achieves when its data assets and data science capability are valuable, aligned with strategy, and difficult for competitors to replicate.

How they connect

  • investment in data assets influences model generalization
  • data science talent predicts competitive advantage
  • data analytic management influences data analytic thinking
  • sound data mining process influences model generalization
  • informative attribute selection influences model generalization
  • complexity control predicts model generalization
  • complexity control moderates model generalization
  • proper evaluation practice influences decision quality
  • expected value framing mediates decision quality
  • data analytic thinking predicts expected value framing
  • data analytic thinking predicts decision quality
  • model generalization influences decision quality
  • decision quality influences competitive advantage
  • investment in data assets influences competitive advantage

A candidate measure

Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking — derived measurement candidates

Investment in Data Assets

Annual data acquisition/curation expenditure; Number of deliberate data-generating experiments; Count and uniqueness of integrated data sources; Ratio of data investment to total analytics budget

self-report suitability: medium

Data Science Talent Quality

Competition placements (KDD Cup, Netflix, Kaggle); Number of successful deployed projects; Apprenticeship lineage with recognized masters; Breadth of professional connections

self-report suitability: low

Data-Analytic Management Capability

Quality of probing questions (rated); Hit rate of selected projects; Maturity-model rating; Team perceptions of management support

self-report suitability: medium

Adherence to Sound Data Mining Process

Process maturity rating; Number of iteration cycles per project; Presence of documented business understanding artifacts; Use of pilot/prototype before full deployment

self-report suitability: medium

Informative Attribute Selection

Information gain (bits) per attribute; Variance reduction for numeric targets; Number of features retained vs. available; Accuracy improvement from selected feature set

self-report suitability: none

Expected Value Problem Framing

Presence/completeness of expected value decomposition (rated); Existence of cost-benefit matrix; Whether decision thresholds derived from EV; Decomposition of problem into subproblems

self-report suitability: medium

Model Complexity Control

Regularization parameter values used; Tree size / number of nodes relative to sweet spot; Training-holdout performance gap; Use of cross-validation for parameter selection

self-report suitability: none

Proper Evaluation Practice

Whether holdout/CV used (binary/rated); Metrics reported (expected profit, AUC, lift vs. only accuracy); Presence of cost-benefit incorporation; Baseline comparison (e.g., majority classifier)

self-report suitability: medium

Data-Analytic Thinking

Scored performance on proposal-critique tasks; Quality/depth of questions asked (rated); Accuracy in identifying proposal flaws; Use of frameworks (expected value, baselines) in reasoning

self-report suitability: medium

Model Generalization Performance

Holdout accuracy / AUC; Expected profit on test set; Lift over baseline; Cross-validation mean and variance

self-report suitability: none

Decision Quality

Profit per decision / expected profit realized; Productivity gain percentage; Reduction in churn/fraud losses; Return on assets/equity, market value

self-report suitability: low

Competitive Advantage from Data Science

Profitability relative to competitors (e.g., charge-off rates); Switching-cost indicators (retention, repeat usage); Uniqueness of data assets (qualitative + market valuation); Market share / valuation attributable to data

self-report suitability: low

Run the assessment

The story

The reader A business professional, manager, investor, or aspiring data scientist who wants to extract competitive advantage and better decisions from their organization's data.

External problem

They have vast amounts of data but lack a principled way to turn it into useful knowledge and better business decisions.

Internal problem

They feel intimidated by jargon and algorithms, unsure whether a proposed data science effort is sound or whether they are being misled.

Philosophical problem

In a data-rich world, it is simply wrong to make decisions on intuition alone or to treat data science as opaque magic when its principles can be understood and harnessed.

The plan

  1. Learn the small set of fundamental concepts that underlie data science.
  2. Adopt data-analytic thinking and the CRISP process to structure problems.
  3. Decompose business problems into known data mining tasks using the expected value framework.
  4. Evaluate models in business terms, guarding against overfitting and misleading metrics.
  5. Build, nurture, and manage data science capability as a strategic asset.

Success

  • The reader confidently frames business problems data-analytically and decomposes them into solvable tasks.
  • The reader can evaluate data science proposals, spot flaws, and ask probing questions.
  • The reader's organization invests wisely in data and data scientists, gaining and sustaining competitive advantage.
  • Business and technical teams communicate with a shared vocabulary and deeper mutual understanding.

At stake

  • The reader is bamboozled by jargon and makes wrong decisions or wastes resources on flawed projects.
  • The organization deploys models that overfit, fail to generalize, or are misaligned with the business goal.
  • Competitors who think data-analytically gain a strategic advantage that becomes hard to overcome.
  • Data assets and talent are squandered for lack of understanding and proper management.

Related in the library