library / lib7df5d6b2ec50d09c
Data Smart: Using Data Science to Transform Information into Insight
John W. Foreman · 2013
In a sentence
A hands-on guide that teaches the core algorithms of data science from scratch using spreadsheets (and finally R), so business people can understand, prototype, and deploy these techniques without first buying tools or hiring consultants.
Data Smart strips away the hype, tools, and code that usually obscure data science and teaches the actual techniques—clustering, naive Bayes, optimization, network graphs, regression, ensemble models, forecasting, and outlier detection—by building each one by hand in a spreadsheet. Written conversationally by MailChimp's Chief Data Scientist, the book targets marketers, analysts, and executives who feel pressure to 'do data science' but don't understand what these techniques are or how to choose the right one for a problem. By the time readers finish, they can identify data science opportunities in their own organizations, prototype solutions, correctly evaluate vendors and developers, and graduate into a programming language like R to scale their work. The book's philosophy is that understanding beats button-pushing: once you've implemented an algorithm from the barest of tools, you can implement it anywhere.
The four lenses
- Science
- Statistics
- Systems
- Strategy
The model
An inferred causal model describing how learning to build data science techniques by hand (design lever), combined with proper data preparation and problem framing, develops practitioner understanding and confidence (psychological states), which drives correct technique selection and hands-on prototyping (behavioral patterns), ultimately producing valid, useful models and business value—moderated by communication skill and the avoidance of tool/complexity/performance obsession.
Hands-On Technique Building in Spreadsheetsdesign lever
The deliberate practice of implementing each data science algorithm from scratch in a transparent, vanilla tool (a spreadsheet) so that every transformation from input to output is visible and touchable, rather than hidden behind tools or code.
Data Preparation and Standardization Qualitydesign lever
The degree to which raw data is cleaned, dummy-coded, standardized to comparable scales, balanced for class imbalance, and shaped (e.g., into matrices or distance/affinity matrices) so that it is fit for the chosen analytic technique.
Correct Problem Framingcontextual condition
The practice of engaging with the business context to understand what problem actually needs solving, rather than passively accepting a poorly posed problem thrown over the fence, ensuring the analytic effort targets the true objective.
Practitioner Understanding of Techniquespsychological state
The internalized, mechanistic comprehension of how each data science technique works, what it requires, and what it produces, gained from building it rather than merely pushing buttons—enabling the practitioner to know what is going on behind the scenes.
Practitioner Confidence and Reduced Data Anxietypsychological state
The emotional shift from data science anxiety and intimidation toward excitement and self-assurance in approaching data problems, resulting from successfully implementing techniques by hand.
Appropriate Technique Selectionbehavioral pattern
The behavior of matching the right data science technique (e.g., optimization vs. AI, k-means vs. modularity, naive Bayes vs. ensemble) and the right distance measure or threshold to the specific business problem at hand.
Prototyping and Iteration Behaviorbehavioral pattern
The behavior of rapidly building, testing, and iterating on lightweight models (e.g., trying multiple k values, comparing models with ROC curves) to explore data and refine solutions before committing to production.
Model Validity and Performanceoutcome metric
The degree to which a built model fits the data well, is statistically significant, generalizes to a holdout set, and performs at an acceptable balance of precision, recall, and other metrics for the business use case.
Business Value Createdoutcome metric
The ultimate organizational benefit—better targeting, forecasting, pricing, decisions, cost savings, revenue, and value—that results when valid models are correctly applied to the right problems and adopted by the business.
Communication and Translation Skillcontextual condition
The practitioner's ability to translate between math, code, and plain business language—understanding others' challenges, articulating what is possible, and explaining the work—so that analytics gets embedded and adopted.
Tool/Complexity/Performance Obsessioncontextual condition
The self-sabotaging tendency to fixate on overly complex modeling, tool acquisition, or computational performance at the expense of practical usefulness and maintainability—the 'three-headed geek-monster' that derails analytics adoption.
How they connect
- hands on technique building → predicts practitioner understanding
- hands on technique building → predicts practitioner confidence
- practitioner understanding → predicts technique selection
- practitioner confidence → predicts prototyping behavior
- data preparation quality → predicts model validity
- technique selection → predicts model validity
- prototyping behavior → influences model validity
- correct problem framing → predicts business value created
- model validity → predicts business value created
- communication skill → moderates business value created
- geek monster obsession − moderates business value created
A candidate measure
Data Smart: Using Data Science to Transform Information into Insight — derived measurement candidates
Hands-On Technique Building in Spreadsheets
number of techniques implemented; proportion of exercises reproduced from scratch; workbook completion rate
self-report suitability: high
Data Preparation and Standardization Quality
checklist score of preparation steps present; post-standardization mean/SD checks; missing-value rate after imputation
self-report suitability: medium
Correct Problem Framing
alignment rating between solved and true problem; count of stakeholder engagement touchpoints
self-report suitability: medium
Practitioner Understanding of Techniques
knowledge test score; rubric-rated explanation quality
self-report suitability: medium
Practitioner Confidence and Reduced Data Anxiety
self-efficacy attitude rating; anxiety attitude rating; task volunteering frequency
self-report suitability: high
Appropriate Technique Selection
appropriateness rating per decision; match score to problem characteristics
self-report suitability: medium
Prototyping and Iteration Behavior
number of variants tried; number of parameter settings explored; number of comparative evaluations
self-report suitability: high
Model Validity and Performance
R-squared; F/t test p-values; AUC; precision; recall; specificity; false positive rate; standard error; prediction interval coverage
self-report suitability: low
Business Value Created
revenue lift; cost reduction; conversion improvement; model usage/adoption rate
self-report suitability: low
Communication and Translation Skill
peer/manager clarity ratings; planning participation counts; stakeholder satisfaction
self-report suitability: medium
Tool/Complexity/Performance Obsession
post-mortem severity rating; model maintainability score; tool-before-problem incidence; non-adoption due to complexity
self-report suitability: low
The story
The reader A business professional (marketing VP, analyst, CEO, or online marketer) who wants to use their transactional data strategically to make better decisions but doesn't understand the data science approaches being recommended to them.
External problem
They have valuable data but lack the knowledge to extract insight from it or to evaluate the tools, consultants, and techniques being pitched to them.
Internal problem
They feel anxiety, intimidation, and a fear of being left behind by competitors who are 'doing data science.'
Philosophical problem
It's just plain wrong that data science is gatekept behind hype, jargon, expensive tools, and code, when the underlying techniques are learnable and useful to anyone willing to engage.
The plan
- Shore up your spreadsheet fundamentals so you can follow along comfortably.
- Learn each technique by building it by hand on real example data in Excel.
- Understand how to prepare and standardize your data for analysis.
- Evaluate your models with statistical tests and performance metrics.
- Choose the right technique and threshold for your specific business problem.
- Graduate into R to scale and productionize what you've prototyped.
Success
- You can identify data science opportunities within your own organization.
- You can prototype solutions, correctly buy data science products, and delegate the right approaches to developers.
- Your data anxiety is replaced with excitement and ideas for taking your business to the next level.
- You can have a leg up on competitors who are wasting money on tools before knowing what they want.
At stake
- You lose out to competitors who understand and apply these techniques.
- You waste money buying tools and hiring consultants before knowing what you actually need.
- You remain unable to tell AI from BI from BS, vulnerable to slick pitches.
- Your valuable transactional data keeps going to waste, read and saved but never mined for insight.