peopleanalyst

library / lib7df5d6b2ec50d09c

Data Smart: Using Data Science to Transform Information into Insight

John W. Foreman · 2013

In a sentence

A hands-on guide that teaches the core algorithms of data science from scratch using spreadsheets (and finally R), so business people can understand, prototype, and deploy these techniques without first buying tools or hiring consultants.

Data Smart strips away the hype, tools, and code that usually obscure data science and teaches the actual techniques—clustering, naive Bayes, optimization, network graphs, regression, ensemble models, forecasting, and outlier detection—by building each one by hand in a spreadsheet. Written conversationally by MailChimp's Chief Data Scientist, the book targets marketers, analysts, and executives who feel pressure to 'do data science' but don't understand what these techniques are or how to choose the right one for a problem. By the time readers finish, they can identify data science opportunities in their own organizations, prototype solutions, correctly evaluate vendors and developers, and graduate into a programming language like R to scale their work. The book's philosophy is that understanding beats button-pushing: once you've implemented an algorithm from the barest of tools, you can implement it anywhere.

The four lenses

  • Science
  • Statistics
  • Systems
  • Strategy

The model

An inferred causal model describing how learning to build data science techniques by hand (design lever), combined with proper data preparation and problem framing, develops practitioner understanding and confidence (psychological states), which drives correct technique selection and hands-on prototyping (behavioral patterns), ultimately producing valid, useful models and business value—moderated by communication skill and the avoidance of tool/complexity/performance obsession.

Hands-On Technique Building in Spreadsheetsdesign lever

The deliberate practice of implementing each data science algorithm from scratch in a transparent, vanilla tool (a spreadsheet) so that every transformation from input to output is visible and touchable, rather than hidden behind tools or code.

Data Preparation and Standardization Qualitydesign lever

The degree to which raw data is cleaned, dummy-coded, standardized to comparable scales, balanced for class imbalance, and shaped (e.g., into matrices or distance/affinity matrices) so that it is fit for the chosen analytic technique.

Correct Problem Framingcontextual condition

The practice of engaging with the business context to understand what problem actually needs solving, rather than passively accepting a poorly posed problem thrown over the fence, ensuring the analytic effort targets the true objective.

Practitioner Understanding of Techniquespsychological state

The internalized, mechanistic comprehension of how each data science technique works, what it requires, and what it produces, gained from building it rather than merely pushing buttons—enabling the practitioner to know what is going on behind the scenes.

Practitioner Confidence and Reduced Data Anxietypsychological state

The emotional shift from data science anxiety and intimidation toward excitement and self-assurance in approaching data problems, resulting from successfully implementing techniques by hand.

Appropriate Technique Selectionbehavioral pattern

The behavior of matching the right data science technique (e.g., optimization vs. AI, k-means vs. modularity, naive Bayes vs. ensemble) and the right distance measure or threshold to the specific business problem at hand.

Prototyping and Iteration Behaviorbehavioral pattern

The behavior of rapidly building, testing, and iterating on lightweight models (e.g., trying multiple k values, comparing models with ROC curves) to explore data and refine solutions before committing to production.

Model Validity and Performanceoutcome metric

The degree to which a built model fits the data well, is statistically significant, generalizes to a holdout set, and performs at an acceptable balance of precision, recall, and other metrics for the business use case.

Business Value Createdoutcome metric

The ultimate organizational benefit—better targeting, forecasting, pricing, decisions, cost savings, revenue, and value—that results when valid models are correctly applied to the right problems and adopted by the business.

Communication and Translation Skillcontextual condition

The practitioner's ability to translate between math, code, and plain business language—understanding others' challenges, articulating what is possible, and explaining the work—so that analytics gets embedded and adopted.

Tool/Complexity/Performance Obsessioncontextual condition

The self-sabotaging tendency to fixate on overly complex modeling, tool acquisition, or computational performance at the expense of practical usefulness and maintainability—the 'three-headed geek-monster' that derails analytics adoption.

How they connect

  • hands on technique building predicts practitioner understanding
  • hands on technique building predicts practitioner confidence
  • practitioner understanding predicts technique selection
  • practitioner confidence predicts prototyping behavior
  • data preparation quality predicts model validity
  • technique selection predicts model validity
  • prototyping behavior influences model validity
  • correct problem framing predicts business value created
  • model validity predicts business value created
  • communication skill moderates business value created
  • geek monster obsession moderates business value created

A candidate measure

Data Smart: Using Data Science to Transform Information into Insight — derived measurement candidates

Hands-On Technique Building in Spreadsheets

number of techniques implemented; proportion of exercises reproduced from scratch; workbook completion rate

self-report suitability: high

Data Preparation and Standardization Quality

checklist score of preparation steps present; post-standardization mean/SD checks; missing-value rate after imputation

self-report suitability: medium

Correct Problem Framing

alignment rating between solved and true problem; count of stakeholder engagement touchpoints

self-report suitability: medium

Practitioner Understanding of Techniques

knowledge test score; rubric-rated explanation quality

self-report suitability: medium

Practitioner Confidence and Reduced Data Anxiety

self-efficacy attitude rating; anxiety attitude rating; task volunteering frequency

self-report suitability: high

Appropriate Technique Selection

appropriateness rating per decision; match score to problem characteristics

self-report suitability: medium

Prototyping and Iteration Behavior

number of variants tried; number of parameter settings explored; number of comparative evaluations

self-report suitability: high

Model Validity and Performance

R-squared; F/t test p-values; AUC; precision; recall; specificity; false positive rate; standard error; prediction interval coverage

self-report suitability: low

Business Value Created

revenue lift; cost reduction; conversion improvement; model usage/adoption rate

self-report suitability: low

Communication and Translation Skill

peer/manager clarity ratings; planning participation counts; stakeholder satisfaction

self-report suitability: medium

Tool/Complexity/Performance Obsession

post-mortem severity rating; model maintainability score; tool-before-problem incidence; non-adoption due to complexity

self-report suitability: low

Run the assessment

The story

The reader A business professional (marketing VP, analyst, CEO, or online marketer) who wants to use their transactional data strategically to make better decisions but doesn't understand the data science approaches being recommended to them.

External problem

They have valuable data but lack the knowledge to extract insight from it or to evaluate the tools, consultants, and techniques being pitched to them.

Internal problem

They feel anxiety, intimidation, and a fear of being left behind by competitors who are 'doing data science.'

Philosophical problem

It's just plain wrong that data science is gatekept behind hype, jargon, expensive tools, and code, when the underlying techniques are learnable and useful to anyone willing to engage.

The plan

  1. Shore up your spreadsheet fundamentals so you can follow along comfortably.
  2. Learn each technique by building it by hand on real example data in Excel.
  3. Understand how to prepare and standardize your data for analysis.
  4. Evaluate your models with statistical tests and performance metrics.
  5. Choose the right technique and threshold for your specific business problem.
  6. Graduate into R to scale and productionize what you've prototyped.

Success

  • You can identify data science opportunities within your own organization.
  • You can prototype solutions, correctly buy data science products, and delegate the right approaches to developers.
  • Your data anxiety is replaced with excitement and ideas for taking your business to the next level.
  • You can have a leg up on competitors who are wasting money on tools before knowing what they want.

At stake

  • You lose out to competitors who understand and apply these techniques.
  • You waste money buying tools and hiring consultants before knowing what you actually need.
  • You remain unable to tell AI from BI from BS, vulnerable to slick pitches.
  • Your valuable transactional data keeps going to waste, read and saved but never mined for insight.