peopleanalyst

library / lib2e5af4f48743f745

Data Science from Scratch: First Principles with Python

Joel Grus · 2015

In a sentence

A hands-on introduction to data science that teaches the core concepts, algorithms, and mathematics by implementing everything from scratch in Python rather than relying on existing libraries.

Data Science from Scratch teaches you data science by having you build the tools and algorithms yourself, from the ground up, in clear and readable Python. Rather than treating libraries like NumPy, scikit-learn, and pandas as magic black boxes, Joel Grus walks you through implementing linear algebra, statistics, probability, gradient descent, machine learning models, neural networks, deep learning, clustering, NLP, network analysis, and recommender systems by hand. Framed around the fictional 'DataSciencester' social network, the book grounds abstract concepts in concrete problems while developing the hacking skills and mathematical intuition that are at the core of doing real data science. By the end you understand not just how to use data science tools, but how and why they work.

The four lenses

  • Science
  • Statistics
  • Systems
  • Strategy

The model

A model expressing how the book's design levers (learning from scratch, mathematical foundations, data handling, modeling choices) drive learner and model states (understanding, code correctness, model fit) and outcomes (data science competence, predictive performance, ethical responsibility).

From-Scratch Implementation Practicedesign lever

The pedagogical practice of building data science tools and algorithms by hand in clear Python rather than relying on existing libraries, in order to better understand how they work internally.

Mathematical Foundation (Linear Algebra, Statistics, Probability)design lever

The learner's grounding in the core mathematics underpinning data science, including vectors, matrices, central tendency, dispersion, correlation, probability distributions, and inference, which the book treats as essential prerequisites.

Data Acquisition and Cleaning Effortbehavioral pattern

The work of getting data via files, web scraping, and APIs, and then cleaning, munging, exploring, rescaling, and manipulating it into usable form, which consumes a large fraction of a data scientist's time.

Model Complexitydesign lever

The richness of the chosen model in terms of number of parameters, features, and flexibility, which influences how well the model can fit training data and how prone it is to overfitting or underfitting.

Gradient Descent Optimizationbehavioral pattern

The iterative technique of computing gradients of a loss function and taking steps in the opposite direction to fit model parameters, serving as a unifying method for fitting many models throughout the book.

Conceptual Understandingpsychological state

The learner's genuine internalized grasp of how and why data science algorithms and mathematics work, as opposed to merely being able to invoke library functions, which the book treats as its central pedagogical goal.

Code Correctnessbehavioral pattern

The degree to which implemented code does what it is intended to do, supported by clean coding, type annotations, and liberal use of assert statements and automated testing.

Model Fit Qualityoutcome metric

How well a fitted model captures the patterns in data, reflected in goodness-of-fit measures like R-squared and loss, and balanced against generalization to unseen data.

Predictive Performanceoutcome metric

How well a model performs on new, unseen data, measured by metrics such as accuracy, precision, recall, and F1 score, and judged against the risk of overfitting.

Data Science Competenceoutcome metric

The learner's overall ability to do data science independently, combining hacking skills, mathematical intuition, and understanding of models, which is the ultimate aspiration the book sets for the reader.

Ethical Responsibilityoutcome metric

The data scientist's commitment to considering and mitigating the ethical consequences of their work, including bias, fairness, privacy, interpretability, and the wide-reaching effects of scalable technology.

How they connect

  • from scratch implementation predicts conceptual understanding
  • mathematical foundation predicts conceptual understanding
  • from scratch implementation influences code correctness
  • code correctness influences model fit quality
  • data acquisition and cleaning influences model fit quality
  • gradient descent optimization mediates model fit quality
  • model complexity influences model fit quality
  • model complexity moderates predictive performance
  • model fit quality predicts predictive performance
  • conceptual understanding predicts data science competence
  • predictive performance influences data science competence
  • data science competence influences ethical responsibility
  • data acquisition and cleaning correlates ethical responsibility

A candidate measure

Data Science from Scratch: First Principles with Python — derived measurement candidates

From-Scratch Implementation Practice

proportion of algorithms implemented by hand; count of library imports for core logic (inverse); presence of explanatory comments

self-report suitability: high

Mathematical Foundation (Linear Algebra, Statistics, Probability)

scores on math problem sets; self-rated topic familiarity; error rate in computations

self-report suitability: medium

Data Acquisition and Cleaning Effort

time spent on data preparation; number of cleaning steps performed; fraction of invalid rows handled

self-report suitability: medium

Model Complexity

parameter count; number of nonzero coefficients; polynomial degree; regularization strength

self-report suitability: low

Gradient Descent Optimization

loss reduction per epoch; number of epochs to converge; learning rate

self-report suitability: low

Conceptual Understanding

explanation quality scores; independent reimplementation success; method selection accuracy

self-report suitability: medium

Code Correctness

assertion pass rate; test pass rate; number of type errors (inverse)

self-report suitability: low

Model Fit Quality

R-squared; mean squared error; training loss

self-report suitability: none

Predictive Performance

accuracy; precision; recall; F1 score

self-report suitability: none

Data Science Competence

portfolio quality rubric; method selection accuracy; successful library usage

self-report suitability: medium

Ethical Responsibility

presence of ethical review processes; bias/fairness audit results; privacy safeguards in place

self-report suitability: high

Run the assessment

The story

The reader An aspiring data scientist with some mathematical aptitude and programming skill who wants to genuinely understand how data science works, not just call library functions.

External problem

They need to learn the core algorithms, mathematics, and tools of data science well enough to actually do the work.

Internal problem

They feel like an underachiever or impostor who can use libraries but doesn't truly understand what's happening under the hood.

Philosophical problem

Treating data science tools as magic black boxes is the wrong way to learn; true competence comes from understanding things from first principles.

The plan

  1. Get comfortable with Python and the language features that matter for data science.
  2. Build a foundation in linear algebra, statistics, and probability.
  3. Learn to get, clean, explore, and manipulate real data.
  4. Implement core machine learning models and evaluation techniques from scratch.
  5. Advance to neural networks, deep learning, clustering, NLP, and recommender systems.
  6. Consider the ethical consequences of your data work and then move on to using production libraries.

Success

  • You possess a solid understanding of the fundamentals of data science.
  • You can build, train, and evaluate models while understanding how they work.
  • You can confidently use production libraries because you know what they do under the hood.
  • You can find datasets that interest you and do your own data science projects.

At stake

  • You remain dependent on libraries you don't understand and can't debug or extend them.
  • You build models that overfit, mislead, or behave unethically without realizing it.
  • You stay stuck feeling like an impostor unable to do real data science work.

Related in the library