library / lib2e5af4f48743f745
Data Science from Scratch: First Principles with Python
Joel Grus · 2015
In a sentence
A hands-on introduction to data science that teaches the core concepts, algorithms, and mathematics by implementing everything from scratch in Python rather than relying on existing libraries.
Data Science from Scratch teaches you data science by having you build the tools and algorithms yourself, from the ground up, in clear and readable Python. Rather than treating libraries like NumPy, scikit-learn, and pandas as magic black boxes, Joel Grus walks you through implementing linear algebra, statistics, probability, gradient descent, machine learning models, neural networks, deep learning, clustering, NLP, network analysis, and recommender systems by hand. Framed around the fictional 'DataSciencester' social network, the book grounds abstract concepts in concrete problems while developing the hacking skills and mathematical intuition that are at the core of doing real data science. By the end you understand not just how to use data science tools, but how and why they work.
The four lenses
- Science
- Statistics
- Systems
- Strategy
The model
A model expressing how the book's design levers (learning from scratch, mathematical foundations, data handling, modeling choices) drive learner and model states (understanding, code correctness, model fit) and outcomes (data science competence, predictive performance, ethical responsibility).
From-Scratch Implementation Practicedesign lever
The pedagogical practice of building data science tools and algorithms by hand in clear Python rather than relying on existing libraries, in order to better understand how they work internally.
Mathematical Foundation (Linear Algebra, Statistics, Probability)design lever
The learner's grounding in the core mathematics underpinning data science, including vectors, matrices, central tendency, dispersion, correlation, probability distributions, and inference, which the book treats as essential prerequisites.
Data Acquisition and Cleaning Effortbehavioral pattern
The work of getting data via files, web scraping, and APIs, and then cleaning, munging, exploring, rescaling, and manipulating it into usable form, which consumes a large fraction of a data scientist's time.
Model Complexitydesign lever
The richness of the chosen model in terms of number of parameters, features, and flexibility, which influences how well the model can fit training data and how prone it is to overfitting or underfitting.
Gradient Descent Optimizationbehavioral pattern
The iterative technique of computing gradients of a loss function and taking steps in the opposite direction to fit model parameters, serving as a unifying method for fitting many models throughout the book.
Conceptual Understandingpsychological state
The learner's genuine internalized grasp of how and why data science algorithms and mathematics work, as opposed to merely being able to invoke library functions, which the book treats as its central pedagogical goal.
Code Correctnessbehavioral pattern
The degree to which implemented code does what it is intended to do, supported by clean coding, type annotations, and liberal use of assert statements and automated testing.
Model Fit Qualityoutcome metric
How well a fitted model captures the patterns in data, reflected in goodness-of-fit measures like R-squared and loss, and balanced against generalization to unseen data.
Predictive Performanceoutcome metric
How well a model performs on new, unseen data, measured by metrics such as accuracy, precision, recall, and F1 score, and judged against the risk of overfitting.
Data Science Competenceoutcome metric
The learner's overall ability to do data science independently, combining hacking skills, mathematical intuition, and understanding of models, which is the ultimate aspiration the book sets for the reader.
Ethical Responsibilityoutcome metric
The data scientist's commitment to considering and mitigating the ethical consequences of their work, including bias, fairness, privacy, interpretability, and the wide-reaching effects of scalable technology.
How they connect
- from scratch implementation → predicts conceptual understanding
- mathematical foundation → predicts conceptual understanding
- from scratch implementation → influences code correctness
- code correctness → influences model fit quality
- data acquisition and cleaning → influences model fit quality
- gradient descent optimization → mediates model fit quality
- model complexity → influences model fit quality
- model complexity − moderates predictive performance
- model fit quality → predicts predictive performance
- conceptual understanding → predicts data science competence
- predictive performance → influences data science competence
- data science competence → influences ethical responsibility
- data acquisition and cleaning → correlates ethical responsibility
A candidate measure
Data Science from Scratch: First Principles with Python — derived measurement candidates
From-Scratch Implementation Practice
proportion of algorithms implemented by hand; count of library imports for core logic (inverse); presence of explanatory comments
self-report suitability: high
Mathematical Foundation (Linear Algebra, Statistics, Probability)
scores on math problem sets; self-rated topic familiarity; error rate in computations
self-report suitability: medium
Data Acquisition and Cleaning Effort
time spent on data preparation; number of cleaning steps performed; fraction of invalid rows handled
self-report suitability: medium
Model Complexity
parameter count; number of nonzero coefficients; polynomial degree; regularization strength
self-report suitability: low
Gradient Descent Optimization
loss reduction per epoch; number of epochs to converge; learning rate
self-report suitability: low
Conceptual Understanding
explanation quality scores; independent reimplementation success; method selection accuracy
self-report suitability: medium
Code Correctness
assertion pass rate; test pass rate; number of type errors (inverse)
self-report suitability: low
Model Fit Quality
R-squared; mean squared error; training loss
self-report suitability: none
Predictive Performance
accuracy; precision; recall; F1 score
self-report suitability: none
Data Science Competence
portfolio quality rubric; method selection accuracy; successful library usage
self-report suitability: medium
Ethical Responsibility
presence of ethical review processes; bias/fairness audit results; privacy safeguards in place
self-report suitability: high
The story
The reader An aspiring data scientist with some mathematical aptitude and programming skill who wants to genuinely understand how data science works, not just call library functions.
External problem
They need to learn the core algorithms, mathematics, and tools of data science well enough to actually do the work.
Internal problem
They feel like an underachiever or impostor who can use libraries but doesn't truly understand what's happening under the hood.
Philosophical problem
Treating data science tools as magic black boxes is the wrong way to learn; true competence comes from understanding things from first principles.
The plan
- Get comfortable with Python and the language features that matter for data science.
- Build a foundation in linear algebra, statistics, and probability.
- Learn to get, clean, explore, and manipulate real data.
- Implement core machine learning models and evaluation techniques from scratch.
- Advance to neural networks, deep learning, clustering, NLP, and recommender systems.
- Consider the ethical consequences of your data work and then move on to using production libraries.
Success
- You possess a solid understanding of the fundamentals of data science.
- You can build, train, and evaluate models while understanding how they work.
- You can confidently use production libraries because you know what they do under the hood.
- You can find datasets that interest you and do your own data science projects.
At stake
- You remain dependent on libraries you don't understand and can't debug or extend them.
- You build models that overfit, mislead, or behave unethically without realizing it.
- You stay stuck feeling like an impostor unable to do real data science work.
Related in the library