peopleanalyst

library / lib60d0c77a9e8846f7

Probabilistic Deep Learning with Python, Keras and TensorFlow Probability

Oliver Dürr, Beate Sick & Elvis Murina · 2020

In a sentence

A hands-on guide to building probabilistic deep learning models using the maximum likelihood principle and Bayesian inference, implemented in Python with Keras and TensorFlow Probability.

Probabilistic Deep Learning demystifies the statistical foundations that underpin nearly every neural network, showing practitioners that all traditional DL amounts to maximum likelihood estimation and that extending models to output full probability distributions—rather than single-point predictions—is both theoretically principled and practically achievable. The book moves from neural network architectures and gradient descent through maximum likelihood loss derivation, TensorFlow Probability, normalizing flows, and finally Bayesian neural networks via variational inference and MC dropout. Rich with Jupyter notebook exercises, real case studies (Bavarian roadkills, CIFAR-10 novel-class detection), and state-of-the-art examples (WaveNet, PixelCNN++, Glow), it equips readers to build models that not only predict accurately but also know when they don't know—an essential capability for safety-critical and decision-support applications.

The four lenses

  • Science
  • Statistics
  • Systems
  • Strategy

Tags

f1-systems

The model

A causal-structural model describing how design choices in neural network architecture, outcome distribution selection, loss function construction, and Bayesian extension propagate through training dynamics and predictive distribution quality to produce downstream outcomes of prediction accuracy, calibration, and uncertainty quantification.

Neural Network Architecture Choicedesign lever

The selection of network type (fcNN, CNN, 1D CNN, RNN) and depth/width, which determines the inductive biases, parameter count, and ability to exploit structural properties of the input data such as local spatial correlations in images or temporal ordering in sequences.

Outcome Distribution Family Selectiondesign lever

The practitioner's choice of parametric probability distribution for the conditional outcome distribution (e.g., Normal for continuous data, Poisson or ZIP for count data, multinomial for categories, logistic mixture for complex discrete data), which determines the structure of the loss function and the expressiveness of the probabilistic predictions.

Loss Function Construction via MaxLikedesign lever

The process of deriving the training loss as the negative log-likelihood (NLL) of the chosen outcome distribution given the observed data, ensuring that minimizing the loss is equivalent to maximizing the probability of the observed data under the model. This is the central technical link between distribution choice and model fitting.

Bayesian Prior Specificationdesign lever

The choice of prior probability distribution over network weights before observing data in a Bayesian neural network, encoding domain knowledge or regularization preferences. A zero-mean Gaussian prior is the most common choice, equivalent to L2 weight decay regularization in the non-Bayesian setting.

Bayesian Approximation Methoddesign lever

The choice of approximation technique for posterior inference in a Bayesian neural network, primarily variational inference (VI) using Gaussian weight distributions with reparameterization gradients, or MC dropout using Bernoulli weight distributions at test time. Each method trades off computational cost, approximation fidelity, and distributional flexibility differently.

Training Data Coverage and Volumecontextual condition

The extent to which the training dataset covers the input space likely to be encountered at test time, including the quantity of labeled examples and the diversity of input conditions. Limited coverage creates regions of high epistemic uncertainty; extensive coverage reduces epistemic uncertainty and causes the Bayesian posterior to concentrate near the MaxLike estimate.

Gradient Descent Training Dynamicsbehavioral pattern

The iterative process of updating network weights via stochastic gradient descent (SGD) or its variants (Adam, RMSProp) to minimize the loss function, including the effects of learning rate, mini-batch size, and optimizer momentum on convergence speed, stability, and the region of the loss landscape reached. Mini-batch noise in gradient estimates is a critical feature enabling escape from sharp local minima.

Posterior Approximation Qualitypsychological state

The fidelity with which the variational distribution or MC dropout distribution approximates the true Bayesian posterior over network weights. High quality means the variational distribution closely matches the posterior, resulting in well-calibrated epistemic uncertainty. Quality is limited by the expressiveness of the approximating family (Gaussian for VI, Bernoulli for MC dropout).

Aleatoric Uncertainty Capturedpsychological state

The degree to which the predicted conditional probability distribution (CPD) correctly represents the irreducible data-inherent variance of the outcome, including heteroscedastic effects where spread varies with input. Properly captured aleatoric uncertainty means the model's CPD width tracks the actual spread of outcomes across the input space.

Epistemic Uncertainty Quantifiedpsychological state

The degree to which the model correctly represents uncertainty arising from limited training data and parameter uncertainty, manifesting as widening predictive distributions in extrapolation regions and for novel inputs. Only Bayesian models (VI or MC dropout) capture this; non-Bayesian models produce constant or inappropriately narrow uncertainty in low-data regions.

Validation Negative Log-Likelihood (NLL)outcome metric

The mean negative log-likelihood of observed outcomes under the model's predicted conditional probability distribution, evaluated on held-out validation or test data not seen during training. This is the uniquely correct and sufficient performance metric for probabilistic prediction models; lower values indicate better probabilistic predictions. It subsumes MSE and accuracy as special cases.

Prediction Accuracy and Point-Estimate Performanceoutcome metric

Standard point-estimate performance metrics including classification accuracy, RMSE, and MAE, which measure how well the mode or mean of the predicted CPD matches observed labels. These metrics are necessary but not sufficient for evaluating probabilistic models; they must be reported alongside validation NLL.

Novel Input and Out-of-Distribution Detectionoutcome metric

The ability of a trained model to flag inputs that are outside the distribution of training data—novel classes in classification or extrapolation regions in regression—by expressing elevated predictive uncertainty. Only Bayesian models (VI or MC dropout) provide this capability; non-Bayesian models produce inappropriately confident predictions on novel inputs.

Inductive Bias Alignment with Data Structurecontextual condition

The degree to which the chosen neural network architecture's built-in assumptions (local connectivity for images, causal masking for sequences, permutation invariance for tabular data) match the actual structural properties of the input data. High alignment means the network needs fewer parameters and less data to learn good representations.

Distribution Family Fit to Data Generating Processcontextual condition

The degree to which the chosen parametric outcome distribution correctly captures the structural properties of the data generating process, including support constraints (non-negativity for counts), tail behavior, multimodality, and zero-inflation. A poor fit between distribution family and true data process limits the achievable validation NLL regardless of model capacity.

How they connect

  • architecture choice influences gradient descent dynamics
  • architecture choice influences aleatoric uncertainty captured
  • outcome distribution selection predicts loss function construction
  • loss function construction influences gradient descent dynamics
  • gradient descent dynamics influences aleatoric uncertainty captured
  • prior specification influences posterior approximation quality
  • bayesian approximation method predicts posterior approximation quality
  • training data coverage influences epistemic uncertainty quantified
  • training data coverage influences validation nll
  • posterior approximation quality predicts epistemic uncertainty quantified
  • aleatoric uncertainty captured influences validation nll
  • epistemic uncertainty quantified influences validation nll
  • epistemic uncertainty quantified predicts novel input detection
  • distribution family fit moderates aleatoric uncertainty captured
  • distribution family fit moderates validation nll
  • inductive bias alignment moderates gradient descent dynamics
  • aleatoric uncertainty captured influences prediction accuracy
  • validation nll correlates novel input detection

The story

The reader A data scientist, ML engineer, or quantitative researcher who can build neural networks but wants to understand their probabilistic foundations and make models that express reliable uncertainty.

External problem

Their neural networks produce confident-sounding predictions even in extrapolation regions or on novel inputs, and they lack principled tools to quantify or communicate prediction uncertainty.

Internal problem

They feel uneasy deploying models without knowing when to trust the predictions, and frustrated that loss functions seem arbitrary rather than principled.

Philosophical problem

Models that cannot say 'I don't know' are dangerous in safety-critical or high-stakes decision contexts, and practitioners deserve tools grounded in sound statistical theory.

The plan

  1. Understand neural network architectures and how gradient descent trains parametric models (Part 1).
  2. Learn the maximum likelihood principle and derive principled loss functions for any outcome distribution (Chapter 4).
  3. Build probabilistic DL models using TensorFlow Probability, selecting distributions matched to the data type (Chapters 5–6).
  4. Extend to complex distributions—mixtures and normalizing flows—for state-of-the-art generative models (Chapter 6).
  5. Understand Bayesian inference, the posterior, prior, and predictive distribution on simple examples (Chapter 7).
  6. Apply variational inference and MC dropout to fit tractable Bayesian neural networks that quantify epistemic uncertainty (Chapter 8).

Success

  • Reader can derive the correct loss function for any probabilistic model from first principles.
  • Reader's models output calibrated uncertainty estimates that widen appropriately in extrapolation and for novel inputs.
  • Reader can detect unreliable predictions and flag them before they cause harm in downstream decisions.
  • Reader understands and can adapt state-of-the-art architectures like WaveNet, PixelCNN++, and Glow.
  • Reader can choose between variational inference and MC dropout based on computational and accuracy trade-offs.

At stake

  • Continued deployment of overconfident models in safety-critical settings (self-driving cars, medical diagnosis) without any uncertainty signal.
  • Loss functions remain black boxes, preventing practitioners from adapting models to novel data types or distribution shifts.
  • Missed performance gains from mismatched outcome distributions (e.g., using MSE for count data instead of Poisson NLL).
  • Inability to detect novel classes or out-of-distribution inputs, leading to silent, high-confidence failures.