Machine Learning for Astroparticle Physics:
A Crash-course in SBI

Lecture 2b (extra) - Making deep learning work in practice

Christoph Weniger — University of Amsterdam (GRAPPA)

Optional bonus material — orthogonal deep-dive into training deep networks

Today's Roadmap

1
Making gradients flow
Backprop, activations, init, normalization
2
Making training work
Momentum, Adam, learning rates
3
Why deep learning works
Double descent, spectral bias, regularization
The thread: Lecture 3 built the MLP.
Today we learn to train it reliably. Each technique solves a specific failure mode.

Backpropagation

Computing gradients efficiently through the computation graph.

Backpropagation: The Computation Graph

SGD needs \(\partial E / \partial W^{(\ell)}\) for every layer. Backpropagation computes all gradients efficiently by traversing the computation graph:

\(\mathbf{x}\)
input
\(\mathbf{z}^{(1)} = W^{(1)}\mathbf{x}+\mathbf{b}^{(1)}\)
\(\mathbf{h} = g(\mathbf{z}^{(1)})\)
\(a = \mathbf{w}^{(2)T}\mathbf{h}+b^{(2)}\)
\(\hat{y} = \sigma(a)\)
\(E\)
Forward pass: compute values left → right
Backward pass: compute gradients right → left
using the chain rule

Computation Graph — Details

  • The model shown above is a simple 2-layer binary classification network — one hidden layer followed by a single output.
  • This can be seen from the sigmoid \(\sigma\) at the end, which squashes the output into \([0,1]\) to produce a probability.
  • The activation function \(g\) in the hidden layer is typically ReLU: \(g(z) = \max(0, z)\).
  • The loss \(E(\hat{y})\) is the binary cross-entropy averaged over training data: \[ E = -\frac{1}{N}\sum_{n=1}^{N}\Big[t_n \ln \hat{y}_n + (1 - t_n)\ln(1-\hat{y}_n)\Big] \] where \(t_n \in \{0,1\}\) are the class labels and \(\hat{y}_n = \sigma\big(\mathbf{w}^{(2)T} g(W^{(1)}\mathbf{x}_n + \mathbf{b}^{(1)}) + b^{(2)}\big)\).
  • Dimensionalities: \(\mathbf{x} \in \mathbb{R}^{D}\), \(W^{(1)} \in \mathbb{R}^{H \times D}\), \(\mathbf{b}^{(1)} \in \mathbb{R}^{H}\), \(\mathbf{z}^{(1)}, \mathbf{h} \in \mathbb{R}^{H}\), \(\mathbf{w}^{(2)} \in \mathbb{R}^{H}\), \(b^{(2)}, a, \hat{y} \in \mathbb{R}\).

Backpropagation: The Chain Rule

The chain rule factors the gradient into local derivatives:   \(\displaystyle\frac{\partial E}{\partial w} = \frac{\partial E}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial a} \frac{\partial a}{\partial w}\)

Concrete 2-layer example (binary classification, cross-entropy loss \(E = -[t\ln\hat{y} + (1-t)\ln(1-\hat{y})]\)):

Forward
\(\mathbf{z} = W^{(1)}\mathbf{x}+\mathbf{b}^{(1)}\)
\(\mathbf{h} = g(\mathbf{z})\)
\(a = \mathbf{w}^{(2)T}\mathbf{h}+b^{(2)}\)
\(\hat{y} = \sigma(a)\)
Backward
\(\nabla_{\mathbf{w}^{(2)}} E = (\hat{y}-t)\,\mathbf{h}\)
\(\nabla_{b^{(2)}} E = \hat{y}-t\)
\(\nabla_{W^{(1)}} E = (\hat{y}-t)\,(\mathbf{w}^{(2)} \odot g'(\mathbf{z}))\,\mathbf{x}^T\)
\(\nabla_{\mathbf{b}^{(1)}} E = (\hat{y}-t)\,(\mathbf{w}^{(2)} \odot g'(\mathbf{z}))\)
Each layer needs only the gradient from above and its own local derivatives.
Total cost ≈ the forward pass.

Notation: \(\odot\) denotes the element-wise (Hadamard) product: \((\mathbf{a} \odot \mathbf{b})_i = a_i \, b_i\). Here \(g'(\mathbf{z})\) is the vector \(\big(g'(z_1), \ldots, g'(z_H)\big)\).

Chain Rule — Deriving \(\nabla_{\mathbf{w}^{(2)}} E\)

Work through the gradient of \(E\) w.r.t. the second-layer weights \(\mathbf{w}^{(2)}\) step by step.

Step 1: Apply the chain rule through the computation graph:

\[ \frac{\partial E}{\partial w^{(2)}_j} = \frac{\partial E}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial a} \frac{\partial a}{\partial w^{(2)}_j} \]

Step 2: Compute each factor:

  • \(\displaystyle\frac{\partial E}{\partial \hat{y}} = -\frac{t}{\hat{y}} + \frac{1-t}{1-\hat{y}} = \frac{\hat{y}-t}{\hat{y}(1-\hat{y})}\)
  • \(\displaystyle\frac{\partial \hat{y}}{\partial a} = \sigma'(a) = \sigma(a)(1-\sigma(a)) = \hat{y}(1-\hat{y})\)
  • \(\displaystyle\frac{\partial a}{\partial w^{(2)}_j} = h_j\)   (since \(a = \sum_j w^{(2)}_j h_j + b^{(2)}\))

Chain Rule — Deriving \(\nabla_{\mathbf{w}^{(2)}} E\) (cont.)

Step 3: Multiply — the \(\hat{y}(1-\hat{y})\) cancels:

\[ \frac{\partial E}{\partial w^{(2)}_j} = \frac{\hat{y}-t}{\hat{y}(1-\hat{y})} \;\hat{y}(1-\hat{y}) \; h_j = (\hat{y}-t)\,h_j \]

In vector form: \(\nabla_{\mathbf{w}^{(2)}} E = (\hat{y}-t)\,\mathbf{h}\). This clean cancellation is specific to sigmoid + cross-entropy.

Backprop in Detail: Regularized Loss

\[L = \underbrace{(\hat y - y)^2}_{L_\text{data}} + \underbrace{\lambda\,w^2}_{L_\text{reg}}, \qquad \hat y = w\,x\]
x
w
y
z
Ld
Lr
L
PyTorch code
x, y = 2.0, 3.0
w = tensor(1., requires_grad=True)
lam = 0.1
z = w * x
L_d = (z - y)**2
L_r = lam * w**2
L = L_d + L_r
L.backward()
Variables
Var .data .grad_fn ∂L/∂·
x
y
w
z
Ld
Lr
L
Click Next ▶ to step through forward & backward pass.

In practice, you wrap this in a torch.nn.Module. Its .parameters() method returns all learnable tensors, which are then passed to an optimizer (e.g. torch.optim.Adam(model.parameters())).

Making Gradients Flow

Activations, initialization, and normalization.

The Vanishing & Exploding Gradient Problem

Backprop multiplies local derivatives through layers. If each factor is < 1, the product vanishes; if > 1, it explodes.

Vanishing gradients
  • Sigmoid: \(\sigma'(z) \leq 0.25\)
  • Through \(L\) layers: \(0.25^L\) → \(10^{-6}\) for \(L{=}10\)
  • Early layers stop learning
Exploding gradients
  • Large weights → derivatives > 1
  • Product grows exponentially
  • Loss becomes NaN
Both problems stem from multiplicative gradient flow through layers.
Good practices (next slides):
  • • ReLU activation variants
  • • He initialization
  • • Input normalization

Dying ReLU and Its Variants

ReLU solves vanishing gradients (\(g'(z)=1\) for \(z{>}0\)), but introduces the dying ReLU problem: if \(z{<}0\), the gradient is exactly zero forever.

LeakyReLU
\(\max(0.1z,\; z)\)
Small slope for \(z{<}0\).
Never fully dead.
ELU
\(e^z - 1\) for \(z{<}0\)
Smooth, pushes mean
activations toward zero.
GELU
\(z \, \Phi(z)\)
Smooth approximation.
Default in transformers.

\(\Phi(z) = P(Z \leq z)\) is the CDF of the standard normal distribution.

In practice, ReLU works for most tasks. GELU is the default for modern transformers.

Demo: Activations and Their Gradients

Activation:

Weight Initialization: Xavier and He

Even with ReLU, bad initialization kills training. If weights are too large, activations explode. Too small, signal vanishes.

Goal: keep the variance of activations and gradients roughly constant across layers.
Xavier / Glorot
\[\text{Var}(W_{ij}^{(\ell)}) = \frac{2}{n_{\text{in}} + n_{\text{out}}}\] Designed for sigmoid & tanh.
Balances forward and backward variance.
He / Kaiming
\[\text{Var}(W_{ij}^{(\ell)}) = \frac{2}{n_{\text{in}}}\] Designed for ReLU.
Accounts for half the neurons being zeroed.

Each \(W_{ij}^{(\ell)}\) is drawn independently from a zero-mean distribution (e.g. \(\mathcal{N}(0, \text{Var})\)). Here \(n_{\text{in}}\) is the number of columns of \(W^{(\ell)}\) (i.e. the layer input dimension), \(n_{\text{out}}\) the number of rows (output dimension). Biases \(\mathbf{b}^{(\ell)}\) are initialized to zero — symmetry is broken by the random weights.

Weight Initialization — Practical Notes

  • PyTorch's nn.Linear default: uses kaiming_uniform_(a=sqrt(5)), which reduces to \(W_{ij} \sim \mathcal{U}(-1/\sqrt{n_{\text{in}}},\; 1/\sqrt{n_{\text{in}}})\). This is a historical convention inherited from Lua Torch — not a principled choice.
  • He init is theoretically correct for ReLU/GELU: it accounts for the fact that ReLU zeroes out roughly half the neurons, so the variance needs to be doubled. The same argument applies to GELU and other ReLU-like activations.
  • Xavier is designed for sigmoid/tanh: it balances forward and backward variance by averaging over \(n_{\text{in}}\) and \(n_{\text{out}}\). Appropriate when the activation doesn't zero out half the signal.
  • In practice, most people don't override the default: modern architectures use BatchNorm or LayerNorm after each layer, which rescales activations and largely compensates for suboptimal init. Combined with Adam (which adapts per-parameter learning rates), the exact init matters less.

Demo: He Initialization Explorer

Scale: 1.00× Input μ: 0.0 Input σ: 1.0
Input \(\mathbf{x}\)
\(\mathbf{h}^{(1)}\)
\(\mathbf{h}^{(2)}\)
\(\mathbf{h}^{(3)}\)
\(\mathbf{h}^{(4)}\)
\(\mathbf{y}\)

Scale = 1.0× is exact He init  |  4 hidden layers (ReLU) + linear output, 128 neurons each

He Init Demo — Setup Details

  • Architecture: a fully connected network with layer sizes \([128,\; 128,\; 128,\; 128,\; 128,\; 128]\) — an input of dimension 128, 4 hidden layers with ReLU activation (each 128 neurons), and a linear output layer (128 neurons, no activation).
  • Weight matrices: 5 matrices, each \(W^{(\ell)} \in \mathbb{R}^{128 \times 128}\) (16,384 parameters per layer). With He init: \(W_{ij}^{(\ell)} \sim \mathcal{N}(0,\; 2/128) = \mathcal{N}(0,\; 0.0156)\).
  • Biases: each \(\mathbf{b}^{(\ell)} \in \mathbb{R}^{128}\), initialized to zero.
  • Input: a single random vector \(\mathbf{x} \in \mathbb{R}^{128}\), sampled from \(\mathcal{N}(\mu,\; \sigma^2)\) where \(\mu\) and \(\sigma\) are controlled by the sliders. Default: \(\mathcal{N}(0, 1)\).
  • Scale slider: multiplies the He standard deviation by the shown factor. At 1.0× the init is exact He; smaller values shrink weights (activations vanish), larger values grow them (activations explode).
  • Histograms: show the distribution of activations \(\mathbf{h}^{(\ell)} = \text{ReLU}(W^{(\ell)} \mathbf{h}^{(\ell-1)} + \mathbf{b}^{(\ell)})\) at each layer after a single forward pass. With correct He init, the histograms should have similar spread across all layers.
  • What to observe: try scaling down (activations collapse to zero in later layers) or scaling up (activations grow exponentially). Also try shifting the input mean away from zero to see how unnormalized inputs interact with initialization.

LayerNorm in This Demo

  • What it does: after each layer, the activation vector \(\mathbf{h} \in \mathbb{R}^{128}\) is normalized to zero mean and unit variance: \[\hat{h}_i = \frac{h_i - \mu}{\sqrt{\sigma^2 + \epsilon}}, \qquad \mu = \frac{1}{D}\sum_i h_i, \quad \sigma^2 = \frac{1}{D}\sum_i (h_i - \mu)^2\]
  • Why LayerNorm and not BatchNorm? BatchNorm normalizes across a batch of samples (computing mean/variance over index \(n\) for each neuron \(i\)). With a single sample (\(N{=}1\)), those statistics are meaningless. LayerNorm normalizes across the feature dimension (the 128 neurons), which works for any batch size.
  • Effect in the demo: toggle LayerNorm on, then try extreme scale values. The histograms stay well-behaved regardless of initialization — LayerNorm rescales everything back to \(\sigma \approx 1\) at every layer.
  • In practice: BatchNorm (normalizing across the batch for each neuron) is standard in CNNs. LayerNorm (normalizing across features for each sample) is standard in transformers. Both achieve the same goal: preventing activation drift across layers.
  • Full LayerNorm also includes learnable scale \(\gamma\) and shift \(\beta\) parameters: \(y_i = \gamma \hat{h}_i + \beta\). These are omitted in this demo for simplicity.

Demo: Gradient Flow Explorer

Scale: 1.00× Activation:
\(\partial E / \partial \mathbf{x}\)
\(\partial E / \partial \mathbf{h}^{(1)}\)
\(\partial E / \partial \mathbf{h}^{(2)}\)
\(\partial E / \partial \mathbf{h}^{(3)}\)
\(\partial E / \partial \mathbf{h}^{(4)}\)
\(\partial E / \partial \mathbf{y}\)

Random unit gradient injected at output  |  Gradient histograms per layer  |  4 hidden layers + linear output, dim 128

Gradient Flow Demo — Details

  • Same architecture as the He init demo: layer sizes \([128,\; 128,\; 128,\; 128,\; 128,\; 128]\) — 4 hidden layers with the selected activation function, plus a linear output layer.
  • Forward pass: a single random input \(\mathbf{x} \sim \mathcal{N}(0,1)\) is propagated through the network. Weights are initialized with He init (scaled by the slider).
  • Gradient injection: a random unit vector \(\hat{\mathbf{g}} \in \mathbb{R}^{D}\) (\(\|\hat{\mathbf{g}}\| = 1\)) is injected at the output as \(\partial E / \partial \mathbf{y}\). This is independent of the forward pass, so the He analysis holds exactly. It measures the network's gradient attenuation without conflating it with a specific loss.
  • Histograms: show the distribution of \(\partial E / \partial \mathbf{h}^{(\ell)}\) at each layer. With healthy gradient flow, the distributions should have similar scale across layers.
  • Summary bar chart: shows the standard deviation of the gradient at each layer. A flat bar chart means gradients flow evenly; a decaying one means vanishing gradients.
  • What to observe: switch to Sigmoid or Tanh and notice how gradient scale shrinks exponentially toward the input (vanishing gradients). ReLU maintains scale because its derivative is exactly 1 for active neurons.

Input Normalization

He/Xavier assume inputs have ~zero mean, ~unit variance.   Normalize:   \(\displaystyle\hat{x}_j = \frac{x_j - \mu_j}{\sigma_j}\)   for each feature \(j\)
Without normalization
  • Feature scales differ by orders of magnitude (e.g. mass in eV vs distance in Mpc)
  • Variance propagation analysis breaks → initialization is wrong
  • Gradients dominated by large-scale features
With normalization
  • All features contribute equally at initialization
  • He/Xavier variance analysis holds
  • Faster, more stable convergence

Always normalize inputs. Compute \(\mu, \sigma\) on the training set and apply the same transform to test data.

Demo: Why Normalization Matters

MLP [1,64,64,64,64,1] trained on \(y = \sin(2\pi x)\). Shift the data and watch training break.

Normalization Demo — Details

  • Task: fit a 1D regression \(y = \sin(2\pi x)\) with 60 noisy training points (\(\sigma = 0.45\)) on \(x \in [-1, 1]\).
  • Network: MLP with layer sizes \([1,\; 64,\; 64,\; 64,\; 64,\; 1]\), GELU activations, trained with SGD (no momentum).
  • x-offset slider: shifts all inputs by a constant: \(x \to x + \Delta x\). This moves the data away from zero, breaking the assumption that inputs are centered. With large offsets, the first-layer pre-activations \(W^{(1)} x + b^{(1)}\) are dominated by the bias-like term \(W^{(1)} \Delta x\), and learning becomes difficult.
  • y-offset slider: shifts all targets by a constant: \(y \to y + \Delta y\). The network must now represent a non-zero mean output, which requires the biases to compensate.
  • What to observe: with zero offsets, training converges smoothly. Shift x or y by ±2 and training either fails or converges much more slowly. This is why input normalization (\(\hat{x} = (x - \mu)/\sigma\)) and target normalization are standard practice.
  • Connection to He init: the He variance analysis assumes \(\text{Var}(x) \approx 1\) and \(\mathbb{E}[x] \approx 0\). When inputs are shifted or scaled, the variance of activations at each layer no longer matches the design assumptions, leading to poor gradient flow from the very first step.

Making Training Work

Momentum, Adam, learning rates, and when to stop.

SGD with Momentum

SGD is noisy due to mini-batch sampling. Idea: smooth the gradients using an exponential moving average (EMA, defined next).

\[\mathbf{m} \leftarrow \beta\,\mathbf{m} + (1-\beta)\,\nabla E \qquad \mathbf{w} \leftarrow \mathbf{w} - \eta\,\mathbf{m}\] \(\beta\): smoothing factor (typical 0.9)  |  \(\eta\): learning rate  |  \(\mathbf{m}\): EMA of gradients
SGD without momentum
Zig-zags across the valley.
Slow convergence along the long axis.
SGD with momentum
Velocity accumulates along consistent direction.
Noise gets damped.

Exponential Moving Average (EMA)

Both SGD with momentum and Adam (see next slides) use EMAs. Core idea: smooth a noisy signal by blending each new value with the running average.

\[m_0 = 0, \qquad m_t = \beta\,m_{t-1} + (1-\beta)\,g_t\] \(t\): step counter.   Higher \(\beta\) → more smoothing, but more lag.   Effective window ≈ \(1/(1-\beta)\) steps.
Window ≈ 10 steps

EMA — Details

  • What it does: the EMA replaces each noisy gradient \(g_t\) with a smoothed estimate \(m_t = \beta\,m_{t-1} + (1-\beta)\,g_t\). This is a weighted average of all past gradients, where recent gradients carry more weight. The effective averaging window is \(\approx 1/(1-\beta)\) steps.
  • Bias correction: since \(m_0 = 0\), early values of \(m_t\) are biased towards zero (the EMA hasn't "warmed up" yet). Dividing by \((1-\beta^t)\) corrects this: \(\hat{m}_t = m_t/(1-\beta^t)\). The correction is large for small \(t\) and vanishes as \(t \to \infty\).
  • Where EMA appears:
    • SGD with momentum: EMA of gradients \(\to\) smoother update direction.
    • Adam, first moment: EMA of gradients (with bias correction) \(\to\) adaptive mean.
    • Adam, second moment: EMA of squared gradients \(\to\) per-parameter step-size scaling.

Adam Optimizer

SGD with momentum smooths gradients. Adam additionally adapts the step size per parameter.

First moment (momentum):\(\mathbf{m} \leftarrow \beta_1\,\mathbf{m} + (1-\beta_1)\,\nabla E\)
Second moment (curvature):\(\mathbf{v} \leftarrow \beta_2\,\mathbf{v} + (1-\beta_2)\,(\nabla E)^2\)
Bias correction:\(\hat{\mathbf{m}} = \mathbf{m}/(1-\beta_1^t), \quad \hat{\mathbf{v}} = \mathbf{v}/(1-\beta_2^t)\)
Update:\(\mathbf{w} \leftarrow \mathbf{w} - \eta\,\hat{\mathbf{m}}\,/\,(\sqrt{\hat{\mathbf{v}}}+\epsilon)\)
All operations (square, division, sqrt) are element-wise, i.e. per parameter.
  • \(\hat{\mathbf{m}}\): EMA of gradients — smooths noise, gives a reliable update direction.
  • \(\hat{\mathbf{v}}\): EMA of squared gradients — rescales each parameter so that steep directions get smaller steps and flat directions get larger ones.
Defaults: \(\beta_1{=}0.9,\; \beta_2{=}0.999,\; \epsilon{=}10^{-8}\)
Adam is the default optimizer
in modern deep learning.

Adam — Details

  • First moment \(\hat{\mathbf{m}}\): EMA of gradients with bias correction. Identical to SGD with momentum — smooths out mini-batch noise to give a more stable update direction.
  • Second moment \(\hat{\mathbf{v}}\): EMA of squared gradients. Tracks \(\mathbb{E}[g^2]\) per parameter. Near a minimum where \(\mathbb{E}[g] \approx 0\), we have \(\mathbb{E}[g^2] \approx \text{Var}(g)\), which relates to curvature: steep directions produce large gradients, flat directions produce small ones. Dividing by \(\sqrt{\hat{\mathbf{v}}}\) means parameters move relatively faster in flat directions (small \(g^2\)) and slower in steep directions (large \(g^2\)), leading to a more balanced evolution across all parameters.
  • Caveat: away from the minimum, \(\mathbb{E}[g^2]\) mixes curvature with the signal (squared mean gradient), so the curvature interpretation is less clean. The per-parameter rescaling still helps empirically.
  • The update \(\hat{\mathbf{m}}/\sqrt{\hat{\mathbf{v}}}\): can be seen as a diagonal approximation to natural gradient descent — each parameter gets its own effective learning rate \(\eta / (\sqrt{\hat{v}_i} + \epsilon)\).

Demo: Adam Optimizer

Polynomial regression on noisy sine data. Model: \(\hat y = w_1\,x + w_2\,x^3\). Mini-batch gradient descent.

Adam Demo — Details

  • Task: polynomial regression \(\hat{y} = w_1\,x + w_2\,x^3\) on 500 noisy samples from \(y = \sin(\pi x)\) with \(\sigma = 0.2\), \(x \in [-1,1]\).
  • Loss surface: the contour plot shows \(\sqrt{\text{MSE}(w_1, w_2)}\). Because the model is linear in \(w_1, w_2\), the true loss is a quadratic bowl. The green star marks the analytic optimum. Contour levels are at \(\sqrt{L} = 1, 2, \ldots, 10\).
  • Left panel: optimizer trajectory in \((w_1, w_2)\) space, starting from the origin. The red dot is the current position.
  • Right panel: mini-batch loss (evaluated on the sampled batch at each step) vs. step number.
  • Sliders: \(\eta\) is the learning rate (log scale), \(\beta_1\) controls the first moment (momentum), \(\beta_2\) controls the second moment (adaptive scaling), and \(B\) is the mini-batch size (powers of 2).
  • Presets (Next button): cycles through four configurations — (1) vanilla SGD (\(\beta_1{=}\beta_2{=}0\)), (2) SGD + momentum (\(\beta_1{=}0.9\)), (3) RMSprop-like (\(\beta_2{=}0.999\)), (4) full Adam (\(\beta_1{=}0.9, \beta_2{=}0.999\)).
  • What to observe: vanilla SGD is noisy and slow. Momentum smooths the trajectory. The second moment adapts step sizes to the loss geometry. Full Adam combines both for fast, stable convergence.

Learning Rate Schedules

Even Adam needs a good learning rate. And the best rate changes during training.

Step decay
Drop by factor every \(N\) epochs
Cosine annealing
Smooth decrease following cosine
Warmup + decay
Ramp up, then decrease

Warmup prevents early instability; decay allows fine-grained convergence.

Learning rate scheduling is good practice but fine-tuning — it helps achieve convergent, low-noise results in the final stages of training.

Early Stopping

Monitor validation loss during training. Stop when it starts to rise.

The simplest and most effective regularization technique.
Save the model at the best validation loss — everything after that is memorizing noise.

Why Deep Learning Works

Double descent, spectral bias, regularization, and the training trajectory.

Double Descent

Classical story: more parameters = more overfitting (U-shaped test error).

The surprise: test error decreases again when the model grows far beyond the interpolation threshold.

At the interpolation threshold, the model barely fits training data — it memorizes noise.
Beyond it, many solutions exist, and SGD finds smooth ones. This challenges the classical bias-variance tradeoff.

Belkin et al., Reconciling modern machine-learning practice and the classical bias-variance trade-off, PNAS 2019  |  Nakkiran et al., Deep Double Descent, ICLR 2020

Double Descent — Details

  • Conceptual picture: double descent is primarily a conceptual framework. It has been demonstrated in concrete experiments (see references), but the clean U-then-descend shape depends on the model family and data.
  • Classical regime: in linear regression (Lecture 1), we saw the same effect — adding polynomial features first reduces error, then overfitting kicks in near the interpolation threshold.
  • Beyond interpolation: when the number of parameters far exceeds the number of data points, many different weight configurations achieve minimal training loss. Among these, gradient-based optimizers tend to select solutions with specific properties (e.g. small norm, smooth functions) that generalise well.
  • Model complexity (x-axis): this is a proxy for the effective capacity of the model (e.g. number of parameters, polynomial degree, network width). It is not the training epoch — each point on the x-axis represents a fully trained model of a given size.
  • Error (y-axis): this is the test error, i.e. the residual error evaluated on held-out test data, after the model has been fully optimised on the training data.

Why Overparameterization Works: Volume Arguments

At the interpolation threshold, there is essentially one solution — it must contort to fit noise. Beyond it, many interpolating solutions exist. Which one does SGD find?

The volume argument (Mingard et al., JMLR 2021)
  • SGD finds solutions with probability ≈ proportional to their volume in parameter space
  • Simple functions have exponentially more parameter configurations that implement them (higher degeneracy)
  • Therefore: SGD is biased toward simple, well-generalizing solutions — not by design, but by geometry
This explains why good solutions exist. But why does SGD find them?
The answer lies in the training dynamics — the order in which features are learned.

Further reading: Belkin et al. (PNAS 2019)Mingard et al. (JMLR 2021)Hastie et al. (Ann. Stat. 2022)Rahaman et al. (ICML 2019)

Why Overparameterization Works: Spectral Bias

Neural networks learn low-frequency (smooth) features first, and only later fit high-frequency noise. This spectral bias means overparameterized models generalise well if training is stopped early enough.

Left: MLP fit (red) vs true sine (dashed). Train data (blue) & held-out validation data (orange). Right: train vs validation loss — the gap is overfitting.

Spectral Bias — Details

  • What is spectral bias? Neural networks trained with gradient descent learn low-frequency components of the target function before high-frequency ones. This is an intrinsic property of the training dynamics, not of the architecture.
  • Demo setup: MLP with layer sizes \([1,\;64,\;64,\;64,\;64,\;1]\) and GELU activations, trained on 60 noisy samples from \(y = \sin(2\pi x)\) with \(\sigma = 0.45\), \(x \in [-1,1]\).
  • Left panel: the red curve shows the MLP prediction. Blue dots are training data, orange dots are held-out validation data, and the dashed line is the true sine function.
  • Right panel: training loss (blue) and validation loss (orange) vs. epoch. Training loss decreases monotonically. Validation loss first decreases (the network is learning the signal), then increases (the network is memorizing noise). The gap between the two curves indicates overfitting.
  • Why this matters for overparameterization: the network has far more parameters than data points, yet early in training it produces a good fit. The spectral bias of SGD acts as an implicit regularizer — it prefers smooth solutions. Overfitting only occurs later, and can be prevented by early stopping.
  • Reference: Rahaman et al., On the Spectral Bias of Neural Networks, ICML 2019.

Why Overparameterization Works: Early Stopping

The training trajectory goes from modelling signal to modelling noise. Early stopping picks the sweet spot.

The recipe:
  1. Split data into train and validation sets
  2. Train on train set, monitor loss on validation set
  3. Stop when validation loss starts increasing
  4. Use the model from the epoch with lowest validation loss
Why it works:
  • Generalizable features are learned first (in general by construction, for MLPs because of spectral bias)
  • Noise memorization comes later
  • Validation loss detects the transition
  • Equivalent to an implicit complexity penalty — fewer effective parameters

Caveat: this applies when you are data-limited. Modern LLMs are typically compute-limited — the training set is so large that overfitting to noise is not the bottleneck.

The Training Trajectory: Simple Features First

An overparameterized network doesn't learn everything at once. It follows a trajectory from simple to complex:

Early training
Low-frequency, smooth features — the generalizable signal
Mid training
Finer structure — still useful but diminishing returns
Late training
High-frequency noise — overfitting
This is not just SGD — it's the whole stack: ReLU activations create piecewise-linear functions that build complexity incrementally. The architecture (convolutions, skip connections) and regularization (weight decay, dropout) all shape which features are easy to learn first, biasing the trajectory toward features that generalize.

Key Takeaways

  • Gradients must flow: backpropagation applies the chain rule through the computation graph, and He initialization, ReLU variants, and input normalization keep signals from vanishing or exploding.
  • Smarter optimizers: momentum and Adam adapt each parameter's step size using running averages of past gradients.
  • Schedules matter: warming the learning rate up and then decaying it stabilizes early training and sharpens final convergence.
  • Overparameterization helps: networks far larger than needed still generalize well, defying the classical bias-variance tradeoff (double descent).
  • Simple features first: training learns smooth, generalizable patterns before fitting noise, so early stopping captures the good solution.
  • Next up: convolutional networks build translation equivariance directly into the architecture.