Machine Learning for Astroparticle Physics:
A Crash-course in SBI

Lecture 2b - Deep neural networks for density estimation

Christoph Weniger — University of Amsterdam (GRAPPA)

Today's Roadmap

Recap from 2a: a linear-basis fit gives \(q(\theta\mid x) = \mathcal{N}(\mu_\theta(x), \sigma_\theta^{2})\), but all the modelling burden sits on the hand-picked basis \(\boldsymbol{\phi}(x)\). Today we make \(\boldsymbol{\phi}\) learnable.

1
Neural networks as learnable bases
Replace fixed templates by neurons \(g(\mathbf{v}_j^T x + b_j)\); one hidden layer is a universal approximator

→

2
Training neural networks
No closed form, so descend the loss: gradient descent and mini-batch SGD

→

3
Going deep
Stack layers; each one composes richer effective basis functions

The thread: the basis was the bottleneck in 2a. Learn it, learn how to train it, then stack it deep.

Neural networks as learnable bases

Replace fixed templates with parametric ones; one hidden layer can already approximate anything.

The hand-picked basis problem

Everything so far depends on hand-picked basis functions \(\boldsymbol{\phi}(x)\).

Worked for 1D
We could visualise and choose polynomial, Gaussian, or sigmoidal templates.

Doesn't scale
What are good basis functions for images? For particle jets? For spectra in 1000 dimensions?

This limits both regression and classification — the bottleneck is always \(\boldsymbol{\phi}\).

\[ \underbrace{\text{output}}_{\text{prediction}} = \text{head}\!\Big( \underbrace{\mathbf{w}^T \boldsymbol{\phi}(x)}_{\text{this is the bottleneck}} \Big) \]

The head (identity or sigmoid) is trivial to swap. The hard part is \(\boldsymbol{\phi}\).

Towards Learnable Building Blocks

Idea: Replace fixed templates with parametric functions. A simple building block:

\[ \phi_j(\mathbf{x}) = g\!\left(\mathbf{v}_j^T \mathbf{x} + b_j\right) \]

\(g\) = nonlinear activation function — \(\mathbf{v}_j, b_j\) are learnable parameters

For a 1D input, what does this look like with random \(\mathbf{v}, b\)?

Activation: Neurons H = 6

Each coloured curve is one learnable basis function \(\phi_j(\mathbf{x}) = g(\mathbf{v}_j^T \mathbf{x} + b_j)\) with random \(\mathbf{v}_j, b_j\). Here: 1D input.

Common Activation Functions

The nonlinearity \(g\) determines the shape of each building block. Common choices:

Step
\(\mathbb{1}[x > 0]\)

Sigmoid
\(\frac{1}{1+e^{-x}}\)

Tanh
\(\tanh(x)\)

ReLU
\(\max(0, x)\)

GELU
\(x \cdot \sigma(1.7x)\)

ReLU — default for most networks. Simple, fast, sparse. Can "die" (zero gradient for \(x < 0\)).

GELU — smooth approximation of ReLU. Used in transformers (GPT, BERT). Slightly better in practice.

The Artificial Neuron

Each building block \(\phi_j(\mathbf{x}) = g(\mathbf{v}_j^T \mathbf{x} + b_j)\) is called a neuron, by analogy with biology:

\(x_1\)

\(x_2\)

\(\vdots\)

\(x_d\)

\(\xrightarrow{v_1}\)

\(\xrightarrow{v_2}\)

\(\xrightarrow{v_d}\)

\(\mathbf{v}^T\!\mathbf{x} + b\) \(\downarrow\) \(g(\cdot)\)

\(\uparrow\; b_j\) (bias)

→

output
\(\phi_j = g(\mathbf{v}_j^T\mathbf{x} + b_j)\)

Weights \(\mathbf{v}_j\) — determine which combination of inputs the neuron responds to

Bias \(b_j\) — sets the firing threshold

Activation \(g\) — shapes the response (on/off, smooth, etc.)

One hidden layer = a learnable basis

Stack \(H\) of these neurons and let a final linear layer combine them:

\[ f(\mathbf{x}) = \sum_{j=1}^{H} w_j \;\underbrace{g(\mathbf{v}_j^T \mathbf{x} + b_j)}_{\text{learnable }\phi_j(\mathbf{x})} \;+\; w_0 \]

This is exactly the linear-basis model \(f(\mathbf{x}) = \mathbf{w}^T\boldsymbol{\phi}(\mathbf{x})\) from earlier, except now the basis functions have learnable parameters.

Input
\(\mathbf{x}\)

→

Hidden layer
\(h_j = g(\mathbf{v}_j^T \mathbf{x} + b_j)\)

→

Output
linear or sigmoid head

This is a neural network with one hidden layer. Stack more layers → a multi-layer perceptron (MLP).

Universal Approximation Theorem

With enough neurons, a single hidden layer can approximate any continuous function.

Intuition: each ReLU neuron contributes a "kink". Combine enough kinks → any shape.

Activation: Neurons H = 4

Grey: individual neurons \(w_j \cdot g(v_j x + b_j)\). Black: their sum. More neurons = more expressive.

Training neural networks

A learnable basis has no closed-form fit. We descend the loss instead: gradient descent and mini-batch SGD.

No Closed-Form Solution

Fixed basis (linear regression)

\(\mu_\theta(x) = \mathbf{w}^T\boldsymbol{\phi}(x)\) with fixed \(\boldsymbol{\phi}\).

Loss is quadratic in \(\mathbf{w}\).

\(\nabla E = 0\) → closed form: \(\mathbf{w}_{\mathrm{ML}} = (\boldsymbol{\Phi}^T\boldsymbol{\Phi})^{-1}\boldsymbol{\Phi}^T\boldsymbol{\theta}\).

Learnable basis (neural net)

\(\mu_\theta(x) = \sum_j w_j\, g(\mathbf{v}_j^T x + b_j)\) with learnable \(\mathbf{v}_j, b_j\).

The nonlinearity \(g\) makes the loss nonlinear and non-convex in the parameters.

No closed form: \(\nabla E = 0\) gives a transcendental system, not a matrix equation.

When we cannot solve \(\nabla E = 0\) on paper, we need a numerical method to find the minimum of \(E(\mathbf{w})\). The default one is (stochastic) gradient descent.

What the Network Actually Is

Concretely, the one-hidden-layer network from the previous slide, as math (left) and as the PyTorch module you would actually run (right). Colours link each line:

The math (input \(\mathbf{x}\in\mathbb{R}^2\), width \(H\))

\( \mathbf{z} = W^{(1)}\mathbf{x} + \mathbf{b}^{(1)} \)

\( \mathbf{h} = g(\mathbf{z}) \)

\( \mu = W^{(2)}\mathbf{h} + b^{(2)} \)

\(W^{(1)}\!\in\!\mathbb{R}^{H\times 2}\), \(\mathbf{b}^{(1)}\!\in\!\mathbb{R}^{H}\), \(W^{(2)}\!\in\!\mathbb{R}^{1\times H}\), \(b^{(2)}\!\in\!\mathbb{R}\); \(g=\text{ReLU}\).

import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, H=16):
        super().__init__()
        self.fc1 = nn.Linear(2, H)   # W⁽¹⁾ (H×2), b⁽¹⁾ (H)
        self.g   = nn.ReLU()         # activation g
        self.fc2 = nn.Linear(H, 1)   # W⁽²⁾ (1×H), b⁽²⁾ (1)

    def forward(self, x):
        z = self.fc1(x)              # z = W⁽¹⁾x + b⁽¹⁾
        h = self.g(z)                # h = g(z)
        return self.fc2(h)           # μ = W⁽²⁾h + b⁽²⁾

The learnable parameters are \(\boldsymbol\phi=\{W^{(1)},\mathbf{b}^{(1)},W^{(2)},b^{(2)}\}\). PyTorch collects them in model.parameters(); the optimizer descends the loss over all of them at once.

Gradient Descent

Bundle every trainable quantity into a single parameter vector \(\boldsymbol\phi\):

\( \boldsymbol\phi = \bigl(\mathbf{W}^{(1)},\mathbf{b}^{(1)},\ldots,\mathbf{W}^{(L)},\mathbf{b}^{(L)},\sigma_\theta,\ldots\bigr) \)

weights, biases, output-noise scale: everything the loss depends on, in one flat vector

Start somewhere, then repeatedly step in the direction that decreases the loss:

\( \boldsymbol\phi \leftarrow \boldsymbol\phi - \eta \,\nabla_{\!\boldsymbol\phi}\, E(\boldsymbol\phi) \)

\(\eta\) = learning rate. The gradient is taken w.r.t. every component of \(\boldsymbol\phi\) at once.

Click Next ▶ to see how the learning rate affects convergence.

Illustration: Step Size and Curvature

How large should \(\eta\) be? Consider a simplistic quadratic loss \(E(w) = \tfrac{1}{2}\,\kappa\,w^2\) with curvature \(\kappa = E''(w)\).

One GD step: \(w_{i+1} = w_i - \eta\,\kappa\,w_i = (1 - \eta\,\kappa)\,w_i\). Converges instantly when:

\[ \eta = \frac{1}{\kappa} = \frac{1}{E''(w)} \qquad \text{(ideal for this toy case)} \]

Practical challenges: the optimal step size is the inverse curvature of the loss, which is typically not known. In many dimensions, curvature differs per direction.

Mini-Batch Stochastic Gradient Descent (SGD)

Computing the full gradient over all \(N\) samples is expensive. Approximate it with a random mini-batch \(\mathcal{B}\) of size \(B \ll N\). The gradient is taken w.r.t. every component of \(\boldsymbol\phi = (\mathbf{W},\mathbf{b},\sigma_\theta,\ldots)\):

\[ \nabla_{\!\boldsymbol\phi}\, E(\boldsymbol\phi) = \frac{1}{N}\sum_{n=1}^{N} \nabla_{\!\boldsymbol\phi}\, E_n(\boldsymbol\phi) \;\approx\; \frac{1}{B}\sum_{n \in \mathcal{B}} \nabla_{\!\boldsymbol\phi}\, E_n(\boldsymbol\phi) \]

One full pass through the data = one epoch. Typical batch sizes: \(B\) = 32–256.

When people say "SGD" they almost always mean mini-batch SGD.

SGD = True Gradient + Noise

The mini-batch gradient is a noisy estimate of the true gradient:

\[ \underbrace{\frac{1}{B}\sum_{n \in \mathcal{B}} \nabla_{\!\boldsymbol\phi}\, E_n}_{\text{mini-batch gradient}} = \underbrace{\nabla_{\!\boldsymbol\phi}\, E(\boldsymbol\phi)}_{\text{true gradient}} + \underbrace{\boldsymbol{\epsilon}}_{\text{noise}}, \qquad \mathrm{Var}(\boldsymbol{\epsilon}) = \frac{\sigma^2_{\nabla}}{B} \]

\(\sigma^2_{\nabla}\) = variance of individual sample gradients \(\nabla_{\!\boldsymbol\phi}\, E_n\) across the dataset

Large \(B\)
Low noise, precise gradient
Expensive per step

Small \(B\)
High noise, rough gradient
Cheap per step, helps generalisation

MLP in Action: Regression

Train a one-hidden-layer MLP on 1D regression with the Gaussian NLL loss. The grey band is the trained \(\sigma\) — all of \(\boldsymbol\phi = (\mathbf{W},\mathbf{b},\log\sigma)\) is updated by SGD.

Target: Activation: Width = 8 η = 0.100

The hidden layer learns its own basis functions — no manual feature engineering.
Width controls expressiveness: too few neurons underfit, too many can overfit.

Three Functions to Infer

Imagine an unknown smooth field \(\theta(x_1,x_2)\) over the plane. We observe noisy samples at scattered locations and want to infer the whole surface. Three targets, easy → hard:

Blob \(\theta=e^{-\lVert\mathbf{x}-\mathbf{c}\rVert^2/2s^2}\)

Moons two curved ridges

Spiral \(\theta=\sin(\angle\mathbf{x}-k\lVert\mathbf{x}\rVert)\)

The prior is uniform over the square: training inputs are drawn uniformly, so the network sees the whole domain. All three are pure regression — map \(\mathbb{R}^2\!\to\!\mathbb{R}\), no classification needed.

MLP in Action: Inferring a 2D Function

A single hidden layer mapping \((x_1,x_2)\to\mu\), trained with mini-batch SGD on the MSE loss:

Target: Width = 8

η = 0.010

Colour = inferred value \(\mu(x_1,x_2)\); scattered points are the noisy training targets on the same scale.
One hidden layer fits the smooth Blob easily, but struggles with the high-frequency Spiral: motivation for going deep.

Going deep

One hidden layer is a universal approximator — but what happens when we stack more?

Stacking Layers

Apply the same idea repeatedly: the output of one layer becomes the input to the next.

Input
\(\mathbf{x}\)

→

Layer 1
\(\mathbf{h}^{(1)}\!=\!g(W^{(1)}\mathbf{x}\!+\!\mathbf{b}^{(1)})\)

→

Layer 2
\(\mathbf{h}^{(2)}\!=\!g(W^{(2)}\mathbf{h}^{(1)}\!+\!\mathbf{b}^{(2)})\)

→ …

… →

Layer \(L\)
\(\mathbf{h}^{(L)}\!=\!g(W^{(L)}\mathbf{h}^{(L-1)}\!+\!\mathbf{b}^{(L)})\)

→

Head
\(\mathbf{y}\!=\!\text{head}(\mathbf{h}^{(L)})\)

From the output head's perspective, the last hidden layer provides effective basis functions.

With more layers, these effective basis functions become richer and more complex.

Key question: what do these effective basis functions look like?
Let's visualise them with random weights for different depths.

Effective Basis Functions by Depth

Take networks with random weights. Plot what the last hidden layer neurons compute as functions of the 1D input:

Activation: Width = 8

1 hidden layer

2 hidden layers

3 hidden layers

Deeper = richer basis functions. Each additional layer composes nonlinearities, creating more complex shapes from simpler ones.

What a Deep Network Is

The same object, now with \(L\) hidden layers. Depth is just the one-layer rule repeated, a loop in code and a recursion in math:

The math (\(\mathbf{x}\in\mathbb{R}^2\), width \(H\), depth \(L\))

\( \mathbf{h}^{(0)} = \mathbf{x} \)

\( \mathbf{h}^{(\ell)} = g\big(W^{(\ell)}\mathbf{h}^{(\ell-1)} + \mathbf{b}^{(\ell)}\big) \)

for \( \ell = 1,\dots,L \)

\( \mu = W^{(L+1)}\mathbf{h}^{(L)} + b^{(L+1)} \)

One hidden-layer rule, applied \(L\) times, then a linear head. Each layer has its own \(W^{(\ell)},\mathbf{b}^{(\ell)}\).

import torch.nn as nn

class DeepMLP(nn.Module):
    def __init__(self, L=4, H=16):
        super().__init__()
        dims = [2] + [H]*L
        self.layers = nn.ModuleList(
            nn.Linear(dims[l], dims[l+1]) for l in range(L))
        self.g    = nn.ReLU()              # activation g
        self.head = nn.Linear(H, 1)        # W⁽ᴸ⁺¹⁾, b⁽ᴸ⁺¹⁾

    def forward(self, x):
        h = x                              # h⁽⁰⁾ = x
        for layer in self.layers:        # stack L hidden layers
            h = self.g(layer(h))           # h⁽ˡ⁾ = g(W⁽ˡ⁾h⁽ˡ⁻¹⁾ + b⁽ˡ⁾)
        return self.head(h)             # μ = W⁽ᴸ⁺¹⁾h⁽ᴸ⁾ + b⁽ᴸ⁺¹⁾

How Do We Train Deep Networks?

We need \(\nabla_{\mathbf{W}^{(l)}} L\) for every layer. The chain rule propagates gradients backwards:

\[ \frac{\partial L}{\partial W^{(l)}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial \mathbf{h}^{(L)}} \cdot \frac{\partial \mathbf{h}^{(L)}}{\partial \mathbf{h}^{(L-1)}} \cdots \frac{\partial \mathbf{h}^{(l+1)}}{\partial \mathbf{h}^{(l)}} \cdot \frac{\partial \mathbf{h}^{(l)}}{\partial W^{(l)}} \]

Each factor \(\partial \mathbf{h}^{(l+1)} / \partial \mathbf{h}^{(l)}\) involves the activation derivative \(g'(\cdot)\) and the weight matrix \(W^{(l+1)}\).

Forward pass
Compute \(\mathbf{h}^{(1)} \to \mathbf{h}^{(2)} \to \cdots \to \hat{y}\)

Backward pass
Propagate \(\nabla L\) from output back to each \(W^{(l)}\)

This is backpropagation — just the chain rule applied systematically.
PyTorch does this automatically via loss.backward().

MLP in Action: Stacking Hidden Layers

The same demo, now with a slider for the number of hidden layers

Target: Layers = 1 Width = 8