Christoph Weniger — University of Amsterdam (GRAPPA)
Read the joint cloud column-wise: the simplest \(q(\theta\mid x)\) comes out almost for free.
A simulator gives us pairs \((\theta, x)\): parameter in, observation out. Plotted together they form a joint cloud.
Our job is to describe the cloud in the \(\theta\) direction: at each value of \(x\), what is the distribution of \(\theta\) consistent with it?
Move the slider: the orange band picks out a slice; the points inside are samples of \(q(\theta\mid x_\mathrm{obs})\).
For now we assume \(q(\theta\mid x)\) is uni-modal at every \(x\). The multi-modal case waits until the last section.
Fit a curve \(\mu_\theta(x)\) through the cloud and assume the slice at every \(x\) is Gaussian with the same width \(\sigma_\theta\).
Trainable parameters \(\phi = (\mathbf{w}, \sigma_\theta)\): the curve and the band width.
Black line: \(\mu_\theta(x)\). Shaded: \(\mu_\theta(x) \pm \sigma_\theta\).
Pick basis functions \(\phi_j(x)\) that encode plausible functional behaviour. Combine \(M\) of them with free parameters:
The modelling decision is the choice of \(\boldsymbol{\phi}\). Everything that follows in this lecture is one continuous attack on that choice.
Click a family to see its shape. Each \(\phi_j(x)\) is one template — stack \(M\) of them to build your model.
Pick random weights \(\mathbf{w}\); they define a function \(\mu_\theta(x) = \mathbf{w}^T\boldsymbol{\phi}(x)\). Observe noisy samples \(\theta_n = \mu_\theta(x_n) + \varepsilon\), \(\varepsilon \sim \mathcal{N}(0, \sigma_\theta^{2})\).
Black: true curve \(\mu_\theta(x)\). Red points: noisy samples \(\theta_n\). Given only the points, can you recover the black curve?
The Gaussian likelihood becomes a loss; the weights and the variance follow in closed form.
With independent Gaussian noise, the joint likelihood factorises:
\[ p(\boldsymbol\theta \mid \mathbf{w}, \sigma_\theta) \;=\; \prod_{n=1}^{N} \mathcal{N}\!\bigl(\theta_n \,\big|\, \mu_\theta(x_n;\mathbf{w}),\, \sigma_\theta^{2}\bigr) \]Take the log and write the constant explicitly in terms of \(N\) and \(\sigma_\theta\):
Negate to turn maximising the likelihood into minimising a loss:
Split the loss into a \(\sigma_\theta\)-independent shape \(S(\mathbf{w})\) and a \(\sigma_\theta\)-only piece:
\[ E(\mathbf{w}, \sigma_\theta) \;=\; \frac{N}{2}\ln\!\bigl(2\pi\sigma_\theta^{2}\bigr) \;+\; \frac{S(\mathbf{w})}{2\sigma_\theta^{2}}, \qquad S(\mathbf{w}) = \sum_{n}\bigl(\theta_n - \mathbf{w}^T\boldsymbol{\phi}(x_n)\bigr)^{2} \]Step 1 — weights. The \(\sigma_\theta\) prefactor is positive, so minimising \(E\) in \(\mathbf{w}\) is the same as minimising \(S(\mathbf{w})\). Set \(\nabla_\mathbf{w} S = 0\); with the design matrix \(\boldsymbol\Phi_{nj} = \phi_j(x_n)\),
Step 2 — variance. Plug \(\mathbf{w}_{\mathrm{ML}}\) back and set \(\partial E/\partial\sigma_\theta = 0\):
The MLE divides by \(N\). The unbiased estimator divides by \(N-M\) (Bessel-style correction for the \(M\) fitted weights); the two agree as \(N \to \infty\).
Black curve: \(\mu(x;\mathbf{w}_{\mathrm{ML}})\). Shaded band: \(\mu(x)\pm\sigma_{\mathrm{ML}}\). The fitted Gaussian density is \(q(\theta\mid x)=\mathcal{N}(\mu(x),\sigma^2)\).
A flexible enough basis can memorise the noise. Held-out data is how we catch it.
RMSE = root mean square error: \(\sqrt{\frac{1}{N}\sum_n (\theta_n - \mu_\theta(x_n;\mathbf{w}))^2}\).
Model too simple — misses the pattern.
Training error: high
Validation error: high
Right complexity.
Training error: low
Validation error: low
Model too flexible — memorises noise.
Training error: very low
Validation error: high
The gap between training and validation error is the signature of overfitting.
Training set — fit parameters \(\mathbf{w}\).
Validation set — choose model complexity (\(M\), \(\lambda\), architecture, ...).
Test set — one-shot final score. Look once.
Never use test data for model selection — it defeats the purpose.
Training error keeps decreasing with \(M\). Validation error is U-shaped — the minimum picks the sweet spot.
In a \(D\)-dimensional input \(\mathbf{x} = (x_1, \ldots, x_D)\) we still write \(\mu_\theta(\mathbf{x}) = \mathbf{w}^T\boldsymbol{\phi}(\mathbf{x})\), but the basis must be hand-picked to tile the whole space:
Tiling \(D\) dimensions needs \(\sim M^D\) of them, and you must place every centre, width and frequency yourself: the curse of dimensionality. Lecture 2b's fix: let the network learn the basis.