Christoph Weniger — University of Amsterdam (GRAPPA)
Replace fixed templates with parametric ones; one hidden layer can already approximate anything.
Everything so far depends on hand-picked basis functions \(\boldsymbol{\phi}(x)\).
This limits both regression and classification — the bottleneck is always \(\boldsymbol{\phi}\).
The head (identity or sigmoid) is trivial to swap. The hard part is \(\boldsymbol{\phi}\).
Idea: Replace fixed templates with parametric functions. A simple building block:
\(g\) = nonlinear activation function — \(\mathbf{v}_j, b_j\) are learnable parameters
For a 1D input, what does this look like with random \(\mathbf{v}, b\)?
Each coloured curve is one learnable basis function \(\phi_j(\mathbf{x}) = g(\mathbf{v}_j^T \mathbf{x} + b_j)\) with random \(\mathbf{v}_j, b_j\). Here: 1D input.
The nonlinearity \(g\) determines the shape of each building block. Common choices:
Each building block \(\phi_j(\mathbf{x}) = g(\mathbf{v}_j^T \mathbf{x} + b_j)\) is called a neuron, by analogy with biology:
Stack \(H\) of these neurons and let a final linear layer combine them:
This is exactly the linear-basis model \(f(\mathbf{x}) = \mathbf{w}^T\boldsymbol{\phi}(\mathbf{x})\) from earlier, except now the basis functions have learnable parameters.
This is a neural network with one hidden layer. Stack more layers → a multi-layer perceptron (MLP).
With enough neurons, a single hidden layer can approximate any continuous function.
Intuition: each ReLU neuron contributes a "kink". Combine enough kinks → any shape.
Grey: individual neurons \(w_j \cdot g(v_j x + b_j)\). Black: their sum. More neurons = more expressive.
A learnable basis has no closed-form fit. We descend the loss instead: gradient descent and mini-batch SGD.
\(\mu_\theta(x) = \mathbf{w}^T\boldsymbol{\phi}(x)\) with fixed \(\boldsymbol{\phi}\).
Loss is quadratic in \(\mathbf{w}\).
\(\nabla E = 0\) → closed form: \(\mathbf{w}_{\mathrm{ML}} = (\boldsymbol{\Phi}^T\boldsymbol{\Phi})^{-1}\boldsymbol{\Phi}^T\boldsymbol{\theta}\).
\(\mu_\theta(x) = \sum_j w_j\, g(\mathbf{v}_j^T x + b_j)\) with learnable \(\mathbf{v}_j, b_j\).
The nonlinearity \(g\) makes the loss nonlinear and non-convex in the parameters.
No closed form: \(\nabla E = 0\) gives a transcendental system, not a matrix equation.
When we cannot solve \(\nabla E = 0\) on paper, we need a numerical method to find the minimum of \(E(\mathbf{w})\). The default one is (stochastic) gradient descent.
Concretely, the one-hidden-layer network from the previous slide, as math (left) and as the PyTorch module you would actually run (right). Colours link each line:
\(W^{(1)}\!\in\!\mathbb{R}^{H\times 2}\), \(\mathbf{b}^{(1)}\!\in\!\mathbb{R}^{H}\), \(W^{(2)}\!\in\!\mathbb{R}^{1\times H}\), \(b^{(2)}\!\in\!\mathbb{R}\); \(g=\text{ReLU}\).
import torch.nn as nn
class MLP(nn.Module):
def __init__(self, H=16):
super().__init__()
self.fc1 = nn.Linear(2, H) # W⁽¹⁾ (H×2), b⁽¹⁾ (H)
self.g = nn.ReLU() # activation g
self.fc2 = nn.Linear(H, 1) # W⁽²⁾ (1×H), b⁽²⁾ (1)
def forward(self, x):
z = self.fc1(x) # z = W⁽¹⁾x + b⁽¹⁾
h = self.g(z) # h = g(z)
return self.fc2(h) # μ = W⁽²⁾h + b⁽²⁾
The learnable parameters are \(\boldsymbol\phi=\{W^{(1)},\mathbf{b}^{(1)},W^{(2)},b^{(2)}\}\). PyTorch collects them in model.parameters(); the optimizer descends the loss over all of them at once.
Bundle every trainable quantity into a single parameter vector \(\boldsymbol\phi\):
Start somewhere, then repeatedly step in the direction that decreases the loss:
\(\eta\) = learning rate. The gradient is taken w.r.t. every component of \(\boldsymbol\phi\) at once.
How large should \(\eta\) be? Consider a simplistic quadratic loss \(E(w) = \tfrac{1}{2}\,\kappa\,w^2\) with curvature \(\kappa = E''(w)\).
One GD step: \(w_{i+1} = w_i - \eta\,\kappa\,w_i = (1 - \eta\,\kappa)\,w_i\). Converges instantly when:
\[ \eta = \frac{1}{\kappa} = \frac{1}{E''(w)} \qquad \text{(ideal for this toy case)} \]Practical challenges: the optimal step size is the inverse curvature of the loss, which is typically not known. In many dimensions, curvature differs per direction.
Computing the full gradient over all \(N\) samples is expensive. Approximate it with a random mini-batch \(\mathcal{B}\) of size \(B \ll N\). The gradient is taken w.r.t. every component of \(\boldsymbol\phi = (\mathbf{W},\mathbf{b},\sigma_\theta,\ldots)\):
One full pass through the data = one epoch. Typical batch sizes: \(B\) = 32–256.
When people say "SGD" they almost always mean mini-batch SGD.
The mini-batch gradient is a noisy estimate of the true gradient:
\(\sigma^2_{\nabla}\) = variance of individual sample gradients \(\nabla_{\!\boldsymbol\phi}\, E_n\) across the dataset
Train a one-hidden-layer MLP on 1D regression with the Gaussian NLL loss. The grey band is the trained \(\sigma\) — all of \(\boldsymbol\phi = (\mathbf{W},\mathbf{b},\log\sigma)\) is updated by SGD.
Imagine an unknown smooth field \(\theta(x_1,x_2)\) over the plane. We observe noisy samples at scattered locations and want to infer the whole surface. Three targets, easy → hard:
The prior is uniform over the square: training inputs are drawn uniformly, so the network sees the whole domain. All three are pure regression — map \(\mathbb{R}^2\!\to\!\mathbb{R}\), no classification needed.
A single hidden layer mapping \((x_1,x_2)\to\mu\), trained with mini-batch SGD on the MSE loss:
One hidden layer is a universal approximator — but what happens when we stack more?
Apply the same idea repeatedly: the output of one layer becomes the input to the next.
From the output head's perspective, the last hidden layer provides effective basis functions.
With more layers, these effective basis functions become richer and more complex.
Take networks with random weights. Plot what the last hidden layer neurons compute as functions of the 1D input:
1 hidden layer
2 hidden layers
3 hidden layers
The same object, now with \(L\) hidden layers. Depth is just the one-layer rule repeated, a loop in code and a recursion in math:
One hidden-layer rule, applied \(L\) times, then a linear head. Each layer has its own \(W^{(\ell)},\mathbf{b}^{(\ell)}\).
import torch.nn as nn
class DeepMLP(nn.Module):
def __init__(self, L=4, H=16):
super().__init__()
dims = [2] + [H]*L
self.layers = nn.ModuleList(
nn.Linear(dims[l], dims[l+1]) for l in range(L))
self.g = nn.ReLU() # activation g
self.head = nn.Linear(H, 1) # W⁽ᴸ⁺¹⁾, b⁽ᴸ⁺¹⁾
def forward(self, x):
h = x # h⁽⁰⁾ = x
for layer in self.layers: # stack L hidden layers
h = self.g(layer(h)) # h⁽ˡ⁾ = g(W⁽ˡ⁾h⁽ˡ⁻¹⁾ + b⁽ˡ⁾)
return self.head(h) # μ = W⁽ᴸ⁺¹⁾h⁽ᴸ⁾ + b⁽ᴸ⁺¹⁾
We need \(\nabla_{\mathbf{W}^{(l)}} L\) for every layer. The chain rule propagates gradients backwards:
Each factor \(\partial \mathbf{h}^{(l+1)} / \partial \mathbf{h}^{(l)}\) involves the activation derivative \(g'(\cdot)\) and the weight matrix \(W^{(l+1)}\).
loss.backward().
The same demo, now with a slider for the number of hidden layers