Machine Learning for Astroparticle Physics: A Crash-course in SBI
Lecture 1b - Approximate Bayesian Computation
Christoph Weniger — University of Amsterdam (GRAPPA)
Today's Roadmap
1
Rejection ABC The simulator loop, the ε tolerance, the sample-efficiency cliff
→
2
Summary statistics Why a good summary buys orders of magnitude — and why we can't always pick one
→
3
KDE on the joint The ε → 0 limit done right, and the seed of neural SBI
The thread: one running example (a cannon ball). We will build intuition for ABC and discuss its limitations.
Tackling those limitations with deep learning will lead to modern SBI.
Simulation vs. inference — two directions
\(\boldsymbol\theta\)
parameters what we care about
e.g. \(H_0,\; \Omega_m,\) BH masses, lens profile
Simulator \(p(\mathbf{x}\mid\boldsymbol\theta)\)
forward — physics, easy
Inference \(p(\boldsymbol\theta\mid\mathbf{x})\)
inverse — what we want, hard
\(\mathbf{x}\)
data what we observe
images, time-series, power spectra, …
Simulation-based inference. We have only the forward direction (a simulator).
SBI builds the inverse by drawing \((\boldsymbol\theta, \mathbf{x})\) pairs from the simulator and training a neural network to output the posterior directly.
The inference problem — formally
Same story, written with physics notation. Parameters \(\boldsymbol\theta\) take the role of the class or parameters of interest.
Note. SBI only requires samples from \(p(\mathbf{x}\mid\boldsymbol\theta)\) — i.e. a running simulator. Likelihood-based methods (MCMC, HMC, VI) need pointwise evaluations of \(p(\mathbf{x}\mid\boldsymbol\theta)\), which are often unavailable when detector effects or systematics must be marginalised (= integrated) out.
Three demos on a ball throw
From accept/reject, to summary statistics, to modelling the joint density.
Running example — a cannon on flat ground
A cannon fires a ball at launch angle \(\theta\) with initial speed \(v\) (also uncertain). We observe only the noisy landing position \(x\), denoted \(x_\mathrm{obs}\) below.
If the likelihood \(p(x\mid\theta)\) is available analytically, we can sample the posterior directly with MCMC. Same observation, same posterior — different ingredients. This is the LBI sibling of rejection ABC.
Left: likelihood \(L(\theta) = \mathcal{N}(x_\mathrm{obs};\, r(\theta),\, \sigma^2)\) vs \(\theta\). Right: MCMC samples (blue) vs reference posterior (red, dashed). Dotted line at \(\theta_\mathrm{true}\). Ingredients: prior, pointwise likelihood evaluations.
Rejection ABC — the loop
Given the simulator output, a conceptually simple and clean approach to obtain samples from the posterior is:
for i in 1..N:
θ_i ∼ p(θ) # draw from the prior
v_i ∼ p(v) # draw the nuisance
x_i ∼ p(x | v_i, θ_i) # run the simulator
if d(x_i, x_obs) < ε: # accept if close enough
keep θ_i
The kept \(\theta\)'s are samples from an approximation to \(p(\theta\mid x_\mathrm{obs})\) — exact only in the limit \(\varepsilon \to 0\), \(N \to \infty\).
Ingredients: prior, simulator.
Analysis choice: distance \(d\) and tolerance \(\varepsilon\).
(For the live demos we replace \(\varepsilon\) by "keep top-\(M\) closest of \(N\)".)
Rubin 1984 (the loop on paper); Pritchard et al. 1999 (the name "ABC"); Sisson, Fan & Beaumont 2018 (Handbook of ABC).
Demo 1 — Rejection ABC
Results are amortised! Moving the \(x_\mathrm{obs}\) slider does not re-run the simulator — the same cloud is just re-ranked. Amortised = simulation cost paid once up front; every new observation is then essentially free.
Demo 1 — what just happened
Small \(M\) → noisy histogram. The two real peaks drown in Monte Carlo noise.
Large \(M\) (near \(N\)) → "accept everything"; the histogram leaks back toward the prior.
Small \(N\) → very few sims land near \(x_\mathrm{obs}\); even the top-\(M\) are far. Wasted simulator calls.
To sharpen the posterior we must shrink \(\varepsilon_\mathrm{eff}\). To shrink it we must throw more sims. This is the sample-efficiency cliff.
Open question. Each retained sample uses one simulator call. The thousands of rejected ones contribute nothing. Is there a way to use them?
What if we throw the ball \(n_\mathrm{balls}\) times at the same launch angle \(\theta\) and observe \(n_\mathrm{balls}\) landing positions \((x_1, \dots, x_{n_\mathrm{balls}})\)?
Each landing carries its own measurement noise. A natural summary is the mean; its variance shrinks by \(1/n_\mathrm{balls}\).
Toggle on: distance \(d = |\bar x_i - \bar x_\mathrm{obs}|\), \(\varepsilon_\mathrm{eff}\) shrinks as \(\sigma/\sqrt{n_\mathrm{balls}}\).
Toggle off: distance is the max over the raw landing vector, \(d=\max_k |x_{i,k}-x_{\mathrm{obs},k}|\); \(\varepsilon_\mathrm{eff}\) grows slowly (\(\propto \sigma\sqrt{\ln n_\mathrm{balls}}\)). The reference (red) is the same in both cases: it depends only on \(\bar x_\mathrm{obs}\).
Demo 2 — what just happened
Variance of the summary scales as \(\sigma^2/n_\mathrm{balls}\). The effective acceptance radius \(\varepsilon_\mathrm{eff}\) shrinks at fixed \(M/N\).
At the same simulator budget, a well-chosen summary
is critical for ABC to gain optimal precision.
Open question (the one that does not go away).
Here we chose the summary by hand: "take the mean." Easy, because the simulator is one-dimensional and we know what is informative.
When \(x\) is a 256×256 image, a waveform, a sky map, a graph — which function of \(x\) should we condition on? Hand-engineered summaries are difficult to obtain.
Demo 3 — Kernel density estimation (KDE) parameter-data space
Taking the $\epsilon \to 0$ limit can be done using interpolation in joined $(\theta, x)$-space.
Left: Gaussian KDE contours on \(p(\theta, x)\) over all \(N\) sims, with \(x_\mathrm{obs}\) marked. Right: weighted histogram \(w_i = K_{h_x}(x_\mathrm{obs}-x_i)\) and its smoothed KDE slice, vs the reference posterior. No accept/reject, no \(\varepsilon\).
Demo 3 — the \(\varepsilon \to 0\) limit done right
Uses all \(N\) simulations — nothing rejected.
No \(\varepsilon\) to tune. The bandwidth \(h_x\) plays a related role but is chosen automatically; for this 1D simulator, \(h_x \approx \sigma\) (the noise level) is the right answer.
At the same \(N\) as Demo 1, the bimodal structure is recovered cleanly instead of histogrammed-against.
Conceptually: we are modelling the joint density \(p(\theta, x)\) on the simulation cloud, then conditioning at \(x_\mathrm{obs}\). Bayes' theorem on a density model. This is the seed idea of every neural-SBI method that follows: replace ABC's accept/reject by a learned density.
Approximate Bayesian Computation: sample from the prior, run the simulator, keep what lands near the observation. A likelihood-free recipe whose three sliders walked us through the whole story of SBI.
What works generically (and will keep working):
Amortised inference. Pay simulation cost once; any new \(x_\mathrm{obs}\) is a free query.
No likelihood evaluations. Only simulator samples needed, never pointwise \(p(x\mid\theta)\).
Trivial marginalisation. Nuisances fold in for free by sampling them inside the simulator.
What is challenging (and what the rest of the school is about):
Density estimation in high D. KDE collapses; replace by a learned \(q_\phi(\theta\mid x)\) — normalising flows, NPE. (Lec 2–3)