Machine Learning for Astroparticle Physics:
A Crash-course in SBI

Lecture 1b - Approximate Bayesian Computation

Christoph Weniger — University of Amsterdam (GRAPPA)

Today's Roadmap

1
Rejection ABC
The simulator loop, the ε tolerance, the sample-efficiency cliff

→

2
Summary statistics
Why a good summary buys orders of magnitude — and why we can't always pick one

→

3
KDE on the joint
The ε → 0 limit done right, and the seed of neural SBI

The thread: one running example (a cannon ball). We will build intuition for ABC and discuss its limitations. Tackling those limitations with deep learning will lead to modern SBI.

Simulation vs. inference — two directions

$\boldsymbol\theta$

parameters
what we care about

e.g. $H_0,\; \Omega_m,$
BH masses, lens profile

Simulator $p(\mathbf{x}\mid\boldsymbol\theta)$

forward — physics, easy

Inference $p(\boldsymbol\theta\mid\mathbf{x})$

inverse — what we want, hard

$\mathbf{x}$

data
what we observe

images, time-series,
power spectra, …

Simulation-based inference. We have only the forward direction (a simulator). SBI builds the inverse by drawing $(\boldsymbol\theta, \mathbf{x})$ pairs from the simulator and training a neural network to output the posterior directly.

The inference problem — formally

Same story, written with physics notation. Parameters $\boldsymbol\theta$ take the role of the class or parameters of interest.

\[ p(\boldsymbol\theta \mid \mathbf{x}) \;=\; \frac{p(\mathbf{x}\mid\boldsymbol\theta)\, p(\boldsymbol\theta)}{p(\mathbf{x})} \]

Prior $p(\boldsymbol\theta)$: what we believed before seeing data.
Likelihood $p(\mathbf{x}\mid\boldsymbol\theta)$: what the physics model predicts.
Posterior $p(\boldsymbol\theta\mid\mathbf{x})$: what we want.
Evidence / Marginal likelihood $p(\mathbf{x}) = \int p(\mathbf{x}\mid\boldsymbol\theta)\, p(\boldsymbol\theta)\, d\boldsymbol\theta$.

Note. SBI only requires samples from $p(\mathbf{x}\mid\boldsymbol\theta)$ — i.e. a running simulator. Likelihood-based methods (MCMC, HMC, VI) need pointwise evaluations of $p(\mathbf{x}\mid\boldsymbol\theta)$, which are often unavailable when detector effects or systematics must be marginalised (= integrated) out.

Three demos on a ball throw

From accept/reject, to summary statistics, to modelling the joint density.

Running example — a cannon on flat ground

A cannon fires a ball at launch angle $\theta$ with initial speed $v$ (also uncertain). We observe only the noisy landing position $x$, denoted $x_\mathrm{obs}$ below.

\[ p(x\mid v,\theta) \;=\; \mathcal{N}\!\left(x;\; \mu(v,\theta),\; \sigma_\mathrm{x}^2\right) \] \[ p(v) \;=\; \mathcal{N}\!\left(v;\; \mu_v,\; \sigma_v^2\right) \] \[ p(\theta) \;=\; \mathcal{U}(5°,\, 85°) \]

with $\mu_v = 10$ m/s, $\sigma_v = 0.2$ m/s, $\sigma_\mathrm{x} = 0.5$ m.

Note: $\theta \leftrightarrow \tfrac{\pi}{2}-\theta$ gives the same expected landing — the posterior is generically bimodal.

$\theta$ = 0.60 rad (34°)

\[ \mu(v,\theta) \;=\; \frac{v^2}{g}\sin(2\theta) \]

Forward model — simulator vs likelihood

Simulator (forward sampling)

\[ \theta \sim p(\theta),\ \ v \sim p(v),\ \ x \sim \mathcal{N}\!\left(\mu(v,\theta),\, \sigma_\mathrm{x}^2\right) \]

Requires sampling distributions in order.

Likelihood (analytic)

\[ p(x_{\rm obs} \mid \theta) \;=\; \int p(x_{\rm obs} \mid v, \theta)\, p(v)\, dv \]

Requires integration over $v$.

Likelihood-based inference — MCMC

If the likelihood $p(x\mid\theta)$ is available analytically, we can sample the posterior directly with MCMC. Same observation, same posterior — different ingredients. This is the LBI sibling of rejection ABC.

$x_\mathrm{obs}$ = 9.50 m

Left: likelihood $L(\theta) = \mathcal{N}(x_\mathrm{obs};\, r(\theta),\, \sigma^2)$ vs $\theta$. Right: MCMC samples (blue) vs reference posterior (red, dashed). Dotted line at $\theta_\mathrm{true}$. Ingredients: prior, pointwise likelihood evaluations.

Rejection ABC — the loop

Given the simulator output, a conceptually simple and clean approach to obtain samples from the posterior is:

for i in 1..N:
    θ_i ∼ p(θ)                     # draw from the prior
    v_i ∼ p(v)                       # draw the nuisance
    x_i ∼ p(x | v_i, θ_i)            # run the simulator
    if d(x_i, x_obs) < ε:           # accept if close enough
        keep θ_i

The kept $\theta$'s are samples from an approximation to $p(\theta\mid x_\mathrm{obs})$ — exact only in the limit $\varepsilon \to 0$, $N \to \infty$.
Ingredients: prior, simulator.
Analysis choice: distance $d$ and tolerance $\varepsilon$.

(For the live demos we replace $\varepsilon$ by "keep top-$M$ closest of $N$".)

Rubin 1984 (the loop on paper); Pritchard et al. 1999 (the name "ABC"); Sisson, Fan & Beaumont 2018 (Handbook of ABC).

Demo 1 — Rejection ABC

$N$ = 2000 $M$ = 200 $x_\mathrm{obs}$ = 9.50 m

Results are amortised! Moving the $x_\mathrm{obs}$ slider does not re-run the simulator — the same cloud is just re-ranked. Amortised = simulation cost paid once up front; every new observation is then essentially free.

Demo 1 — what just happened

Small $M$ → noisy histogram. The two real peaks drown in Monte Carlo noise.
Large $M$ (near $N$) → "accept everything"; the histogram leaks back toward the prior.
Small $N$ → very few sims land near $x_\mathrm{obs}$; even the top-$M$ are far. Wasted simulator calls.
To sharpen the posterior we must shrink $\varepsilon_\mathrm{eff}$. To shrink it we must throw more sims. This is the sample-efficiency cliff.

Open question. Each retained sample uses one simulator call. The thousands of rejected ones contribute nothing. Is there a way to use them?

Beaumont, Zhang & Balding 2002; Marin et al. 2012, "Approximate Bayesian Computational methods," Stat. Comput. 22.

Setting up Demo 2 — many throws, one angle

What if we throw the ball $n_\mathrm{balls}$ times at the same launch angle $\theta$ and observe $n_\mathrm{balls}$ landing positions $(x_1, \dots, x_{n_\mathrm{balls}})$?

Each landing carries its own measurement noise. A natural summary is the mean; its variance shrinks by $1/n_\mathrm{balls}$.

$\theta$ = 0.60 rad

\[ s(x_1,\dots,x_{n_\mathrm{balls}}) \;=\; \frac{1}{n_\mathrm{balls}} \sum_{k=1}^{n_\mathrm{balls}} x_k \]

Demo 2 — ABC with a summary statistic

use summary $s = \tfrac{1}{n}\sum_k x_k$ $N$ = 2000 $M$ = 200 $n_\mathrm{balls}$ = 10

Toggle on: distance $d = |\bar x_i - \bar x_\mathrm{obs}|$, $\varepsilon_\mathrm{eff}$ shrinks as $\sigma/\sqrt{n_\mathrm{balls}}$. Toggle off: distance is the max over the raw landing vector, $d=\max_k |x_{i,k}-x_{\mathrm{obs},k}|$; $\varepsilon_\mathrm{eff}$ grows slowly ($\propto \sigma\sqrt{\ln n_\mathrm{balls}}$). The reference (red) is the same in both cases: it depends only on $\bar x_\mathrm{obs}$.

Demo 2 — what just happened

Variance of the summary scales as $\sigma^2/n_\mathrm{balls}$. The effective acceptance radius $\varepsilon_\mathrm{eff}$ shrinks at fixed $M/N$.
At the same simulator budget, a well-chosen summary is critical for ABC to gain optimal precision.

Open question (the one that does not go away). Here we chose the summary by hand: "take the mean." Easy, because the simulator is one-dimensional and we know what is informative.

When $x$ is a 256×256 image, a waveform, a sky map, a graph — which function of $x$ should we condition on? Hand-engineered summaries are difficult to obtain.

Demo 3 — Kernel density estimation (KDE) parameter-data space

Taking the $\epsilon \to 0$ limit can be done using interpolation in joined $(\theta, x)$-space.

$N$ = 800 $h_x = \sigma\!\cdot\!$1.00 $x_\mathrm{obs}$ = 9.50 m

Left: Gaussian KDE contours on $p(\theta, x)$ over all $N$ sims, with $x_\mathrm{obs}$ marked. Right: weighted histogram $w_i = K_{h_x}(x_\mathrm{obs}-x_i)$ and its smoothed KDE slice, vs the reference posterior. No accept/reject, no $\varepsilon$.

Demo 3 — the $\varepsilon \to 0$ limit done right

Uses all $N$ simulations — nothing rejected.
No $\varepsilon$ to tune. The bandwidth $h_x$ plays a related role but is chosen automatically; for this 1D simulator, $h_x \approx \sigma$ (the noise level) is the right answer.
At the same $N$ as Demo 1, the bimodal structure is recovered cleanly instead of histogrammed-against.

Conceptually: we are modelling the joint density $p(\theta, x)$ on the simulation cloud, then conditioning at $x_\mathrm{obs}$. Bayes' theorem on a density model. This is the seed idea of every neural-SBI method that follows: replace ABC's accept/reject by a learned density.

Blum 2010, "Approximate Bayesian Computation: a nonparametric perspective," JASA 105; Silverman 1986 (KDE).

Key Takeaways

Approximate Bayesian Computation: sample from the prior, run the simulator, keep what lands near the observation. A likelihood-free recipe whose three sliders walked us through the whole story of SBI.
What works generically (and will keep working):
- Amortised inference. Pay simulation cost once; any new $x_\mathrm{obs}$ is a free query.
- No likelihood evaluations. Only simulator samples needed, never pointwise $p(x\mid\theta)$.
- Trivial marginalisation. Nuisances fold in for free by sampling them inside the simulator.
What is challenging (and what the rest of the school is about):
- Density estimation in high D. KDE collapses; replace by a learned $q_\phi(\theta\mid x)$ — normalising flows, NPE. (Lec 2–3)
- Summary construction. Hand-picked summaries don't scale to images, waveforms, graphs — learn them. (Lec 4)

Machine Learning for Astroparticle Physics: A Crash-course in SBI