Machine Learning for Astroparticle Physics:
A Crash-course in SBI

Lecture 1b - Approximate Bayesian Computation

Christoph Weniger — University of Amsterdam (GRAPPA)

Today's Roadmap

1
Rejection ABC
The simulator loop, the ε tolerance, the sample-efficiency cliff
2
Summary statistics
Why a good summary buys orders of magnitude — and why we can't always pick one
3
KDE on the joint
The ε → 0 limit done right, and the seed of neural SBI
The thread: one running example (a cannon ball). We will build intuition for ABC and discuss its limitations. Tackling those limitations with deep learning will lead to modern SBI.

Simulation vs. inference — two directions

\(\boldsymbol\theta\)
parameters
what we care about
e.g. \(H_0,\; \Omega_m,\)
BH masses, lens profile
Simulator \(p(\mathbf{x}\mid\boldsymbol\theta)\)

forward — physics, easy

Inference \(p(\boldsymbol\theta\mid\mathbf{x})\)

inverse — what we want, hard

\(\mathbf{x}\)
data
what we observe
images, time-series,
power spectra, …

Simulation-based inference. We have only the forward direction (a simulator). SBI builds the inverse by drawing \((\boldsymbol\theta, \mathbf{x})\) pairs from the simulator and training a neural network to output the posterior directly.

The inference problem — formally

Same story, written with physics notation. Parameters \(\boldsymbol\theta\) take the role of the class or parameters of interest.

\[ p(\boldsymbol\theta \mid \mathbf{x}) \;=\; \frac{p(\mathbf{x}\mid\boldsymbol\theta)\, p(\boldsymbol\theta)}{p(\mathbf{x})} \]
  • Prior \(p(\boldsymbol\theta)\): what we believed before seeing data.
  • Likelihood \(p(\mathbf{x}\mid\boldsymbol\theta)\): what the physics model predicts.
  • Posterior \(p(\boldsymbol\theta\mid\mathbf{x})\): what we want.
  • Evidence / Marginal likelihood \(p(\mathbf{x}) = \int p(\mathbf{x}\mid\boldsymbol\theta)\, p(\boldsymbol\theta)\, d\boldsymbol\theta\).

Note. SBI only requires samples from \(p(\mathbf{x}\mid\boldsymbol\theta)\) — i.e. a running simulator. Likelihood-based methods (MCMC, HMC, VI) need pointwise evaluations of \(p(\mathbf{x}\mid\boldsymbol\theta)\), which are often unavailable when detector effects or systematics must be marginalised (= integrated) out.

Three demos on a ball throw

From accept/reject, to summary statistics, to modelling the joint density.

Running example — a cannon on flat ground

A cannon fires a ball at launch angle \(\theta\) with initial speed \(v\) (also uncertain). We observe only the noisy landing position \(x\), denoted \(x_\mathrm{obs}\) below.

\[ p(x\mid v,\theta) \;=\; \mathcal{N}\!\left(x;\; \mu(v,\theta),\; \sigma_\mathrm{x}^2\right) \] \[ p(v) \;=\; \mathcal{N}\!\left(v;\; \mu_v,\; \sigma_v^2\right) \] \[ p(\theta) \;=\; \mathcal{U}(5°,\, 85°) \]

with \(\mu_v = 10\) m/s, \(\sigma_v = 0.2\) m/s, \(\sigma_\mathrm{x} = 0.5\) m.

Note: \(\theta \leftrightarrow \tfrac{\pi}{2}-\theta\) gives the same expected landing — the posterior is generically bimodal.

\[ \mu(v,\theta) \;=\; \frac{v^2}{g}\sin(2\theta) \]

Forward model — simulator vs likelihood

Simulator (forward sampling)

\[ \theta \sim p(\theta),\ \ v \sim p(v),\ \ x \sim \mathcal{N}\!\left(\mu(v,\theta),\, \sigma_\mathrm{x}^2\right) \]

Requires sampling distributions in order.

Likelihood (analytic)

\[ p(x_{\rm obs} \mid \theta) \;=\; \int p(x_{\rm obs} \mid v, \theta)\, p(v)\, dv \]

Requires integration over \(v\).

Likelihood-based inference — MCMC

If the likelihood \(p(x\mid\theta)\) is available analytically, we can sample the posterior directly with MCMC. Same observation, same posterior — different ingredients. This is the LBI sibling of rejection ABC.

Left: likelihood \(L(\theta) = \mathcal{N}(x_\mathrm{obs};\, r(\theta),\, \sigma^2)\) vs \(\theta\). Right: MCMC samples (blue) vs reference posterior (red, dashed). Dotted line at \(\theta_\mathrm{true}\). Ingredients: prior, pointwise likelihood evaluations.

Rejection ABC — the loop

Given the simulator output, a conceptually simple and clean approach to obtain samples from the posterior is:

for i in 1..N:
    θ_i ∼ p(θ)                     # draw from the prior
    v_i ∼ p(v)                       # draw the nuisance
    x_i ∼ p(x | v_i, θ_i)            # run the simulator
    if d(x_i, x_obs) < ε:           # accept if close enough
        keep θ_i
  • The kept \(\theta\)'s are samples from an approximation to \(p(\theta\mid x_\mathrm{obs})\) — exact only in the limit \(\varepsilon \to 0\), \(N \to \infty\).
  • Ingredients: prior, simulator.
  • Analysis choice: distance \(d\) and tolerance \(\varepsilon\).

(For the live demos we replace \(\varepsilon\) by "keep top-\(M\) closest of \(N\)".)

Rubin 1984 (the loop on paper); Pritchard et al. 1999 (the name "ABC"); Sisson, Fan & Beaumont 2018 (Handbook of ABC).

Demo 1 — Rejection ABC

Results are amortised! Moving the \(x_\mathrm{obs}\) slider does not re-run the simulator — the same cloud is just re-ranked. Amortised = simulation cost paid once up front; every new observation is then essentially free.

Demo 1 — what just happened

  • Small \(M\) → noisy histogram. The two real peaks drown in Monte Carlo noise.
  • Large \(M\) (near \(N\)) → "accept everything"; the histogram leaks back toward the prior.
  • Small \(N\) → very few sims land near \(x_\mathrm{obs}\); even the top-\(M\) are far. Wasted simulator calls.
  • To sharpen the posterior we must shrink \(\varepsilon_\mathrm{eff}\). To shrink it we must throw more sims. This is the sample-efficiency cliff.
Open question. Each retained sample uses one simulator call. The thousands of rejected ones contribute nothing. Is there a way to use them?

Beaumont, Zhang & Balding 2002; Marin et al. 2012, "Approximate Bayesian Computational methods," Stat. Comput. 22.

Setting up Demo 2 — many throws, one angle

What if we throw the ball \(n_\mathrm{balls}\) times at the same launch angle \(\theta\) and observe \(n_\mathrm{balls}\) landing positions \((x_1, \dots, x_{n_\mathrm{balls}})\)?

Each landing carries its own measurement noise. A natural summary is the mean; its variance shrinks by \(1/n_\mathrm{balls}\).

\[ s(x_1,\dots,x_{n_\mathrm{balls}}) \;=\; \frac{1}{n_\mathrm{balls}} \sum_{k=1}^{n_\mathrm{balls}} x_k \]

Demo 2 — ABC with a summary statistic

Toggle on: distance \(d = |\bar x_i - \bar x_\mathrm{obs}|\), \(\varepsilon_\mathrm{eff}\) shrinks as \(\sigma/\sqrt{n_\mathrm{balls}}\). Toggle off: distance is the max over the raw landing vector, \(d=\max_k |x_{i,k}-x_{\mathrm{obs},k}|\); \(\varepsilon_\mathrm{eff}\) grows slowly (\(\propto \sigma\sqrt{\ln n_\mathrm{balls}}\)). The reference (red) is the same in both cases: it depends only on \(\bar x_\mathrm{obs}\).

Demo 2 — what just happened

  • Variance of the summary scales as \(\sigma^2/n_\mathrm{balls}\). The effective acceptance radius \(\varepsilon_\mathrm{eff}\) shrinks at fixed \(M/N\).
  • At the same simulator budget, a well-chosen summary is critical for ABC to gain optimal precision.
Open question (the one that does not go away). Here we chose the summary by hand: "take the mean." Easy, because the simulator is one-dimensional and we know what is informative.

When \(x\) is a 256×256 image, a waveform, a sky map, a graph — which function of \(x\) should we condition on? Hand-engineered summaries are difficult to obtain.

Demo 3 — Kernel density estimation (KDE) parameter-data space

Taking the $\epsilon \to 0$ limit can be done using interpolation in joined $(\theta, x)$-space.

Left: Gaussian KDE contours on \(p(\theta, x)\) over all \(N\) sims, with \(x_\mathrm{obs}\) marked. Right: weighted histogram \(w_i = K_{h_x}(x_\mathrm{obs}-x_i)\) and its smoothed KDE slice, vs the reference posterior. No accept/reject, no \(\varepsilon\).

Demo 3 — the \(\varepsilon \to 0\) limit done right

  • Uses all \(N\) simulations — nothing rejected.
  • No \(\varepsilon\) to tune. The bandwidth \(h_x\) plays a related role but is chosen automatically; for this 1D simulator, \(h_x \approx \sigma\) (the noise level) is the right answer.
  • At the same \(N\) as Demo 1, the bimodal structure is recovered cleanly instead of histogrammed-against.
Conceptually: we are modelling the joint density \(p(\theta, x)\) on the simulation cloud, then conditioning at \(x_\mathrm{obs}\). Bayes' theorem on a density model. This is the seed idea of every neural-SBI method that follows: replace ABC's accept/reject by a learned density.

Blum 2010, "Approximate Bayesian Computation: a nonparametric perspective," JASA 105; Silverman 1986 (KDE).

Key Takeaways

  • Approximate Bayesian Computation: sample from the prior, run the simulator, keep what lands near the observation. A likelihood-free recipe whose three sliders walked us through the whole story of SBI.
  • What works generically (and will keep working):
    • Amortised inference. Pay simulation cost once; any new \(x_\mathrm{obs}\) is a free query.
    • No likelihood evaluations. Only simulator samples needed, never pointwise \(p(x\mid\theta)\).
    • Trivial marginalisation. Nuisances fold in for free by sampling them inside the simulator.
  • What is challenging (and what the rest of the school is about):
    • Density estimation in high D. KDE collapses; replace by a learned \(q_\phi(\theta\mid x)\) — normalising flows, NPE. (Lec 2–3)
    • Summary construction. Hand-picked summaries don't scale to images, waveforms, graphs — learn them. (Lec 4)