Bayesian posterior estimation with classification networks.

# Bayesian posterior estimation with classification networks

University of Amsterdam

28 September 2020
]
.left[BSc Course Machine learning for Physics and Astronomy, 2020]

---

# All measurements are uncertain

.grid[
.kol-1-2[
## Causes for uncertainties
- Random fluctuations
  - photon shot noise, ...
- Detector systematics
  - calibration uncertainties, ...
- Uncertainties in theoretical models
  - cosmological constant, ...
- Imperfect/simplistic models
  - modeling complex astrophysical data
- Limitations of the analysis framework
  - assumption about well-behaved noise
]
.kol-1-2[
.center[Example: Cosmological parameters]
.center[.width-80[![](figures/eso0419d.jpg)]]
]
]

--
count: false

---

# Probability distributions

.center.red[Continuous random numbers are described by probability density functions]

.grid[
.kol-1-2[
.center[1-dim standard normal distribution]
.width-100[![](figures/Standard_deviation_diagram.svg.png)]
$$
\int_{-\infty}^\infty dx\, P(x) = 1
$$
]
.kol-1-2[
.center[.width-80[![](figures/2-dim-prob.png)]]
$$
\int_{\mathbb{R}^2} dx_1 dx_2\, P(x_1, x_2) = 1
$$
]
]

---

# Sampling from a probability distribution

- Random draws $x$ from $P(x)$ can be written as $x \sim P(x)$

--
count: false

- Making a histogram of many draws recovers the shape of $P(x)$

.grid[
.kol-1-2[
.center[.width-100[![](figures/HistogramwithaSpecifiedDistributionFitExample.png)]]
]
.kol-1-2[
red line: $\propto P(x)$

blue bars: histogram of many draws $x \sim P(x)$
]
]

---

# The Galton board / Bean machine

## An example for random sampling from the normal distribution

---

# Frequentist vs Bayesian* interpretation

**Bayesian statistics**: Those distributions can also describe our **belief** (plausibility/probability) that a certain parameter has a certain value.

---

# Conditional probabilities

2) Probability of $y$, obtained by "marginalizing" $x$:
$$
P(y) = \int_{-\infty}^\infty dx\, P(x, y)
$$

3) .red[Conditional probability] of $x$ given $y$:
$$
P(x|y) = \frac{P(x, y)}{P(y)}
$$
]
.kol-1-2[
.center[]
.width-100[![](figures/MultivariateNormal.png)]
]
]

--
count: false

---

# Updating beliefs with Bayes' theorem

.center.red[Bayes' theorem provides a rule for updating believes in light of new data]

.grid[
.kol-1-2[
## Likelihood $P(D|H)$
How probable is the data $D$ given that our hypothesis $H$ is true?
]
.kol-1-2[
## Prior $P(H)$
How probable was our hypothesis $H$) before observing the evidence?
]
]

$$
P(H|D) = \frac{P(D|H) P(H)}{P(D)}
$$

<br>
.grid[
.kol-1-2[
## Posterior $P(D|H)$
How probably is our hypothesis $H$ given the observed data $D$?
]
.kol-1-2[
## Marginal likelihood $P(D)$
How probable is the new data $D$ under all possible hypothesis $H$?

$$
P(D) \equiv \int dH\, P(D|H) P(H)
$$
]
]

---

# Going from prior to posterior

.center.width-90[![](figures/bayes_illustration.png)]

---

# Update rule in practice

.grid[
.kol-1-3[
- Prior Beliefs: $P(H)$
- Evidence/likelhood: $P(D|H)$
- Posterior beliefs: $P(H|D)$
]
.kol-2-3[
.center[]
.width-100[![](figures/Bayes_update.jpg)]
]
]

---

# Marginal likelihoods

.center.red[However, in many cases the likelihood of observing $D$ given hypothesis $H$, $P(D|H)$ is not actually known, or very hard to calculate.]

$$
P(D|H) = \int d\eta \, P(D|H, \vec\eta) P(\vec\eta|H) P(H)
$$

- You can think of $\eta$ as for instance the different paths that a single ball can take in the Galton board example.
- The integral over $\eta$ is often "intractable", i.e. not doable in practice because very high-dimensional etc.

---

# Physics examples - gamma rays

.grid[
.kol-1-2[
.center[Gamma-ray sky]
.width-100[![](figures/fermi _lat_sky.jpeg)]
]
.kol-1-2[
.center[Observed gamma-ray sources]
.width-100[![](figures/fermi3fglait.png)]
]
]

--
count:false

## Difficult to determine properties of source populations?  E.g.:

- What is the spatial distribution of pulsars in the Galactic disk?

--
count: false

- What is the luminosity function of blazars at high Galactic latitudes?

---

# Physics examples - strong lensing

.grid[
.kol-1-2[
.center[Strong gravitational lensing due to dark and visible matter]
.width-100[![](figures/NASA_lensing.jpg)]
]
.kol-1-2[
.center[Example: Distant galaxy lensed by red lens galaxy along the line-of-sight]
.width-100[![](figures/strong_lensing.jpeg)]
]
]

--
count: false

## Questions

- How does the unperturbed source look like and how the lens?

--
count: false

- What can we learn about the nature of dark matter?  Small clumps of dark matter would lead to characteristic distortions in the image.

---

# Physics examples - collider physics

.grid[
.kol-1-2[
.center[Illustration of collition at ALTAS detector]
.width-100[![](figures/ATLAS-figaux_01.png)]
]
.kol-1-2[
.center[Invariant mass of 4-lepton channel]
.width-80[![](figures/ATLASHIGGS-FIG1.png)]
]
]

--
count: false

## Typical questions are:
- What processes have most likely contributed to the 4-lepton signal?

--
count: false

- How does this constraint the Higgs production cross section?

---

# The likelihood-to-evidence ratio

In order to evaluate the probability of any outcome, we often have to sum or integrate over a very large number of random variables, here called $z$, that we are not actually interested in.
$$
P(D|H) = \underbrace{\int dz}_\text{\color{red} intractable}P(D|H, z)
P(z|H)
$$

## Examples

- Position of individual point sources on the sky
- Position of individual dark matter halos, individual galaxies images
- Physical mechanisms that lead to specific 4-lepton invariant mass

.red.center[There are many ways to solve the integral approximately. Here, we will use neural networks to help us out.]

---

# Neural likelihood-free inference

.bold[Starting point]: for any pair of observation $x$ and model parameter $\theta$, the goal is to estimate the probability that this pair belongs one of the following classes:
.grid[
.kol-1-2[
$H_0$: Data $\vec x$ corresponds to model parameters $\vec z$

$H_1$: Data $\vec x$ and model $\vec z$ are unrelated
]
.kol-1-2[
$(\vec x, \vec z) \sim P(\vec x, \vec z) = P(\vec x|\vec z)P(\vec z)$

$(\vec x, \vec z) \sim P(\vec x)P(\vec z)$
]
]

---

# Joint vs marginal samples

1) Examples for $H_0$, jointly sampled from $\vec x, z \sim P(\vec x|z) P(z)$

.grid[
.kol-1-6[
.center[Cat]
.width-100[![](figures/cat.jpeg)]
]
.kol-1-6[
.center[Donkey]
.width-70[![](figures/donkey.jpeg)]
]
.kol-1-6[
.center[Cat]
.width-100[![](figures/cat2.jpeg)]
]
.kol-1-6[
.center[Cat]
.width-100[![](figures/cat3.jpeg)]
]
.kol-1-6[
.center[Donkey]
.width-100[![](figures/donkey2.jpeg)]
]
.kol-1-6[
.center[Donkey]
.width-80[![](figures/donkey3.jpeg)]
]
]

2) Examples for $H_1$, marginally sampled from $\vec x, z \sim P(\vec x) P(z)$

.grid[
.kol-1-6[
.center[Donkey]
.width-100[![](figures/donkey3.jpeg)]
]
.kol-1-6[
.center[Cat]
.width-100[![](figures/cat3.jpeg)]
]
.kol-1-6[
.center[Cat]
.width-100[![](figures/donkey2.jpeg)]
]
.kol-1-6[
.center[Cat]
.width-70[![](figures/donkey.jpeg)]
]
.kol-1-6[
.center[Donkey]
.width-100[![](figures/cat2.jpeg)]
]
.kol-1-6[
.center[Donkey]
.width-80[![](figures/cat.jpeg)]
]
]

---

# Loss function

.bold[Strategy:] We train a neural network $d_\phi(\vec x, z) \in [0, 1]$ as binary classifier to estimate the probability of hypothesis $H_0$ or $H_1$.  The network output can be interpreted, for a given input pair $\vec x$ and $z$, as probability that $H_0$ is true.
- H0 is true: $d_\phi(\vec x, z) \simeq 1$
- H1 is true: $d_\phi(\vec x, z) \simeq 0$

The corresponding loss function is (so-called "binary cross-entroy")

$$
L\left[d(\vec x, z)\right] = -\int dx dz  \left[ p(\vec x, z) \ln\left(d(\vec x, z)\right) + p(\vec x)p(z) \ln\left(1-d(\vec x, z)\right) \right]
$$

--
count: false

Minimizing that function (see next slide) w.r.t. the network parameters $\phi$ yields
$$
d(\vec x, z) \approx \frac{p(\vec x, z)}{p(\vec x, z) + p(\vec x)p(z)}
$$

---

# Properties of optimized network

## Some analytical estimates

$$
\frac{\partial}{\partial\phi}L \left[d_\phi(\vec x, z)\right] = -\int dx dz  \left[ p(\vec x, z) \ln\left(d(\vec x, z)\right) + p(\vec x)p(z) \ln\left(1-d(\vec x, z) \right) \right]
$$
$$
= -\int dx dz  \left[ \frac{p(\vec x, z)}{d(\vec x, z)} + \frac{p(\vec x)p(z)}{1-d(\vec x, z) } \right] \frac{\partial d(\vec x, z)}{\partial \phi}
$$

Setting the part in square brackets to zero yields that the network is optimized once
$$
d(\vec x, z) \approx \frac{p(\vec x, z)}{p(\vec x, z) + p(\vec x)p(z)}
$$

---

# Likelihood-to-evidence ratio

.red.center[Training binary classification networks yield true Bayesian posterior estimates!]

With a bit of math one can show that
$$
r(\vec x, z) \equiv \frac{1}{d(\vec x, z)}-1 \simeq
\frac{p(\vec x|z)}{p(\vec x)} 
= \frac{p(z|\vec x)}{p(z)} 
$$

Once we have trained the network $d_\phi(\vec x, z)$, we can **estimate the posterior**
$$
p(z|\vec x) \simeq  r(\vec x, z) p(z)
$$

---

# Example network

## An example network implementing $d_\phi(\vec x, z)$

.center.width-70[![](figures/gw_network.png)]

---

# Example: Strong lensing image analysis

Here we trained a neural network to recognizing subhalos in an image (indicated by a red dot). The contours show the most likely regions guessed by the network.

The shown posteriors are effectively marginalized over thousands of source and
lens parameters.  Those marginal posteriors can be evaluated in seconds once
the network is trained (and training takes maybe 20 min).

---

# Exercise: Neural posterior estimation

The overall goal of the exercise is to perform posterior estimation with classification networks. This is broken down in several tasks.

1. Training of a parameter regression network
  - **Point estmation** of ring radii in an image.

2. Training of a classification network
  - Train network to predict the **probability** of an image containing
  different number of rings.

3. Training of a posterior estimation network
  - **Posterior estimation** of ring radii in an image.

.center.red[Enjoy!]