class: left, middle # Bayesian posterior estimation with classification networks .right[ Christoph Weniger, GRAPPA University of Amsterdam 28 September 2020 ] .left[BSc Course Machine learning for Physics and Astronomy, 2020] --- # All measurements are uncertain .grid[ .kol-1-2[ ## Causes for uncertainties - Random fluctuations - photon shot noise, ... - Detector systematics - calibration uncertainties, ... - Uncertainties in theoretical models - cosmological constant, ... - Imperfect/simplistic models - modeling complex astrophysical data - Limitations of the analysis framework - assumption about well-behaved noise ] .kol-1-2[ .center[Example: Cosmological parameters] .center[.width-80[]] ] ] .footnote[Credit: ESO] -- count: false .red[.center[How can we describe these uncertainties mathematically?]] --- # Probability distributions .center.red[Continuous random numbers are described by probability density functions] .grid[ .kol-1-2[ .center[1-dim standard normal distribution] .width-100[] $$ \int_{-\infty}^\infty dx\, P(x) = 1 $$ ] .kol-1-2[ .center[.width-80[]] $$ \int_{\mathbb{R}^2} dx_1 dx_2\, P(x_1, x_2) = 1 $$ ] ] .center[Any non-negative function normalized to one can act as probability distribution.] --- # Sampling from a probability distribution - Random draws $x$ from $P(x)$ can be written as $x \sim P(x)$ -- count: false - Making a histogram of many draws recovers the shape of $P(x)$ .grid[ .kol-1-2[ .center[.width-100[]] ] .kol-1-2[ red line: $\propto P(x)$ blue bars: histogram of many draws $x \sim P(x)$ ] ] --- class: center, middle, black-slide # The Galton board / Bean machine ## An example for random sampling from the normal distribution
--- # Frequentist vs Bayesian* interpretation .center[.width-40[]] **Bayesian statistics**: Those distributions can also describe our **belief** (plausibility/probability) that a certain parameter has a certain value. .footnote[Source: The Cartoon Guide to Statistics] --- # Conditional probabilities .grid[ .kol-1-2[ 1) Probability of $x$ and $y$: $$ P(x, y) $$ 2) Probability of $y$, obtained by "marginalizing" $x$: $$ P(y) = \int_{-\infty}^\infty dx\, P(x, y) $$ 3) .red[Conditional probability] of $x$ given $y$: $$ P(x|y) = \frac{P(x, y)}{P(y)} $$ ] .kol-1-2[ .center[] .width-100[] ] ] -- count: false .red[.center[Per definition]] $$ P(x|y) P(y) = P(y|x)P(x) $$ --- # Updating beliefs with Bayes' theorem .center.red[Bayes' theorem provides a rule for updating believes in light of new data] .grid[ .kol-1-2[ ## Likelihood $P(D|H)$ How probable is the data $D$ given that our hypothesis $H$ is true? ] .kol-1-2[ ## Prior $P(H)$ How probable was our hypothesis $H$) before observing the evidence? ] ] $$ P(H|D) = \frac{P(D|H) P(H)}{P(D)} $$ -- count: false
.grid[ .kol-1-2[ ## Posterior $P(D|H)$ How probably is our hypothesis $H$ given the observed data $D$? ] .kol-1-2[ ## Marginal likelihood $P(D)$ How probable is the new data $D$ under all possible hypothesis $H$? $$ P(D) \equiv \int dH\, P(D|H) P(H) $$ ] ] --- # Going from prior to posterior .center.width-90[] --- exclude: true # Update rule in practice .grid[ .kol-1-3[ - Prior Beliefs: $P(H)$ - Evidence/likelhood: $P(D|H)$ - Posterior beliefs: $P(H|D)$ ] .kol-2-3[ .center[] .width-100[] ] ] --- # Marginal likelihoods .center.red[However, in many cases the likelihood of observing $D$ given hypothesis $H$, $P(D|H)$ is not actually known, or very hard to calculate.] $$ P(D|H) = \int d\eta \, P(D|H, \vec\eta) P(\vec\eta|H) P(H) $$ - You can think of $\eta$ as for instance the different paths that a single ball can take in the Galton board example. - The integral over $\eta$ is often "intractable", i.e. not doable in practice because very high-dimensional etc. --- # Physics examples - gamma rays .grid[ .kol-1-2[ .center[Gamma-ray sky] .width-100[] ] .kol-1-2[ .center[Observed gamma-ray sources] .width-100[] ] ] -- count:false ## Difficult to determine properties of source populations? E.g.: - What is the spatial distribution of pulsars in the Galactic disk? -- count: false - What is the luminosity function of blazars at high Galactic latitudes? --- # Physics examples - strong lensing .grid[ .kol-1-2[ .center[Strong gravitational lensing due to dark and visible matter] .width-100[] ] .kol-1-2[ .center[Example: Distant galaxy lensed by red lens galaxy along the line-of-sight] .width-100[] ] ] -- count: false ## Questions - How does the unperturbed source look like and how the lens? -- count: false - What can we learn about the nature of dark matter? Small clumps of dark matter would lead to characteristic distortions in the image. --- # Physics examples - collider physics .grid[ .kol-1-2[ .center[Illustration of collition at ALTAS detector] .width-100[] ] .kol-1-2[ .center[Invariant mass of 4-lepton channel] .width-80[] ] ] -- count: false ## Typical questions are: - What processes have most likely contributed to the 4-lepton signal? -- count: false - How does this constraint the Higgs production cross section? --- exclude: true # The likelihood-to-evidence ratio In order to evaluate the probability of any outcome, we often have to sum or integrate over a very large number of random variables, here called $z$, that we are not actually interested in. $$ P(D|H) = \underbrace{\int dz}_\text{\color{red} intractable}P(D|H, z) P(z|H) $$ ## Examples - Position of individual point sources on the sky - Position of individual dark matter halos, individual galaxies images - Physical mechanisms that lead to specific 4-lepton invariant mass .red.center[There are many ways to solve the integral approximately. Here, we will use neural networks to help us out.] --- # Neural likelihood-free inference .bold[Starting point]: for any pair of observation $x$ and model parameter $\theta$, the goal is to estimate the probability that this pair belongs one of the following classes: .grid[ .kol-1-2[ $H_0$: Data $\vec x$ corresponds to model parameters $\vec z$ $H_1$: Data $\vec x$ and model $\vec z$ are unrelated ] .kol-1-2[ $(\vec x, \vec z) \sim P(\vec x, \vec z) = P(\vec x|\vec z)P(\vec z)$ $(\vec x, \vec z) \sim P(\vec x)P(\vec z)$ ] ] .footnote[See Louppe+Hermanns 2019] --- # Joint vs marginal samples 1) Examples for $H_0$, jointly sampled from $\vec x, z \sim P(\vec x|z) P(z)$ .grid[ .kol-1-6[ .center[Cat] .width-100[] ] .kol-1-6[ .center[Donkey] .width-70[] ] .kol-1-6[ .center[Cat] .width-100[] ] .kol-1-6[ .center[Cat] .width-100[] ] .kol-1-6[ .center[Donkey] .width-100[] ] .kol-1-6[ .center[Donkey] .width-80[] ] ] 2) Examples for $H_1$, marginally sampled from $\vec x, z \sim P(\vec x) P(z)$ .grid[ .kol-1-6[ .center[Donkey] .width-100[] ] .kol-1-6[ .center[Cat] .width-100[] ] .kol-1-6[ .center[Cat] .width-100[] ] .kol-1-6[ .center[Cat] .width-70[] ] .kol-1-6[ .center[Donkey] .width-100[] ] .kol-1-6[ .center[Donkey] .width-80[] ] ] .center[ Data: $\vec x = \text{Image}$; Label: $z \in \{\text{Cat}, \text{Donkey}\}$] --- # Loss function .bold[Strategy:] We train a neural network $d_\phi(\vec x, z) \in [0, 1]$ as binary classifier to estimate the probability of hypothesis $H_0$ or $H_1$. The network output can be interpreted, for a given input pair $\vec x$ and $z$, as probability that $H_0$ is true. - H0 is true: $d_\phi(\vec x, z) \simeq 1$ - H1 is true: $d_\phi(\vec x, z) \simeq 0$ The corresponding loss function is (so-called "binary cross-entroy") $$ L\left[d(\vec x, z)\right] = -\int dx dz \left[ p(\vec x, z) \ln\left(d(\vec x, z)\right) + p(\vec x)p(z) \ln\left(1-d(\vec x, z)\right) \right] $$ -- count: false Minimizing that function (see next slide) w.r.t. the network parameters $\phi$ yields $$ d(\vec x, z) \approx \frac{p(\vec x, z)}{p(\vec x, z) + p(\vec x)p(z)} $$ --- # Properties of optimized network ## Some analytical estimates $$ \frac{\partial}{\partial\phi}L \left[d_\phi(\vec x, z)\right] = -\int dx dz \left[ p(\vec x, z) \ln\left(d(\vec x, z)\right) + p(\vec x)p(z) \ln\left(1-d(\vec x, z) \right) \right] $$ $$ = -\int dx dz \left[ \frac{p(\vec x, z)}{d(\vec x, z)} + \frac{p(\vec x)p(z)}{1-d(\vec x, z) } \right] \frac{\partial d(\vec x, z)}{\partial \phi} $$ Setting the part in square brackets to zero yields that the network is optimized once $$ d(\vec x, z) \approx \frac{p(\vec x, z)}{p(\vec x, z) + p(\vec x)p(z)} $$ --- # Likelihood-to-evidence ratio .red.center[Training binary classification networks yield true Bayesian posterior estimates!] With a bit of math one can show that $$ r(\vec x, z) \equiv \frac{1}{d(\vec x, z)}-1 \simeq \frac{p(\vec x|z)}{p(\vec x)} = \frac{p(z|\vec x)}{p(z)} $$ Once we have trained the network $d_\phi(\vec x, z)$, we can **estimate the posterior** $$ p(z|\vec x) \simeq r(\vec x, z) p(z) $$ --- # Example network ## An example network implementing $d_\phi(\vec x, z)$ .center.width-70[] .center[(our $\vec x$ corresponds to the noisy 1-dim data; our $z$ corresponds to $\vartheta$ here)] --- # Example: Strong lensing image analysis Here we trained a neural network to recognizing subhalos in an image (indicated by a red dot). The contours show the most likely regions guessed by the network. .center[.width-100[]] The shown posteriors are effectively marginalized over thousands of source and lens parameters. Those marginal posteriors can be evaluated in seconds once the network is trained (and training takes maybe 20 min). .footnote[Preliminary, Coogan+ 2020 in preparation] --- # Exercise: Neural posterior estimation The overall goal of the exercise is to perform posterior estimation with classification networks. This is broken down in several tasks. 1. Training of a parameter regression network - **Point estmation** of ring radii in an image. 2. Training of a classification network - Train network to predict the **probability** of an image containing different number of rings. 3. Training of a posterior estimation network - **Posterior estimation** of ring radii in an image. .center.red[Enjoy!]