Machine learning for gamma rays

# Precision analysis of gravitational strong lensing images with nested likelihood-free inference

.center[Ongoing work with: Marco Chianese (GRAPPA),
Adam Coogan (GRAPPA), Kosio Karchev (SISSA/GRAPPA),
Ben Miller (AI4Science/GRAPPA)
]

AI4Science Colloquium]

---

# The energy budget of the Universe

## What is Dark Matter made of?

---

# Dark matter candidates

---

class: black-slide, center, bottom
background-image: url(figures/illustris_box_z0124_4panel.png)
background-size: contain

## Evolution of structure in the Universe (bottom to top)

Dark matter overdensities collapse into increasily large halos.

Large enough halos are sites of star and galaxy formation.

## Details depend on the particle nature of dark matter!

---

# Dark matter mass vs substructure

Hot (left), warm (middle), and cold (right) dark matter

at early times (top) and and late times (bottom)

---

# Dark matter mass vs substructure

Amount of substructure depends on the dark matter particle mass.

.red.center[Absence of substructure implies that the Cold Dark Matter paradigm is wrong.]

.footnote[Image credit: Chianese, Marco. 2019. “Differentiable Probabilistic Programming for Strong Gravitational Lensing.” arXiv [astro-ph.CO]. arXiv. https://doi.org/10.22323/1.358.0515]

---

class: black-slide, center
background-image: url(figures/A_Horseshoe_Einstein_Ring_from_Hubble.jpeg)
background-size: cover

# Strong gravitational lensing

## A "horseshoe Einstein Ring from Hubble"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
The gravity of a luminous red galaxy has gravitationally distorted the light from a much more distant blue galaxy.

---

# Strong lensing

.grid[
.kol-2-3[
.center[.width-100[![](figures/lensing_overview.png)]]
]
.kol-1-4[
Lensed quasar
.width-100[![](figures/lensing_multi_quasars.png)]

Lensed galaxy
.width-100[![](figures/lensing_galaxy.png)]
]
]

---

# Weak and strong lensing

---

# The effect of dark matter substructure

---

# Signature of interest is .underline[very small]

## For mock images with different subhalos

.red.center[Images look virtually identical without stats analysis]

---

# Variance between images is .underline[very large]

.center.width-80[![](figures/213372main_67_hubble_lg.jpg)]

---

# Can Machine Learning techniques be used to perform precision analysis of these images?

---

# Likelihood vs generative models

## We model the world from first principles

- Data analyses usually starts with **forward models**, i.e. we an write down
$P(x|z)$ from first principles (cp. with cancer images)
- In many cases, we have some idea about reasonable **priors** $P(z)$. 
- This defines **generative model**: 
$$x, z \sim P(x, z) \equiv P(x|z)P(z)$$

## Not only lensing, also...
.grid[
.kol-1-3[Neutron stars .width-100[![](https://public.nrao.edu/wp-content/uploads/2006/08/imagessquared2006magnetar06_squared.jpg)]]
.kol-1-3[Gravitational waves .width-100[![](https://cdn.mos.cms.futurecdn.net/fNo6JvmhpmL3qL3nnMSVzK-970-80.jpeg)]]
.kol-1-3[Particle colliders .width-100[![](https://assets.newatlas.com/dims4/default/1299ab9/2147483647/strip/true/crop/800x533+0+1/resize/1200x800!/format/webp/quality/90/?url=http%3A%2F%2Fnewatlas-brightspot.s3.amazonaws.com%2Farchive%2Fquantizer-1.jpeg)]]
]

---

# The inverse problem

Given some data $x$, what does this tell me about $z$?

.grid[
.kol-2-5[
**Bayes' theorem**
$$
P(z|x) = \frac{P(x|z) P(z)}{P(x)}
$$
]
.kol-3-5[
- Likelihood: $P(x|z)$
- Prior: $P(z)$
- Posterior: $P(z|x)$
- Evidence: $P(x) \equiv \int dz\, P(x|z) P(z)$
- (**Generative model**: $P(x, z) \equiv P(x|z)P(z)$)
]
]

--
count: false

This gives the **joined posterior** for the full latent space $z\in \mathbb{R}^n$.  Almost always, we want however the **marginal posterior**, say, for $\theta\in \mathbb{R}$, where $z\equiv(\theta, \eta)$ and $\eta \in \mathbb{R}^{n-1}$ are nuisance parameters.

$$
P(\theta|x) = \int d\eta\, P(\theta, \eta|x) 
$$

Alas, this integral is usually **intractable**.

---

# Solution: Monte Carlo Integration etc!

Likelihood ratios $P(x|\theta)/P(x|\theta')$ are central input for most sampling algorithms

.grid[
.kol-1-2[
- Markov Chain Monte Carlo
- Gibbs sampling
- Nested sampling
- Hamiltonian Monte Carlo (also needs gradients)
- ...

## Results
- samples from **joint posterior**, $z \sim P(z|x)$ (can be trivially marginalized)
- maybe value of marginal likelihood $P(x)$ (in the case of nested sampling)
]
.kol-1-2[
Typical example:
.right[.width-90[![](https://www.researchgate.net/profile/Carmelita_Carbone/publication/321719035/figure/fig1/AS:614186346352656@1523444844641/Joint-posterior-distribution-obtained-using-Planck-temperature-and-polarization-data-and.png)]]
]
]

---

# The traditional analysis

.center.width-80[![](figures/traditional_lens.svg)]

## Challenges

- Hyper-parameter optimization requires performing above analysis on a grid, to
optimize Bayesian evidence
- No source uncertainties can be reconstructed
- Very hard to incorporate additional components in the lens model (multiple subhalos), since posteriors become highly multi-modal

---

# First approaches using neural networks

Example: Brehmer+ 2019.  The authors train classification networks to estimate posteriors for subhalo parameters.

.center.width-80[![](figures/Brehmer_posteriors.png)]

Very promising in context of this toy scenario

.red.center[How can this method be successfully applied to real complex images?]

.footnote[From Brehmer, Johann, Siddharth Mishra-Sharma, Joeri Hermans, Gilles Louppe, and Kyle Cranmer. 2019. “Mining for Dark Matter Substructure: Inferring Subhalo Population Properties from Strong Lenses with Machine Learning.” arXiv [astro-ph.CO]. arXiv. http://arxiv.org/abs/1909.02005]

---

# Here: targeted inference networks

.center[Instead of training a single all-purpose inference network, we build an analysis pipeline where .red[inference networks are trained individually for each image]]

--
count: false

$$\Rightarrow$$

.large.center.red[Trading analysis time for precision]

--
count: false

## Challenges

- How to generate training data for targeted training?
- How to make the pipeline sufficiently fast to be applicable to O(100) or more images?

---

# Our strategy

## Step 1: Generate image-specific training data with variational methods

- Perform high-dimensional fits
- Component separation
- Hyper-parameter optimization
- Can be used to generate training data

## Step 2: Train inference network to anlayse dark matter structures
- Posterior estimation with targeted neural ratio estimation

---

# Step 1: Variational inference

## The goal: Interverting the lensing effect and recovering the original source image

---

# The math of strong lensing

- We have to reconstruct the original source galaxy, $I_{src}(\Omega)$ (right)
  - Sources are complex, tenthousand of parameters.
- We have to reconstruct the mass distribution in the lens plane, $\Sigma_{lens}(\Omega)$ (middle)
  - We keep the mass distribution simple (analytic model with about 10 parameters).

---

# Modeling the source as Gaussian process

- Hyper-parameter optimization through maximization of marginal likelihood
- Kernel is simple in source plane
- But: Lensing causes complicated kernal structure in image plane

$\Rightarrow$ **existing approaches/libraries were insufficient for our use-case**

---

# Factorized Gaussian process

## Covariance of image pixels from indicator functions and GP kernel
$$
K_{ij}  = k(\mathbf{\vec{p}}_i - \mathbf{\vec{p}}_j)
= \int \text{d}\vec{x}_1 \text{d}\vec{x}_2
\ f_i(\mathbf{\vec{p}}_i - \vec{x}_1)
\ k_0(\vec{x}_1 - \vec{x}_2)
\ f_j(\mathbf{\vec{p}}_j - \vec{x}_2)
$$

.grid[
.kol-2-5[
Approximate factorization
$$
K \approx T T^T
$$
allows to write
$$
\mu\_i = \sum\_j T_{ij} y_j
$$
$$
y_j \sim \mathcal{N}(0, 1)
$$
]
.kol-3-5[
.width-80[![](figures/GP_source.png)]
]
]

---

# Factorization of covariance matrix

The intergral
$$
K_{ij}  = k(\mathbf{\vec{p}}_i - \mathbf{\vec{p}}_j)
= \int \text{d}\vec{x}_1 \text{d}\vec{x}_2
\ f_i(\mathbf{\vec{p}}_i - \vec{x}_1)
\ k_0(\vec{x}_1 - \vec{x}_2)
\ f_j(\mathbf{\vec{p}}_j - \vec{x}_2)
$$
becomes easy to evaluate if we approximate pixel indicator functions with multivariate Gaussians.

- .red[Integral can be expressed in terms of multivariate Gaussian].
- By replacing $k_0$ with $\sqrt{k_0}\sqrt{k_0}$, we can factorize $K$.

---

# Exact vs approx. correlation coefficients

.center.width-60[![](figures/windowing_c_idist.png)]

---

# Reverse KL divergence

Goal: Find $q_\phi(z|x_0)$ for some observation $x_0$ by minimize the KL divergence at $x=x_0$

One can show that minimizing KL divergence is equivalent to maximizing **Evidence Lower Bound (ELBO)**

$$
\text{ELBO} \equiv \underbrace{\mathbb{E}\_{z \sim q\_\phi(z|x\_0)} \left[\ln p\_\theta(x\_0,z)\right]}\_{\rm "Negative\; energy"} - \underbrace{\mathbb{E}\_{z \sim q\_\phi(z|x\_0)} \left[\ln q\_\phi(z|x\_0)\right]}\_{\rm Entropy}
$$

ELBO maximization w.r.t. model parameters $\theta$ and variational parameters $\phi$ leads to
- Variational posterior $q\_\phi(z|x_0)$ for data $x_0$
- Large model evidence $p\_\theta(x_0)$

.footnote[See e.g. Hoffman et al. 2016, “ELBO Surgery: Yet Another Way to Carve up the Variational Evidence Lower Bound.” http://approximateinference.org/accepted/HoffmanJohnson2016.pdf]

---

# Exact vs approx. GP

.grid[
.kol-1-2[
## Properties of our approximate GP
- No log-determinant evaluation (happens on-the-fly as part of ELBO maximization)
- Can model complex kernels
- Reasonable agreement with exact Gaussian process regression
]
.kol-1-2[
.center.width-100[![](figures/gp-nonoise.png)]
]
]

---

# Image reconstruction works well

.center.width-100[![](figures/lens_toy_img-rec.png)]

- We quickly can find lensing parameters and source parameters that reproduce the image
- Hyper-parameter fits optimize variations in the source

---

# Source reconstruction

.center[
.width-35[![](figures/lens_toy_source_true.png)] .width-55[![](figures/lens_toy_gp_fit_err.png)]
]

- Besides good mean values, we obtain uncertainties

---

# Source layers differ in smoothness

## Mean values
.center.width-80[![](figures/lens_toy_gp_layers_mean.png)]

## Uncertainties
.center.width-80[![](figures/lens_toy_gp_layers_err.png)]

---

# Full posteriors (with Sersic source)

---

# Towards targeted inference networks

.center.red[Training data can now be generated by sampling from the posterior predictive distribution]

.center.width-50[![](figures/lensing_training.gif)]

---

# Step 2) Marginal posterior estimation with inference networks

.red.center[Our goal marginal posteriors about properties of substructure in the images, like:]

.large.center[How to train inference network for $p(z|\vec x)$?]

---

# Neural likelihood-free inference

.center.red[Our goal is to train an augmented CNN to estimate marginal posteriors for parameters of interest.]

.bold[Starting point]: for any pair of observation $x$ and model parameter $\theta$, the goal is to estimate the probability that this pair belongs one of the following classes:
.grid[
.kol-1-2[
$H_0$: Data $\vec x$ corresponds to model parameters $\vec z$

$H_1$: Data $\vec x$ and model $\vec z$ are unrelated
]
.kol-1-2[
$(\vec x, \vec z) \sim P(\vec x, \vec z) = P(\vec x|\vec z)P(\vec z)$

$(\vec x, \vec z) \sim P(\vec x)P(\vec z)$
]
]

---

# Joint vs marginal samples

1) Examples for $H_0$, jointly sampled from $\vec x, z \sim P(\vec x|z) P(z)$

.grid[
.kol-1-6[
.center[Cat]
.width-100[![](figures/cat.jpeg)]
]
.kol-1-6[
.center[Donkey]
.width-70[![](figures/donkey.jpeg)]
]
.kol-1-6[
.center[Cat]
.width-100[![](figures/cat2.jpeg)]
]
.kol-1-6[
.center[Cat]
.width-100[![](figures/cat3.jpeg)]
]
.kol-1-6[
.center[Donkey]
.width-100[![](figures/donkey2.jpeg)]
]
.kol-1-6[
.center[Donkey]
.width-80[![](figures/donkey3.jpeg)]
]
]

2) Examples for $H_1$, marginally sampled from $\vec x, z \sim P(\vec x) P(z)$

.grid[
.kol-1-6[
.center[Donkey]
.width-100[![](figures/donkey3.jpeg)]
]
.kol-1-6[
.center[Cat]
.width-100[![](figures/cat3.jpeg)]
]
.kol-1-6[
.center[Cat]
.width-100[![](figures/donkey2.jpeg)]
]
.kol-1-6[
.center[Cat]
.width-70[![](figures/donkey.jpeg)]
]
.kol-1-6[
.center[Donkey]
.width-100[![](figures/cat2.jpeg)]
]
.kol-1-6[
.center[Donkey]
.width-80[![](figures/cat.jpeg)]
]
]

---

# Loss function

.bold[Strategy:] We train a neural network $d_\phi(\vec x, z) \in [0, 1]$ as binary classifier to estimate the probability of hypothesis $H_0$ or $H_1$.  The network output can be interpreted, for a given input pair $\vec x$ and $z$, as probability that $H_0$ is true.
- H0 is true: $d_\phi(\vec x, z) \simeq 1$
- H1 is true: $d_\phi(\vec x, z) \simeq 0$

The corresponding loss function is (so-called "binary cross-entroy")

$$
L\left[d(\vec x, z)\right] = -\int dx dz  \left[ p(\vec x, z) \ln\left(d(\vec x, z)\right) + p(\vec x)p(z) \ln\left(1-d(\vec x, z)\right) \right]
$$

--
count: false

Minimizing that function (see next slide) w.r.t. the network parameters $\phi$ yields
$$
d(\vec x, z) \approx \frac{p(\vec x, z)}{p(\vec x, z) + p(\vec x)p(z)}
$$

---

# Properties of optimized network

## Some analytical estimates

$$
\frac{\partial}{\partial\phi}L \left[d_\phi(\vec x, z)\right] = \int dx dz  \left[ p(\vec x, z) \ln\left(d(\vec x, z)\right) + p(\vec x)p(z) \ln\left(1-d(\vec x, z) \right) \right]
$$
$$
= \int dx dz  \left[ \frac{p(\vec x, z)}{d(\vec x, z)} + \frac{p(\vec x)p(z)}{1-d(\vec x, z) } \right] \frac{\partial d(\vec x, z)}{\partial \phi}
$$

Setting the part in square brackets to zero yields that the network is optimized once
$$
d(\vec x, z) \approx \frac{p(\vec x, z)}{p(\vec x, z) + p(\vec x)p(z)}
$$

---

# Likelihood-to-evidence ratio

.red.center[Training binary classification networks yield true Bayesian posterior estimates!]

With a bit of math one can show that
$$
r(\vec x, z) \equiv \frac{1}{d(\vec x, z)}-1 \simeq
\frac{p(\vec x|z)}{p(\vec x)} 
= \frac{p(z|\vec x)}{p(z)} 
$$

Once we have trained the network $d_\phi(\vec x, z)$, we can **estimate the posterior**
$$
p(z|\vec x) \simeq  r(\vec x, z) p(z)
$$

---

# Approach: Likelihood-to-evidence ratio

Assume that the goal is to distriminate parameters $z$ matching an observation $x$ from parameters that do not match an observation. We want to distriminate
- $H_0$: $x, z \sim  p(x, z)$
- $H_1$: $x, z \sim  p(x)p(z)$

We train a binary classifier to discriminate between $(x, z)$ pairs drawn from the joint distribution $p(x, z)$ and the marginal distributions $p(x)p(z)$.  This loss function is given by
$$\ell = \int dx dz \left(
p(x, z)\ln d(x, z)
+p(x)p(z)\ln [1-d(x, z)]
\right)$$

One can then show that the trained neural network $d(x, z) \in [0, 1]$ satisfies

$$r(x,z)\equiv\frac{d(x, z)}{1-d(x, z)} = \frac{p(x|z)}{p(x)}$$

which is the likelihood-to-evidence ratio.  Training only requires simulation/samples, and side-steps any likelihood marginalization.

---

# The inference network

.center.red[We use a very simple network structure (low capacity, fast to train).]

## Three conv layers
.center[$\vec x = \text{conv2d}(\vec x)$]
.center[$\vec x = \text{conv2d}(\vec x)$]

## Concatenating features from CNN, $\vec f$ with model parameters $z$
.center[$\vec f' = \text{concat}(\vec f, z)$]

## Three dense layers
.center[$r(x, z) = \text{MLP}(\vec f')$]

---

We used variational inference fit from above to generate training data for a neural network that can recognize individual subhalos.

---

# Strong lensing image analysis

.center.width-80[![](figures/lens_posterior.png)]

The shown posteriors are effectively marginalized over thousands of source and
lens parameters.  Those marginal posteriors can be evaluated in seconds once
the network is trained (and training takes maybe 20 min).

---

.center.large[It works: Similar sensitivity to full likelihood analysis, but much much more flexible.]

---

# Next steps

Fully atomatize analysis pipeline

--
count: false

Inference of large number of subhalos

--
count: false

Inference of cutoff of small-scale structure

--
count: false

Application to real images (not just mock images)

--
count: false

Extension to radio data?

---

# Spinoff I: Nested posterior estimation

The idea of targeted inference networks can be generalized.

---

# Spinoff II: Analysis of $\gamma$-ray data

---

# Conclusions

- The analysis of .red[gravitationally strongly lensed galaxy images] enables to study the substructure of dark matter.
- Extracting information from these images requires .red[sub-percent accuracy] of the analysis.
- We developed a .underline.red[two-step analysis pipeline].
- Step 1) Source and main lens reconstruction with .red[variational inference], including uncertainties. This enables the generation of .red[targetd training data].
- Step 2) Training of an .red[targeted inference network] to perform substructure analysis.
- The approach enables to study the combined effect of thousands of subhalos, which is .red[intractable with traditional techniques].
- Lots of .red[spin-off work] (dark matter searches with gamma-ray data, gravitational wave parameter inference, ...)

--
count: false

.large.center[Thank you!]