From zero to neural likelihood free inference.

# Neural networks and likelihood-free inference.red.bold[*]

.center.width-60[![](figures/lec2/title.png)]

University of Amsterdam

10 March 2020
]

.left.footnote[.red.bold[*] Based on the beautiful slide decks of [Gilles Louppe](https://glouppe.github.io/teaching.html).]

---

# Overview

.bold[
- Neural Networks
- Gradient Descent and Automatic differentiation
  - [Ex 1: Logistic regression](#ex1)
- Deep Neural Networks
  - [Ex 1b: Simple multilayer perceptron](#ex1b)
  - [Ex 2: General function approximation](#ex2)
- Convolutional Neural Networks
  - [Ex 3: Convnet parameter regression](#ex3)

- Inside convolutional neural networks
- Neural likelihood-free inference
  - [Ex 4: Posterior estimation with Convnets](#ex4)
]

---

# Neural Networks

---

# Perceptron

The perceptron model (Rosenblatt, 1957)

$$f(\mathbf{x}) = \begin{cases}
   1 &\text{if } \sum_i w_i x_i + b \geq 0  \\\\
   0 &\text{otherwise}
\end{cases}$$

was originally motivated by biology, with $w_i$ being synaptic weights and $x_i$ and $f$ firing rates.

---
exclude: true

# Threshold Logic Unit

- $\text{or}(a,b) = 1\_{\\\{a+b - 0.5 \geq 0\\\}}$
- $\text{and}(a,b) = 1\_{\\\{a+b - 1.5 \geq 0\\\}}$
- $\text{not}(a) = 1\_{\\\{-a + 0.5 \geq 0\\\}}$
]
.kol-1-2[
.center.width-60[![](figures/lec2/tlu.png)]
]
]

.footnote[Credits: McCulloch and Pitts, [A logical calculus of ideas immanent in nervous activity](http://www.cse.chalmers.se/~coquand/AUTOMATA/mcp.pdf), 1943.]

---

.center.width-100[![](figures/lec2/perceptron.jpg)]

.footnote[Credits: Frank Rosenblatt, [Mark I Perceptron operators' manual](https://apps.dtic.mil/dtic/tr/fulltext/u2/236965.pdf), 1960.]

???

A perceptron is a signal transmission network
consisting of sensory units (S units), association units
(A units), and output or response units (R units). The
‘retina’ of the perceptron is an array of sensory
elements (photocells). An S-unit produces a binary
output depending on whether or not it is excited. A
randomly selected set of retinal cells is connected to
the next level of the network, the A units. As originally
proposed there were extensive connections among the
A units, the R units, and feedback between the R units
and the A units.

In essence an association unit is also an MCP neuron which is 1 if a single specific pattern of inputs is received, and it is 0 for all other possible patterns of  inputs. Each association unit will have a certain number of inputs which are selected from all the inputs to the perceptron.  So the number of inputs to a particular association unit does not have to be the same as the total number of inputs to the perceptron, but clearly the number of inputs to an association unit  must be less than or equal to the total number of inputs to the perceptron.  Each association unit's output then becomes the input to a single MCP neuron, and the output from this single MCP neuron is the output of the perceptron.  So a perceptron consists of a "layer" of MCP neurons, and all of these neurons send their output to a single MCP neuron.

---

.grid[
.kol-1-2[.width-100[![](figures/lec2/perceptron2.jpg)]]
.kol-1-2[<br><br>.width-100[![](figures/lec2/perceptron3.jpg)]]
]

The Mark I Percetron (Frank Rosenblatt).

---

The Perceptron

---

Let us define the (non-linear) **activation** function:

$$\text{sign}(x) = \begin{cases}
   1 &\text{if } x \geq 0  \\\\
   0 &\text{otherwise}
\end{cases}$$
.center[![](figures/lec2/activation-sign.png)]

The perceptron classification rule can be rewritten as
$$f(\mathbf{x}) = \text{sign}(\sum\_i w\_i x\_i  + b).$$

---

## Computational graphs

.grid[
.kol-3-5[.width-90[![](figures/lec2/graphs/perceptron.svg)]]
.kol-2-5[
The computation of
$$f(\mathbf{x}) = \text{sign}(\sum\_i w\_i x\_i  + b)$$ can be represented as a **computational graph** where
- white nodes correspond to inputs and outputs;
- red nodes correspond to model parameters;
- blue nodes correspond to intermediate operations.
]
]

???

Draw the NN diagram.

---

In terms of **tensor operations**, $f$ can be rewritten as
$$f(\mathbf{x}) = \text{sign}(\mathbf{w}^T  \mathbf{x} + b),$$
for which the corresponding computational graph of $f$ is:

.center.width-70[![](figures/lec2/graphs/perceptron-neuron.svg)]

---
exclude: true

# Linear discriminant analysis

Consider training data $(\mathbf{x}, y) \sim P(X,Y)$, with
- $\mathbf{x} \in \mathbb{R}^p$,
- $y \in \\\{0,1\\\}$.

Assume class populations are Gaussian, with same covariance matrix $\Sigma$ (homoscedasticity):

$$P(\mathbf{x}|y) = \frac{1}{\sqrt{(2\pi)^p |\Sigma|}} \exp \left(-\frac{1}{2}(\mathbf{x} - \mathbf{\mu}_y)^T \Sigma^{-1}(\mathbf{x} - \mathbf{\mu}_y) \right)$$

---
exclude: true

<br>
Using the Bayes' rule, we have:

$$
\begin{aligned}
P(Y=1|\mathbf{x}) &= \frac{P(\mathbf{x}|Y=1) P(Y=1)}{P(\mathbf{x})} \\\\
         &= \frac{P(\mathbf{x}|Y=1) P(Y=1)}{P(\mathbf{x}|Y=0)P(Y=0) + P(\mathbf{x}|Y=1)P(Y=1)} \\\\
         &= \frac{1}{1 + \frac{P(\mathbf{x}|Y=0)P(Y=0)}{P(\mathbf{x}|Y=1)P(Y=1)}}.
\end{aligned}
$$

--
exclude: true

It follows that with

$$\sigma(x) = \frac{1}{1 + \exp(-x)},$$

we get

$$P(Y=1|\mathbf{x}) = \sigma\left(\log \frac{P(\mathbf{x}|Y=1)}{P(\mathbf{x}|Y=0)} + \log \frac{P(Y=1)}{P(Y=0)}\right).$$

---
exclude: true

Therefore,

$$\begin{aligned}
&P(Y=1|\mathbf{x}) \\\\
&= \sigma\left(\log \frac{P(\mathbf{x}|Y=1)}{P(\mathbf{x}|Y=0)} + \underbrace{\log \frac{P(Y=1)}{P(Y=0)}}\_{a}\right) \\\\
    &= \sigma\left(\log P(\mathbf{x}|Y=1) - \log P(\mathbf{x}|Y=0) + a\right) \\\\
    &= \sigma\left(-\frac{1}{2}(\mathbf{x} - \mathbf{\mu}\_1)^T \Sigma^{-1}(\mathbf{x} - \mathbf{\mu}\_1) + \frac{1}{2}(\mathbf{x} - \mathbf{\mu}\_0)^T \Sigma^{-1}(\mathbf{x} - \mathbf{\mu}\_0) + a\right) \\\\
    &= \sigma\left(\underbrace{(\mu\_1-\mu\_0)^T \Sigma^{-1}}\_{\mathbf{w}^T}\mathbf{x} + \underbrace{\frac{1}{2}(\mu\_0^T \Sigma^{-1} \mu\_0 - \mu\_1^T \Sigma^{-1} \mu\_1) + a}\_{b} \right) \\\\
    &= \sigma\left(\mathbf{w}^T \mathbf{x} + b\right)
\end{aligned}$$

---
exclude: true

---
exclude: true

---

# The sigmoid function

.kol-1-2[
Extending binary logic with Bayesian probabilities motivates
the **sigmoid** function,
$$\sigma(x) = \frac{1}{1 + \exp(-x)}$$
which looks like a soft heavyside.

Therefore, an overall model
$f(\mathbf{x};\mathbf{w},b) =  \sigma(\mathbf{w}^T \mathbf{x} + b)$
is very similar to the perceptron.

]
.kol-1-2[
.center[![](figures/lec2/activation-sigmoid.png)]
]

---

.center.width-70[![](figures/lec2/graphs/logistic-neuron.svg)]

This unit is the main **primitive** of all neural networks!

---

# Example: Logistic regression

.grid[
.kol-1-2[
Consider the model $$P(Y=1|\mathbf{x}) = \sigma\left(\mathbf{w}^T \mathbf{x} + b\right)$$.

- colored classes correspond to $Y=1$ and $Y=0$
- no model assumptions on class population (Gaussian class populations, homoscedasticity);
- goal: instead, find $\mathbf{w}, b$ that maximizes the likelihood of the data.
]
.kol-1-2[
.width-100[![](figures/lec2/lda3.png)]
]]

---

We have,

$$
\begin{aligned}
&\arg \max\_{\mathbf{w},b} P(\mathbf{d}|\mathbf{w},b) \\\\
&= \arg \max\_{\mathbf{w},b} \prod\_{\mathbf{x}\_i, y\_i \in \mathbf{d}} P(Y=y\_i|\mathbf{x}\_i, \mathbf{w},b) \\\\
&= \arg \max\_{\mathbf{w},b} \prod\_{\mathbf{x}\_i, y\_i \in \mathbf{d}} \sigma(\mathbf{w}^T \mathbf{x}\_i + b)^{y\_i}  (1-\sigma(\mathbf{w}^T \mathbf{x}\_i + b))^{1-y\_i}  \\\\
&= \arg \min\_{\mathbf{w},b} \underbrace{\sum\_{\mathbf{x}\_i, y\_i \in \mathbf{d}} -{y\_i} \log\sigma(\mathbf{w}^T \mathbf{x}\_i + b) - {(1-y\_i)} \log (1-\sigma(\mathbf{w}^T \mathbf{x}\_i + b))}\_{\mathcal{L}(\mathbf{w}, b) = \sum\_i \ell(y\_i, \hat{y}(\mathbf{x}\_i; \mathbf{w}, b))}
\end{aligned}
$$

???

This loss is an instance of the **cross-entropy** $$H(p,q) = \mathbb{E}_p[-\log q]$$ for  $p=Y|\mathbf{x}\_i$ and $q=\hat{Y}|\mathbf{x}\_i$.

---
exclude: true

When $Y$ takes values in $\\{-1,1\\}$, a similar derivation yields the **logistic loss** $$\mathcal{L}(\mathbf{w}, b) = -\sum_{\mathbf{x}\_i, y\_i \in \mathbf{d}} \log \sigma\left(y\_i (\mathbf{w}^T \mathbf{x}\_i + b))\right).$$

---

- In general, the cross-entropy and the logistic losses do not admit a minimizer that can be expressed analytically in closed form.
- However, a minimizer can be found numerically, using a general minimization technique such as **gradient descent**.

---

# Gradient descent

---

# Gradient descent

Let $\mathcal{L}(\theta)$ denote a loss function defined over model parameters $\theta$ (e.g., $\mathbf{w}$ and $b$).
To minimize $\mathcal{L}(\theta)$, **gradient descent** uses local linear information to iteratively move towards a (local) minimum.
For $\theta\_0 \in \mathbb{R}^d$, a first-order approximation around $\theta\_0$ can be defined as
$$\hat{\mathcal{L}}(\epsilon; \theta\_0) = \mathcal{L}(\theta\_0) + \epsilon^T\nabla\_\theta \mathcal{L}(\theta\_0) + \frac{1}{2\gamma}||\epsilon||^2.$$

.center.width-60[![](figures/lec2/gd-good-0.png)]

---

A minimizer of the approximation $\hat{\mathcal{L}}(\epsilon; \theta\_0)$ is given for
$$\begin{aligned}
\nabla\_\epsilon \hat{\mathcal{L}}(\epsilon; \theta\_0) &= 0 \\\\
 &= \nabla\_\theta \mathcal{L}(\theta\_0) + \frac{1}{\gamma} \epsilon,
\end{aligned}$$
which results in the best improvement for the step $\epsilon = -\gamma \nabla\_\theta \mathcal{L}(\theta\_0)$.

Therefore, model parameters can be updated iteratively using the update rule
$$\theta\_{t+1} = \theta\_t -\gamma \nabla\_\theta \mathcal{L}(\theta\_t),$$
where
- $\theta_0$ are the initial parameters of the model;
- $\gamma$ is the **learning rate**;
- both are critical for the convergence of the update rule.

---

![](figures/lec2/gd-good-0.png)

Example 1: Convergence to a local minima

---

![](figures/lec2/gd-good-1.png)

Example 1: Convergence to a local minima

---

![](figures/lec2/gd-good-2.png)

Example 1: Convergence to a local minima

---

![](figures/lec2/gd-good-3.png)

Example 1: Convergence to a local minima

---

![](figures/lec2/gd-good-4.png)

Example 1: Convergence to a local minima

---

![](figures/lec2/gd-good-5.png)

Example 1: Convergence to a local minima

---

![](figures/lec2/gd-good-6.png)

Example 1: Convergence to a local minima

---

![](figures/lec2/gd-good-7.png)

Example 1: Convergence to a local minima

---

![](figures/lec2/gd-good-right-0.png)