Christoph Weniger
Thursday, 3 March 2022
\(\newcommand{\indep}{\perp\!\!\!\perp}\)
Based on the slide decks of Gilles Louppe.
In this lecture I will discuss the role of various activation functions. We will construct a dense deep neural networks, also known as multi layer percetrons.
Sign
Sigmoid
ReLU
The perceptron model (Rosenblatt, 1957)
\[f(\mathbf{x}) = \begin{cases} 1 &\text{if } \sum_i w_i x_i + b \geq 0 \\ 0 &\text{otherwise} \end{cases}\]
was originally motivated by biology, with \(w_i\) being synaptic weights and \(x_i\) and \(f\) firing rates.
Credits: Frank Rosenblatt, Mark I Perceptron operators’ manual, 1960.
The Mark I Percetron (Frank Rosenblatt).
The Perceptron
Let us define the (non-linear) activation function:
\[\text{sign}(x) = \begin{cases} 1 &\text{if } x \geq 0 \\ 0 &\text{otherwise} \end{cases}\]
The perceptron classification rule can be rewritten as \[f(\mathbf{x}) = \text{sign}(\sum_i w_i x_i + b).\]
The computation of \[f(\mathbf{x}) = \text{sign}(\sum_i w_i x_i + b)\] can be represented as a computational graph where
In terms of tensor operations, \(f\) can be rewritten as \[f(\mathbf{x}) = \text{sign}(\mathbf{w}^T \mathbf{x} + b),\] for which the corresponding computational graph of \(f\) is:
Extending binary logic with Bayesian probabilities motivates the sigmoid function, \[\sigma(x) = \frac{1}{1 + \exp(-x)}\] which looks like a soft heavyside.
Therefore, an overall model \(f(\mathbf{x};\mathbf{w},b) = \sigma(\mathbf{w}^T \mathbf{x} + b)\) is very similar to the perceptron.
This unit (or variants with other activation functions, see below) is the main primitive of all neural networks!
(we discussed this already in the classification chapter)
Consider the model \[P(Y=1|\mathbf{x}) = \sigma\left(\mathbf{w}^T \mathbf{x} + b\right)\].
So far we considered the logistic unit \(h=\sigma\left(\mathbf{w}^T \mathbf{x} + b\right)\), where \(h \in \mathbb{R}\), \(\mathbf{x} \in \mathbb{R}^p\), \(\mathbf{w} \in \mathbb{R}^p\) and \(b \in \mathbb{R}\).
These units can be composed in parallel to form a layer with \(q\) outputs: \[\mathbf{h} = \sigma(\mathbf{W}^T \mathbf{x} + \mathbf{b})\] where \(\mathbf{h} \in \mathbb{R}^q\), \(\mathbf{x} \in \mathbb{R}^p\), \(\mathbf{W} \in \mathbb{R}^{p\times q}\), \(b \in \mathbb{R}^q\) and where \(\sigma(\cdot)\) is upgraded to the element-wise sigmoid function.
Similarly, layers can be composed in series, such that: \[\begin{aligned} \mathbf{h}_0 &= \mathbf{x} \\ \mathbf{h}_1 &= \sigma(\mathbf{W}_1^T \mathbf{h}_0 + \mathbf{b}_1) \\ ... \\ \mathbf{h}_L &= \sigma(\mathbf{W}_L^T \mathbf{h}_{L-1} + \mathbf{b}_L) \\ f(\mathbf{x}; \theta) = \hat{y} &= \mathbf{h}_L \end{aligned}\] where \(\theta\) denotes the model parameters \(\{ \mathbf{W}_k, \mathbf{b}_k, ... | k=1, ..., L\}\).
This model is the multi-layer perceptron, also known as the fully connected feedforward network.
Let us consider a simplified 2-layer MLP and the following loss function: \[\begin{aligned} f(\mathbf{x}; \mathbf{W}_1, \mathbf{W}_2) &= \sigma\left( \mathbf{W}_2^T \sigma\left( \mathbf{W}_1^T \mathbf{x} \right)\right) \\ \mathcal{\ell}(y, \hat{y}; \mathbf{W}_1, \mathbf{W}_2) &= \text{cross_ent}(y, \hat{y}) + \lambda \left( ||\mathbf{W}_1||_2 + ||\mathbf{W}_2||_2 \right) \end{aligned}\] for \(\mathbf{x} \in \mathbb{R^p}\), \(y \in \mathbb{R}\), \(\mathbf{W}_1 \in \mathbb{R}^{p \times q}\) and \(\mathbf{W}_2 \in \mathbb{R}^q\).
Here the term proportional to \(\lambda\) is some regularization that is sometimes introduced to avoid overfitting of the network weights (similar to what we have seen in linear regression earlier).
In the forward pass, intermediate values are all computed from inputs to outputs, which results in the annotated computational graph below:
The total derivative can be computed through a backward pass, by walking through all paths from outputs to parameters in the computational graph and accumulating the terms. For example, for \(\frac{\text{d} \ell}{\text{d} \mathbf{W}_1}\) we have: \[\begin{aligned} \frac{\text{d} \ell}{\text{d} \mathbf{W}_1} &= \frac{\partial \ell}{\partial u_8}\frac{\text{d} u_8}{\text{d} \mathbf{W}_1} + \frac{\partial \ell}{\partial u_4}\frac{\text{d} u_4}{\text{d} \mathbf{W}_1} \\ \frac{\text{d} u_8}{\text{d} \mathbf{W}_1} &= ... \end{aligned}\]
Let us zoom in on the computation of the network output \(\hat{y}\) and of its derivative with respect to \(\mathbf{W}_1\).
Training deep MLPs with many layers has for long (pre-2011) been very difficult due to the vanishing gradient problem.
Backpropagated gradients normalized histograms (Glorot and Bengio, 2010).
Gradients for layers far from the output vanish to zero.
Instead of the sigmoid activation function, modern neural networks are mostly based on rectified linear units (ReLU) (Glorot et al, 2011):
\[\text{ReLU}(x) = \max(0, x)\]
Note that the derivative of the ReLU function is
\[\frac{\text{d}}{\text{d}x} \text{ReLU}(x) = \begin{cases} 0 &\text{if } x \leq 0 \\ 1 &\text{otherwise} \end{cases}\]
For \(x=0\), the derivative is undefined. In practice, it is set to zero.
Therefore,
\[\frac{\text{d}\hat{y}}{\text{d}w_1} = \underbrace{\frac{\partial \sigma(u_5)}{\partial u_5}}_{= 1} w_3 \underbrace{\frac{\partial \sigma(u_3)}{\partial u_3}}_{= 1} w_2 \underbrace{\frac{\partial \sigma(u_1)}{\partial u_1}}_{= 1} x\]
This solves the vanishing gradient problem, even for deep networks! (provided proper initialization)
Note that:
shorturl.at/yBW16