Christoph Weniger
Thursday, 3 March 2022
During this chapter we will lay the groundwork for studying deep neural networks lated in the course. What we will discuss here is in fact the mathematics of a single artifical neuron.
(biological neuron)
Consider a dataset that is described by two real values and has two possible class labels (red and blue).
Image Credit: PRML: C. Bishop, Fig. 4.4
Visually, we can see that the dataset is separable, and more specifically linearly separable because we can draw a boundary, and more specifically a line, between the two classes without having misclassified data points.
Image Credit: PRML: C. Bishop, Fig. 4.4
Most datasets, however, are not so simple. How could we model this data such that given a new point, we could predict its label (red, green, or blue)?
Image Credit: PRML: C. Bishop, Fig. 1.19
One very simple approach would be to divide the input space into regularly sized cells and assign any new data point the label that a majority of the training points in the cell have.
Image Credit: PRML: C. Bishop, Fig. 1.20
A probabilistic classification method
The linear case has a nice interpretation in terms of a single neuron.
\[ y(\mathbf{x}) = f(\mathbf{w}^T \mathbf{x} + w_0) \]
(the analogy would be even better if we were to use a step function as activation function)
For classification, the most important activation function is the logistic sigmoid function (from now on simply sigmoid), which maps the whole real number range to the finite interval (0,1).
\[\sigma(x) = \frac{1}{1+\exp(-x)}\]
Sigmoid also satisfies the useful symmetry property: \(\sigma(-x) = 1 - \sigma(x)\).
Like in the case of linear regression, we can drastically increase the expressivity of the model by using basis functions \(\phi_i(\mathbf{x})\). In that case the model becomes
\[ y(\mathbf{x}) = f(\mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}) + w_0) \]
For a dataset \({\boldsymbol{\phi}_n, t_n}\) where \(t_n = 0, 1\) and \(\boldsymbol{\phi}_n = \boldsymbol{\phi} (\mathbf{x}_n)\) with \(n = 1, \ldots , N\), we write the likelihood as \[ p(\mathtt{t} | \mathbf{w}) = \prod^N_{n=1} y_n^{t_n} \{ 1 - y_n \}^{1-t_n} \] where \(\mathtt{t} = (t_1, \ldots , t_N)^T\) and \(y_n = p(C_1 | \boldsymbol{\phi}_n) = \sigma (\mathbf{w}^T \boldsymbol{\phi}_n)\)
As per usual, we can define an error function by taking the negative log-likelihood, which gives us the cross-entropy error function in the form
\[ E(\mathbf{w}) = -\ln p(\mathtt{t}|\mathbf{w}) = -\sum^N_{n=1} \{ t_n \ln y_n + (1- t_n)\ln (1-y_n) \} \]
Although we have defined an error function, there is no closed solution for the optimal parameters \(\mathbf{w}\). We will therefore now be exploring gradient descent algorithms for optimization. For this, we will need the gradient of the error function with respect to the parameters \(\mathbf{w}\) :
\[ \nabla_{\mathbf{w}} E(\mathbf{w}) = \sum^N_{n=1} (y_n - t_n)\boldsymbol{\phi}_n \]
To minimize the error function \(E(\theta)\) defined over model parameters \(\theta\) (e.g. \(\mathbf{w}\) and \(b\)), the simplest gradient method of gradient descent uses local linear information to iteratively move towards a local minimum. The algorithm must be initialized with an initial value of parameters \(\theta_0\). A first order approximation to \(E(\theta)\) around \(\theta_0\) can be given as
\[ \hat{E}(\epsilon ; \theta_0) = E(\theta_0) + \epsilon^T \nabla E(\theta_0) + \frac{1}{2\gamma}||\epsilon||^2 \]
Image and slide credit: Gilles Louppe
We can minimize the approximation \(\hat{E}({\theta }_0)\) in the usual way:
\[ \begin{aligned} \nabla_\epsilon \hat{E}(\epsilon ; \theta_0) &= 0 \\ &= \nabla_\theta E(\theta_0) + \frac{1}{\gamma}\epsilon \end{aligned} \] which results in the best improvement when \(\epsilon = -\gamma \nabla_\theta E(\theta_0)\).
Therefore model parameters can be updated iteratively using the update rule \[ \theta_{t+1} = \theta_{t} -\gamma \nabla_\theta E(\theta_t) \]
where - \(\theta_0\) are the initial conditions of the parameters of the model - \(\gamma\) is known as the learning rate - choosing both correctly is critical for the convergence of the update rule
For logistic regression, as well as for most error functions, the error function \(E\), as well as its derivative is defined as a sum over all data points
\[ \begin{aligned} E(\mathbf{w}) &= -\ln p(\mathtt{t}|\mathbf{w}) = -\sum^N_{n=1} \{ t_n \ln y_n + (1- t_n)\ln (1-y_n) \} \\ \nabla E(\mathbf{w}) &= \sum^N_{n=1} (y_n - t_n)\boldsymbol{\phi}_n \end{aligned} \]
To find the full gradient, we need to therefore look at every data point. Because of this, regular gradient descent is often referred to as batch gradient descent. However, the complexity of an update step grows linearly (!) with the size of the dataset. This is not good.
Since we are optimizing already an approximation, it should not be necessary to carry out the minimization to great accuracy. Therefore we often use stochastic gradient descent for an update rule:
\[ \theta_{t+1} = \theta_{t} -\gamma \nabla_\theta e(y_{i(t+1)}, f(\mathbf{x}_{i(t+1)},\theta_t)) \]
Batch gradient descent
Stochastic gradient descent
Why is stochastic gradient descent still a good idea?
End of Gilles Louppe’s material
shorturl.at/frCLX