Christoph Weniger
Monday, 10 May 2021
\(\newcommand{\indep}{\perp\!\!\!\perp}\)
Hubel and Wiesel
Hubel and Wiesel
Credits: Hubel and Wiesel, Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex, 1962.
Credits: Hubel and Wiesel, Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex, 1962.
Fukushima proposes a direct neural network implementation of the hierarchy model of the visual nervous system of Hubel and Wiesel.
Credits: Kunihiko Fukushima, Neocognitron: A Self-organizing Neural Network Model, 1980.
Convolutions
Feature hierarchy
Credits: Kunihiko Fukushima, Neocognitron: A Self-organizing Neural Network Model, 1980.
Credits: Rumelhart et al, Learning representations by back-propagating errors, 1986.
Credits: LeCun et al, Handwritten Digit Recognition with a Back-Propagation Network, 1990.
LeNet-1 (LeCun et al, 1993)
For one-dimensional tensors, given an input vector \(\mathbf{x} \in \mathbb{R}^W\) and a convolutional kernel \(\mathbf{u} \in \mathbb{R}^w\), the discrete convolution \(\mathbf{x} \circledast \mathbf{u}\) is a vector of size \(W - w + 1\) such that \[\begin{aligned} (\mathbf{x} \circledast \mathbf{u})[i] &= \sum_{m=0}^{w-1} x_{m+i} u_m . \end{aligned} \]
Note: Technically, \(\circledast\) denotes the cross-correlation operator. However, most machine learning libraries call it convolution.
Credits: Francois Fleuret, EE559 Deep Learning, EPFL.
Convolutions can implement differential operators: \[(0,0,0,0,1,2,3,4,4,4,4) \circledast (-1,1) = (0,0,0,1,1,1,1,0,0,0) \]
or crude template matchers:
Credits: Francois Fleuret, EE559 Deep Learning, EPFL.
Convolutions generalize to multi-dimensional tensors:
Credits: Francois Fleuret, EE559 Deep Learning, EPFL.
Convolutions have three additional parameters:
Credits: Francois Fleuret, EE559 Deep Learning, EPFL.
Padding is useful to control the spatial dimension of the feature map, for example to keep it constant across layers.
Credits: Dumoulin and Visin, A guide to convolution arithmetic for deep learning, 2016.
Stride is useful to reduce the spatial dimension of the feature map by a constant factor.
Credits: Dumoulin and Visin, A guide to convolution arithmetic for deep learning, 2016.
The dilation modulates the expansion of the kernel support by adding rows and columns of zeros between coefficients.
Having a dilation coefficient greater than one increases the units receptive field size without increasing the number of parameters.
Credits: Dumoulin and Visin, A guide to convolution arithmetic for deep learning, 2016.
A function \(f\) is equivariant to \(g\) if \(f(g(\mathbf{x})) = g(f(\mathbf{x}))\).
If an object moves in the input image, its representation will move the same amount in the output.
Credits: LeCun et al, Gradient-based learning applied to document recognition, 1998.
As a guiding example, let us consider the convolution of single-channel tensors \(\mathbf{x} \in \mathbb{R}^{4 \times 4}\) and \(\mathbf{u} \in \mathbb{R}^{3 \times 3}\):
\[ \mathbf{x} \circledast \mathbf{u} = \begin{pmatrix} 4 & 5 & 8 & 7 \\ 1 & 8 & 8 & 8 \\ 3 & 6 & 6 & 4 \\ 6 & 5 & 7 & 8 \end{pmatrix} \circledast \begin{pmatrix} 1 & 4 & 1 \\ 1 & 4 & 3 \\ 3 & 3 & 1 \end{pmatrix} = \begin{pmatrix} 122 & 148 \\ 126 & 134 \end{pmatrix}\]
The convolution operation can be equivalently re-expressed as a single matrix multiplication:
Then, \[\mathbf{U}v(\mathbf{x}) = \begin{pmatrix} 122 & 148 & 126 & 134 \end{pmatrix}^T\] which we can reshape to a \(2 \times 2\) matrix to obtain \(\mathbf{x} \circledast \mathbf{u}\).
A convolutional layer is a special case of a fully connected layer.
Convolution view
Fully connected view
When the input volume is large, pooling layers can be used to reduce the input dimension while preserving its global structure, in a way similar to a down-scaling operation.
Consider a pooling area of size \(h \times w\) and a 3D input tensor \(\mathbf{x} \in \mathbb{R}^{C\times(rh)\times(sw)}\).
Credits: Francois Fleuret, EE559 Deep Learning, EPFL.
Credits: Francois Fleuret, EE559 Deep Learning, EPFL.
A convolutional network is generically defined as a composition of convolutional layers (\(\texttt{CONV}\)), pooling layers (\(\texttt{POOL}\)), linear rectifiers (\(\texttt{RELU}\)) and fully connected layers (\(\texttt{FC}\)).
The most common convolutional network architecture follows the pattern:
\[\texttt{INPUT} \to [[\texttt{CONV} \to \texttt{RELU}]\texttt{*}N \to \texttt{POOL?}]\texttt{*}M \to [\texttt{FC} \to \texttt{RELU}]\texttt{*}K \to \texttt{FC}\]
where:
Some common architectures for convolutional networks following this pattern include:
Note that for the last architecture, two \(\texttt{CONV}\) layers are stacked before every \(\texttt{POOL}\) layer. This is generally a good idea for larger and deeper networks, because multiple stacked \(\texttt{CONV}\) layers can develop more complex features of the input volume before the destructive pooling operation.
Composition of two \(\texttt{CONV}+\texttt{POOL}\) layers, followed by a block of fully-connected layers.
Credits: Dive Into Deep Learning, 2020.
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 6, 28, 28] 156
ReLU-2 [-1, 6, 28, 28] 0
MaxPool2d-3 [-1, 6, 14, 14] 0
Conv2d-4 [-1, 16, 10, 10] 2,416
ReLU-5 [-1, 16, 10, 10] 0
MaxPool2d-6 [-1, 16, 5, 5] 0
Conv2d-7 [-1, 120, 1, 1] 48,120
ReLU-8 [-1, 120, 1, 1] 0
Linear-9 [-1, 84] 10,164
ReLU-10 [-1, 84] 0
Linear-11 [-1, 10] 850
LogSoftmax-12 [-1, 10] 0
================================================================
Total params: 61,706
Trainable params: 61,706
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.11
Params size (MB): 0.24
Estimated Total Size (MB): 0.35
----------------------------------------------------------------
Composition of a 8-layer convolutional neural network with a 3-layer MLP.
The original implementation was made of two parts such that it could fit within two GPUs.
LeNet vs. AlexNet
Credits: Dive Into Deep Learning, 2020.
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 64, 55, 55] 23,296
ReLU-2 [-1, 64, 55, 55] 0
MaxPool2d-3 [-1, 64, 27, 27] 0
Conv2d-4 [-1, 192, 27, 27] 307,392
ReLU-5 [-1, 192, 27, 27] 0
MaxPool2d-6 [-1, 192, 13, 13] 0
Conv2d-7 [-1, 384, 13, 13] 663,936
ReLU-8 [-1, 384, 13, 13] 0
Conv2d-9 [-1, 256, 13, 13] 884,992
ReLU-10 [-1, 256, 13, 13] 0
Conv2d-11 [-1, 256, 13, 13] 590,080
ReLU-12 [-1, 256, 13, 13] 0
MaxPool2d-13 [-1, 256, 6, 6] 0
Dropout-14 [-1, 9216] 0
Linear-15 [-1, 4096] 37,752,832
ReLU-16 [-1, 4096] 0
Dropout-17 [-1, 4096] 0
Linear-18 [-1, 4096] 16,781,312
ReLU-19 [-1, 4096] 0
Linear-20 [-1, 1000] 4,097,000
================================================================
Total params: 61,100,840
Trainable params: 61,100,840
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.57
Forward/backward pass size (MB): 8.31
Params size (MB): 233.08
Estimated Total Size (MB): 241.96
----------------------------------------------------------------
Composition of 5 VGG blocks consisting of \(\texttt{CONV}+\texttt{POOL}\) layers, followed by a block of fully connected layers. The network depth increased up to 19 layers, while the kernel sizes reduced to 3.
AlexNet vs. VGG
Credits: Dive Into Deep Learning, 2020.
The effective receptive field is the part of the visual input that affects a given unit indirectly through previous convolutional layers. It grows linearly with depth.
E.g., a stack of two \(3 \times 3\) kernels of stride \(1\) has the same effective receptive field as a single \(5 \times 5\) kernel, but fewer parameters.
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 64, 224, 224] 1,792
ReLU-2 [-1, 64, 224, 224] 0
Conv2d-3 [-1, 64, 224, 224] 36,928
ReLU-4 [-1, 64, 224, 224] 0
MaxPool2d-5 [-1, 64, 112, 112] 0
Conv2d-6 [-1, 128, 112, 112] 73,856
ReLU-7 [-1, 128, 112, 112] 0
Conv2d-8 [-1, 128, 112, 112] 147,584
ReLU-9 [-1, 128, 112, 112] 0
MaxPool2d-10 [-1, 128, 56, 56] 0
Conv2d-11 [-1, 256, 56, 56] 295,168
ReLU-12 [-1, 256, 56, 56] 0
Conv2d-13 [-1, 256, 56, 56] 590,080
ReLU-14 [-1, 256, 56, 56] 0
Conv2d-15 [-1, 256, 56, 56] 590,080
ReLU-16 [-1, 256, 56, 56] 0
MaxPool2d-17 [-1, 256, 28, 28] 0
Conv2d-18 [-1, 512, 28, 28] 1,180,160
ReLU-19 [-1, 512, 28, 28] 0
Conv2d-20 [-1, 512, 28, 28] 2,359,808
ReLU-21 [-1, 512, 28, 28] 0
Conv2d-22 [-1, 512, 28, 28] 2,359,808
ReLU-23 [-1, 512, 28, 28] 0
MaxPool2d-24 [-1, 512, 14, 14] 0
Conv2d-25 [-1, 512, 14, 14] 2,359,808
ReLU-26 [-1, 512, 14, 14] 0
Conv2d-27 [-1, 512, 14, 14] 2,359,808
ReLU-28 [-1, 512, 14, 14] 0
Conv2d-29 [-1, 512, 14, 14] 2,359,808
ReLU-30 [-1, 512, 14, 14] 0
MaxPool2d-31 [-1, 512, 7, 7] 0
Linear-32 [-1, 4096] 102,764,544
ReLU-33 [-1, 4096] 0
Dropout-34 [-1, 4096] 0
Linear-35 [-1, 4096] 16,781,312
ReLU-36 [-1, 4096] 0
Dropout-37 [-1, 4096] 0
Linear-38 [-1, 1000] 4,097,000
================================================================
Total params: 138,357,544
Trainable params: 138,357,544
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.57
Forward/backward pass size (MB): 218.59
Params size (MB): 527.79
Estimated Total Size (MB): 746.96
----------------------------------------------------------------
Composition of first layers similar to GoogLeNet, a stack of 4 residual blocks, and a global average pooling layer. Extensions consider more residual blocks, up to a total of 152 layers (ResNet-152).
Regular ResNet block vs. ResNet block with \(1\times 1\) convolution.
.footnote[Credits: Dive Into Deep Learning, 2020.]
Training networks of this depth is made possible because of the skip connections in the residual blocks. They allow the gradients to shortcut the layers and pass through without vanishing.
Credits: Dive Into Deep Learning, 2020.
Understanding what is happening in deep neural networks after training is complex and the tools we have are limited.
In the case of convolutional neural networks, we can look at:
Credits: Francois Fleuret, EE559 Deep Learning, EPFL.
LeNet’s first convolutional layer, all filters.
Credits: Francois Fleuret, EE559 Deep Learning, EPFL.
LeNet’s second convolutional layer, first 32 filters.
.
Credits: Francois Fleuret, EE559 Deep Learning, EPFL.
AlexNet’s first convolutional layer, first 20 filters.
Credits: Francois Fleuret, EE559 Deep Learning, EPFL.
Convolutional networks can be inspected by looking for synthetic input images \(\mathbf{x}\) that maximize the activation \(\mathbf{h}_{\ell,d}(\mathbf{x})\) of a chosen convolutional kernel \(\mathbf{u}\) at layer \(\ell\) and index \(d\) in the layer filter bank.
These samples can be found by gradient ascent on the input space: \[\begin{aligned} \mathcal{L}_{\ell,d}(\mathbf{x}) &= ||\mathbf{h}_{\ell,d}(\mathbf{x})||_2\\ \mathbf{x}_0 &\sim U[0,1]^{C \times H \times W } \\ \mathbf{x}_{t+1} &= \mathbf{x}_t + \gamma \nabla_{\mathbf{x}} \mathcal{L}_{\ell,d}(\mathbf{x}_t) \end{aligned}\]
VGG-16, convolutional layer 1-1, a few of the 64 filters
Credits: Francois Chollet, How convolutional neural networks see the world, 2016.
VGG-16, convolutional layer 2-1, a few of the 128 filters
Credits: Francois Chollet, How convolutional neural networks see the world, 2016.
VGG-16, convolutional layer 3-1, a few of the 256 filters
Credits: Francois Chollet, How convolutional neural networks see the world, 2016.
VGG-16, convolutional layer 4-1, a few of the 512 filters
Credits: Francois Chollet, How convolutional neural networks see the world, 2016.
VGG-16, convolutional layer 5-1, a few of the 512 filters
Credits: Francois Chollet, How convolutional neural networks see the world, 2016.
Some observations:
The network appears to learn a hierarchical composition of patterns.
What if we build images that maximize the activation of a chosen class output?
The left image is predicted with 99.9% confidence as a magpie!
Credits: Francois Chollet, How convolutional neural networks see the world, 2016.
Deep Dream. Start from an image \(\mathbf{x}_t\), offset by a random jitter, enhance some layer activation at multiple scales, zoom in, repeat on the produced image \(\mathbf{x}_{t+1}\).
“Deep hierarchical neural networks are beginning to transform neuroscientists’ ability to produce quantitatively accurate computational models of the sensory systems, especially in higher cortical areas where neural response properties had previously been enigmatic.”
Credits: Yamins et al, Using goal-driven deep learning models to understand sensory cortex, 2016.
(optional)