Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998
MNIST (Modified National Institute of Standards and Technology) - hand-written digits, 0-9. 60,000 28x28 pixels black-and-white images
CIFAR 10 (Canadian Institute For Advanced Research) - 60,000 32x32 pixels color images, 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck)
ImageNet - over 14 million images, 20,000 categories (animals, plants, sports, etc.)
Finally, an operation called "pooling" is usually applied to smooth out the result. It consists of merging the kernel outputs of different successive positions by taking the maximum (max-pooling) of all values of those positions
One of the main advantages of convolutional networks is their capability to reduce the number of parameters to be estimated. These networks also have sparse interactions and are equivariant to translations
https://d2l.ai/chapter_convolutional-neural-networks/conv-layer.html
Two-dimensional cross-correlation operation. The shaded portions are the first output element and the input and kernel array elements used in its computation: 0×0+1×1+3×2+4×3=19
https://d2l.ai/chapter_convolutional-neural-networks/channels.html
Convolutions are defined by two key parameters:
Size of the filters extracted from the inputs. These are typically 3×3 (most frequently used), or 5×5
Depth of the output feature map - The number of filters computed by the convolution
A convolution works by sliding these filters over the 3D input feature map, stopping at every possible location, and extracting the 3D patch of surrounding features
Max pooling consists of extracting windows from the input feature maps and outputting the max value of each channel
It’s conceptually similar to convolution, except that instead of transforming local patches via a learned linear transformation (the convolution kernel), they’re transformed via a hardcoded max tensor operation
A big difference from convolution is that max-pooling is usually done with 2×2 windows and stride 2, to downsample the feature maps by a factor of 2
The reason to use downsampling via max-pooling is to reduce the number of feature-map coefficients to process and to induce spatial-filter hierarchies by making successive convolution layers look at increasingly large windows
Max is used because it’s more informative to look at the maximal presence of different features
https://deeplizard.com/learn/video/ZjM_XQa5s6s, https://d2l.ai/chapter_convolutional-neural-networks/pooling.html
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015
A friendly introduction to Convolutional Neural Networks and Image Recognition, 32m https://youtu.be/2-Ol7ZB0MmU
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015
A friendly introduction to Convolutional Neural Networks and Image Recognition, 32m https://youtu.be/2-Ol7ZB0MmU
To get an output feature map with the same spatial dimensions as the input, you can use padding
Review Figures 5.5 and 5.6 from the book
Review Figure 5.7 from the book
Implemented as layer_separable_conv_2d()
Performs a spatial convolution on each channel of its input, independently, before mixing output channels via a pointwise convolution (a 1×1 convolution)
It requires significantly fewer parameters and involves fewer computations, thus resulting in smaller, speedier models
https://towardsdatascience.com/review-xception-with-depthwise-separable-convolution-better-than-inception-v3-image-dc967dd42568
Given infinite data, your model would be exposed to every possible aspect of the data distribution at hand: you would never overfit
Data augmentation takes the approach of generating more training data from existing training samples, by augmenting the samples via a number of random transformations that yield believable-looking images
The goal is that at training time, your model will never see the exact same picture twice. This helps expose the model to more aspects of the data and generalize better
rotation_range()
is a value in degrees (0–180), a range within which to randomly rotate pictures
width_shift()
and height_shift()
are fractions of total width or height within which to randomly translate pictures vertically or horizontally
shear_range()
is for randomly applying shearing transformations
zoom_range()
is for randomly zooming inside pictures
horizontal_flip()
is for randomly flipping half the images horizontally
fill_mode()
is the strategy used for filling in newly created pixels, which can appear after a rotation or a width/height shift
Visualizing intermediate convnet outputs (intermediate activations) - Useful for understanding how successive convnet layers transform their input, and for getting a first idea of the meaning of individual convnet filters
Visualizing convnets filters - Useful for understanding precisely what visual pattern or concept each filter in a convnet is receptive to
Visualizing heatmaps of class activation in an image - Useful for understanding which parts of an image were identified as belonging to a given class, thus allowing you to localize objects in images
http://www.image-net.org/index
https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/
VGG networks developed by the Visual Geometry Group
VGG-11 - 8 convolutional layers (wrapped into 5 convolutional blocks) and 3 fully-connected layers
VGG-16, VGG-19 - deeper architectures
https://d2l.ai/chapter_convolutional-modern/googlenet.html
https://datascience.stackexchange.com/questions/14984/what-is-an-inception-layer
https://d2l.ai/chapter_convolutional-modern/googlenet.html
A pre-trained network is a saved network previously trained on a large dataset, typically on a large-scale image-classification task
If this original dataset is large enough and general enough, then the spatial hierarchy of features learned by the pre-trained network can effectively act as a generic model of the visual world
Its features can prove useful for many different computer-vision problems, even though these new problems may involve completely different classes than those of the original task.
CNNs comprise two parts: they start with a series of pooling and convolution layers, and they end with a densely connected classifier
The first part is called the convolutional base
The second part is the densely connected layers
The representations learned by the convolutional base are likely to be more generic and therefore more reusable: the feature maps of a convnet are presence maps of generic concepts over a picture
The densely connected layers utilize those representations to learn specific properties of the new input
conv_base
) by adding dense layers on top and running the whole thing end-to-end on the input dataconv_base
(preventing their weights from being updated during training)conv_base
) after the initial training with frozen weights allow for fine-tuning the performance (slight adjustment of representations in conv_base
)The list of image-classification models (all pre-trained on the ImageNet dataset) that are available as part of Keras:
A simple scheme of a one-dimension (1D) convolutional operation (a). Full representation of a 1D convolutional neural network for an SNP-matrix (b). The convolution outputs are represented in yellow. Pooling layers after convolutional operations combining the output of the previous layer at certain locations into a single neuron are represented in green. The final output is a standard MLP
Pérez-Enciso and Zingaretti, “A Guide for Using Deep Learning for Complex Trait Genomic Prediction.”
Convolution layer animation and math, https://www.analyticsvidhya.com/blog/2020/02/mathematics-behind-convolutional-neural-network/
Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |