Deep Learning with R

Deep Learning with RNeural network fundamentalsMikhail DozmorovVirginia Commonwealth University2020-06-081 / 32

Deep Learning Prerequisites

For each machine- and deep learning algorithms, we need:

Input data - samples and their properties. E.g., images represented by color pixels. Proper data representation is crucial
Examples of the expected output - expected sample annotations
Performance evaluation metrics - how well the algorithm's output matches the expected output. Used as a feedback signal to adjust the algorithm - the process of learning

2 / 32

How deep learning learns

Creates layer-by-layer increasingly complex representations of the input data maximizing learning accuracy
Intermediate representations learned jointly, with the properties of each layer being updated depending on the following and the previous layers

3 / 32

The beginning of Deep Learning

A generic Deep Learning architecture is made up of a combination of several layers of "neurons"
The concept of a "neuron" was proposed in the 1950s with the well-known Rosenblatt "perceptron", inspired by brain function
The multilayer perceptron (MLP) is a fully-connected feedforward neural network containing at least one hidden layer

https://www.analyticsvidhya.com/blog/2020/01/fundamentals-deep-learning-activation-functions-when-to-use-them/

White, B.W.; Rosenblatt, F. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Am. J. Psychol. 1963

4 / 32

Deep Learning winter and revival

Widespread belief that gradient descent would be unable to escape poor local minima during optimization, preventing neural networks from converging to a global acceptable solution
During 1980s, 1990s, deep neural networks were largely abandoned
In 2006, deep belief networks revived interest to deep learning
In 2012, Krizhevsky et al. presented a convolutional neural network that significantly improved image recognition accuracy
GPU technologies enabled further development

Hinton GE, Osindero S, Teh Y-W. A fast learning algorithm for deep belief nets. Neural Comput. 2006

5 / 32

The Perceptron: Linear input-output relationships

Input: Take $x_{1} = 0$ , $x_{2} = 1$ , $x_{3} = 1$ and setting a $t h r e s h o l d = 0$
If $x_{1} + x_{2} + x_{3} > 0$ , the output is 1 otherwise 0
Output: calculated as 1

https://www.analyticsvidhya.com/blog/2017/05/neural-network-from-scratch-in-python-and-r/

http://neuralnetworksanddeeplearning.com/chap1.html

6 / 32

The Perceptron: Adding weights to inputs

$\hat{y} = g (\sum_{i = 1}^{m} x_{i} w_{i})$

$\hat{y}$ - the output
$\sum$ - the linear combination of inputs
$g$ - a non-linear activation function

Weights give importance to an input. For example, you assign $w_{1} = 2$ , $w_{2} = 3$ and $w_{3} = 4$ to $x_{1}$ , $x_{2}$ and $x_{3}$ respectively. These weights assign more importance to $x_{3}$ .
To compute the output, we will multiply input with respective weights and compare with threshold value as $w_{1} * x_{1} + w_{2} * x_{2} + w_{3} * x_{3} > t h r e s h o l d$

https://www.analyticsvidhya.com/blog/2017/05/neural-network-from-scratch-in-python-and-r/

7 / 32

The Perceptron: Adding bias

$\hat{y} = g (w_{0} + \sum_{i = 1}^{m} x_{i} w_{i})$

$w_{0}$ - bias term

$\hat{y} = g (w_{0} + X^{T} W)$

Bias adds flexibility to the perceptron by globally shifting the calculations and allowing the weights to be more precise
Think about a linear function $y = a x + b$ , where $b$ is the bias. Without bias, the line will always go through the origin (0,0) and we get poorer fit
Input consists of multiple values $x_{i}$ and multiple weights $w_{i}$ , but only one bias is added. For $i = 3$ , the linear representation of input will look like $w_{1} * x_{1} + w_{2} * x_{2} + w_{3} * x_{3} + 1 * b$

https://www.analyticsvidhya.com/blog/2017/05/neural-network-from-scratch-in-python-and-r/

8 / 32

Multi-layer neural network

Input - a layer with $n$ neurons each taking input measures
Processing information - each neuron maps input to output via nonlinear transformations that include input data $x_{i}$ , weights $w_{i}$ , and biases $b$
Output - Predicted probability of a characteristic associated with a given input

https://www.datasciencecentral.com/profiles/blogs/how-to-configure-the-number-of-layers-and-nodes-in-a-neural

9 / 32

Layers

Deep learning models are formed by multiple layers
The multi-layer perceptron (MLP) with more than 2 hidden layers is already a Deep Model
Most frequently used layers
- Convolution Layer
- Max/Average Pooling Layer
- Dropout Layer
- Batch Normalization Layer
- Fully Connected (Affine) Layer
- Relu, Tanh, Sigmoid Layer (Non-Linearity Layers)
- Softmax, Cross-Entropy, SVM, Euclidean (Loss Layers)

10 / 32

Fitting the parameters using the training set

Parameters of the neural network (weights and biases) are first randomly initialized
- For a given layer, initialize weights using Gaussian random variables with $μ = 0$ and $σ = 1$
- Better to use standard deviation $1 / \sqrt{n_{n e u r o n s}}$
- Uniform distribution, and its modifications, also used
Small random subsets, so-called batches, of input–target pairs of the training data set are iteratively used to make small updates on model parameters to minimize the loss function between the predicted values and the observed targets
This minimization is performed by using the gradient of the loss function computed using the backpropagation algorithm

11 / 32

Overflow and underflow

Need to represent infinitely many real numbers with a finite number of fig patterns
The approximation error is always present and can accumulate across many operations
Underﬂow occurs when numbers near zero are rounded to zero
Overﬂow occurs when numbers with large magnitude are approximated as $\infty$ or $- \infty$

12 / 32

Activation function

Activation function takes the sum of weighted inputs as an argument and returns the output of the neuron

$a = f (\sum_{i = 0}^{N} w_{i} x_{i})$

where index 0 correspond to the bias term ( $x_{0} = b$ , $w_{0} = 1$ ).

13 / 32

Activation functions

Adds nonlinearity to the network calculations, allows for flexibility to capture complex nonlinear relationships
Softmax - applied over a vector $z = (z_{1}, . . ., z_{K}) \in R^{K}$ of length $K$ as $σ (z)_{i} = \frac{e^{z_{i}}}{\sum_{j = 1}^{K} e^{z_{j}}}$
Sigmoid - $f (x) = \frac{1}{1 + e^{- x}}$
Tahn - Hyperbolic tangent $t a n h (x) = 2 * s i g m o i d (2 x) - 1$
ReLU - Rectified Linear Unit $f (x) = m a x (x, 0)$ .

Other functions: binary step function, linear (i.e., identity) activation function, exponential and scaled exponential linear unit, softplus, softsign

https://keras.io/activations/

https://www.analyticsvidhya.com/blog/2020/01/fundamentals-deep-learning-activation-functions-when-to-use-them/

14 / 32

Activation functions overview

https://towardsdatascience.com/complete-guide-of-activation-functions-34076e95d044

15 / 32

Learning rules

Optimization - update model parameters on the training data and check its performance on a new validation data to find the most optimal parameters for the best model performance

https://www.youtube.com/watch?v=5u4G23_OohI

https://www.analyticsvidhya.com/blog/2017/03/introduction-to-gradient-descent-algorithm-along-its-variants/

16 / 32

Loss function

Loss function - (aka objective, or cost function) metric to assess the predictive accuracy, the difference between true and predicted values. Needs to be minimized (or, maximized, metric-dependent)
- Regression loss functions - mean squared error (MSE) $M S E = \frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - \hat{Y_{i}})^{2}$
- Binary classification loss functions - Binary Cross-Entropy $- (y l o g (p) + (1 - y) l o g (1 - p))$
- Multi-class classification loss functions - Multi-class Cross Entropy Loss $- \sum_{c = 1}^{M} y_{o, c} l o g (p_{o, c})$ ( $M$ - number of classes, $y$ - binary indicator if class label $c$ is the correct classification for observation $o$ , $p$ - predicted probability observation $o$ is of class $c$ ), Kullback-Leibler Divergence Loss $\sum \hat{y} * l o g (\frac{\hat{y}}{y})$

https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html

17 / 32

Loss optimization

We want to find the network weights that achieve the lowest loss

$W^{*} = \underset{W}{\arg min} \frac{1}{n} \sum_{i = 1}^{n} L (f (x^{(i)}; W), y^{(i)})$ where $W = {W^{(0)}, W^{(1)}, . . .}$

18 / 32

Gradient descent

An optimization technique - finds a combination of weights for best model performance
Full batch gradient descent uses all the training data to update the weights
Stochastic gradient descent uses parts of the training data
Gradient descent requires calculation of gradient by differentiation of cost function. We can either use first-order differentiation or second-order differentiation

https://www.analyticsvidhya.com/blog/2017/03/introduction-to-gradient-descent-algorithm-along-its-variants/

Richards, Blake A., Timothy P. Lillicrap, Philippe Beaudoin, Yoshua Bengio, Rafal Bogacz, Amelia Christensen, Claudia Clopath, et al. “A Deep Learning Framework for Neuroscience.” Nature Neuroscience 2019 - Box 1, Learning and the credit assignment problem

19 / 32

Gradient descent algorithm

Initialize weights randomly $\sim N (0, σ^{2})$
Loop until convergence
- Compute gradient, $\frac{\partial J (W)}{\partial W}$
- Update weights, $W \leftarrow W - η \frac{\partial J (W)}{\partial W}$
Return weights

where $η$ is a learning rate. Right selection is critical - too small may lead to local minima, too large may miss minima entirely. Adaptive implementations exist

20 / 32

Gradient descent algorithms

Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent with momentum (Very popular)
Nesterov's accelerated gradient (NAG)
Adaptive gradient (AdaGrad)
Adam (Very good because you need to take less care about learning rate)
RMSprop

https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/model_optimization.html

21 / 32

Forward and backward propagation

Forward propagation computes the output by passing the input data through the network
The estimated output is compared with the expected output - the error (loss function) is calculated
Backpropagation (the chain rule) propagates the loss back through the network and updates the weights to minimize the loss. Uses chain rule to recursively calculate gradients backward from the output
Each round of forward- and backpropagation is known as one training iteration or epoch

Rumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. “Learning Representations by Back-Propagating Errors,” 1986

22 / 32

Forward propagation

Assuming sigmoid activation function $σ (f)$ , at Layer L1, we have:

$a_{0}^{1} = σ ([w_{00}^{1} \cdot x_{0} + b_{00}^{1}] + [w_{01}^{1} \cdot x_{1} + b_{01}^{1}])$

$a_{1}^{1} = σ ([w_{10}^{1} \cdot x_{0} + b_{10}^{1}] + [w_{11}^{1} \cdot x_{1} + b_{11}^{1}])$

https://www.analyticsvidhya.com/blog/2020/04/comprehensive-popular-deep-learning-interview-questions-answers/

23 / 32

Forward propagation

At Layer L2, we have:

$\hat{y} = σ ([w_{00}^{2} \cdot a_{0}^{1} + b_{00}^{2}] + [w_{01}^{2} \cdot a_{1}^{1} + b_{01}^{2}])$

https://www.analyticsvidhya.com/blog/2020/04/comprehensive-popular-deep-learning-interview-questions-answers/

24 / 32

Backpropagation

Back-propagation - A common method to train neural networks by updating its parameters (i.e., weights) by using the derivative of the network’s performance with respect to the parameters. A technique to calculate gradient through the chain of functions

$\frac{\partial J (W)}{\partial w_{1}} = \frac{\partial J (W)}{\partial \hat{y}} * \frac{\partial \hat{y}}{\partial z_{1}} * \frac{\partial z_{1}}{\partial w_{1}}$

Review https://ml-cheatsheet.readthedocs.io/en/latest/backpropagation.html

Rumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. “Learning Representations by Back-Propagating Errors”, 1986, 4.

25 / 32

Backpropagation

https://www.analyticsvidhya.com/blog/2020/04/comprehensive-popular-deep-learning-interview-questions-answers/

26 / 32

Backpropagation

https://www.analyticsvidhya.com/blog/2020/04/comprehensive-popular-deep-learning-interview-questions-answers/

27 / 32

Backpropagation Explained

A series of 10-15 min videos by deeplizard

Part 1 - The Intuition
Part 2 - The Mathematical Notation
Part 3 - Mathematical Observations and the chain rule
Part 4 - Calculating The Gradient, derivative of the loss function with respect to the weights
Part 5 - What Puts The "Back" In Backprop?

Analytics Vidhya tutorial: Step-by-step forward and backpropagation, implemented in R and Python: https://www.analyticsvidhya.com/blog/2017/05/neural-network-from-scratch-in-python-and-r/

28 / 32

Vanishing gradient

Typical deep NNs suffer from the problem of vanishing or exploding gradients
- The gradient descent tries to minimize the error by taking small steps towards the minimum value. These steps are used to update the weights and biases in a neural network
- On the course of backpropagation, the steps may become too small, resulting in negligible updates to weights and bias terms. Thus, a network will be trained with nearly unchanging weights. This is the vanishing gradient problem
- Weights of early layers (latest to be updated) suffer the most

https://en.wikipedia.org/wiki/Vanishing_gradient_problem

Vanishing & Exploding Gradient Explained | A Problem Resulting From Backpropagation

https://www.analyticsvidhya.com/blog/2020/04/comprehensive-popular-deep-learning-interview-questions-answers/

29 / 32

Exploding gradient

Typical deep NNs suffer from the problem of vanishing or exploding gradients
- The gradient descent tries to minimize the error by taking small steps towards the minimum value. These steps are used to update the weights and biases in a neural network
- The steps may become too large, resulting in large updates to weights and bias terms and potential numerical overflow. This is the exploding gradient problem
- Various solutions exist, typically by propagating a feedback signal from previous layers (residual connections)

https://en.wikipedia.org/wiki/Vanishing_gradient_problem

Vanishing & Exploding Gradient Explained | A Problem Resulting From Backpropagation

https://www.analyticsvidhya.com/blog/2020/04/comprehensive-popular-deep-learning-interview-questions-answers/

30 / 32

Neural Network summary

Angermueller et al., “Deep Learning for Computational Biology.”

31 / 32

The Neural Network Zoo

Review the complete infographics at https://www.asimovinstitute.org/neural-network-zoo/

32 / 32

Deep Learning Prerequisites

For each machine- and deep learning algorithms, we need:

Input data - samples and their properties. E.g., images represented by color pixels. Proper data representation is crucial

Examples of the expected output - expected sample annotations

Performance evaluation metrics - how well the algorithm's output matches the expected output. Used as a feedback signal to adjust the algorithm - the process of learning

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Deep Learning with R

Neural network fundamentals

Mikhail Dozmorov

Virginia Commonwealth University

2020-06-08

Deep Learning Prerequisites

How deep learning learns

The beginning of Deep Learning

Deep Learning winter and revival

The Perceptron: Linear input-output relationships

The Perceptron: Adding weights to inputs

The Perceptron: Adding bias

Multi-layer neural network

Layers

Fitting the parameters using the training set

Overflow and underflow

Activation function

Activation functions

Activation functions overview

Learning rules

Loss function

Loss optimization

Gradient descent

Gradient descent algorithm

Gradient descent algorithms

Forward and backward propagation

Forward propagation

Forward propagation

Backpropagation

Backpropagation

Backpropagation

Backpropagation Explained

Vanishing gradient

Exploding gradient

Neural Network summary

The Neural Network Zoo

Deep Learning Prerequisites

Help