6 Neural Networks

You’ve probably been hearing a lot about “neural networks.” Now that we have several useful machine-learning concepts (hypothesis classes, classification, regression, gradient descent, regularization, etc.), we are well equipped to understand neural networks in detail.

This is, in some sense, the “third wave” of neural nets. The basic idea is founded on the 1943 model of neurons of McCulloch and Pitts and the learning ideas of Hebb. There was a great deal of excitement, but not a lot of practical success: there were good training methods (e.g., perceptron) for linear functions, and interesting examples of non-linear functions, but no good way to train non-linear functions from data. Interest died out for a while, but was re-kindled in the 1980s when several people came up with a way to train neural networks with “back-propagation,” which is a particular style of implementing gradient descent, that we will study here.

As with many good ideas in science, the basic idea for how to train non-linear neural networks with gradient descent was independently developed by more than one researcher.

By the mid-90s, the enthusiasm waned again, because although we could train non-linear networks, the training tended to be slow and was plagued by a problem of getting stuck in local optima. Support vector machines (SVMs) that use regularization of high-dimensional hypotheses by seeking to maximize the margin, alongside kernel methods that provide an efficient and beautiful way of using feature transformations to non-linearly transform data into a higher-dimensional space, provided reliable learning methods with guaranteed convergence and no local optima.

However, during the SVM enthusiasm, several groups kept working on neural networks, and their work, in combination with an increase in available data and computation, has made neural networks rise again. They have become much more reliable and capable, and are now the method of choice in many applications. There are many, many variations of neural networks, which we can’t even begin to survey. We will study the core “feed-forward” networks with “back-propagation” training, and then, in later chapters, address some of the major advances beyond this core.

The number of neural network variants increases daily, as may be seen on arxiv.org.

We can view neural networks from several different perspectives:

View 1: An application of stochastic gradient descent for classification and regression with a potentially very rich hypothesis class.

View 2: A brain-inspired network of neuron-like computing elements that learn distributed representations.

View 3: A method for building applications that make predictions based on huge amounts of data in very complex domains.

We will mostly take view 1, with the understanding that the techniques we develop will enable the applications in view 3. View 2 was a major motivation for the early development of neural networks, but the techniques we will study do not seem to actually account for the biological learning processes in brains.

Some prominent researchers are, in fact, working hard to find analogues of these methods in the brain.

6.1 Basic element

The basic element of a neural network is a “neuron,” pictured schematically below. We will also sometimes refer to a neuron as a “unit” or “node.”

$\begin{tikzpicture}[ main node/.style={thick,circle,font=\Large}, background rectangle/.style={fill=white}, show background rectangle ] \node[main node,draw](sum) at (0,0) {$\sum$}; \node[main node](x1) at (-2,1) {$x_1$}; \node[main node](dots) at (-2,0) {$\vdots$}; \node[main node](xm) at (-2,-1) {$x_m$}; \coordinate (w0) at (0,-1.5); \node[main node, draw] (f) at (2,0) {$f(\cdot)$}; \node[main node] (y) at (4,0) {$a$}; \draw[->, above right] (x1) -- node {$w_1$} (sum); \draw (dots) -- node {} (sum); \draw[->, below right] (xm) -- node {$w_m$} (sum); \draw[above right] (w0) node {$w_0$} -- (sum) ; \draw[->, above] (sum) -- node (z) {$z$} (f); \draw[->, above] (f) -- (y); \node[black!70] (inp) at ($(xm) + (0,-1.5)$) {input}; \node[black!70] (pre) at ($(z) + (0,1)$) {pre-activation}; \node[black!70] (out) at ($(y) + (0,1)$) {output}; \node[black!70] (act) at ($(f) - (0,1.5)$) {activation function}; \draw[->,black!70] (inp) -- (xm); \draw[->,black!70] (pre) -- (z); \draw[->,black!70] (out) -- (y); \draw[->,black!70] (act) -- (f); \end{tikzpicture}$

It is a (generally non-linear) function of an input vector $x \in \mathbb{R}^m$ to a single output value $a \in \mathbb{R}$.

Sorry for changing our notation here. We were using $d$ as the dimension of the input, but we are trying to be consistent here with many other accounts of neural networks. It is impossible to be consistent with all of them though—there are many different ways of telling this story.

It is parameterized by a vector of weights $(w_1, \ldots, w_m) \in \mathbb{R}^m$ and an offset or threshold $w_0 \in \mathbb{R}$.

This should remind you of our $\theta$ and $\theta_0$ for linear models.

We also specify an activation function $f : \mathbb{R}\rightarrow \mathbb{R}$. In general, this is chosen to be a non-linear function, which means the neuron is non-linear. In the case that the activation function is the identity ($f(x) = x$) or another linear function, then the neuron is a linear function of $x$). The activation can theoretically be any function, though we will only be able to work with it if it is differentiable.

The function represented by the neuron is expressed as: \[a = f(z) = f\left(\left(\sum_{j=1}^m x_jw_j\right) + w_0\right) = f(w^Tx + w_0)\;\;.\]

Before thinking about a whole network, we can consider how to train a single unit. Given a loss function $\mathcal{L}(\text{guess}, \text{actual})$ and a dataset $\{(x^{(1)}, y^{(1)}), \ldots, (x^{(n)},y^{(n)})\}$, we can do (stochastic) gradient descent, adjusting the weights $w, w_0$ to minimize \[J(w, w_0) = \sum_{i} \mathcal{L}\left(\text{NN}(x^{(i)}; w, w_0), y^{(i)}\right)\;,\] where $\text{NN}$ is the output of our single-unit neural net for a given input.

We have already studied two special cases of the neuron: linear logistic classifiers (LLCs) with NLL loss and regressors with quadratic loss! The activation function for the LLC is $f(x) = \sigma(x)$ and for linear regression it is simply $f(x) = x$.

Just for a single neuron, imagine for some reason, that we decide to use activation function $f(z) = e^z$ and loss function $\mathcal{L}(\text{guess}, \text{actual}) = (\text{guess} - \text{actual})^2$. Derive a gradient descent update for $w$ and $w_0$.

6.2 Networks

Now, we’ll put multiple neurons together into a network. A neural network in general takes in an input $x \in \mathbb{R}^m$ and generates an output $a \in \mathbb{R}^n$. It is constructed out of multiple neurons; the inputs of each neuron might be elements of $x$ and/or outputs of other neurons. The outputs of the neural network are generated by $n$ output units.

In this chapter, we will only consider feed-forward networks. In a feed-forward network, you can think of the network as defining a function-call graph that is acyclic: that is, the input to a neuron can never depend on that neuron’s output. Data flows one way, from the inputs to the outputs, and the function computed by the network is just a composition of the functions computed by the individual neurons.

Although the graph structure of a feed-forward neural network can really be anything (as long as it satisfies the feed-forward constraint), for simplicity in software and analysis, we usually organize them into layers. A layer is a group of neurons that are essentially “in parallel”: their inputs are the outputs of neurons in the previous layer, and their outputs are the inputs to the neurons in the next layer. We’ll start by describing a single layer, and then go on to the case of multiple layers.

6.2.1 Single layer

A layer is a set of units that, as we have just described, are not connected to each other. The layer is called fully connected if, as in the diagram below, all of the inputs (i.e., $x_1, x_2, \ldots x_m$ in this case) are connected to every unit in the layer. A layer has input $x \in \mathbb{R}^m$ and output (also known as activation) $a \in \mathbb{R}^n$.

$\begin{tikzpicture}[ main node/.style={circle}, background rectangle/.style={fill=white}, show background rectangle ] \coordinate (soff) at (0,1.4); \node[main node,draw](s1) at ($2*(soff)$) {$\sum$}; \node[main node,draw](s2) at (soff) {$\sum$}; \node[main node,draw](s3) at (0,0) {$\sum$}; \node[main node](sdots) at ($(0,0)-(soff)$) {$\vdots$}; \node[main node,draw](sn) at ($(0,0)-2*(soff)$) {$\sum$}; % inputs \coordinate (x) at (-2,0); \coordinate (xoff) at (0,1); \node[main node](x1) at ($(x) + 1.5*(xoff)$) {$x_1$}; \node[main node](x2) at ($(x) + .5*(xoff)$) {$x_2$}; \node[main node](xdots) at ($(x) - .5*(xoff)$) {$\vdots$}; \node[main node](xm) at ($(x) - 1.5*(xoff)$) {$x_m$}; % activations \coordinate (foff) at (1.5,0); \node[main node, draw](f1) at ($(s1) + (foff)$) {$f$}; \node[main node, draw](f2) at ($(s2) + (foff)$) {$f$}; \node[main node, draw](f3) at ($(s3) + (foff)$) {$f$}; \node[main node] at ($(sdots) + (foff)$) {$\vdots$}; \node[main node, draw](fn) at ($(sn) + (foff)$) {$f$}; % outputs \node[main node](y1) at ($(s1) + 2*(foff)$) {$a_1$}; \node[main node](y2) at ($(s2) + 2*(foff)$) {$a_2$}; \node[main node](y3) at ($(s3) + 2*(foff)$) {$a_3$}; \node[main node] at ($(sdots) + 2*(foff)$) {$\vdots$}; \node[main node](yn) at ($(sn) + 2*(foff)$) {$a_n$}; % arrows \foreach \b in {(s1), (s2), (s3), (sn)} \foreach \a in {(x1), (x2), (xm)} \draw[->] \a -- \b; \foreach \x/\y/\z in {(s1)/(f1)/(y1), (s2)/(f2)/(y2), (s3)/(f3)/(y3), (sn)/(fn)/(yn)}{ \draw[->] \x -- \y; \draw[->] \y -- \z; } %\foreach \x in {(s1), (s2), (s3), (sn)} % \draw \x -- +(0,-.65); \node[main node, ,below left] (weights) at ($(xm)!0.5!(sn)$) {$W,W_0$}; \end{tikzpicture}$

Since each unit has a vector of weights and a single offset, we can think of the weights of the whole layer as a matrix, $W$, and the collection of all the offsets as a vector $W_0$. If we have $m$ inputs, $n$ units, and $n$ outputs, then

$W$ is an $m\times n$ matrix,
$W_0$ is an $n \times 1$ column vector,
$X$, the input, is an $m \times 1$ column vector,
$Z = W^T X + W_0$, the pre-activation, is an $n \times 1$ column vector,
$A$, the activation, is an $n \times 1$ column vector,

and the output vector is \[A = f(Z) = f(W^TX + W_0)\;\;.\] The activation function $f$ is applied element-wise to the pre-activation values $Z$.

6.2.2 Many layers

A single neural network generally combines multiple layers, most typically by feeding the outputs of one layer into the inputs of another layer.

We have to start by establishing some nomenclature. We will use $l$ to name a layer, and let $m^l$ be the number of inputs to the layer and $n^l$ be the number of outputs from the layer. Then, $W^l$ and $W^l_0$ are of shape $m^l \times n^l$ and $n^l \times 1$, respectively. Note that the input to layer $l$ is the output from layer $l-1$, so we have $m^l= n^{l-1}$, and as a result $A^{l-1}$ is of shape $m^l \times 1$, or equivalently $n^{l-1} \times 1$. Let $f^l$ be the activation function of layer $l$. Then, the pre-activation outputs are the $n^l \times 1$ vector \[Z^l = {W^l}^TA^{l-1} + W^l_0\] and the activation outputs are simply the $n^l \times 1$ vector \[ A^l = f^l(Z^l)\;\;. \]

It is technically possible to have different activation functions within the same layer, but, again, for convenience in specification and implementation, we generally have the same activation function within a layer.

Here’s a diagram of a many-layered network, with two blocks for each layer, one representing the linear part of the operation and one representing the non-linear activation function. We will use this structural decomposition to organize our algorithmic thinking and implementation.

$\begin{tikzpicture}[ background rectangle/.style={fill=white}, show background rectangle ] \coordinate (x) at (0,0); \node[inner sep=0em] (w1) at (1.7,0) {\begin{tabular}{c} $W^1$ \\ $W^1_0$\end{tabular}}; \node[inner sep=1em] (f1) at ($2*(w1)$) {$f^1$}; \node[inner sep=0em] (w2) at ($3*(w1)$) {\begin{tabular}{c} $W^2$ \\ $W^2_0$\end{tabular}}; \node[inner sep=1em] (f2) at ($4*(w1)$) {$f^2$}; \node (dots) at ($5*(w1)$) {$\cdots$}; \node[inner sep=0em] (wL) at ($6*(w1)$) {\begin{tabular}{c} $W^L$ \\ $W^L_0$\end{tabular}}; \node[inner sep=1em] (fL) at ($7*(w1)$) {$f^L$}; \coordinate (y) at ($8*(w1) - (.3,0)$); \draw[->] (x) -- node[above] {$X = A^0$} (w1); \draw[->] (w1) -- node[above] {$Z^1$} (f1); \draw[->] (f1) -- node[above] {$A^1$} (w2); \draw[->] (w2) -- node[above] {$Z^2$} (f2); \draw[->] (f2) -- node[above] {$A^2$} (dots); \draw[->] (dots) -- node[above] {$A^{L-1}$} (wL); \draw[->] (wL) -- node[above] {$Z^L$} (fL); \draw[->] (fL) -- node[above] {$A^L$} (y); \draw [decorate,decoration={brace,mirror,amplitude=10pt}] ($(w1) + (-.4, -.7)$) -- node[yshift=-1.8em] {layer 1} ($(f1) + (.4,-.7)$); \draw [decorate,decoration={brace,mirror,amplitude=10pt}] ($(w2) + (-.4, -.7)$) -- node[yshift=-1.8em] {layer 2} ($(f2) + (.4,-.7)$); \draw [decorate,decoration={brace,mirror,amplitude=10pt}] ($(wL) + (-.4, -.7)$) -- node[yshift=-1.8em, align=center] {layer $L$} ($(fL) + (.4,-.7)$); \foreach \point in {w1, f1, w2, f2, wL, fL}{ \draw ($(\point) + (-.3,-.5)$) rectangle ($(\point) + (.3,.5)$); } \end{tikzpicture}$

6.3 Choices of activation function

There are many possible choices for the activation function. We will start by thinking about whether it’s really necessary to have an $f$ at all.

What happens if we let $f$ be the identity? Then, in a network with $L$ layers (we’ll leave out $W_0$ for simplicity, but keeping it wouldn’t change the form of this argument), \[ A^L = {W^L}^T A^{L-1} = {W^L}^T {W^{L-1}}^T \cdots {W^1}^T X\;\;. \]

So, multiplying out the weight matrices, we find that \[A^L = W^\text{total}X\;\;,\] which is a linear function of $X$! Having all those layers did not change the representational capacity of the network: the non-linearity of the activation function is crucial.

Convince yourself that any function representable by any number of linear layers (where $f$ is the identity function) can be represented by a single layer.

Now that we are convinced we need a non-linear activation, let’s examine a few common choices. These are shown mathematically below, followed by plots of these functions.

Step function: \[\text{step}(z) = \begin{cases} 0 & \text{if $z<0$} \\ 1 & \text{otherwise} \end{cases}\]

Rectified linear unit (ReLU): \[\text{ReLU}(z) = \begin{cases} 0 & \text{if $z<0$} \\ z & \text{otherwise} \end{cases} = \max(0,z)\]

Sigmoid function: Also known as a logistic function. This can sometimes be interpreted as probability, because for any value of $z$ the output is in $(0, 1)$: \[\sigma(z) = \frac{1}{1+e^{-z}}\]

Hyperbolic tangent: Always in the range $(-1, 1)$: \[\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\]

Softmax function: Takes a whole vector $Z \in \mathbb{R}^n$ and generates as output a vector $A \in (0, 1)^n$ with the property that $\sum_{i = 1}^n A_i = 1$, which means we can interpret it as a probability distribution over $n$ items:

\[ \text{softmax}(z) = \begin{bmatrix} \exp(z_1) / \sum_{i} \exp(z_i) \\ \vdots \\ \exp(z_n) / \sum_{i} \exp(z_i) \end{bmatrix} \]

$\begin{tikzpicture}[ background rectangle/.style={fill=white}, show background rectangle ] \begin{axis}[ name=topleft, axis lines=middle, axis equal image, xmin=-2, xmax=2, ymin=-.5, ymax=1.5, xlabel={$z$}, ylabel={$\text{step}(z)$}, ] \addplot [domain=-2:0, samples=2, ultra thick] {0}; \addplot [domain=0:2, samples=2, ultra thick] {1}; \end{axis} \begin{axis}[ name=topright, axis lines=middle, axis equal image, xmin=-2, xmax=2, ymin=-.5, ymax=1.5, xlabel={$z$}, ylabel={$\text{ReLU}(z)$}, at = (topleft.east), anchor=west, xshift=.7cm ] \addplot [domain=-2:0, samples=2, ultra thick] {0}; \addplot [domain=0:2, samples=2, ultra thick] {x}; \end{axis} \begin{axis}[ name=bottomleft, axis lines=middle, %axis equal image, xmin=-4, xmax=4, ymin=-1.25, ymax=1.25, xlabel={$z$}, ylabel={$\sigma(z)$}, at = (topleft.south), anchor=north, yshift=-.7cm ] \addplot [domain=-4:4, samples=100, ultra thick] {1/(1+exp(-x))}; \addplot [domain=-4:4, samples=2, dashed] {1}; \addplot [domain=-4:4, samples=2, dashed] {0}; \end{axis} \begin{axis}[ name=bottomright, axis lines=middle, %axis equal image, xmin=-4, xmax=4, ymin=-1.25, ymax=1.25, xlabel={$z$}, ylabel={$\tanh(z)$}, at = (bottomleft.east), anchor=west, xshift=.7cm ] \addplot [domain=-4:4, samples=100, ultra thick] {(exp(x) - exp(-x))/(exp(x) + exp(-x))}; \addplot [domain=-4:4, samples=2, dashed] {1}; \addplot [domain=-4:4, samples=2, dashed] {-1}; \end{axis} \end{tikzpicture}$

The original idea for neural networks involved using the step function as an activation, but because the derivative of the step function is zero everywhere except at the discontinuity (and there it is undefined), gradient-descent methods won’t be useful in finding a good setting of the weights, and so we won’t consider the step function further. Step functions have been replaced, in a sense, by the sigmoid, ReLU, and tanh activation functions.

Consider sigmoid, ReLU, and tanh activations. Which one is most like a step function? Is there an additional parameter you could add to a sigmoid that would make it be more like a step function?

What is the derivative of the ReLU function? Are there some values of the input for which the derivative vanishes?

ReLUs are especially common in internal (“hidden”) layers, sigmoid activations are common for the output for binary classification, and softmax activations are common for the output for multi-class classification (see Section 4.3.3 for an explanation).

6.4 Loss functions and activation functions

At layer $L,$ which is the output layer, we need to specify a loss function, and possibly an activation function as well. Different loss functions make different assumptions about the range of values they will get as input and, as we have seen, different activation functions will produce output values in different ranges. When you are designing a neural network, it’s important to make these things fit together well. In particular, we will think about matching loss functions with the activation function in the last layer, $f^L$. Here is a table of loss functions and activations that make sense for them:

Loss	$f^L$	task
squared	linear	regression
nll	sigmoid	binary classification
nllm	softmax	multi-class classification

We explored squared loss in Chapter 2 and (nll and nllm) in Chapter 4.

6.5 Error back-propagation

We will train neural networks using gradient descent methods. It’s possible to use batch gradient descent, in which we sum up the gradient over all the points (as in Section 3.2 of Chapter 3) or stochastic gradient descent (SGD), in which we take a small step with respect to the gradient considering a single point at a time (as in Section 3.4 of Chapter 3).

Our notation is going to get pretty hairy pretty quickly. To keep it as simple as we can, we’ll focus on computing the contribution of one data point $x^{(i)}$ to the gradient of the loss with respect to the weights, for SGD; you can simply sum up these gradients over all the data points if you wish to do batch descent.

So, to do SGD for a training example $(x, y)$, we need to compute $\nabla_W \mathcal{L}(\text{NN}(x;W),y)$, where $W$ represents all weights $W^l, W_0^l$ in all the layers $l = (1, \ldots, L)$. This seems terrifying, but is actually quite easy to do using the chain rule.

Remember the chain rule! If $a = f(b)$ and $b = g(c)$, so that $a = f(g(c))$, then \[ \begin{aligned} \frac{d a}{d c} &= \frac{d a}{d b} \cdot \frac{d b}{d c} \\ &= f'(b) g'(c) \\ &= f'(g(c)) g'(c) \end{aligned} \]

Remember that we are always computing the gradient of the loss function with respect to the weights for a particular value of $(x, y)$. That tells us how much we want to change the weights, in order to reduce the loss incurred on this particular training example.

6.5.1 Notation and Setup

We consider supervised learning with training data \[ \mathcal{D}_n = \{(x^{(i)}, y^{(i)})\}_{i=1}^n. \]

For a single example $(x,y)$, the forward pass through an $L$-layer network is \[ Z^l = (W^l)^\top A^{\,l-1} + W_0^l, \qquad A^l = f^l(Z^l), \qquad l=1,\dots,L, \] with $A^0 = x$ as the input and $g = A^L$ as the network’s output.

If layer $l$ has $d_l$ units, then:

$A^l, Z^l \in \mathbb{R}^{d_l \times 1}$,
$W^l \in \mathbb{R}^{d_{l-1} \times d_l}$,
$W_0^l \in \mathbb{R}^{d_l \times 1}$.

Each column of $W^l$ contains the incoming weights for one neuron in layer $l$.

Shape check: $(W^l)^\top \in \mathbb{R}^{d_l \times d_{l-1}}$, so $(W^l)^\top A^{l-1}$ is $d_l \times 1$, consistent with $Z^l$.

The per-example loss is $\mathcal{L}(g,y)$, and the overall training objective is \[ J = \frac{1}{n} \sum_{i=1}^n \mathcal{L}(g^{(i)}, y^{(i)}). \]

6.5.2 A Single Neuron

A single neuron computes \[ z = w^\top x + w_0, \qquad g = f(z). \]

Here $x \in \mathbb{R}^d$ is the input vector, $w \in \mathbb{R}^d$ is the weight vector, $w_0$ is the offset, $z$ is a scalar and $f$ is an activation function. The loss is $\mathcal{L}(g,y)$.

$\begin{tikzpicture}[ main node/.style={thick,circle,font=\Large}, background rectangle/.style={fill=white}, show background rectangle ] \node[main node,draw](sum) at (0,0) {$\sum$}; \node[main node](x1) at (-2,1) {$x_1$}; \node[main node](dots) at (-2,0) {$\vdots$}; \node[main node](xm) at (-2,-1) {$x_m$}; \coordinate (w0) at (0,-1.5); \node[main node, draw] (f) at (2,0) {$f(\cdot)$}; \node[main node] (y) at (4,0) {$a$}; \draw[->, above right] (x1) -- node {$w_1$} (sum); \draw (dots) -- node {} (sum); \draw[->, below right] (xm) -- node {$w_m$} (sum); \draw[above right] (w0) node {$w_0$} -- (sum) ; \draw[->, above] (sum) -- node (z) {$z$} (f); \draw[->, above] (f) -- (y); \node[black!70] (pre) at ($(z) + (0,1)$) {pre-activation}; \node[black!70] (out) at ($(y) + (0,1)$) {output}; \draw[->,black!70] (pre) -- (z); \draw[->,black!70] (out) -- (y); \end{tikzpicture}$

Before computing the derivatives, it helps to build some intuition:

A change in the weights $w$ changes the pre-activation $z$ in proportion to the input $x$.
A change in $z$ changes the output $g$ according to the slope $f'(z)$.
A change in $g$ changes the loss according to $\tfrac{\partial \mathcal{L}}{\partial g}$.

By the chain rule: \[ \begin{align*} \frac{\partial \mathcal{L}}{\partial w} &= \frac{\partial z}{\partial w} \frac{\partial g}{\partial z} \frac{\partial \mathcal{L}}{\partial g} \\ &=\; x \; f'(z)\; \frac{\partial \mathcal{L}}{\partial g}, \end{align*} \]

\[ \begin{align*} \frac{\partial \mathcal{L}}{\partial w_0} &= \frac{\partial z}{\partial w_0} \frac{\partial g}{\partial z} \; \frac{\partial \mathcal{L}}{\partial g} \\ &= f'(z) \; \frac{\partial \mathcal{L}}{\partial g}. \end{align*} \]

Shape check:

Since the loss and $g$ are scalars, $\tfrac{\partial \mathcal{L}}{\partial g}$ and $f'(z)$ are scalars.
Because $z$ is a scalar and $w \in \mathbb{R}^{d_0 \times 1}$, $\tfrac{\partial z}{\partial w}$ is a vector with the same shape as $w$, which is consistent here.

Also note the arrangement of the terms. Once we move to neural networks with vector outputs and multiple layers, the matrix form of the chain rule (under the denominator-layout convention we are using) requires working from right to left. Conveniently, this order matches how computations are drawn in network block diagrams.

As a final connection to earlier models:

If $f(z) = z$ and $\mathcal{L}(g,y) = \tfrac{1}{2}(g-y)^2$, this is linear regression.
If $f(z) = \sigma(z)$ (the sigmoid) and $\mathcal{L}$ is the negative log-likelihood loss, this is logistic regression.

So linear regressors and logistic classifiers are simply single-neuron networks!

Suppose you choose squared loss. What is $\partial \text{loss} / \partial a^L$?

Check the derivations above yourself. You should use the chain rule and also solve for the individual derivatives that arise in the chain rule.

6.5.3 A One-Hidden-Layer, Single Output Neuron Network

Building on the previous case, consider a network with one hidden layer: \[ Z^1 = (W^1)^\top A^0 + W_0^1, \qquad A^1 = f^1(Z^1), \] \[ z^2 = (w^2)^\top A^1 + w_0^2, \qquad g = f^2(z^2). \]

We want gradients of $\mathcal{L}(g,y)$ with respect to $W^1, W_0^1, w^2, w_0^2$. Here $w^2$ is a vector and $w_0^2$ is a scalar, written this way to emphasize continuity with the single-neuron case. In general, of course, a network can have multiple output neurons.

$\begin{tikzpicture}[ main node/.style={circle}, background rectangle/.style={fill=white}, show background rectangle ] % Input nodes \coordinate (x) at (-2,0); \coordinate (xoff) at (0,1); \node[main node](x1) at ($(x) + 1.5*(xoff)$) {$x_1$}; \node[main node](x2) at ($(x) + .5*(xoff)$) {$x_2$}; \node[main node](xdots) at ($(x) - .5*(xoff)$) {$\vdots$}; \node[main node](xm) at ($(x) - 1.5*(xoff)$) {$x_m$}; % Hidden layer (single-layer network) \coordinate (soff) at (0,1.4); \node[main node,draw](s1) at ($2*(soff)$) {$\sum$}; \node[main node,draw](s2) at (soff) {$\sum$}; \node[main node,draw](s3) at (0,0) {$\sum$}; \node[main node](sdots) at ($(0,0)-(soff)$) {$\vdots$}; \node[main node,draw](sn) at ($(0,0)-2*(soff)$) {$\sum$}; % Activation layer \coordinate (foff) at (1.5,0); \node[main node, draw](f1) at ($(s1) + (foff)$) {$f^1$}; \node[main node, draw](f2) at ($(s2) + (foff)$) {$f^1$}; \node[main node, draw](f3) at ($(s3) + (foff)$) {$f^1$}; \node[main node] at ($(sdots) + (foff)$) {$\vdots$}; \node[main node, draw](fn) at ($(sn) + (foff)$) {$f^1$}; % Single output neuron \node[main node, draw](sout) at ($(s3) + 3*(foff)$) {$\sum$}; \node[main node, draw](fout) at ($(sout) + (foff)$) {$f^2$}; \node[main node](y) at ($(fout) + (foff)$) {$g$}; % Connections from inputs to hidden layer \foreach \b in {(s1), (s2), (s3), (sn)} \foreach \a in {(x1), (x2), (xm)} \draw[->] \a -- \b; % Hidden layer to activation \foreach \x/\y in {(s1)/(f1), (s2)/(f2), (s3)/(f3), (sn)/(fn)} \draw[->] \x -- \y; % Activations to single output neuron \foreach \a in {(f1), (f2), (f3), (fn)} \draw[->] \a -- (sout); % Output connections \draw[->] (sout) -- (fout); \draw[->] (fout) -- (y); % Weight label \node[main node, below left] (weights1) at ($(xm)!0.5!(sn)$) {$W^1,W^1_0$}; \node[main node, below right] (weights2) at ($(fn)!0.5!(sout)$) {$w^2,w^2_0$}; % pre- and post- activation labels \node[main node, below] (Z1) at ($(sn)!0.5!(fn)$) {$Z^1$}; \node[main node, right=1cm of Z1] (A1) {$A^1$}; \node[main node, below] (z2) at ($(sout)!0.5!(fout)$) {$z^2$}; \end{tikzpicture}$

We have already worked through the gradients of the loss with respect to $w^2$ and $w_0^2$:

\[ \begin{align*} \frac{\partial \mathcal{L}}{\partial w^2} &= \frac{\partial z^2}{\partial w^2} \frac{\partial g}{\partial z^2} \frac{\partial \mathcal{L}}{\partial g} \\ &=\; A^1 \; {f^2}'(z^2)\; \frac{\partial \mathcal{L}}{\partial g}, \end{align*} \]

\[ \begin{align*} \frac{\partial \mathcal{L}}{\partial w^2_0} &= \frac{\partial z^2}{\partial w^2_0} \frac{\partial g}{\partial z^2} \; \frac{\partial \mathcal{L}}{\partial g} \\ &= {f^2}'(z^2) \; \frac{\partial \mathcal{L}}{\partial g}. \end{align*} \]

Shape check:

$A^1 \in \mathbb{R}^{d_1 \times 1}$,
the scalar ${f^2}'(z^2)\tfrac{\partial \mathcal{L}}{\partial g}$ multiplies it,
so $\tfrac{\partial \mathcal{L}}{\partial w^2} \in \mathbb{R}^{d_1 \times 1}$, consistent with $w^2$.

Now, we need the gradients with respect to $W^1$ and $W^1_0$. Writing out the chain rule shows that

\[ \frac{\partial \mathcal{L}}{\partial W^1} = \frac{\partial Z^1}{\partial W^1} \frac{\partial A^1}{\partial Z^1} \frac{\partial z^2}{\partial A^1} \underbrace{\frac{\partial g}{\partial z^2} \frac{\partial \mathcal{L}}{\partial g}}_{\text{shared}}, \]

\[ \frac{\partial \mathcal{L}}{\partial W^1_0} = \frac{\partial Z^1}{\partial W^1_0} \frac{\partial A^1}{\partial Z^1} \frac{\partial z^2}{\partial A^1} \underbrace{\frac{\partial g}{\partial z^2} \frac{\partial \mathcal{L}}{\partial g}}_{\text{shared}}. \]

The final two terms are exactly the same as in the gradients for $w^2$ and $w^2_0$. This illustrates the backward reuse that makes back-propagation efficient.

Carrying out the derivatives gives:

\[ \begin{align*} \frac{\partial \mathcal{L}}{\partial W^1} &= A^0 \Big( \underbrace{{f^1}'(Z^1) \big(w^2 {f^2}'(z^2) \tfrac{\partial \mathcal L}{\partial g}\big)}_{\tfrac{\partial \mathcal L}{\partial Z^1}}\Big)^T \end{align*} \]

\[ \begin{align*} \frac{\partial \mathcal{L}}{\partial W^1_0} &= \underbrace{{f^1}'(Z^1) \big(w^2 {f^2}'(z^2) \tfrac{\partial \mathcal L}{\partial g}\big)}_{\tfrac{\partial \mathcal L}{\partial Z^1}} \end{align*} \]

Shape check:

$A^0 \in \mathbb{R}^{d_0 \times 1}$,
$(\tfrac{\partial \mathcal{L}}{\partial Z^1})^\top \in \mathbb{R}^{1 \times d_1}$,

so the product is $d_0 \times d_1$, consistent with $W^1$.

Where did the transpose come from? A note on the shape of things.

Note that $Z^1$ is a vector and $W^1$ is a matrix; strictly speaking, $\tfrac{\partial Z^1}{\partial W^1}$ is a vector-by-matrix derivative, which produces a tensor. Why, then, do we always end up with something as simple as an outer product?

To see why, recall \[ Z^1 = (W^1)^\top A^0 + W_0^1, \] so for the $j$-th coordinate, \[ z^1_j = \sum_{k=1}^{d_0} w^1_{k,j}\,a^0_k + w^1_{0,j}. \]

Now consider $\tfrac{\partial z^1_j}{\partial w^1_{k,j'}}$:

If $j \neq j'$, then $z^1_j$ does not depend on $w^1_{k,j'}$, so the derivative is $0$.
If $j = j'$, then \[ \frac{\partial z^1_j}{\partial w^1_{k,j}} = a^0_k. \]

Thus, every entry of the derivative tensor is either zero or one of the inputs $a^0_k$. The tensor has a very simple structure: for each $j$, the slice corresponding to $z^1_j$ is just the input vector $A^0$ in the $j$-th column, with zeros elsewhere.

When we apply the chain rule to compute the gradient of the loss, \[ \frac{\partial \mathcal{L}}{\partial W^1} = \sum_j \frac{\partial \mathcal{L}}{\partial z^1_j} \, \frac{\partial z^1_j}{\partial W^1}, \] each $\frac{\partial z^1_j}{\partial W^1}$ contributes a column equal to $A^0$ scaled by $\tfrac{\partial \mathcal{L}}{\partial z^1_j}$.

Stacking these columns gives exactly the outer product: \[ \frac{\partial \mathcal{L}}{\partial W^1} = A^0 \Big(\frac{\partial \mathcal{L}}{\partial Z^1}\Big)^\top. \]

So although $\tfrac{\partial Z^1}{\partial W^1}$ is formally a tensor, its sparse structure ensures that, once contracted with $\tfrac{\partial \mathcal{L}}{\partial Z^1}$, it collapses neatly into a matrix. This is why in practice we never need to form the tensor explicitly.

6.5.4 General $L$-Layer Network

Putting this together, we arrive at the general back-propagation recursion:

\[ \frac{\partial \mathcal{L}}{\partial Z^l} = {f^l}'(Z^l) \Big(W^{l+1}\frac{\partial \mathcal{L}}{\partial Z^{l+1}}\Big). \]

This rule is the heart of back-propagation: once the sensitivity $\tfrac{\partial \mathcal{L}}{\partial Z^{l+1}}$ is known for layer $l{+}1$, we can compute $\tfrac{\partial \mathcal{L}}{\partial Z^l}$ for the previous layer by (1) multiplying by the next layer’s weights, and (2) applying the derivative of the activation function.

Shape check:

$W^{l+1}\in\mathbb{R}^{d_l \times d_{l+1}}$,
$\tfrac{\partial \mathcal{L}}{\partial Z^{l+1}}\in\mathbb{R}^{d_{l+1}\times 1}$,
so $W^{l+1}\tfrac{\partial \mathcal{L}}{\partial Z^{l+1}} \in\mathbb{R}^{d_l\times 1}$.
${f^l}'(Z^l)\in\mathbb{R}^{d_l\times d_l}$ keeps the shape consistent.

For a general neural network with $L$ layers, the same reasoning applies.

Forward pass: \[ Z^l = (W^l)^\top A^{l-1} + W_0^l, \qquad A^l = f^l(Z^l), \qquad l=1,\dots,L. \]

Start at the output: \[ \frac{\partial \mathcal{L}}{\partial Z^L} = {f^L}'(Z^L) \frac{\partial \mathcal{L}}{\partial g}, \] \[ \frac{\partial \mathcal{L}}{\partial W^L} = A^{L-1}\Big(\frac{\partial \mathcal{L}}{\partial Z^L}\Big)^\top, \qquad \frac{\partial \mathcal{L}}{\partial W_0^L} = \frac{\partial \mathcal{L}}{\partial Z^L}. \]

Then for each hidden layer $l=L-1,\dots,1$:

Pass sensitivity back through the weights: \[ \frac{\partial \mathcal{L}}{\partial A^l} = W^{l+1}\frac{\partial \mathcal{L}}{\partial Z^{l+1}}. \]
Apply the elementwise activation derivative: \[ \frac{\partial \mathcal{L}}{\partial Z^l} = {f^l}'(Z^l) \frac{\partial \mathcal{L}}{\partial A^l}. \]
Compute parameter gradients: \[ \frac{\partial \mathcal{L}}{\partial W^l} = A^{l-1}\Big(\frac{\partial \mathcal{L}}{\partial Z^l}\Big)^\top, \qquad \frac{\partial \mathcal{L}}{\partial W_0^l} = \frac{\partial \mathcal{L}}{\partial Z^l}. \]

Shape check:

$A^{l-1}\in\mathbb{R}^{d_{l-1}\times 1}$,
$(\tfrac{\partial \mathcal{L}}{\partial Z^l})^\top \in \mathbb{R}^{1\times d_l}$, so their product is $d_{l-1}\times d_l$, consistent with $W^l$.

Check that the final layer ($l=L$) case is a special case of the general layer $l$ case above.

6.5.5 From per-example loss to the objective

All of the derivations above apply to a single training example $(x,y)$. For the full objective \[ J = \frac{1}{n} \sum_{i=1}^n \mathcal{L}\big(g^{(i)},y^{(i)}\big), \] the gradients are the averages: \[ \frac{\partial J}{\partial W^l} = \frac{1}{n}\sum_{i=1}^n \frac{\partial \mathcal{L}^{(i)}}{\partial W^l}, \qquad \frac{\partial J}{\partial W_0^l} = \frac{1}{n}\sum_{i=1}^n \frac{\partial \mathcal{L}^{(i)}}{\partial W_0^l}. \]

Shape check: Each $\tfrac{\partial \mathcal{L}^{(i)}}{\partial W^l}$ has the same dimensions as $W^l$ ($d_{l-1}\times d_l$), so averaging preserves the shape.

6.5.6 Reflecting on backpropagation

This general process of computing the gradients of the loss with respect to the weights is called error back-propagation.

The idea is that we first do a forward pass to compute all the $a$ and $z$ values at all the layers, and finally the actual loss. Then, we can work backward and compute the gradient of the loss with respect to the weights in each layer, starting at layer $L$ and going back to layer 1.

$\begin{tikzpicture}[ scale=.98, background rectangle/.style={fill=white}, show background rectangle ] \coordinate (x) at (0,0); \node[inner sep=0em] (w1) at (1.7,0) {\begin{tabular}{c} $W^1$ \\ $W^1_0$\end{tabular}}; \node[inner sep=1em] (f1) at ($2*(w1)$) {$f^1$}; \node[inner sep=0em] (w2) at ($3*(w1)$) {\begin{tabular}{c} $W^2$ \\ $W^2_0$\end{tabular}}; \node[inner sep=1em] (f2) at ($4*(w1)$) {$f^2$}; \node (dots) at ($5*(w1)$) {$\cdots$}; \node[inner sep=0em] (wL) at ($6*(w1)$) {\begin{tabular}{c} $W^L$ \\ $W^L_0$\end{tabular}}; \node[inner sep=1em] (fL) at ($7*(w1)$) {$f^L$}; \node (loss) at ($8*(w1)$) {Loss}; \coordinate (y) at ($8*(w1) + (0,1.5)$); \draw[->] (x) -- node[above] {$X = A^0$} (w1); \draw[->] (w1) -- node[above] {$Z^1$} (f1); \draw[->] (f1) -- node[above] {$A^1$} (w2); \draw[->] (w2) -- node[above] {$Z^2$} (f2); \draw[->] (f2) -- node[above] {$A^2$} (dots); \draw[->] (dots) -- node[above] {$A^{L-1}$} (wL); \draw[->] (wL) -- node[above] {$Z^L$} (fL); \draw[->] (fL) -- node[above] {$A^L$} (loss); \draw[->] (y) -- node[right] {$y$} ($(loss)+(0,.65)$); %\draw[->,yshift=0.5cm] (loss.west) to [out=150,in=30] (fL.east); \foreach \s/\e/\t in {loss/fL/A^L, fL/wL/Z^L, wL/dots/A^{L-1}, dots/f2/A^2, f2/w2/Z^2, w2/f1/A^1, f1/w1/Z^1}{ \path[->] ($(\s.west) - (0,.5)$) edge[out=210,in=-30] node[below] {$\frac{\partial \text{loss}}{\partial \t}$} ($(\e.east) - (0,.5)$); } \foreach \point in {w1, f1, w2, f2, wL, fL}{ \draw ($(\point) + (-.3,-.5)$) rectangle ($(\point) + (.3,.5)$); } \draw ($(loss) + (-.4,-.5)$) rectangle ($(loss) + (.4,.5)$); \end{tikzpicture}$

If we view our neural network as a sequential composition of modules (in our work so far, it has been an alternation between a linear transformation with a weight matrix, and a component-wise application of a non-linear activation function), then we can define a simple API for a module that will let us compute the forward and backward passes, as well as do the necessary weight updates for gradient descent. Each module has to provide the following “methods.” We are already using letters $a, x, y, z$ with particular meanings, so here we will use $u$ as the vector input to the module and $v$ as the vector output:

forward: $u \rightarrow v$
backward: $u, v, \partial L / \partial v \rightarrow \partial L / \partial u$

Notice that the backward pass does not output $\partial v / \partial u$, even though the forward pass maps from $u$ to $v$. In the backward pass, we are always directly computing and ``passing around’’ gradients of the loss.

weight grad: $u, \partial L / \partial v \rightarrow \partial L / \partial W$ only needed for modules that have weights $W$

In homework we will ask you to implement these modules for neural network components, and then use them to construct a network and train it as described in the next section.

6.6 Training

Here we go! Here’s how to do stochastic gradient descent training on a feed-forward neural network. After this pseudo-code, we motivate the choice of initialization in lines 2 and 3. The actual computation of the gradient values (e.g., $\partial\text{loss}/\partial A^L$) is not directly defined in this code, because we want to make the structure of the computation clear.

What is $\partial Z^l / \partial W^l$?

Which terms in the code below depend on $f^L$?

\begin{algorithm}\begin{algorithmic} \Procedure{SGD-NEURAL-NET}{$\mathcal{D}_n, T, L,\left(m^1, \ldots, m^L\right),\left(f^1, \ldots, f^{L}\right), \text{Loss}$} \For{$l \gets 1$ to $L$} \State $W_{ij}^l \sim \text{Gaussian}(0, 1/m^l)$ \State $W_{0j}^l \sim \text{Gaussian}(0, 1)$ \EndFor \For{$t \gets 1$ to $T$} \State $i \gets \text{random sample from } \{1,\ldots,n\}$ \State $A^0 \gets x^{(i)}$ \Comment{forward pass to compute $A^L$} \For{$l \gets 1$ to $L$} \State $Z^l \gets W^{l^T} A^{l-1} + W_0^l$ \State $A^l \gets f^l(Z^l)$ \EndFor \State $\text{loss} \gets \text{Loss}(A^L,\ y^{(i)})$ \For{$l \gets L$ down to $1$} \Comment{error back-propagation} \State $\displaystyle \frac{\partial\text{loss}}{\partial A^l} \gets \begin{cases} \frac{\partial Z^{l+1}}{\partial A^l}\cdot\frac{\partial \text{loss}}{\partial Z^{l+1}} & \text{if } l < L,\\ \frac{\partial \text{loss}}{\partial A^L} & \text{otherwise} \end{cases}$ \State $\displaystyle \frac{\partial\text{loss}}{\partial Z^l} \gets \frac{\partial A^l}{\partial Z^l} \cdot \frac{\partial \text{loss}}{\partial A^l}$ \State $\displaystyle \frac{\partial\text{loss}}{\partial W^l} \gets A^{l-1}\,\bigl(\frac{\partial \text{loss}}{\partial Z^l}\bigr)^\top$ \State $\displaystyle \frac{\partial\text{loss}}{\partial W_0^l} \gets \frac{\partial \text{loss}}{\partial Z^l}$ \Comment{SGD update} \State $W^l \gets W^l - \eta(t)\,\frac{\partial \text{loss}}{\partial W^l}$ \State $W_0^l \gets W_0^l - \eta(t)\,\frac{\partial \text{loss}}{\partial W_0^l}$ \EndFor \EndFor \EndProcedure \end{algorithmic} \end{algorithm}

Initializing $W$ is important; if you do it badly there is a good chance the neural network training won’t work well. First, it is important to initialize the weights to random values. We want different parts of the network to tend to “address” different aspects of the problem; if they all start at the same weights, the symmetry will often keep the values from moving in useful directions. Second, many of our activation functions have (near) zero slope when the pre-activation $z$ values have large magnitude, so we generally want to keep the initial weights small so we will be in a situation where the gradients are non-zero, so that gradient descent will have some useful signal about which way to go.

One good general-purpose strategy is to choose each weight at random from a Gaussian (normal) distribution with mean 0 and standard deviation $(1/m)$ where $m$ is the number of inputs to the unit.

If the input $x$ to this unit is a vector of 1’s, what would the expected pre-activation $z$ value be with these initial weights?

We write this choice (where $\sim$ means “is drawn randomly from the distribution”) as $W^l_{ij} \sim \text{Gaussian}\left(0, \frac{1}{m^l}\right).$

It will often turn out (especially for fancier activations and loss functions) that computing $\frac{\partial \text{loss}}{\partial Z^L}$ is easier than computing $\frac{\partial \text{loss}}{\partial A^L}$ and $\frac{\partial A^L}{\partial Z^L}.$ So, we may instead ask for an implementation of a loss function to provide a backward method that computes $\partial \text{loss}/\partial Z^L$ directly.

6.7 Optimizing neural network parameters

Because neural networks are just parametric functions, we can optimize loss with respect to the parameters using standard gradient-descent software, but we can take advantage of the structure of the loss function and the hypothesis class to improve optimization. As we have seen, the modular function-composition structure of a neural network hypothesis makes it easy to organize the computation of the gradient. As we have also seen earlier, the structure of the loss function as a sum over terms, one per training data point, allows us to consider stochastic gradient methods. In this section we’ll consider some alternative strategies for organizing training, and also for making it easier to handle the step-size parameter.

6.7.1 Batches

Assume that we have an objective of the form \[J(W) = \frac{1}{n}\sum_{i = 1}^n \mathcal{L}(h(x^{(i)}; W), y^{(i)})\;\;,\] where $h$ is the function computed by a neural network, and $W$ stands for all the weight matrices and vectors in the network.

Recall that, when we perform batch (or the vanilla) gradient descent, we use the update rule \[ W_t = W_{t-1} - \eta\nabla_W J(W_{t-1})\;\;,\] which is equivalent to \[W_t = W_{t-1} - \eta\sum_{i=1}^n \nabla_W \mathcal{L}(h(x^{(i)}; W_{t-1}), y^{(i)})\;\;. \]

So, we sum up the gradient of loss at each training point, with respect to $W$, and then take a step in the negative direction of the gradient.

In stochastic gradient descent, we repeatedly pick a point $(x^{(i)}, y^{(i)})$ at random from the data set, and execute a weight update on that point alone: \[ W_t = W_{t-1} - \eta \nabla_W \mathcal{L}(h(x^{(i)}; W_{t-1}), y^{(i)})\;\;. \]

As long as we pick points uniformly at random from the data set, and decrease $\eta$ at an appropriate rate, we are guaranteed, with high probability, to converge to at least a local optimum.

These two methods have offsetting virtues. The batch method takes steps in the exact gradient direction but requires a lot of computation before even a single step can be taken, especially if the data set is large. The stochastic method begins moving right away, and can sometimes make very good progress before looking at even a substantial fraction of the whole data set, but if there is a lot of variability in the data, it might require a very small $\eta$ to effectively average over the individual steps moving in “competing” directions.

An effective strategy is to “average” between batch and stochastic gradient descent by using mini-batches. For a mini-batch of size $K$, we select $K$ distinct data points uniformly at random from the data set and do the update based just on their contributions to the gradient \[ W_t = W_{t-1} - \frac{\eta}{K}\sum_{i=1}^K \nabla_W \mathcal{L}(h(x^{(i)}; W_{t-1}), y^{(i)})\;\;. \]

Most neural network software packages are set up to do mini-batches.

For what value of $K$ is mini-batch gradient descent equivalent to stochastic gradient descent? To batch gradient descent?

Picking $K$ unique data points at random from a large data-set is potentially computationally difficult. An alternative strategy, if you have an efficient procedure for randomly shuffling the data set (or randomly shuffling a list of indices into the data set) is to operate in a loop, roughly as follows:

\begin{algorithm}\begin{algorithmic} \Procedure{Mini-Batch-SGD}{\texttt{NN}, \texttt{data}, \texttt{K}} \State $n \gets \text{length}(\texttt{data})$ \While{\texttt{not done}} \State \Call{Random-Shuffle}{\texttt{data}} \For{$i \gets 1$ to $\left\lceil \frac{n}{K} \right\rceil$} \State \Call{Batch-Gradient-Update}{\texttt{NN}, \texttt{data}[(i-1)K : iK]} \EndFor \EndWhile \EndProcedure \end{algorithmic} \end{algorithm}

In line 4 of the algorithm above, $\lceil \cdot \rceil$ is known as the ceiling function; it returns the smallest integer greater than or equal to its input. E.g., $\lceil 2.5 \rceil = 3$ and $\lceil 3 \rceil = 3$.

6.7.2 Adaptive step-size

Picking a value for $\eta$ is difficult and time-consuming. If it’s too small, then convergence is slow and if it’s too large, then we risk divergence or slow convergence due to oscillation. This problem is even more pronounced in stochastic or mini-batch mode, because we know we need to decrease the step size for the formal guarantees to hold.

It’s also true that, within a single neural network, we may well want to have different step sizes. As our networks become deep (with increasing numbers of layers) we can find that magnitude of the gradient of the loss with respect the weights in the last layer, $\partial \text{loss} / \partial W_L$, may be substantially different from the gradient of the loss with respect to the weights in the first layer $\partial \text{loss} / \partial W_1$. If you look carefully at ?eq-gradz, you can see that the output gradient is multiplied by all the weight matrices of the network and is “fed back” through all the derivatives of all the activation functions. This can lead to a problem of exploding or vanishing gradients, in which the back-propagated gradient is much too big or small to be used in an update rule with the same step size.

So, we can consider having an independent step-size parameter for each weight, and updating it based on a local view of how the gradient updates have been going.

This section is very strongly influenced by Sebastian Ruder’s excellent blog posts on the topic:{ruder.io/optimizing-gradient-descent}

Some common strategies for this include momentum (“averaging” recent gradient updates), Adadelta (take larger steps in parts of the space where $J(W)$ is nearly flat), and Adam (which combines these two previous ideas). Details of these approaches are described in Section B.1.

6.8 Regularization

So far, we have only considered optimizing loss on the training data as our objective for neural network training. But, as we have discussed before, there is a risk of overfitting if we do this. The pragmatic fact is that, in current deep neural networks, which tend to be very large and to be trained with a large amount of data, overfitting is not a huge problem. This runs counter to our current theoretical understanding and the study of this question is a hot area of research. Nonetheless, there are several strategies for regularizing a neural network, and they can sometimes be important.

6.8.1 Methods related to ridge regression

One group of strategies can, interestingly, be shown to have similar effects to each other: early stopping, weight decay, and adding noise to the training data.

Result is due to Bishop, described in his textbook and here.

Early stopping is the easiest to implement and is in fairly common use. The idea is to train on your training set, but at every epoch (a pass through the whole training set, or possibly more frequently), evaluate the loss of the current $W$ on a validation set. It will generally be the case that the loss on the training set goes down fairly consistently with each iteration, the loss on the validation set will initially decrease, but then begin to increase again. Once you see that the validation loss is systematically increasing, you can stop training and return the weights that had the lowest validation error.

Warning: If you use your validation set in this way – i.e., to set the number of epochs (or any other hyperparameter associated with your learning algorithm) – then error on the validation set no longer provides a “pure” estimate of error on the test set (i.e., generalization error). This is because information about the validation set has “leaked” into the design of your algorithm. See also the discussion on Validation and Cross-Validation in Chapter 2.

Another common strategy is to simply penalize the norm of all the weights, as we did in ridge regression. This method is known as weight decay, because when we take the gradient of the objective \[ J(W) = \sum_{i = 1}^{n}\mathcal{L}(\text{NN}(x^{(i)}), y^{(i)}; W) + \lambda\|W\|^2 \] we end up with an update of the form

\[ \begin{aligned} W_t & = W_{t-1} - \eta\left(\left(\nabla_{W}\mathcal{L}(\text{NN}(x^{(i)}), y^{(i)}; W_{t-1})\right) + 2\lambda W_{t-1}\right) \\ & = W_{t-1}(1 - 2\lambda\eta) - \eta\left(\nabla_{W}\mathcal{L}(\text{NN}(x^{(i)}), y^{(i)}; W_{t-1})\right) \;\;. \end{aligned} \]

This rule has the form of first “decaying” $W_{t-1}$ by a factor of $(1 - 2 \lambda \eta)$ and then taking a gradient step.

Finally, the same effect can be achieved by perturbing the $x^{(i)}$ values of the training data by adding a small amount of zero-mean normally distributed noise before each gradient computation. It makes intuitive sense that it would be more difficult for the network to overfit to particular training data if they are changed slightly on each training step.

6.8.2 Dropout

Dropout is a regularization method that was designed to work with deep neural networks. The idea behind it is, rather than perturbing the data every time we train, we’ll perturb the network! We’ll do this by randomly, on each training step, selecting a set of units in each layer and prohibiting them from participating. Thus, all of the units will have to take a kind of “collective” responsibility for getting the answer right, and will not be able to rely on any small subset of the weights to do all the necessary computation. This tends also to make the network more robust to data perturbations.

During the training phase, for each training example, for each unit, randomly with probability $p$ temporarily set $a^{\ell}_j = 0$. There will be no contribution to the output and no gradient update for the associated unit.

When we are done training and want to use the network to make predictions, we multiply all weights by $p$ to achieve the same average activation levels.

Implementing dropout is easy! In the forward pass during training, we let \[ a^{\ell} = f(z^{\ell}) * d^{\ell} \] where $*$ denotes component-wise product and $d^{\ell}$ is a vector of $0$’s and $1$’s drawn randomly with probability $p$. The backwards pass depends on $a^{\ell}$, so we do not need to make any further changes to the algorithm.

It is common to set $p$ to $0.5$, but this is something one might experiment with to get good results on your problem and data.

6.8.3 Batch normalization

Another strategy that seems to help with regularization and robustness in training is batch normalization.

For more details see arxiv.org/abs/1502.03167.

It was originally developed to address a problem of covariate shift: that is, if you consider the second layer of a two-layer neural network, the distribution of its input values is changing over time as the first layer’s weights change. Learning when the input distribution is changing is extra difficult: you have to change your weights to improve your predictions, but also just to compensate for a change in your inputs (imagine, for instance, that the magnitude of the inputs to your layer is increasing over time—then your weights will have to decrease, just to keep your predictions the same).

So, when training with mini-batches, the idea is to standardize the input values for each mini-batch, just in the way that we did it in Section 5.3.3 of Chapter 5, subtracting off the mean and dividing by the standard deviation of each input dimension. This means that the scale of the inputs to each layer remains the same, no matter how the weights in previous layers change. However, this somewhat complicates matters, because the computation of the weight updates will need to take into account that we are performing this transformation. In the modular view, batch normalization can be seen as a module that is applied to $z^l$, interposed after the product with $W^l$ and before input to $f^l$.

We follow here the suggestion from the original paper of applying batch normalization before the activation function. Since then it has been shown that, in some cases, applying it after works a bit better. But there aren’t any definite findings on which works better and when.

Although batch-norm was originally justified based on the problem of covariate shift, it’s not clear that that is actually why it seems to improve performance. Batch normalization can also end up having a regularizing effect for similar reasons that adding noise and dropout do: each mini-batch of data ends up being mildly perturbed, which prevents the network from exploiting very particular values of the data points. For those interested, the equations for batch normalization, including a derivation of the forward pass and backward pass, are described in Section B.2.

6.1 Basic element

6.2 Networks

6.2.1 Single layer

6.2.2 Many layers

6.3 Choices of activation function

6.4 Loss functions and activation functions

6.5 Error back-propagation

6.5.1 Notation and Setup

6.5.2 A Single Neuron

6.5.3 A One-Hidden-Layer, Single Output Neuron Network

6.5.4 General \(L\)-Layer Network

6.5.5 From per-example loss to the objective

6.5.6 Reflecting on backpropagation

6.6 Training

6.7 Optimizing neural network parameters

6.7.1 Batches

6.7.2 Adaptive step-size

6.8 Regularization

6.8.1 Methods related to ridge regression

6.8.2 Dropout

6.8.3 Batch normalization