Appendix A — Matrix Calculus

A.1 Single-variable Calculus

We all know how to calculate a derivative in single-variable calculus. If we have an input variable \(x \in \mathbb{R}\) and a function \(f(x): \mathbb{R} \rightarrow \mathbb{R}\), then:

\[ f'(x) = \frac{f(x + dx) - f(x)}{dx} \]

We generally define the numerator as \(df = f(x + dx) - f(x)\), which gives \(\frac{df}{dx}\) as our standard derivative. In other words, we can approximate some small change in the function as the derivative times a small change in our input variable:

\[ df = \frac{df}{dx} \cdot dx \]

We can even, in a very hand-waving way, think of the \(dx\)’s canceling out nicely.

A.2 Multi-variable Calculus

A.2.1 Vector-by-Vector Derivatives

In multi-variable calculus, we have \(x \in \mathbb{R}^n\) and \(f(x): \mathbb{R}^n \rightarrow \mathbb{R}^m\). We will use the following notation, which may seem a bit counter-intuitive at first:

\[ df = \frac{df}{dx}^T dx \]

We are intentionally adding a transpose to the \(\frac{df}{dx}\) term. This means that \(\frac{df}{dx}\) will be of size \((n \times m)\). The input (\(x\)) determines the rows in the derivative, and the output (\(f\)) determines the columns in the derivative. In other words:

\[ \left(\frac{df}{dx}\right)_{i,j} = \frac{df_j}{dx_i} \]

This is called the denominator layout convention, and it has been widely adopted by the machine learning community because it ensures that the shape of the gradient is the same as the shape of the respective parameter. This simplifies calculations considerably, as we will see.

Numerator Layout

Some textbooks instead use the numerator layout, where \(df = \frac{df}{dx} \cdot dx\) without the transpose. In that convention, \(\frac{df}{dx}\) is of size \((m \times n)\)—the transpose of what we define above. Although it may seem more intuitive at first, it makes chain rule computations more complicated. We will not use numerator layout in this course.

A.2.2 Scalar-by-Matrix Derivatives

What if our inputs and outputs are not just vectors, but matrices? In this case, things certainly become more complicated. For instance, note that the number of dimensions needed for a derivative object is the product of the number of dimensions in the input and output of the function. That’s why vector-by-vector derivatives look like matrices. This would mean that vector-by-matrix derivatives would require us to have a three-dimensional derivative object (often referred to as a tensor). These types of high-order objects are beyond the scope of this course.

However, we will deal with one specific case of matrix derivatives; namely, when we have a function \(f\) that maps matrix inputs of size \(m \times n\) to scalar outputs. This derivative will require only a matrix of values to capture its information. But how do we formally define this?

In the vector-by-vector case, we defined a \(df\) by summing up all of the products between \(\frac{df_i}{dx_j}\) and \(dx_j\). We can extend the notation above of accumulating partial derivatives quite naturally to the case of scalar-by-matrix derivatives. Specifically:

\[df = \sum_{i,j} \frac{df}{dA_{i,j}} \cdot dA_{i,j}\]

In other words, we element-wise multiply the elements of \(\frac{df}{dA}\) and \(dA\) and sum up the results! This is essentially a component-wise dot-product for matrices, and immediatly tells us that the size of \(\frac{df}{dA}\) must match the size of \(dA\).

In practice, however, we want to avoid all of these indices, so there is actually a cleaner way to write this expression. First, recall from Linear Algebra that the trace of a matrix is denoted \(tr(A)\) and it sums up the elements along the main diagonal of the matrix. Using this function, we can rewrite the above as:

\[df = tr(\frac{df}{dA}^T \cdot dA)\]

This can be verified by expanding out the indices of the multiplication.

Proof

Consider the matrix \(A\) of size \(m \times n\), and denote \(A_j\) as the \(j\)th column of \(A\). We have: \[ \begin{aligned} df &= tr(\frac{df}{dA}^T \cdot dA) \\ &= \sum_{j=1}^{n} \frac{df}{dA}_j^T \cdot dA_j \\ &= \sum_{j=1}^{n} \sum_{i=1}^{m} \frac{df}{dA}_{i, j} \cdot dA_{i,j} \\ &= \sum_{i,j} \frac{df}{dA}_{i, j} \cdot dA_{i,j} \end{aligned} \] which was the original expression above

Nevertheless, this cleaner notation that avoids indexing directly will be helpful when we take derivatives with respect to weight matrices in Neural Nets, as we will cover later.

A.3 The Shapes of Things

Here are the important special cases of the denominator layout convention:

Scalar-by-scalar: For \(x\) of size \(1\times 1\) and \(y\) of size \(1\times 1\), \(\partial y/\partial x\) is the (scalar) partial derivative of \(y\) with respect to \(x\).
Scalar-by-vector: For \({\bf x}\) of size \(n\times 1\) and \(y\) of size \(1\times 1\), \(\partial y/\partial {\bf x}\) (also written \(\nabla_{\bf x}y\), the gradient of \(y\) with respect to \({\bf x}\)) is a column vector of size \(n\times 1\) with the \(i^{\rm th}\) entry \(\partial y/\partial x_i\):

\[\begin{aligned} \partial y/\partial {\bf x}= \begin{bmatrix} \partial y/\partial x_1 \\ \partial y/\partial x_2 \\ \vdots \\ \partial y/\partial x_n \end{bmatrix}. \end{aligned}\]

Vector-by-scalar: For \(x\) of size \(1\times 1\) and \({\bf y}\) of size \(m\times 1\), \(\partial {\bf y}/\partial x\) is a row vector of size \(1 \times m\) with the \(j^{\rm th}\) entry \(\partial {\bf y}_j/\partial x\):

\[\begin{aligned} \partial {\bf y}/\partial x= \begin{bmatrix} \partial y_1/\partial x & \partial y_2/\partial x & \cdots & \partial y_m/\partial x \end{bmatrix}. \end{aligned}\]

Vector-by-vector: For \({\bf x}\) of size \(n\times 1\) and \({\bf y}\) of size \(m\times 1\), \(\partial {\bf y}/\partial {\bf x}\) is a matrix of size \(n\times m\) with the \((i, j)\) entry \(\partial y_j/\partial x_i\):

\[\begin{aligned} \partial {\bf y}/\partial {\bf x}= \begin{bmatrix} \partial y_1/\partial x_1 & \partial y_2/\partial x_1 & \cdots & \partial y_m/\partial x_1 \\ \partial y_1/\partial x_2 & \partial y_2/\partial x_2 & \cdots & \partial y_m/\partial x_2 \\ \vdots & \vdots & \ddots & \vdots \\ \partial y_1/\partial x_n & \partial y_2/\partial x_n & \cdots & \partial y_m/\partial x_n \\ \end{bmatrix}. \end{aligned}\]

Scalar-by-matrix: For \({\bf X}\) of size \(n\times m\) and \(y\) of size \(1\times 1\), \(\partial y/\partial {\bf X}\) (also written \(\nabla_{\bf X}y\), the gradient of \(y\) with respect to \({\bf X}\)) is a matrix of size \(n\times m\) with the \((i, j)\) entry \(\partial y/\partial X_{i, j}\):

\[\begin{aligned} \partial y/\partial {\bf X}= \begin{bmatrix} \partial y/\partial X_{1,1} & \cdots & \partial y/\partial X_{1,m} \\ \vdots & \ddots & \vdots \\ \partial y/\partial X_{n,1} & \cdots & \partial y/\partial X_{n,m} \\ \end{bmatrix}. \end{aligned}\]

You may notice that in this list, we have not included matrix-by-matrix, matrix-by-vector, or vector-by-matrix derivatives. This is because, generally, they cannot be expressed nicely in matrix form and require higher order objects (e.g., tensors) to represent their derivatives. These cases are beyond the scope of this course.

Additionally, notice that for all cases, you can explicitly compute each element of the derivative object using (scalar) partial derivatives. You may find it useful to work through some of these by hand as you are reviewing matrix derivatives.

A.4 Examples

A.4.1 Example 1 - Simple Numerical Example

Let us calculate the derivative of \(f(\theta) = \theta_1^2 - 3\theta_2 + 5\). The input is a vector of size 2, and the output is a scalar. Therefore, our derivative \(\frac{df}{d\theta}\) will be of size \((2 \times 1)\). More specifically, we know that:

\[ \frac{df}{d\theta} = \begin{bmatrix} \frac{df}{d\theta_1} \\ \frac{df}{d\theta_2} \end{bmatrix} = \begin{bmatrix} 2\theta_1 \\ -3 \end{bmatrix} \]

A.4.2 Example 2: \(f(\theta) = X\theta\)

We begin with a very simple function: \(f(\theta) = X\theta\). \(f\) takes in a vector \(\theta\) of size \((n \times 1)\), multiplies by a matrix \(X\) of size \((m \times n)\), and thus returns a vector of size \((m \times 1)\). Let us calculate \(\frac{df}{d\theta}\):

\[ \begin{align*} df &= f(\theta + d\theta) - f(\theta) \\ &= X(\theta + d\theta) - X\theta \\ &= X\theta + Xd\theta - X\theta \\ &= Xd\theta \end{align*} \]

This exactly matches our standard denominator layout form:

\[ df = \frac{df}{d\theta}^T d\theta \]

where:

\[ \frac{df}{d\theta} = X^T \]

Done!

A.4.3 Example 3: \(f(v) = ||v||^2\)

Note that \(f\) maps a column vector of size \(n\) to a scalar. Therefore, the derivative \(\frac{df}{dv}\) will be of size \((n \times 1)\). There are two ways to calculate this:

A.4.3.1 Old Approach

Take the partial derivatives at each index.

\[ \frac{df}{dv} = \begin{bmatrix} \frac{df}{dv_1} \\ \vdots \\ \frac{df}{dv_n} \end{bmatrix} = \begin{bmatrix} 2v_1 \\ \vdots \\ 2v_n \end{bmatrix} = 2v \]

A.4.3.2 New Approach

Let us use matrix calculus and the differential notation.

\[ \begin{align*} df &= f(v + dv) - f(v) \\ &= ||(v + dv)||^2 - ||v||^2 \\ &= (v + dv)^T(v + dv) - ||v||^2 \\ &= ||v||^2 + v^T dv + (dv)^T v + ||dv||^2 - ||v||^2 \\ &= (2v)^T dv + ||dv||^2 \end{align*} \]

Just like in single-variable calculus, we can drop the squared \(dv\) term since it grows smaller much faster than our linear \(dv\) term.

Thus we are left with

\[ df = (2v)^T dv \]

Comparing this to our standard form:

\[ df = \frac{df}{dv}^T dv \]

we see clearly that

\[ \frac{df}{dv} = 2v \]

A.4.4 Example 4 - Chain Rule

This is where the advantage of matrix calculus—and more specifically the denominator layout—really shines.

Let us assume we have a composition of functions \(f(g(x))\). We want to calculate the derivative of \(f\) with respect to \(x\). As a first step, let’s just look at how \(f\) changes with respect to \(g\):

\[ df = \frac{df}{dg}^T \cdot dg \]

Now we also know how \(dg\) changes with respect to \(dx\), so we can substitute that in:

\[ df = \frac{df}{dg}^T \cdot \left(\frac{dg}{dx}^T \cdot dx\right) \]

Simplifying gives:

\[ \begin{align*} df &= \frac{df}{dg}^T \cdot \left(\frac{dg}{dx}^T \cdot dx\right) \\ &= \left(\frac{df}{dg}^T \cdot \frac{dg}{dx}^T\right) \cdot dx \\ &= \left(\frac{dg}{dx} \cdot \frac{df}{dg}\right)^T \cdot dx \end{align*} \]

Comparing this to our standard form gives:

\[ \frac{df}{dx} = \frac{dg}{dx} \cdot \frac{df}{dg} \]

Important: Notice how the products go inner → outer as we move from left → right. This will be the general pattern as we compose more and more functions. The innermost derivatives will be on the left, and the outermost will be on the right. This is super important. In general, matrix multiplication is not commutative and the dimensions won’t even work out if you try it any other way.

For example, if we had \(f(g(h(x)))\), then:

\[ \frac{df}{dx} = \frac{dh}{dx} \cdot \frac{dg}{dh} \cdot \frac{df}{dg} \]

Check for yourself that the dimensions will work out if we multiply in this way!

A.5 Vector-by-vector Identities

In this section, we collect useful identities for \(\partial {\bf y}/ \partial {\bf x}\). In each case, assume \({\bf x}\) is \(n \times 1\), \({\bf y}\) is \(m \times 1\), \(a\) is a scalar constant, \({\bf a}\) is a vector that does not depend on \({\bf x}\) and \({\bf A}\) is a matrix that does not depend on \({\bf x}\), \(u\) and \(v\) are scalars that do depend on \({\bf x}\), and \({\bf u}\) and \({\bf v}\) are vectors that do depend on \({\bf x}\). We also have vector-valued functions \({\bf f}\) and \({\bf g}\).

A.5.1 Some Fundamental Cases

First, we will cover a couple of fundamental cases: suppose that \({\bf a}\) is an \(m \times 1\) vector which is not a function of \({\bf x}\), an \(n\times1\) vector. Then,

\[ \frac{\partial {\bf a}}{\partial {\bf x}} = {\bf 0}, \tag{A.1}\]

is an \(n \times m\) matrix of 0s. This is similar to the scalar case of differentiating a constant. Next, we can consider the case of differentiating a vector with respect to itself:

\[ \frac{\partial {\bf x}}{\partial {\bf x}} = {\bf I} \]

This is the \(n \times n\) identity matrix, with 1’s along the diagonal and 0’s elsewhere. It makes sense, because \(\partial {\bf x}_j / {\bf x}_i\) is 1 for \(i = j\) and 0 otherwise. This identity is also similar to the scalar case.

A.5.2 Derivatives Involving a Constant Matrix

Let the dimensions of \({\bf A}\) be \(m \times n\). Then the object \({\bf A}{\bf x}\) is an \(m\times 1\) vector. We can then compute the derivative of \({\bf A}{\bf x}\) with respect to \({\bf x}\) as:

\[\frac{\partial {\bf A}{\bf x}}{\partial {\bf x}} = \begin{bmatrix} \partial ({\bf A}{\bf x})_1/\partial x_1 & \partial ({\bf A}{\bf x})_2/\partial x_1 & \cdots & \partial ({\bf A}{\bf x})_m/\partial x_1 \\ \partial ({\bf A}{\bf x})_1/\partial x_2 & \partial ({\bf A}{\bf x})_2/\partial x_2 & \cdots & \partial ({\bf A}{\bf x})_m/\partial x_2 \\ \vdots & \vdots & \ddots & \vdots \\ \partial ({\bf A}{\bf x})_1/\partial x_n & \partial ({\bf A}{\bf x})_2/\partial x_n & \cdots & \partial ({\bf A}{\bf x})_m/\partial x_n \\ \end{bmatrix}\]

Note that any element of the column vector \({\bf A}{\bf x}\) can be written as, for \(j = 1, \ldots, m\):

\[({\bf A}{\bf x})_j = \sum_{k=1}^n A_{j,k} x_k.\]

Thus, computing the \((i,j)\) entry of \(\frac{\partial {\bf A}{\bf x}}{\partial {\bf x}}\) requires computing the partial derivative \(\partial ({\bf A}{\bf x})_j/\partial x_i:\)

\[\begin{aligned} \partial ({\bf A}{\bf x})_j/\partial x_i = \partial \left( \sum_{k=1}^n A_{j,k} x_k \right)/ \partial x_i = A_{j,i} \end{aligned}\]

Therefore, the \((i,j)\) entry of \(\frac{\partial {\bf A}{\bf x}}{\partial {\bf x}}\) is the \((j,i)\) entry of \({\bf A}\):

\[\frac{\partial {\bf A}{\bf x}}{\partial {\bf x}} = {\bf A}^T \tag{A.2}\]

Similarly, for objects \({\bf x}, {\bf A}\) of the same shape, one can obtain,

\[\frac{\partial {\bf x}^T {\bf A}}{\partial {\bf x}} = {\bf A} \tag{A.3}\]

A.5.3 Linearity of Derivatives

Suppose that \({\bf u}, {\bf v}\) are both vectors of size \(m \times 1\). Then,

\[\frac{\partial ({\bf u}+ {\bf v})}{\partial {\bf x}} = \frac{\partial {\bf u}}{\partial {\bf x}} + \frac{\partial {\bf v}}{\partial {\bf x}} \tag{A.4}\]

Suppose that \(a\) is a scalar constant and \({\bf u}\) is an \(m\times 1\) vector that is a function of \({\bf x}\). Then,

\[ \frac{\partial a {\bf u}}{\partial {\bf x}} = a \frac{\partial {\bf u}}{ \partial {\bf x}} \]

One can extend the previous identity to vector- and matrix-valued constants. Suppose that \({\bf a}\) is a vector with shape \(m \times 1\) and \(v\) is a scalar which depends on \({\bf x}\). Then,

\[\frac{\partial v {\bf a}}{\partial {\bf x}} = \frac{\partial v}{\partial {\bf x}}{\bf a}^T \]

First, checking dimensions, \(\partial v/\partial {\bf x}\) is \(n \times 1\) and \({\bf a}\) is \(m \times 1\) so \({\bf a}^T\) is \(1 \times m\) and our answer is \(n \times m\) as it should be. Now, checking a value, element \((i,j)\) of the answer is \(\partial v {\bf a}_j / \partial x_i\) = \((\partial v / \partial x_i) {\bf a}_j\) which corresponds to element \((i,j)\) of \((\partial v / \partial {\bf x}){\bf a}^T\).

Similarly, suppose that \({\bf A}\) is a matrix which does not depend on \({\bf x}\) and \({\bf u}\) is a column vector which does depend on \({\bf x}\). Then,

\[ \frac{\partial {\bf A}{\bf u}}{\partial {\bf x}} = \frac{\partial {\bf u}}{\partial {\bf x}}{\bf A}^T \]

A.5.4 Product Rule (Vector-valued Numerator)

Suppose that \(v\) is a scalar which depends on \({\bf x}\), while \({\bf u}\) is a column vector of shape \(m\times 1\) and \({\bf x}\) is a column vector of shape \(n \times 1\). Then,

\[ \frac{\partial v {\bf u}}{\partial {\bf x}} = v\frac{\partial {\bf u}}{\partial {\bf x}} + \frac{\partial v}{\partial {\bf x}} {\bf u}^T \]

One can see this relationship by expanding the derivative as follows:

\[\begin{aligned} \frac{\partial v {\bf u}}{\partial {\bf x}} = \begin{bmatrix} \partial (v u_1)/\partial x_1 & \partial (v u_2)/\partial x_1 & \cdots & \partial (v u_m)/\partial x_1 \\ \partial (v u_1)/\partial x_2 & \partial (v u_2)/\partial x_2 & \cdots & \partial (v u_m)/\partial x_2 \\ \vdots & \vdots & \ddots & \vdots \\ \partial (v u_1)/\partial x_n & \partial (v u_2)/\partial x_n & \cdots & \partial (v u_m)/\partial x_n \\ \end{bmatrix}. \end{aligned} \]

Then, one can use the product rule for scalar-valued functions, \[ \begin{aligned} \partial (v u_j)/\partial x_i = v (\partial u_j / \partial x_i) + (\partial v / \partial x_i) u_j, \end{aligned} \] to obtain the desired result.

A.5.5 Chain Rule Identity

Suppose that \({\bf g}\) is a vector-valued function with output vector of shape \(m \times 1\), and the argument to \({\bf g}\) is a column vector \({\bf u}\) of shape \(d\times 1\) which depends on \({\bf x}\). Then, one can obtain the chain rule as,

\[\frac{\partial {\bf g}({\bf u})}{\partial {\bf x}} = \frac{\partial {\bf u}}{\partial {\bf x}}\frac{\partial {\bf g}({\bf u})}{\partial {\bf u}} \]

Following “the shapes of things,” \(\partial {\bf u}/ \partial {\bf x}\) is \(n \times d\) and \(\partial {\bf g}({\bf u}) / \partial {\bf u}\) is \(d \times m\), where element \((i,j)\) is \(\partial {\bf g}({\bf u})_j / \partial {\bf u}_i\). The same chain rule applies for further compositions of functions:

\[ \frac{\partial {\bf f}({\bf g}({\bf u}))}{\partial {\bf x}} = \frac{\partial {\bf u}}{\partial {\bf x}}\frac{\partial {\bf g}({\bf u})}{\partial {\bf u}} \frac{\partial {\bf f}({\bf g})}{\partial {\bf g}} \]

A.5.6 Some Other Identities

You can get many scalar-by-vector and vector-by-scalar cases as special cases of the rules above, making one of the relevant vectors just be 1 x 1. Here are some other ones that are handy. For more, see the Wikipedia article on Matrix derivatives (for consistency, only use the ones in denominator layout).

\[ \frac{\partial {\bf u}^T {\bf v}}{\partial {\bf x}} = \frac{\partial {\bf u}}{\partial {\bf x}} {\bf v}+ \frac{\partial {\bf v}}{\partial {\bf x}} {\bf u} \tag{A.5}\]

\[\frac{\partial {\bf u}^T}{\partial x} = \big(\frac{\partial {\bf u}}{\partial x}\big)^T \tag{A.6}\]

A.6 MSE for Linear Regression

From lecture/recitation, we know that the Mean Squared Error for Linear Regression can be written as:

\[ J(\theta) = \frac{1}{n} ||X\theta - Y||^2 \]

where \(X\) and \(\theta\) are the augmented forms to allow for an intercept. Let us show how to optimize this function by setting its gradient equal to 0. We combine all of the facts that we learned. We know that the derivative of \(||v||^2\) with respect to \(v\) is \(2v\), the derivative of \(X\theta\) with respect to \(\theta\) is \(X^T\), and we know the chain rule. Combining all of that gives:

\[ \frac{dJ}{d\theta} = \frac{2}{n} X^T \cdot (X\theta - Y) \]

Now, to optimize, we set the derivative equal to 0:

\[ \frac{dJ}{d\theta} = \frac{2}{n} X^T \cdot (X\theta - Y) = 0 \implies \theta^* = (X^TX)^{-1}X^TY \]

That’s it! You can now derive the optimal solution for any linear regression problem.

A.7 MSE for Regularized Linear Regression

In lecture, we also encountered the regularized MSE expression for Linear Regression:

\[ J(\theta) = \frac{1}{n} ||X\theta - Y||^2 + \lambda || \theta ||^2 \]

where \(\lambda\) is some scalar hyperparameter we can choose.

Calculating this derivative is basically the same as last time, except there’s an additional squared term at the end that we can differentiate separately. Setting the derivative with respect to \(\theta\) to 0, gives:

\[ \frac{dJ}{d\theta} = \frac{2}{n} X^T \cdot (X\theta^* - Y) + 2\lambda \theta^* = 0 \]

which simplifies to:

\[ 0 = X^T X\theta^* - X^T Y + n\lambda \theta^* \]

\[ \theta^* = (X^T X + n\lambda I)^{-1} X^T Y \]

A.8 Matrix Derivatives Using Einstein Summation

You do not have to read or learn this! But you might find it interesting or helpful.

Consider the objective function for linear regression, written out as products of matrices:

\[\begin{aligned} J(\theta) = \frac{1}{n} (X\theta - Y)^T (X\theta-Y) \,, \end{aligned} \]

where \(X\) is \(n\times d\), \(Y\) is \(n\times 1\), and \(\theta\) is \(d\times 1\). How does one show, with no shortcuts, that

\[\begin{aligned} \nabla_{\theta}J = \frac{2}{n} {X^T} {(X\theta - Y)} \;\;? \end{aligned} \]

One neat way, which is very explicit, is to simply write all the matrices as variables with row and column indices, e.g., \(X_{ab}\) is the row \(a\), column \(b\) entry of the matrix \(X\). Furthermore, let us use the convention that in any product, all indices which appear more than once get summed over; this is a popular convention in theoretical physics, and lets us suppress all the summation symbols which would otherwise clutter the following expresssions. For example, \(X_{ab} \theta_b\) would be the implicit summation notation giving the element at the \(a^{\rm th}\) row of the matrix-vector product \(X\theta\).

Using implicit summation notation with explicit indices, we can rewrite \(J(\theta)\) as

\[\begin{aligned} J(\theta) = \frac{1}{n} \left( X_{ab} \theta_b - Y_a \right) \left( X_{ac}\theta_c - Y_a \right) \,. \end{aligned}\]

Note that we no longer need the transpose on the first term, because all that transpose accomplished was to take a dot product between the vector given by the left term, and the vector given by the right term. With implicit summation, this is accomplished by the two terms sharing the repeated index \(a\).

Taking the derivative of \(J\) with respect to the \(d^{\rm th}\) element of \(\theta\) thus gives, using the chain rule for (ordinary scalar) multiplication:

\[\begin{aligned} \frac{dJ}{d\theta_d} &=& \frac{1}{n} \left[ X_{ab} \delta_{bd} \left( X_{ac}\theta_c-Y_a \right) + \left(X_{ab}\theta_b - Y_a \right) X_{ac} \delta_{cd} \right] \\ &=& \frac{1}{n} \left[ X_{ad} \left( X_{ac}\theta_c-Y_a \right) + \left(X_{ab}\theta_b - Y_a \right) X_{ad} \right] \\ &=& \frac{2}{n} X_{ad} \left( X_{ab}\theta_b-Y_a \right) \,, \end{aligned}\] where the second line follows from the first, with the definition that \(\delta_{bd} = 1\) only when \(b=d\) (and similarly for \(\delta_{cd}\)). And the third line follows from the second by recognizing that the two terms in the second line are identical. Now note that in this implicit summation notation, the \(a, b\) element of the matrix product of \(A\) and \(B\) is \((AB)_{ac} = A_{ab}B_{bc}\). That is, ordinary matrix multiplication sums over indices which are adjacent to each other, because a row of \(A\) times a column of \(B\) becomes a scalar number. So the term in the above equation with \(X_{ad} X_{ab}\) is not a matrix product of \(X\) with \(X\). However, taking the transpose \(X^T\) switches row and column indices, so \(X_{ad} = X^T_{da}\). And \(X^T_{da} X_{ab}\) is a matrix product of \(X^T\) with \(X\)! Thus, we have that

\[\begin{aligned} \frac{dJ}{d\theta_d} =& \frac{2}{n} X^T_{da} \left( X_{ab}\theta_b-Y_a \right) \\ =& \frac{2}{n} \left[ X^T \left( X\theta-Y\right) \right]_{d} \,, \end{aligned} \] which is the desired result.

A.9 Matrix Calculus for Neural Networks

Let us try applying everything that we’ve learned about Matrix Calculus to the case of Neural Networks. This boils to three main principles: 1. Chain Rule 2. Element-wise operations 3. Derivatives with respect to matrices

Let’s cover / remind ourselves of each of these points

A.9.1 1. Chain Rule

Let’s say we are given a composition of functions of the form \(f(g(h(x)))\), then we proved in Section A.4.4 that:

\[\frac{df}{dx} = \frac{dh}{dx} \cdot \frac{dg}{dh} \cdot \frac{df}{dg}\]

where as we move from left to right, we move from the inner most function to the outermost.

A.9.2 2. Element-Wise Operations

Let’s say we are given a vector \(v \in \mathbb{R}^n\) and a function \(f: \mathbb{R} \rightarrow \mathbb{R}\). Let’s define \(f.: \mathbb{R}^n \rightarrow \mathbb{R}^n\) as the function that applies \(f\) element-wise.

This shows up very often in Neural Nets, when we apply an element-wise activation to the neurons in a layer of the network.

What is the derivative of this operation? Note that \(f.\) maps vectors to vectors of the same size, so we expect to get a square matrix as the derivative.

Then: \[ \begin{aligned} dF &= f.(v + dv) - f.(v) \\ &= [f(v_1 + dv_1), \ldots, f(v_n + dv_n)]^T - [f(v_1), \ldots, f(v_n)]^T \\ &= [f(v_1 + dv_1) - f(v_1), \ldots, f(v_n + dv_n) - f(v_n)]^T \\ &= [f'(v_1) dv_1, \ldots, f'(v_n) dv_n]^T \\ &= \begin{bmatrix} f'(v_1) & 0 & \cdots & 0 \\ 0 & f'(v_2) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & f'(v_n) \end{bmatrix} dv \end{aligned} \]

where \(f'(x)\) represents the standard one-dimensional derivative of the function with respect to \(x\). We thus have that:

\[\frac{dF}{dv} = \begin{bmatrix} f'(v_1) & 0 & \cdots & 0 \\ 0 & f'(v_2) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & f'(v_n) \end{bmatrix}\]

We often abbreviate this as \(\frac{dF}{dv} = \text{diag}(f'.(v))\), where the diagonal matrix contains the derivatives evaluated at each element of \(v\).

A.9.3 Example

Before we get to the third rule, let’s do an example using what we’ve learned so far.

Consider the following case:

We begin with an input \(x \in \mathbb{R}^n\) and associated output \(y \in \mathbb{R}^m\)
We then apply a linear layer of the form \(g = W^Tx\), where \(W \in \mathbb{R}^{n \times m}\) and \(g \in \mathbb{R}^m\)
We then apply an element-wise ReLU: \(a = \text{ReLU}.(g)\)
Finally we calculate some loss \(\mathcal{L}(a, y) = ||a - y||^2\)

How can we calculate the derivative with respect to the input \(x\)? Note that we normally want to calculate the derivative with respect to the weight matrix, but let’s start with this example first.

We apply the chain rule:

\[ \frac{dL}{dx} = \frac{dg}{dx} \cdot \frac{da}{dg} \cdot \frac{dL}{da} \]

Now let’s consider each term separately. First, \(\frac{dg}{dx}\) is simply \(W\). Second, we know that the derivative of the ReLU function applied to a scalar \(x\) is simply an indicator variable that is \(1\) if \(x > 0\) and \(0\) else. We use the notation \(\mathbb{1}_{g > 0}\) to denote the element-wise indicator function, so the derivative is:

\[ \frac{da}{dg} = \begin{bmatrix} \mathbb{1}_{g_1 > 0} & 0 & \cdots & 0 \\ 0 & \mathbb{1}_{g_2 > 0} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \mathbb{1}_{g_m > 0} \end{bmatrix} \]

Third, we know from earlier that the derivative of \(\mathcal{L}(a, y) = ||a - y||^2\) with respect to \(a\) is:

\[ \frac{d\mathcal{L}}{da} = 2(a - y) \]

Now we can chain them all together:

\[ \frac{d\mathcal{L}}{dx} = W \cdot \text{diag}(\mathbb{1}_{g > 0}) \cdot 2(a - y) \]

This tells us exactly how the loss changes with respect to the input \(x\).

A.9.4 Derivative with respect to Weight Matrices

What if we wanted – as we do in training Neural nets – to take the derivative of the loss with respect to the weight matrix \(W\). Something with this approach goes wrong. Writing out the chain rule, we get:

\[ \frac{dL}{dW} = \frac{dg}{dW} \cdot \frac{da}{dg} \cdot \frac{dL}{da} \]

What type of object is \(\frac{dg}{dW}\)? It is a vector-by-matrix derivative, which means it actually must be a tensor! However, we know that \(\frac{dL}{dW}\) must be able to be repsented by just a matrix! So how can we reconcile these facts? The answer lies in the fact that \(\frac{dg}{dW}\) is a sparse tensor (many of its entires are \(0\)) and using some clever matrix calculus tricks we can simplify our chain rule into something more managable.

Remember from Section A.2.2 that \(\frac{dL}{dW}\) must take the form of:

\[dL = tr(\frac{dL}{dW}^T \cdot dW) \tag{A.7}\]

How can we massage our chain rule expression above into this form?

First, note that by definition, we have:

\[ dL = \frac{dL}{dg}^T \cdot dg \]

and we know

\[ dg = (W + dW)^Tx - W^Tx = (dW)^Tx \]

This gives:

\[ dL = \frac{dL}{dg}^T (dW)^Tx \]

Not that the expression on the RHS is actually a scalar (since \(dL\) is a scalar), and so we can simply take its trace, since the trace of a scalar is just that same scalar. Also recall that the trace is cyclic and invariant to transpose, so \(tr(ABC) = tr(CAB)\) and \(tr(A) = tr(A^T)\).

\[ \begin{aligned} dL &= \frac{dL}{dg}^T (dW)^Tx \\ &= tr( \frac{dL}{dg}^T (dW)^Tx) \\ &= tr( x^T dW \frac{dL}{dg}) \\ &= tr( \frac{dL}{dg} x^T dW) \\ &= tr( (x \cdot \frac{dL}{dg}^T)^T dW) \end{aligned} \]

where the second line uses the fact that the RHS is a scalar, the third uses the fact that the trace is invariant to transposing its input, the fourth uses the cyclic property, and the fifth simply combines the terms to match the form of Equation A.7.

Comparing to Equation A.7, we can identify:

\[ \frac{dL}{dW} = x \cdot \frac{dL}{dg}^T \]

And there you have it! To take the derivative of the loss with respect to a weight matrix, you simply do

\[ \frac{dL}{dW} = x \cdot \frac{dL}{dg}^T \]

where \(x\) is the input into that weight matrix and \(g\) is the output from it. Substituting in from the example above, we get:

\[ \frac{dL}{dW} = x \cdot (\text{diag}(\mathbb{1}_{g > 0}) \cdot 2(a - y))^T = 2x(a - y)^T \cdot \text{diag}(\mathbb{1}_{g > 0}) \]

Notice that this has the correct shape! We have \(x \in \mathbb{R}^{n \times 1}\), \((a - y)^T \in \mathbb{R}^{1 \times m}\), and the diagonal matrix \(\text{diag}(\mathbb{1}_{g > 0})\) has shape \(m \times m\), so their product is \(\mathbb{R}^{n \times m}\), which matches the shape of \(W\).

This is the key insight for backpropagation in neural networks. This allows us to efficiently compute how the loss changes with respect to each weight, which is exactly what we need for training neural networks via gradient descent.

A.9.5 Key Takeaways

Finish up here!