3 Gradient Descent
This page contains all content from the legacy PDF notes; gradient descent chapter.
As we phase out the PDF, this page may receive updates not reflected in the static PDF.
In the previous chapter, we showed how to describe an interesting objective function for machine learning, but we need a way to find the optimal
There is an enormous and fascinating literature on the mathematical and algorithmic foundations of optimization, but for this class, we will consider one of the simplest methods, called gradient descent.
You might want to consider studying optimization some day! It’s one of the fundamental tools enabling machine learning, and it’s a beautiful and deep field.
Intuitively, in one or two dimensions, we can easily think of
Below, we explicitly give gradient descent algorithms for one and multidimensional objective functions (Section 3.1 and Section 3.2). We then illustrate the application of gradient descent to a loss function which is not merely mean squared loss (Section 3.3). And we present an important method known as stochastic gradient descent (Section 3.4), which is especially useful when datasets are too large for descent in a single batch, and has some important behaviors of its own.
3.1 Gradient descent in one dimension
We start by considering gradient descent in one dimension. Assume
The hyper-parameter
1:procedure 1D-Gradient-Descent(
2:
3:
4:repeat
5:
6:
7:until
8:return
9:end procedure
Note that this algorithm terminates when the derivative of the function
Stop after a fixed number of iterations
, i.e., when . Practically, this is the most common choice.Stop when the change in the value of the parameter
is sufficiently small, i.e., when .
Consider all of the potential stopping criteria for 1D-Gradient-Descent
, both in the algorithm as it appears and listed separately later. Can you think of ways that any two of the criteria relate to each other?
Theorem 3.1 Choose any small distance
However, we must be careful when choosing the learning rate to prevent slow convergence, non-converging oscillation around the minimum, or divergence.
The following plot illustrates a convex function
If
What happens in this example with very small
If
There are two notable exceptions to this common sense expectation: First, gradient descent can get stagnated while approaching a point
The plot below shows two different
3.2 Multiple dimensions
The extension to the case of multi-dimensional
The gradient of
The algorithm remains the same, except that the update step in line 5 becomes
Which termination criteria from the 1D case were defined in a way that assumes
3.3 Application to regression
Recall from the previous chapter that choosing a loss function is the first step in formulating a machine-learning problem as an optimization problem, and for regression we studied the mean square loss, which captures losws as
We use the gradient of the objective with respect to the parameters,
3.3.1 Ridge regression
Now, let’s add in the regularization term, to get the ridge-regression objective:
Recall that in ordinary least squares, we finessed handling
Note that
Convince yourself that the dimensions of all these quantities are correct, under the assumption that
Compute
Compute
Use these last two results to verify our derivation above.
Putting everything together, our gradient descent algorithm for ridge regression becomes
1:procedure RR-Gradient-Descent(
2:
3:
4:
5:repeat
6:
7:
8:
9:until
10:return
11:end procedure
Beware double superscripts!
Is it okay that
Is it okay that the 2’s from the gradient definitions don’t appear in the algorithm?
3.4 Stochastic gradient descent
When the form of the gradient is a sum, rather than take one big(ish) step in the direction of the gradient, we can, instead, randomly select one term of the sum, and take a very small step in that direction. This seems sort of crazy, but remember that all the little steps would average out to the same direction as the big step if you were to stay in one place. Of course, you’re not staying in that place, so you move, in expectation, in the direction of the gradient.
Most objective functions in machine learning can end up being written as an average over data points, in which case, stochastic gradient descent (sgd) is implemented by picking a data point randomly out of the data set, computing the gradient as if there were only that one point in the data set, and taking a small step in the negative direction.
Let’s assume our objective has the form
Sometimes you will see that the objective being written as a sum, instead of an average. In the “sum” convention, the
Here is pseudocode for applying sgd to such an objective
1:procedure Stochastic-Gradient-Descent(
2:
3:for
4:randomly select
5:
6:end for
7:end procedure
Note that now instead of a fixed value of
For sgd to converge to a local optimum point as
Theorem 3.2 If
Why these two conditions? The intuition is that the first condition, on
One “legal” way of setting the learning rate is to make
If you start a long way from the optimum, would making
There are multiple intuitions for why sgd might be a better choice algorithmically than regular gd (which is sometimes called batch gd (bgd)):
bgd typically requires computing some quantity over every data point in a data set. sgd may perform well after visiting only some of the data. This behavior can be useful for very large data sets – in runtime and memory savings.
If your
is actually non-convex, but has many shallow local optimum points that might trap bgd, then taking samples from the gradient at some point might “bounce” you around the landscape and away from the local optimum points.Sometimes, optimizing
really well is not what we want to do, because it might overfit the training set; so, in fact, although sgd might not get lower training error than bgd, it might result in lower test error.