12  Reinforcement Learning

Reinforcement learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. Unlike other learning paradigms, RL has several distinctive characteristics:

In a reinforcement learning problem, the interaction between the agent and environment follows a specific pattern:

The interaction cycle proceeds as follows:

  1. Agent observes the current state \(s^{(i)}\)
  2. Agent selects and executes an action \(a^{(i)}\)
  3. Agent receives a reward \(r^{(i)}\) from the environment
  4. Agent observes the new state \(s^{(i+1)}\)
  5. Agent selects and executes a new action \(a^{(i+1)}\)
  6. Agent receives a new reward \(r^{(i+1)}\)
  7. This cycle continues…

Similar to MDP Chapter 11, in an RL problem, the agent’s goal is to learn a policy - a mapping from states to actions - that maximizes its expected cumulative reward over time. This policy guides the agent’s decision-making process, helping it choose actions that lead to the most favorable outcomes.

12.1 Reinforcement learning algorithms overview

Approaches to reinforcement learning differ significantly according to what kind of hypothesis or model is being learned. Roughly speaking, RL methods can be categorized into model-free methods and model-based methods. The main distinction is that model-based methods explicitly learn the transition and reward models to assist the end-goal of learning a policy; model-free methods do not. We will start our discussion with the model-free methods, and introduce two of the arguably most popular types of algorithms, Q-learning Section 12.1.2 and policy gradient Section 12.3. We then describe model-based methods Section 12.4. Finally, we briefly consider “bandit” problems Section 12.5, which can be viewed as a simplified RL setting with no state transitions (or a trivial one-state MDP), where the main issue is exploration versus exploitation across actions.

12.1.1 Model-free methods

Model-free methods are methods that do not explicitly learn transition and reward models. Depending on what is explicitly being learned, model-free methods are sometimes further categorized into value-based methods (where the goal is to learn/estimate a value function) and policy-based methods (where the goal is to directly learn an optimal policy). It’s important to note that such categorization is approximate and the boundaries are blurry. In fact, current RL research tends to combine the learning of value functions, policies, and transition and reward models all into a complex learning algorithm, in an attempt to combine the strengths of each approach.

12.1.2 Q-learning

Q-learning is a frequently used class of RL algorithms that concentrates on learning (estimating) the state-action value function, i.e., the \(\mathrm{Q}\) function. Specifically, recall the MDP value-iteration update:

\[ \begin{equation} \mathrm{Q}(s,a) = \mathrm{R}(s,a) + \gamma \sum_{s'} \mathrm{T}(s,a,s')\max_{a'}\mathrm{Q}(s',a') \end{equation} \]

The thing that most students seem to get confused about is when we do value iteration and when we do Q-learning. Value iteration assumes you know \(\mathrm{T}\) and \(\mathrm{R}\) and just need to compute \(\mathrm{Q}\). In Q-learning, we don’t know or even directly estimate \(\mathrm{T}\) and \(\mathrm{R}\): we estimate \(\mathrm{Q}\) directly from experience!

The Q-learning algorithm below adapts this value-iteration idea to the RL scenario, where we do not know the transition function \(\mathrm{T}\) or reward function \(\mathrm{R}\), and instead rely on samples to perform the updates.

\begin{algorithm}\begin{algorithmic} \Procedure{Q-Learning}{$\mathcal{S}, \mathcal{A}, \gamma, \alpha, s_0, \text{max\_iter}$} \State $i \gets 0$ \ForAll{$s \in \mathcal{S},\; a \in \mathcal{A}$} \State $\mathrm{Q}_{\text{old}}(s, a) \gets 0$ \EndFor \State $s \gets s_0$ \While{$i < \text{max\_iter}$} \State $a \gets \text{select\_action}\bigl(s,\,\mathrm{Q}_{\text{old}}\bigr)$ \State $(r, s') \gets \text{execute}(a)$ \State $\mathrm{Q}_{\text{new}}(s, a) \gets (1-\alpha)\,\mathrm{Q}_{\text{old}}(s,a)\;+\;\alpha\bigl(r + \gamma \max_{a'}\mathrm{Q}_{\text{old}}(s',a')\bigr)$ \State $s \gets s'$ \State $i \gets i + 1$ \State $\mathrm{Q}_{\text{old}} \gets \mathrm{Q}_{\text{new}}$ \EndWhile \Return $\mathrm{Q}_{\text{new}}$ \EndProcedure \end{algorithmic} \end{algorithm}

With the pseudo‑code provided for Q‑Learning, there are a few key things to note. First, we must determine which state to initialize the learning from. In the context of a game, this initial state may be well defined. In the context of a robot navigating an environment, one may consider sampling the initial state at random. In any case, the initial state is necessary to determine the trajectory the agent will experience as it navigates the environment.

Second, different contexts will influence how we want to choose when to stop iterating through the while loop. Again, in some games there may be a clear terminating state based on the rules of how it is played. On the other hand, a robot may be allowed to explore an environment ad infinitum. In such a case, one may consider either setting a fixed number of transitions (as done explicitly in the pseudo‑code) to take; or we may want to stop iterating in the example once the values in the Q‑table are not changing, after the algorithm has been running for a while.

Finally, a single trajectory through the environment may not be sufficient to adequately explore all state‑action pairs. In these instances, it becomes necessary to run through a number of iterations of the Q‑Learning algorithm, potentially with different choices of initial state \(s_0\).

This notion of running a number of instances of Q‑Learning is often referred to as experiencing multiple episodes.

Of course, we would then want to modify Q‑Learning such that the Q table is not reset with each call.

Now, let’s dig into what is happening in Q‑Learning. Here, \(\alpha \in (0,1]\) represents the learning rate, which needs to decay for convergence purposes, but in practice is often set to a constant. It’s also worth mentioning that Q-learning assumes a discrete state and action space where states and actions take on discrete values like \(1,2,3,\dots\) etc. In contrast, a continuous state space would allow the state to take values from, say, a continuous range of numbers; for example, the state could be any real number in the interval \([1,3]\). Similarly, a continuous action space would allow the action to be drawn from, e.g., a continuous range of numbers. There are many Q-learning extensions that handle large or continuous state spaces (we’ll look at one soon); continuous action spaces usually require additional ideas, often from policy-gradient methods or actor-critic methods that are beyond our scope. Therefore, the algorithm above is sometimes referred to more specifically as tabular Q-learning to differentiate it from these extensions.

In the Q-learning update rule

\[ \begin{equation}\label{q_avg} \mathrm{Q}[s, a] \leftarrow (1-\alpha)\mathrm{Q}[s, a] + \alpha(r + \gamma \max_{a'}\mathrm{Q}[s',a']) \end{equation} \tag{12.1}\]

the term \((r + \gamma \max_{a'}\mathrm{Q}[s',a'])\) is often referred to as the one-step look-ahead target. The update can be viewed as a combination of two different iterative processes that we have already seen: the combination of an old estimate with the target using a running average with a learning rate \(\alpha.\)

Equation 12.1 can also be equivalently rewritten as

\[ \begin{equation}\label{td:q} \mathrm{Q}[s, a] \leftarrow \mathrm{Q}[s, a] + \alpha\bigl((r + \gamma \max_{a'} \mathrm{Q}[s',a'])-\mathrm{Q}[s,a]\bigr), \end{equation} \tag{12.2}\]

which allows us to interpret Q-learning in yet another way: we make an update (or correction) based on the temporal difference between the target and the current estimated value \(\mathrm{Q}[s, a].\)

The Q-learning algorithm above includes a procedure called select_action, that, given the current state \(s\) and current \(\mathrm{Q}\) function, has to decide which action to take. If the \(\mathrm{Q}\) value is estimated very accurately and the agent is deployed to “behave” in the world (as opposed to “learn” in the world), then generally we would want to choose the apparently optimal action \(\arg\max_{a \in \mathcal{A}} \mathrm{Q}(s,a)\).

But, during learning, the \(\mathrm{Q}\) value estimates won’t be very good and exploration is important. However, exploring completely at random is also usually not the best strategy while learning, because it is good to focus your attention on the parts of the state space that are likely to be visited when executing a good policy (not a bad or random one).

A typical action-selection strategy that attempts to address this exploration versus exploitation dilemma is the so-called \(\epsilon\)-greedy strategy:

  • with probability \(1-\epsilon\), choose \(\arg\max_{a \in \mathcal{A}} \mathrm{Q}(s,a)\);

  • with probability \(\epsilon\), choose the action \(a \in \mathcal{A}\) uniformly at random.

where the \(\epsilon\) probability of choosing a random action helps the agent to explore and try out actions that might not seem so desirable at the moment.

Q-learning has the surprising property that under some standard assumptions, it is guaranteed to converge to the actual optimal \(\mathrm{Q}\) function! The conditions it needs to satisfy are: having a finite state-action space, visiting every state-action pair infinitely often, and the learning rate \(\alpha\) must satisfy a scheduling condition. This implies that for exploration strategy specifically, any strategy is okay as long as it tries every state-action infinitely often on an infinite run (so that it doesn’t converge prematurely to a bad action choice).

Q-learning can be very inefficient. Imagine a robot that has a choice between moving to the left and getting a reward of 1, then returning to its initial state, or moving to the right and walking down a 10-step hallway in order to get a reward of 1000, then returning to its initial state.

The first time the robot moves to the right and goes down the hallway, it will update the \(\mathrm{Q}\) value just for state 9 on the hallway and action ``right’’ to have a high value, but it won’t yet understand that moving to the right in the earlier steps was a good choice. The next time it moves down the hallway it updates the value of the state before the last one, and so on. After 10 trips down the hallway, it now can see that it is better to move to the right than to the left.

More concretely, consider the vector of Q values \(\mathrm{Q}(i = 0, \ldots, 9; \text{right})\), representing the Q values for moving right at each of the positions \(i = 0, \ldots, 9\). Position index 0 is the starting position of the robot as pictured above.

Then, for \(\alpha = 1\) and \(\gamma = 0.9\), Equation 12.2 becomes

\[ \mathrm{Q}(i, \text{right}) \;=\; \mathrm{R}(i, \text{right}) \;+\; 0.9 \,\max_a \mathrm{Q}(i+1, a)\,. \]

Starting with Q values of 0,

\[ \mathrm{Q}^{(0)}(i = 0, \ldots, 9;\,\text{right}) =\begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{bmatrix}\,. \]

We are violating our usual notational conventions here, and writing \(\mathrm{Q}^{i}\) to mean the Q value function that results after the robot runs all the way to the end of the hallway, when executing the policy that always moves to the right.

Since the only nonzero reward from moving right is \(\mathrm{R}(9,\text{right}) = 1000\), after our robot makes it down the hallway once, our new Q vector is

\[ \mathrm{Q}^{(1)}(i = 0, \ldots, 9;\,\text{right}) =\begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1000 \end{bmatrix}\,. \]

After making its way down the hallway again,

\[ \mathrm{Q}(8,\text{right}) = 0 + 0.9\,\mathrm{Q}(9,\text{right}) = 900 \]

updates:

\[ \mathrm{Q}^{(2)}(i = 0, \ldots, 9;\,\text{right}) =\begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 900 & 1000 \end{bmatrix}\,. \]

Similarly,

\[ \mathrm{Q}^{(3)}(i = 0, \ldots, 9;\,\text{right}) =\begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 0 & 810 & 900 & 1000 \end{bmatrix}\,, \]

\[ \mathrm{Q}^{(4)}(i = 0, \ldots, 9;\,\text{right}) =\begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 729 & 810 & 900 & 1000 \end{bmatrix}\,, \]

\[ \dots \]

\[ \mathrm{Q}^{(10)}(i = 0, \ldots, 9;\,\text{right}) =\\ \begin{bmatrix} 387.4 & 420.5 & 478.3 & 531.4 & 590.5 & 656.1 & 729 & 810 & 900 & 1000 \end{bmatrix}\,, \]

and the robot finally sees the value of moving right from position 0.

Here, we can see the exploration/exploitation dilemma in action: from the perspective of \(s_0=0\), it will seem that getting the immediate reward of \(1\) is a better strategy without exploring the long hallway.

Determine the Q value functions that result from always executing the “move left” policy.

12.2 Q-learning with function approximation

In our Q-learning algorithm above, we essentially keep track of each \(\mathrm{Q}\) value in a table, indexed by \(s\) and \(a\). What do we do if \(\mathcal{S}\) and/or \(\mathcal{A}\) are large (or continuous)?

We can use a function approximator like a neural network to store Q values. For example, we could design a neural network with parameters \(\theta\) that takes inputs \(s\) and \(a\), and outputs \(Q_\theta(s,a)\). We can treat this as a regression problem, optimizing this loss:

\[ \mathcal{L}(\theta) \;=\; \bigl(Q_\theta(s,a) \;-\; y\bigr)^2, \qquad y \;=\; r + \gamma \max_{a'} Q_\theta(s', a') \]

This is the so-called squared Bellman error; as the name suggests, it’s closely related to the Bellman equation we saw in MDPs in Chapter 11. Roughly speaking, this error measures how much the Bellman equality is violated.

An important subtlety: the target \(y\) also depends on \(\theta\) (through \(Q_\theta(s', a')\)), but we treat it as a constant when computing the gradient. The fitted-Q algorithm below makes this explicit by freezing the target as a regression label.

There are several different architectural choices for using a neural network to approximate \(\mathrm{Q}\) values:

  • One network for each action \(a\), that takes \(s\) as input and produces \(Q_\theta(s, a)\) as output;
  • One single network that takes \(s\) as input and produces a vector consisting of the \(\mathrm{Q}\) values for each action (where the \(i\)th component is \(Q_\theta(s,a_i)\)); or
  • One single network that takes \(s, a\) concatenated into a vector (if \(a\) is discrete, we would probably use a one-hot encoding, unless it had some useful internal structure) and produces \(Q_\theta(s, a)\) as output.

For continuous action spaces, it is popular to use a class of methods called actor-critic methods, which combine policy and value-function learning. We won’t get into them in detail here, though.

The first two choices are only suitable for discrete (and not too big) action sets. The last choice can be applied for continuous actions, but then it is difficult to find \(\arg\max_{a \in \mathcal{A}} Q_\theta(s, a)\).

There are not many theoretical guarantees about Q-learning with function approximation and, indeed, it can sometimes be fairly unstable (learning to perform well for a while, and then suddenly getting worse, for example). But neural network Q-learning has also had some significant successes.

One form of instability that we do know how to guard against is catastrophic forgetting. In standard supervised learning, we expect that the training \(x\) values were drawn independently from some distribution.

And, in fact, we routinely shuffle their order in the data file, anyway.

But when a learning agent, such as a robot, is moving through an environment, the sequence of states it encounters will be temporally correlated. For example, the robot might spend 12 hours in a dark environment and then 12 in a light one. This can mean that while it is in the dark, the neural-network weight-updates will make \(Q_\theta\) "forget" the value function for when it’s light.

One way to handle this is to use experience replay, where we save our \((s,a,s',r)\) experiences in a replay buffer. Whenever we take a step in the world, we add the \((s,a,s',r)\) to the replay buffer and use it to do a Q-learning update. Then we also randomly select some number of tuples from the replay buffer, and do Q-learning updates based on them as well. In general, it may help to keep a sliding window of just the 1000 most recent experiences in the replay buffer. (A larger buffer will be necessary for situations when the optimal policy might visit a large part of the state space, but we like to keep the buffer size small for memory reasons and also so that we don’t focus on parts of the state space that are irrelevant for the optimal policy.) The idea is that it will help us propagate reward values through our state space more efficiently if we do these updates. We can see it as doing something like value iteration, but using samples of experience rather than a known model.

Combining a neural-network \(\mathrm{Q}\) approximator, experience replay, and a target network (a copy of \(Q_\theta\) whose parameters are held fixed for many SGD steps and only periodically refreshed) is the recipe known as Deep Q-Networks (DQN). The target network keeps the bootstrapped target stable across many updates.

12.2.1 Fitted Q-learning

An alternative strategy for learning the \(\mathrm{Q}\) function that is somewhat more robust than the standard Q-learning algorithm is a method called fitted Q. It uses the same Bellman-error idea as above, but cleanly separates the two phases: first freeze the bootstrapped targets \(y = r + \gamma \max_{a'} Q_\theta(s', a')\) into a static dataset, then run ordinary supervised regression to fit \(Q_\theta\) to those targets.

\begin{algorithm}\begin{algorithmic} \Procedure{Fitted-Q-Learning}{$\mathcal{A}, s_0, \gamma, \epsilon, m$} \State $s \gets s_0$ \Comment{e.g., $s_0$ can be drawn randomly from $\mathcal{S}$} \State $\mathcal{D} \gets \emptyset$ \State initialize neural-network representation of $Q$ \While{True} \State $\mathcal{D}_\text{new} \gets$ experience from executing $\epsilon$-greedy policy based on $Q$ for $m$ steps \State $\mathcal{D} \gets \mathcal{D} \cup \mathcal{D}_\text{new}$ represented as tuples $(s, a, s', r)$ \State $\mathcal{D}_\text{supervised} \gets \emptyset$ \For{each tuple $(s, a, s', r) \in \mathcal{D}$} \State $x \gets (s,a)$ \State $y \gets r + \gamma \max_{a' \in \mathcal{A}} Q(s', a')$ \State $\mathcal{D}_\text{supervised} \gets \mathcal{D}_\text{supervised} \cup \{(x,y)\}$ \EndFor \State $Q \gets \text{supervised-NN-regression}(\mathcal{D}_\text{supervised})$ \EndWhile \EndProcedure \end{algorithmic} \end{algorithm}

Here, we alternate between using the policy induced by the current \(Q_\theta\) to gather a batch of data \(\mathcal{D}_\text{new}\), adding it to our overall data set \(\mathcal{D}\), and then using supervised neural-network training to learn a representation of the \(\mathrm{Q}\) value function on the whole data set. This method does not mix the dynamic-programming phase (computing new \(\mathrm{Q}\) values based on old ones) with the function approximation phase (supervised training of the neural network) and avoids catastrophic forgetting. The regression training typically uses squared error as a loss function and would be trained until the fit is good (possibly measured on held-out data).

Note that fitted-Q and the experience-replay version of Q-learning above are not so much rival algorithms as two points on a spectrum: both store past transitions and both fit a network to bootstrapped Bellman targets. They differ in how often the targets get refreshed (every gradient step vs. once per outer iteration) and in how many gradient steps are taken between refreshes.

12.4 Model-based RL

The conceptually simplest approach to RL is to estimate \(\mathrm{T}\) and \(\mathrm{R}\) from interaction data, and then solve the resulting MDP \((\mathcal{S}, \mathcal{A}, \hat{\mathrm{T}}, \hat{\mathrm{R}})\) with value iteration (or another MDP-solving algorithm).

Suppose we have collected experience as tuples \((s^{(t)}, a^{(t)}, s^{(t+1)}, r^{(t)})\). Since \(\mathrm{T}(s,a,s')\) is a probability, we estimate it by counting: \[ \hat{\mathrm{T}}(s,a,s') = \frac{\#(s,a,s')}{\#(s,a)}, \] where \(\#(s,a,s')\) counts transitions matching \((s,a,s')\) in the data and \(\#(s,a)\) counts visits to \((s,a)\).

A common refinement is Laplace smoothing, which uses \(\hat{\mathrm{T}}(s,a,s') = \frac{\#(s,a,s') + 1}{\#(s,a) + |\mathcal{S}|}\) to prevent zero-probability estimates and division by zero. Conceptually, this is similar to having “initialized” our estimate with uniform random probabilities; the effect fades as data accumulates.

Since rewards are deterministic in \((s,a)\), a single observation suffices: \(\hat{\mathrm{R}}(s,a) = r\).

This approach is effective for small discrete state and action spaces, but generalizing to continuous or very large state spaces is an active research area.

12.5 Bandit problems

Bandit problems are a subset of reinforcement learning problems. A basic bandit problem is given by:

  • A set of actions \(\mathcal{A}\);
  • A set of reward values \(\mathcal{R}\); and
  • A probabilistic reward function \(R_p: \mathcal{A} \times \mathcal{R} \rightarrow \mathbb{R}\), i.e., \(R_p\) is a function that takes an action and a reward and returns the probability of getting that reward conditioned on that action being taken, \(R_p(a, r) = Pr(\text{reward} = r \mid \text{action} = a)\). Each time the agent takes an action, a new value is drawn from this distribution.

Notice that this probablistic rewards set up in bandits differs from the “rewards are deterministic” assumptions we made so far.

The most typical bandit problem has \(\mathcal{R} = \{0, 1\}\) and \(|\mathcal{A}| = k\). This is called a \(k\)-armed bandit problem, where the decision is which “arm” (action \(a\)) to select, and the reward is either getting a payoff (\(1\)) or not (\(0\)).

Why “bandit”? In English slang, “one-armed bandit” refers to a slot machine because it has one arm and takes your money! Here, we have a similar machine but with \(k\) arms.

The important question is usually one of exploration versus exploitation. Imagine you have tried each action 10 times, and now you have estimates \(\hat{R}_p(a, r)\) for the probabilities \(R_p(a, r)\). Which arm should you pick next? You could:

  • exploit your knowledge, choosing the arm with the highest value of expected reward; or
  • explore further, trying some or all actions more times to get better estimates of the \(R_p(a, r)\) values.

The theory ultimately tells us that, the longer our horizon \(h\) (or similarly, closer to 1 our discount factor), the more time we should spend exploring, so that we don’t converge prematurely on a bad choice of action.

Bandit problems are reinforcement learning problems (and very different from batch supervised learning) in that:

  • The agent gets to influence what data it obtains (selecting \(a\) gives it another sample from \(R(a, r)\)), and
  • The agent is penalized for mistakes it makes while it is learning (trying to maximize the expected reward it gets while behaving).

In a contextual bandit problem, you have multiple possible states from some set \(\mathcal{S}\), and a separate bandit problem associated with each one.

Bandit problems are an essential subset of reinforcement learning. It’s important to be aware of the issues, but we will not study solutions to them in this class.