The policy\pi defines the action probabilities, P(A_t | S_t).
The horizonT, i.e. number of steps (either finite or infinite).
The discount factor\gamma (with 0 < \gamma \le 1).
Horizon and discount factor:
For finite horizon, we usually take \gamma = 1.
For infinite horizon, we usualy take \gamma < 1,
Ensure that the objective is not infinite (assuming that the rewards are bounded).
This favors immediate rewards rather than delayed rewards.
This can be interpreted as a (1 - \gamma) probability that the execution stops after each time step.
Note: continuous state and action spaces
For simplicity, I am assuming here finite state and action space for the formulas in most places. This makes the formulas more approachable for people who are not much confortable with integrals, probability densities and general distributions, etc. The formulas can be generalized for continuous state and action spaces in most cases.
Policy
For finite horizon problems, the action usually depends on the current time t:
i.e. \pi = (\pi_0, \ldots, \pi_{H-1})
For deterministic policies: A_t = \pi_t(S_t)
For stochastic policies: (A_t | S_t = s) \sim \pi_t(s)
For infinite horizon problems, the action selection does not need to depend on the current time t:
For deterministic policies: A_t = \pi(S_t)
For stochastic policies: (A_t | S_t = s) \sim \pi(s)
Additional definitions
The return (or reward-to-go) G_t is the (discounted) sum of the rewards R_t:
G_t \triangleq \sum_{u=t}^{H-1} \gamma^{u-t} R_u
We define R(s,a) as the expected reward given S_t = s and A_t = a:
R(s,a) \triangleq \mathbb{E} \left[ R_t | S_t = s, A_t = a \right] = \underset{r \sim \mathcal{R}(s,a)}{\mathbb{E}} \left[ r \right]
A trajectory\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots) (or an episode) is a sequence of states, actions and rewards.
We note R(\tau) the return for a given trajectory \tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots):
R(\tau) \triangleq \sum_{t=0}^{H-1} \gamma^t r_t
We note J(\pi) the utility function, i.e. the expected return for policy \pi:
A deterministic optimal policy always exists. We can assume that the policy is deterministic:
A_t = \pi(S_t)
The optimal value/action-value/advantage functions, V^\star(s), Q^\star(s,a) and A^\star(s,a), is the value/action-value/advantage functions of a/any optimal policy:
Instead of using a tabular value function, we can use a parametric function Q_\phi(s,a). For example, this can be a neural network taking the state as input and with one output neuron per action value.
In Q-learning, we learn the optimal action-value function Q^\star, or an approximation of it (\hat{Q}^\star \approx Q^\star), from (off-policy) trajectories. We can then use this learned function to choose the best action for a given state:
This assumes that the state space and the action space are finite and is not practical if the action and state spaces are not too big. One drawback is that there is no generalization across states and actions.
Tabular Double Q-learning (mitigates the maximisation bias) learns to functions Q and Q':
Q-networks: using Q-learning with function approximation (i.e. neural networks). When the Q-networkds are deep neural networks, we have Deep Q-learning.
We can minimize the a mean squared error loss of the form:
In order to make the policy derivable, we often consider stochastic policies. When using a continous action space, the policy can be deterministic as well.
We use the following notations:
\begin{align*} (A_t|S_t = s) & \sim\pi_\theta(s) \\ \pi_\theta(a|s) &= P(A_t = a | S_t = s) & \text{ discrete action space } \\ \pi_\theta(a|s) &= p(A_t = a | S_t = s) & \text{ continuous action space } \end{align*}
Note: finite horizon
When using finite horizon, the action selection usually depends on the current time t: (A_t|S_t = s) \sim\pi_{\theta,t}(s), etc. When using a parameterized function (eg. based on a neural network), t can be an input of the neural network.
We simplify the notations by introducing:
J(\theta) \triangleq J(\pi_\theta)
Finite action space
For a finite action space, the policy \pi_\theta uses a categorical policy and is typically defined by a neural network of the form \pi_\theta(s) = \mathrm{softmax}(f_\theta(x)) (i.e. taking the state as input and with one output neuron per action value):
where \mu_\theta(s) gives the means and \Sigma_\theta(s) is a diagonal covariance matrix.
Note: policy gradient for deterministic policies
For a policy gradient algorithm which works with deterministic policies, a = \pi_\theta(s) (on continous action spaces), see Deep Deterministic Policy Gradient (DDPG), in Continuous control with deep reinforcement learningnn, Lillicrap et al (2015).
Actor-critic methods are policy gradient methods but use TD learning for estimating the value of G_t.
A2C
Advantage actor-critic (A2C) uses G_t \approx g_{t:t+1} (one-step) and b(s) = V_\phi(s) and learn \pi_\theta and an approximation V_\phi of its value fuction at the same time:
Kakade's Natural Policy Gradient is using a natural gradient ascent with a constraint on the KL divergence between the current policy \pi_\theta and the new policy \pi_{\theta'}:
Morimura's Natural Policy Gradient (aka natural state-action gradient) uses a constrainted base on the KL divergence of the joint distribution over states and actions, P_\theta(s,a), instead of over states, \pi_\theta(a|s):
with q(a|s) a sampling distribution, typically \pi_\theta(a|s), and \pi_{\theta'}(s|a)/q(s|a) is the likelihood ratio introduced because we are sampling a using q(s) instead of \pi_{\theta'}(s) (importance sampling).
\begin{aligned} s & \leftarrow \hat{H}_\theta^{-1} \, \hat{g}_\theta & \text{ (search direction)} \\ \beta & \leftarrow \sqrt{\frac{2 \epsilon}{s^T \, \hat{H}_\theta^{-1} \, s}} \\ \beta & \leftarrow \eta^k \beta & \text{ where } k \in \mathbb{N} \text { is the smallest natural integer} \\ && \text{such that } \hat{D}(\theta, \theta + \eta^k \, \beta \, s) & \le \epsilon \\ && \text{and } \tilde{J}_\theta(\theta + \eta^k \, \beta \, s ) & \gt \tilde{J}_\theta(\theta) \\ \theta & \leftarrow \beta \, s \end{aligned}
with 0 < \eta < 1.
The Q values Q^{\pi_\theta}(s,a) can for example be estimated by the reward to go g_t of the sampled trajectories.
Note: TRPO (single-path) algorithm
Sample trajectories from \pi_\theta
Compute \hat{g}_\theta = \nabla_\theta\left[ \frac{1}{n} \sum_{s,a,g} \frac{\pi_{\theta+\delta}(a|s)}{\pi_\theta(a|s)} \, g \right] where g is the reward-to-go after (s,a) of the sampled trajectory
Find seach direction s \leftarrow \hat{H}_\theta^{-1} \, \hat{g}_\theta (using the conjugate gradient algorithm)
And then we choose the largest \beta = \eta^k \, \beta^\star (i.e. the smallest k \in \mathbb{N}) such that:
the constraint is verified, D(\theta, \theta + \beta \, s) \le \epsilon;
the objective is improved, \tilde{J}_\theta(\theta + \beta \, s ) \gt \tilde{J}_\theta(\theta) .
PPO
Proximal Policy Optimization (PPO) uses the same idea as TRPO (avoiding moving too far away from the current policy) but without using second-order derivatives (the hessian).
where \argmax_\delta is solved using gradient ascent.
The \epsilon parameter is typically around \epsilon = 0.2
Note: explanation of the surrogate objective
This surrogate objective L_\theta(\delta) removes the incentive to move to far away from \pi_\theta based on the \frac{ \pi_{\theta+\delta}(a|s) }{ \pi_{\theta}(a|s) } ratio:
For \hat{A}_{\pi_\theta}(s,a) > 0, \nabla_\delta L_\theta(\delta,s,a) = 0 when \frac{ \pi_{\theta+\delta}(a|s) }{ \pi_{\theta}(a|s) } > 1 + \epsilon.
For \hat{A}_{\pi_\theta}(s,a) < 0, \nabla_\delta L_\theta(\delta,s,a) = 0 when \frac{ \pi_{\theta+\delta}(a|s) }{ \pi_{\theta}(a|s) } < 1 - \epsilon.
Note: PPO-clip algorithm
Sample trajectories from \pi_\theta
Estimate advantages \hat{A}_{\pi_\theta}(s,a)
Optimize \max_\delta L_\theta(\delta) using gradient ascent
If the expectation of the KL divergence was too small, \beta is reduced which leads to greater policy steps. On the other hand, if the expectation of the KL divergence was too big, \beta is increased which leads to smaller policy steps.
Note: PPO-penalty algorithm
Sample trajectories from \pi_\theta;
Estimate advantages \hat{A}_{\pi_\theta}(s,a);
Optimize \max_\delta L_\theta(\delta) using gradient ascent
Adaptative update of parameter \beta based on \underset{s \sim \rho_\theta}{\mathbb{E}} \left[ D_\mathrm{KL}( \pi_\theta(s) \| \pi_{\theta+\delta}(s)) \right]
Model-based reinforcement learning (MBRL) methods learn the world model. Once the world mode is learned we can use planning algorithms.
Not covered here.
Exploration/exploitation tradeoff
In order to make the RL algorithm work, we need to find a balance between exploration (of new states, trajectories, actions) and exploitation (of known good states and actions).
ε-soft policies
An \epsilon-soft policy is a policy such that:
\forall s, \forall a, \pi(a|s) \ge \frac{\epsilon}{|\mathbb{A}|}
One example are \epsilon-greedy policies. An \epsilon-greedy policy takes the best action (according to some Q function) with probability (1 - \epsilon) and a random action with probability \epsilon.
This is often used with Q learning in order to have some amount of exploration.
Entropy regularization
We can add some regularization in order to prevent the policy from being too deterministic and keep some amount of exploration.
A max-entropy term can be added in the objective function in order to prevent the policies from being to deterministic:
is an estimator of \underset{X \sim P}{\mathbb{E}} \left[ f(X) \right] ,
with w(x) = P(x) / Q(x) or more generally w(x) = \lambda \, P(x) / Q(x) is the likelihood ratio;
assuming that as Q(x) > 0 for all x such that P(x) > 0 (which ensures that w(X_i) is finite and therefore that the denominator of the estimator does not become infinite).
Benefits:
This can be used if we can only know P(x) and/or Q(x) up to a scale factor.
Avoid the inconsistent estimates problem mentionned in the previous section.
Drawbacks:
In contrast to the previous one, it is not unbiased.
Stronger requirements on Q.
Appendix, Kullback–Leibler divergence
Kullback–Leibler divergence
The Kullback–Leibler divergence (KL divergence) D_\mathrm{KL}(P \| Q) between to distribution P and Q measures the difference between a (real) P distribution and a Q distribution (often an approximation of P):
The Hesssian of the KL divergence between P_\theta and P_{\theta'} evaluated at \theta' \coloneqq \theta is the Fisher information matrix \mathcal{I}_X(\theta) of X \sim P_\theta:
for some \lambda \ge 0. We can then search a local optimum by solving \nabla_\delta L(\delta) = 0.
Equivalently, we can form the Lagrangian, L(\delta, \lambda) = f(\theta + \delta) + \lambda (\|\delta\|_2^2 - \epsilon) and find a local optimum by solving \nabla_\delta L(\delta, \lambda) = 0 and \nabla_\lambda L(\delta, \lambda) = 0 for unknown \delta and \lambda. The former equation is equivalent to the gradient of the relaxed problem and the latter equation is equivalent to the constraint ( \|\delta\|_2^2 = \epsilon ).
More generally, if we consider a constraint of the form D(\theta,\theta+\delta) \le \epsilon where D(\theta,\theta+\delta) is a distance or divergence:
This assumes that \left[ \nabla_\theta D(\theta, \theta+\delta) \right]_{\delta \coloneqq 0} which is true if D(\theta,\theta) is a local minimum and \theta is not a boundary.
We assume that D(\theta,\theta') > 0 for \theta \ne \theta'. This implies that H_\theta is definite positive and therefore invertible.
Note: derivation
We can approximate D(\theta,\theta+\delta) using a second order Taylor expansion:
We don't want to explicitly form \hat{H}(\theta) (which is very huge for large deep network models), and we don't want to inverse it using basic matrix inversion methods.
Because \hat{H}(\theta) is positive semi-definite, we can use the conjugate gradient algorithm to efficiently find an approximate solution. This algorithm only needs evaluating \hat{H}(\theta) \, x (Hessian-Vector product, HVP).
The Hessian-Vector product can be computed using the following equation (using some autodiff library):
(\nabla^2_\theta h(\theta)) \, x = \nabla_\theta\left[ (\nabla_\theta h(\theta))^T \, x \right]
When trying to learn a probability distribution P_\theta (eg. using maximum likelihood or maximum a posteriori), we can limit the size of the steps by adding a constraint in the probability distribution space:
In many cases, adding a constraint in the distribution space, D_\mathrm{KL}(P_\theta \| P_{\theta+\delta}) \le \epsilon, makes a lot more sense that a constraint on the parameter space, \|d\theta\|_2^2 \le \epsilon.