Reinforcement Learning formulas cheat sheet

Published: Sep 22 2025

Updated: Sep 22 2025

Cheat sheet for (some) reinforcement learning mathematical formulas and algorithms.

For (very important) details about the algorithms, see for example:

Reinforcement Learning, an Introduction, second edition, Sutton and Barto
Reinforcement Learning: An Overview, Kevin Murphy (2024)

Table of content

Table of content
MDP model
Planning
Learning the value of a policy
Q-learning
Policy optimization
Model-based reinforcement learning
Exploration/exploitation tradeoff
- ε-soft policies
- Entropy regularization
Appendix, Log-derivative trick
Appendix, Importance sampling
- Basic importance sampling
- Self-normalizing importance sampling
Appendix, Kullback–Leibler divergence
- Kullback–Leibler divergence
- Entropy
Appendix, gradient descent
Appendix, notations
References

MDP model

Markov Decision Process (MDP) represented as a (dynamic) Bayesian network

When the world model is known→ planning.

When the world model is not known (or not used) → reinforcement learning,

3 types of approaches for reinforcement learning:

learning the optimal policy, \pi^\star(s) (policy optimization);
learning the optimal value function, Q^\star(s,a) (Q Learning);
learning the world model and then using planning (not covered here).

Objective

Goal: (find a policy in order to) maximize the expected sum of (discounted) rewards.

Objective:

\max_\pi \mathbb{E} \left[ \sum_{t=0}^{H-1} \gamma^{t} R_t | S_0=s, \pi \right]

with:

The policy \pi defines the action probabilities, P(A_t | S_t).
The horizon T, i.e. number of steps (either finite or infinite).
The discount factor \gamma (with 0 < \gamma \le 1).

Horizon and discount factor:

For finite horizon, we usually take \gamma = 1.
For infinite horizon, we usualy take \gamma < 1,
- Ensure that the objective is not infinite (assuming that the rewards are bounded).
- This favors immediate rewards rather than delayed rewards.
- This can be interpreted as a (1 - \gamma) probability that the execution stops after each time step.

Note: continuous state and action spaces

For simplicity, I am assuming here finite state and action space for the formulas in most places. This makes the formulas more approachable for people who are not much confortable with integrals, probability densities and general distributions, etc. The formulas can be generalized for continuous state and action spaces in most cases.

Policy

For finite horizon problems, the action usually depends on the current time t:

i.e. \pi = (\pi_0, \ldots, \pi_{H-1})
For deterministic policies: A_t = \pi_t(S_t)
For stochastic policies: (A_t | S_t = s) \sim \pi_t(s)

For infinite horizon problems, the action selection does not need to depend on the current time t:

For deterministic policies: A_t = \pi(S_t)
For stochastic policies: (A_t | S_t = s) \sim \pi(s)

Additional definitions

The return (or reward-to-go) G_t is the (discounted) sum of the rewards R_t:

G_t \triangleq \sum_{u=t}^{H-1} \gamma^{u-t} R_u

We define R(s,a) as the expected reward given S_t = s and A_t = a:

R(s,a) \triangleq \mathbb{E} \left[ R_t | S_t = s, A_t = a \right] = \underset{r \sim \mathcal{R}(s,a)}{\mathbb{E}} \left[ r \right]

A trajectory \tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots) (or an episode) is a sequence of states, actions and rewards.

We note R(\tau) the return for a given trajectory \tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots):

R(\tau) \triangleq \sum_{t=0}^{H-1} \gamma^t r_t

We note J(\pi) the utility function, i.e. the expected return for policy \pi:

J(\pi) \triangleq \mathbb{E}\left[G_0 | \pi\right] = \mathbb{E}\left[\sum_{t=0}^{H-1} \gamma^t R_t | \pi\right] = \mathbb{E}\left[ R(\tau) | \pi\right] = \mathbb{E}\left[ V^\pi(S_0) \right]

Value of a policy

The value function of a policy \pi gives the expected return when following policy \pi starting from state S_0 = s:

V^\pi(s) \triangleq \mathbb{E} \left[ G_0 | S_0 = s, \pi \right] = \mathbb{E} \left[ \sum_{t=0}^{H-1} \gamma^t R^t | S_0 = s, \pi \right]

The action-value function (or Q-function) of a policy \pi gives the expected return when taking action a in state s and then following policy \pi:

Q^\pi(s,a) \triangleq \mathbb{E} \left[ G_0 | S_0 = s, A_0 = a, \pi \right] = \mathbb{E} \left[ \sum_{t=0}^{H-1} \gamma^t R^t | S_0 = s, A_0 = a, \pi \right]

The advantage function, gives the advantage of taking action a in state s and then following \pi instead of following \pi immediately:

A^\pi(s,a) \triangleq Q^\pi(s,a) - V^\pi(s)

In finite horizon, the functions depend on the current time stamp t:

\begin{aligned} V^\pi_t & \triangleq \mathbb{E} \left[ G_t | S_t = s, \pi \right] && = \mathbb{E} \left[ \sum_{u=t}^{H-1} \gamma^u R^u | S_t = s, \pi \right] \\ Q^\pi_t & \triangleq \mathbb{E} \left[ G_t | S_t = s, A_t = a, \pi \right] && = \mathbb{E} \left[ \sum_{u=t}^{H-1} \gamma^u R^u | S_t = s, A_t = a, \pi \right] \\ A^\pi_t & \triangleq Q^\pi_t(s,a) - V^pi_t(s) \end{aligned}

Bellman equation

When using infinite horizon, the Bellman equations characterizes the value function (and action-value function) of a policy \pi:

\begin{aligned} V^\pi(s) & = \underset{a \sim \pi(s), s' \sim \mathcal{T}(s,a) }{\mathbb{E}} \left[ R(s, a) + \gamma V^\pi(s') \right] \\ Q^\pi(s, a) & = R(s, a) + \gamma \underset{s' \sim \mathcal{T}(s,a), a' \sim \pi(s')}{\mathbb{E}} \left[ Q^\pi(s', a') \right] \end{aligned}

For discrete state and action spaces:

\begin{aligned} V^\pi(s) & = \sum_{a} \pi(a|s) \left[ R(s, a) + \gamma \sum_{s'} \mathcal{T}(s'|s,a) \, V^\pi(s') \right] \\ Q^\pi(s, a) & = R(s, a) + \gamma \sum_{s',a'} \mathcal{T}(s'|s,a) \, \pi(a'|s') \, Q^\pi(s', a') \end{aligned}

For discrete state space and deterministic policies:

\begin{aligned} V^\pi(s) & = R(s, \pi(s)) + \gamma \sum_{s'} \mathcal{T}(s'|s,\pi(s)) \, V^\pi(s') \\ Q^\pi(s, a) & = R(s, a) + \gamma \sum_{s'} \mathcal{T}(s'|s,a) \, Q^\pi(s', \pi(s')) \end{aligned}

Policy Evaluation

The policy Evaluation algorithm computes the value function of a given policy \pi by iterative computation of approximations of V^\pi

In infinite horizon, the following algorithm estimates V^\pi:

\begin{aligned} V_0(s) & \leftarrow 0 \text{ (or any other value)} \\ V_{k+1}(s) & \leftarrow \sum_a \pi(a|s) \left[ R(s, a) + \gamma \sum_{s'} \mathcal{T}(s'|s,a) V_k(s') \right] \end{aligned}

This converges to V^\pi: V_k(s) \overset{k \rightarrow \infty}\longrightarrow V^\pi(s).

In finite horizon, we can compute the value function t each time stamp V^\pi_t:

\begin{aligned} V^\pi_H(s) & \leftarrow 0 \\ V^\pi_{t-1}(s) & \leftarrow \sum_a \pi(a|s) \left[ R(s, a) + \gamma \sum_{s'} \mathcal{T}(s'|s,a) V^\pi_t(s') \right] \end{aligned}

Planning

Computing the optimal policy when the world model is known.

Optimal policy

A policy \pi^\star is optimal if its value function is at least as high as the value function of any other policy (for all states):

\forall \pi, \forall s, V^{\pi^\star}(s) \ge V^\pi(s)

There can be more than one optimal policy.

A deterministic optimal policy always exists. We can assume that the policy is deterministic:

A_t = \pi(S_t)

The optimal value/action-value/advantage functions, V^\star(s), Q^\star(s,a) and A^\star(s,a), is the value/action-value/advantage functions of a/any optimal policy:

\begin{aligned} V^\star(s) & = V^{\pi^\star}(s) && = \max_\pi V^\pi(s) \\ Q^\star(s,a) & = Q^{\pi^\star}(s,a) &&= \max_\pi Q^\pi(s,a) \\ A^\star(s,a) & = A^{\pi^\star}(s,a) &&= Q^\star(s,a) - V^\star(s) \end{aligned}

Note: using the optimal value functions

We can derive (a) the optimal policy \pi^\star from V^\star(s) if we know the world model (R(s,a) and \mathcal{T}(s'|s,a)):

\pi^\star(s) = \argmax_a \left[ R(s,a) + \gamma \sum_{s'} \mathcal{T}(s' | s,a) V^\star(s') \right]

We can derive the optimal policy from Q^\star(s,a) without the world model:

\pi^\star(s) = \argmax_a Q^\star(s,a)

If we learn Q^\star (Q-learning), we can directly use it to derive the optimal policy.

Bellman optimality equations

Bellman optimality equation for V^\star:

V^\star(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} \mathcal{T}(s' | s,a) V^\star(s') \right]

Bellman optimality equation for Q^\star:

Q^\star(s,a) = R(s,a) + \gamma \sum_{s'} \mathcal{T}(s' | s,a) \max_{a'} Q^\star(s',a')

Value Iteration

The value iteration algorithm computes the optimal value function V^\star (when the world model is known).

Infinite horizon:

\begin{aligned} V_0(s) & \leftarrow 0 \\ V_{k+1}(s) & \leftarrow \max_a \left[ R(s,a) + \gamma \sum_{s'} \mathcal{T}(s'|s,a) V_k(s') \right] \end{aligned}

This converges to V^\star: V_k(s) \overset{k \rightarrow \infty}\longrightarrow V^\star(s).

Policy Iteration

The policy iteration algorithm computing a series of policies \pi_k (converging to \pi^\star):

Policy evaluation, computes V^{\pi_k}
Policy improvement, uses V^{\pi_k} to derive a function \pi_{k+1} better than \pi_k

\pi_{k+1}(s) \leftarrow \argmax_{a}\left[ R(s,a) + \gamma \sum_{s'} \mathcal{T}(s' | s,a) V_k(s') \right]

Learning the value of a policy

Learning V (tabular)

General form:

V(s_t) \leftarrow V(s_t) + \alpha \left[ \overbrace{\text{target} - V(s_t)}^{\text{error}} \right] = (1 - \alpha) V(s_t) + \alpha \; \text{target}

with \alpha, the learning rate ( 0 < \alpha < 1 ).

Monte Carlo estimation (where \text{target} = g_t is the value of the return G_t):

V(s_t) \leftarrow V(s_t) + \alpha \overbrace{\left( g_t - V(s_t) \right)}^\text{MC error} = (1 - \alpha) \, V(s_t) + \alpha \, g_t

One-step Temporal Difference TD:

\begin{aligned} \text{target} & = r_t + \gamma V(s_{t+1}) \\ V(s_t) & \leftarrow V(s_t) + \alpha \overbrace{\left[ r_t + \gamma V(s_{t+1}) - V(s_t) \right]}^\text{TD error} \end{aligned}

n-step Temporal Difference:

V(s_t) \leftarrow V(s_t) + \alpha \left[ \underbrace{r_t + \gamma r_{t+1} + \ldots + \gamma^{n-1} r_{t+n-1} + \gamma ^{n} V(s_{t+n})}_{g_{t:t+n} \triangleq} - V(s_t) \right]

TD(λ) uses a combination of each n-step temporal difference:

V(s_t) \leftarrow V(s_t) + \alpha \left[ \underbrace{(1 - \lambda) \sum_{n=1}^{H-t-1} \lambda^{n-1} g_{t:t+n} + \lambda^{H-t-1} g_t}_{g^\lambda_t \triangleq} - V(s_t) \right]

For TD(0), we have plain normal TD.

Learning Q (tabular)

SARSA:

Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \, Q(s',a') - Q(s,a) \right]

n-step SARSA:

Q(s_t,a_t) \leftarrow V(s_t,a_t) + \alpha \left[ r_t + \gamma \, r_{t+1} + \ldots + \gamma^{n-1} \, r_{t+n-1} + \gamma ^{n} \, Q(s_{t+n},a_{t+n}) - Q(s_t,a_t) \right]

SARSA(λ):

Q(s_t,a_t) \leftarrow V(s_t,a_t) + \alpha \left[ g^\lambda_t - Q(s_t,a_t) \right]

Learning V (parametric)

Instead of using a tabular value function, we can use a parametric function V_\phi(s).

We can minimize the squared-error loss:

\mathcal{L}(\phi) \triangleq \frac{1}{|\mathcal{D}|} \sum_{s,\text{target}} \left( \text{target} - V_\phi(s) \right)^2

This gives the following gradient:

\nabla_\phi \mathcal{L}(\phi) = - \frac{2}{|\mathcal{D}|} \sum_{s,\text{target}} \left( \text{target} - V_\phi(s) \right) \nabla_\phi V_\phi(s)

Monte Carlo estimation uses the reward to go of a single trajectory (g = R(\tau)) as target:

\phi \leftarrow \phi + \alpha \left( g_t - V_\phi(s_t) \right) \nabla_\phi V_\phi(s_t)

One-step TD Learning:

\phi \leftarrow \phi + \alpha \left[ r_t + \gamma V_{\bar{\phi}}(s_{t+1}) - V_\phi(s_t) \right] \nabla_\phi V_\phi(s_t)

Learning Q (parametric)

Instead of using a tabular value function, we can use a parametric function Q_\phi(s,a). For example, this can be a neural network taking the state as input and with one output neuron per action value.

SARSA with function approximation:

\phi \leftarrow \phi + \alpha \left[ r + \gamma Q_{\bar{\phi}}(s',a') - Q_\phi(s,a) \right] \nabla_\phi Q_\phi(s,a)

Q-learning

In Q-learning (off-policy TD control^[1]), we learn the optimal action-value function Q^\star, or an approximation of it (\hat{Q}^\star \approx Q^\star), from (off-policy) trajectories. We can then use this learned function to choose the best action for a given state:

\begin{aligned} \pi^\star(s) = \argmax_{a} Q^\star(s,a) \\ \hat\pi^\star(s) = \argmax_{a} \hat{Q}^\star(s,a) \end{aligned}

Tabular Q-learning

In tabular Q-learning, the value function Q(s,a) is reprented as a table:

Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right]

This assumes that the state space and the action space are finite and is not practical if the action and state spaces are not too big. One drawback is that there is no generalization across states and actions.

The algorithm is:

Generate a action a for current state s, typically using an ε-greedy policy for Q and obtain transition (s,a,r,s')
Update Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right]
Update state s \leftarrow s'
Repeat

Tabular Double Q-learning learns two functions Q and Q' in order to mitigate the maximisation bias:

\begin{aligned} Q(s,a) & \leftarrow Q(s,a) + \alpha \left[ r + \gamma Q'(s', \argmax_{a'} Q(s',a')) - Q(s,a) \right] \\ Q'(s,a) & \leftarrow Q'(s,a) + \alpha \left[ r + \gamma Q(s', \argmax_{a'} Q'(s',a')) - Q'(s,a) \right] \end{aligned}

Q-Networks

Q-networks: using Q-learning with function approximation (i.e. neural networks). When the Q-networkds are deep neural networks, we have Deep Q-learning.

We can minimize the a mean squared error loss of the form:

\mathcal{L}(\phi) \triangleq \frac{1}{|\mathcal{D}|} \sum_{(s,a,r,s') \in \mathcal{D}} \left( r + \gamma \max_{a'} Q_{\bar{\phi}}(s',a') - Q_\phi(s,a) \right)^2

where \bar{\phi}=\phi but is considered frozen and therefore does not contribute to the derivation (stop-gradient).

This gives the following gradient:

\nabla_\phi \mathcal{L}(\phi) = - \frac{2}{|\mathcal{D}|} \sum_{(s,a,r,s') \in \mathcal{D}} \left( r + \gamma \max_{a'} Q_{\bar{\phi}}(s',a') - Q_\phi(s,a) \right) \nabla_\phi Q_\phi(s, a)

And the update:

\phi \leftarrow \phi + \alpha \frac{1}{|\mathcal{D}|} \sum_{(s,a,r,s') \in \mathcal{D}} \left( r + \gamma \max_{a'} Q_{\bar{\phi}}(s',a') - Q_\phi(s,a) \right) \nabla_\phi Q_\phi(s, a)

Double DQN

Double Deep Q Networks (Double DQN) is double Q-learning with Q-networks:

\begin{aligned} \phi \leftarrow \phi + \alpha \frac{1}{|\mathcal{D}|} \sum_{(s,a,r,s',a') \in \mathcal{D}} \left( r + \gamma Q_{\phi'}(s', \argmax_{a'} Q_{\bar{\phi}}(s',a') - Q_\phi(s,a) \right) \nabla_\phi Q_\phi(s, a) \\ \phi' \leftarrow \phi' + \alpha \frac{1}{|\mathcal{D}|} \sum_{(s,a,r,s',a') \in \mathcal{D}} \left( r + \gamma Q_{\phi}(s', \argmax_{a'} Q_{\bar{\phi}'}(s',a') - Q_{\phi'}(s,a) \right) \nabla_{\phi'} Q_{\phi'}(s, a) \end{aligned}

Policy optimization

In policy optimization methods, we directly learn/optimize the policy. The goal is to find a policy \pi in order to optimize the utility J(\pi):

J(\pi) \triangleq \mathbb{E}\left[G_0 | \pi\right] = \mathbb{E}\left[\sum_{t=0}^{H-1} \gamma^t R_t | \pi\right] = \mathbb{E}\left[ R(\tau) | \pi\right]

Note: policy optimization objective and state visitation

In infinite-horizon discounted settings we have:

J(\pi) = \sum_s \rho_\pi(s) \sum_a \pi(a|s) \, R(s,a)

with \rho_\pi(s) the (non-normalized) state visitation weights given by,

\rho_\pi(s) = \sum_{t=0}^{H-1} \gamma^t P(S_t = s | \pi)

In infinite horizon no discount, we can maximize the long-term expected reward instead,

\max_\pi \lim_{n \rightarrow \infty} \frac{1}{n} \mathbb{E}\left[ \sum_{t=0}^n R_t \right]

which gives:

J(\pi) = \sum_s \rho_\pi(s) \sum_a \pi(a|s) R(s,a)

with \rho_\pi(s) the (normalized) state visitation probabilities.

In both cases, we have the policy gradient:

\nabla_\theta J(\theta) = \sum_s \rho_\pi(s) \sum_a Q^\pi(s,a) \nabla_\theta \pi_\theta(a|s)

Form of the policy

In order to make the policy derivable, we often consider stochastic policies. When using a continous action space, the policy can be deterministic as well.

We use the following notations:

\begin{align*} (A_t|S_t = s) & \sim\pi_\theta(s) \\ \pi_\theta(a|s) &= P(A_t = a | S_t = s) & \text{ discrete action space } \\ \pi_\theta(a|s) &= p(A_t = a | S_t = s) & \text{ continuous action space } \end{align*}

Note: finite horizon

When using finite horizon, the action selection usually depends on the current time t: (A_t|S_t = s) \sim\pi_{\theta,t}(s), etc. When using a parameterized function (eg. based on a neural network), t can be an input of the neural network.

We simplify the notations by introducing:

J(\theta) \triangleq J(\pi_\theta)

Finite action space

For a finite action space, the policy \pi_\theta uses a categorical policy and is typically defined by a neural network of the form \pi_\theta(s) = \mathrm{softmax}(f_\theta(x)) (i.e. taking the state as input and with one output neuron per action value):

\begin{aligned} (A_t | S_t = s) & \sim \mathrm{Cat}(\mathrm{softmax}(f_\theta(s))) \\ \pi_\theta(s) & = \begin{pmatrix} P(A_t = \mathfrak{a}_1 | S_t = s, \pi_\theta) \\ P(A_t = \mathfrak{a}_2 | S_t = s, \pi_\theta) \\ \ldots \end{pmatrix} = \mathrm{softmax} (f_\theta(s)) \end{aligned}

where \mathfrak{a}_i are the different possible action values.

Continuous action space

For a contiunous action space, we can often use diagonal normal policies given by a neural network of the form:

(A_t | S_t = s) \sim \mathcal{N}( \mu_\theta(s), \Sigma_\theta(s))

where \mu_\theta(s) gives the means and \Sigma_\theta(s) is a diagonal covariance matrix.

Note: policy gradient for deterministic policies

For a policy gradient algorithm which works with deterministic policies, a = \pi_\theta(s) (on continous action spaces), see Deep Deterministic Policy Gradient (DDPG), in Continuous control with deep reinforcement learningnn, Lillicrap et al (2015).

Policy Gradient methods

Policy gradient methods are of the form:

\theta \leftarrow \theta + \alpha \, \underbrace{\nabla_\theta J(\theta)}_\text{policy gradient}

We can derive the policy gradient:

\begin{aligned} \nabla_\theta J(\theta) & = \nabla_\theta \left( \sum_\tau R(\tau) \, P_\theta(\tau) \right) \\ & = \sum_\tau R(\tau) \, \nabla_\theta P_\theta(\tau) \\ & = \sum_\tau R(\tau) \, O_\theta(\tau) \nabla_\theta \log P_\theta(\tau) & \text{(log-derivative trick)} \\ & = \underset{\tau \sim P_\theta}{\mathbb{E}} \left[R(\tau) \,\nabla_\theta \log P_\theta(\tau) \right] \\ & = \underset{\tau \sim P_\theta}{\mathbb{E}} \left[R(\tau) \,\nabla_\theta \log \left( \prod_{t=0}^{H-1} P_\theta(s_t|s_t) \times \ldots \right) \right] \\ & = \underset{\tau \sim P_\theta}{\mathbb{E}} \left[R(\tau) \sum_{t=0}^{H-1} \,\nabla_\theta \log \pi_\theta(a_t | s_t) \right] \\ & = \underset{\tau \sim P_\theta}{\mathbb{E}} \left[ \sum_{t=0}^{H-1} \gamma^t \, g_t \,\nabla_\theta \log \pi_\theta(a_t | s_t) \right] \end{aligned}

This gives the following sample estimate:

\nabla_\theta J(\theta) \approx \frac{1}{|\mathcal{D}|} \sum_{(s,a,g) \in \mathcal{D}} \sum_{t=0}^{H-1} \gamma^t \, g_t \,\nabla_\theta \log \pi_\theta(a_t | s_t)

where \mathcal{D} is a on-policy dataset. This is contrast to Q-learning which don't require on-policy data.

Basic policy gradient algorithm:

\theta \leftarrow \theta + \alpha \frac{1}{|\mathcal{D}|} \sum_{(s,a,g) \in \mathcal{D}} \sum_{t=0}^{H-1} \gamma^t \, g_t \, \underbrace{ \nabla_\theta \log \pi_{\theta}(a_t | s_t) }_{ = \frac{ \nabla_\theta \pi_{\theta}(a_t | s_t) }{\pi_{\theta}(a_t | s_t)} }

The REINFORCE algorithm (|\mathcal{D}| = 1):

\theta \leftarrow \theta + \alpha \, \gamma^t \, g_t \, \nabla_\theta \log \pi_{\theta}(a_t | s_t)

REINFORCE with baseline:

\theta \leftarrow \theta + \alpha \, \gamma^t \, (g_t - b(s_t)) \, \nabla_\theta \log \pi_{\theta}(a_t | s_t)

where b(s) is the baseline (it must not depend on a_t or \theta).

Note: baseline

For any function b(s), which does not depend on a, we have:

\begin{aligned} \mathbb{E}_{a \sim \pi_\theta(s)}\left[b(s) \nabla_\theta \log \pi(a|s) \right] &= b(s) \, \mathbb{E}_{a \sim \pi_\theta(s)}\left[ \nabla_\theta \log \pi(a|s) \right] \\ &= b(s) \, \mathbb{E}_{a \sim \pi_\theta(s)}\left[ \frac{\nabla_\theta \pi(a|s)}{ \pi(a|s)} \right] \\ &= b(s) \, \sum_a \frac{\nabla_\theta \pi(a|s)}{ \pi(a|s)} \, \pi(a|s) \\ &= b(s) \, \sum_a \nabla_\theta \pi(a|s) \\ &= b(s) \, \nabla_\theta \sum_a \pi(a|s) \\ &= b(s) \, \nabla_\theta 1 \\ &= 0 \end{aligned}

Therefore adding/substracting such a function still yields to the expectation of the gradient:

\begin{aligned} \nabla_\theta J(\theta) &= \underset{\tau \sim p_\theta}{\mathbb{E}} \left[ \sum_{t=0}^{H-1} \gamma^t \, g_t \nabla_\theta \log \pi_\theta(a_t | s_t) \right] \\ & = \underset{\tau \sim p_\theta}{\mathbb{E}} \left[ \sum_{t=0}^{H-1} \gamma^t \, \left(g_t - b(s_t)\right)\,\nabla_\theta \log \pi_\theta(a_t | s_t) \right] \end{aligned}

A good choice for variance reduction is b(s) = V^\pi(s).

In practice, we often use an approximation V_\phi(s) \approx V^\pi(s).

More generally:

\begin{aligned} \nabla_\theta J(\theta) & = \underset{(s, a, \psi) \sim p_\theta}{\mathbb{E}} \sum_{t=0}^{H -1} \left[ \psi_t \,\nabla_\theta \log \pi_\theta(a_t | s_t) \right] \\ \nabla_\theta J(\theta) & \approx \frac{1}{|\mathcal{D}|} \sum_{(s,a,g) \in \mathcal{D}} \sum_{t=0}^{H-1} \psi_t \,\nabla_\theta \log \pi_\theta(a_t | s_t) & \text{(sample estimate)} \end{aligned}

with either of:

\begin{aligned} \psi_t & = G_0 = \sum_{u=0}^{H-1} \, \gamma^u r_u \\ \psi_t &= \gamma^t \, \left( g_t - b(s_t) \right) = \gamma^t \, \left( \sum_{u=t}^{H-1} \gamma^{u-t} r_u - b(s_t) \right) \\ \psi_t &= \gamma^t \, \left( Q^{\pi_\theta}(s_t, a_t) - b(s_t) \right) \\ \psi_t &= \gamma^t \, \left( A^{\pi_\theta}(s_t, a_t) - b(s_t) \right) \\ \psi_t &= \gamma^t \left( r_t + \gamma V^{\pi_\theta}(s_{t+1}) - V^{\pi_\theta}(s_t) \right) \end{aligned}

Leading to:

\theta \leftarrow \theta + \alpha \frac{1}{|\mathcal{D}|} \sum_{(a,s,\psi) \in \mathcal{D}} \sum_{t=0}^{H-1} \psi_t \, \nabla_\theta \log \pi_{\theta}(a_t | s_t)

Actor-critic methods

Actor-critic methods are policy gradient methods which learn a value function at the same time. This value function is used to approximate a TD target.

A2C

Advantage actor-critic (A2C) learns an approximation of the current policy value function V_\phi. This function is used to compute a TD target g_{t:t+1} \approx r_t + \gamma \, V_\phi(s_{t+1}):

\begin{aligned} \delta_t & = \overbrace{r_t + \gamma \, V_\phi(s_{t+1})}^{\text{target } \approx g_{t:t+1}} - \overbrace{V_\phi(s_t)}^{b(s_t)} &\text{(TD error)} \\ \phi & \leftarrow \phi + \eta \, \delta_t \, \nabla_\phi V_\phi(s_t) &\text{(critic)} \\ \theta & \leftarrow \theta + \alpha \, \delta_t \, \nabla_\theta \log \pi_\theta(a_t | s_t) &\text{(actor)} \end{aligned}

When the policy and the value function have shared parameters \varphi (multi-headed neural network), we can use a single objective:

\begin{aligned} \min_\varphi \mathcal{L}(\varphi) \\ \mathcal{L}(\varphi) & = \lambda_1 \mathcal{L}_1(\varphi) - \lambda_2 \mathcal{L}_2(\varphi) - \lambda_3 \mathcal{L}_3(\varphi) \\ \mathcal{L}_1(\varphi) &= \frac{1}{T} \sum_{0}^{H-1} (r_t + \gamma \, V_{\bar{\varphi}}(s_{t+1}) - V_\varphi(s_{t+1}))^2 & \text{(critic)}\\ \\ \mathcal{L}_2(\varphi) &= \frac{1}{T} \sum_{t=0}^{H-1} (r_t + \gamma \, V_{\bar{\varphi}}(s_{t+1}) - V_\varphi(s_{t+1})) \log \pi_\varphi(a_t | s_t) & \text{(actor)} \\ \\ \mathcal{L}_3(\varphi) &= - \frac{1}{T} \sum_{t=0}^{H-1} \sum_a \pi_\varphi(a | s_t) \log \pi_\varphi(a | s_t) & \text{(entropy)} \end{aligned}

The (optional) entropy term acts as a regularizer.

A3C

In Asynchronous Advantage Actor Critic (A3C), several worker threads accumulate gradients independently:

\begin{aligned} \delta_t & = r_t + \gamma \, V_\phi(s_{t+1}) - V_\phi(s_t) &\text{(TD error)} \\ \delta\phi & \leftarrow \delta\phi + \eta \, \delta_t \, \nabla_\phi V_\phi(s_t) &\text{(critic)} \\ \delta\theta & \leftarrow \delta\theta + \alpha \, \delta_t \, \nabla_\theta \log \pi_\theta(a_t | s_t) &\text{(actor)} \end{aligned}

and apply them asynchronously to shared parameters \theta and \phi:

\begin{aligned} \phi &\leftarrow \phi + \delta\phi & \delta\phi &\leftarrow 0 \\ \theta &\leftarrow \theta + \delta\theta & \delta\theta &\leftarrow 0 \end{aligned}

Natural policy gradient

References

A Natural Policy Gradient, Sham Kakade (2001)
A New Natural Policy Gradient by Stationary Distribution Metric, Tetsuro Morimura et al (2008)
Geometry and convergence of natural policy gradient methods, Johannes Müller, Guido Montúfar (2022)

Kakade's Natural Policy Gradient

Kakade's Natural Policy Gradient is using a natural gradient ascent with a constraint on the KL divergence between the current policy \pi_\theta and the new policy \pi_{\theta'}:

\theta \leftarrow \theta + \alpha \, \argmax_{\delta} J(\theta+\delta) \text{ s.t. } \underbrace{ \underset{s \sim \rho_\theta}{\mathbb{E}}\left[ D_\mathrm{KL}(\pi_\theta(s) \| \pi_{\theta+\delta}(s)) \right] }_{D(\theta\|\theta+\delta) \triangleq} \le \epsilon

This gives the update:

\theta \leftarrow \theta + \alpha \, \underbrace{ \hat{H}_\theta^{-1} \, \nabla_\theta \hat{J}(\theta)}_{\text{natural policy gradient}}

with

\begin{aligned} H_\theta & = \left[ \nabla^2_\theta \underset{s \sim \rho_\theta}{\mathbb{E}} \left[ D_\mathrm{KL}(\pi_\theta(s) \| \pi_{\theta+\delta}(s)) \right] \right]_{\delta \coloneqq 0} \\ & = \underset{s \sim \rho_\theta}{\mathbb{E}} \left[ \left[ \nabla^2_\theta D_\mathrm{KL}(\pi_\theta(s) \| \pi_{\theta+\delta}(s)) \right]_{\delta \coloneqq 0} \right] \\ & = \underset{s \sim \rho_\theta}{\mathbb{E}} \left[ \underset{a \sim \pi_\theta(s)}{\mathbb{E}} \left[ (\nabla_{\theta} \log \pi_{\theta}(a|s)) \, (\nabla_{\theta} \log \pi_{\theta}(a|s))^T \right] \right] \end{aligned}

Note: justification

We can replace the constraint of the optimization problem by a penalization (soft-constraint):

\theta \leftarrow \theta + \alpha \, \argmax_\delta \underbrace{ \left( J(\theta) - \beta \underset{s \sim \rho_\theta}{\mathbb{E}}\left[ D_\mathrm{KL}(\pi_\theta(s) \| \pi_{\theta+\delta}(s)) \right] \right) }_{L_\theta(\delta)}

Approximation (Taylor expansion):

\begin{aligned} J(\theta+\delta) & \approx J(\theta) + \delta^T \, \nabla_\theta J(\theta) \\ D(\theta \| \theta+\delta) & \approx \frac{1}{2} \, \delta^T \, H_\theta \, \delta \\ L_\theta(\delta) & \approx J(\theta) + \delta^T \, \nabla_\theta J(\theta) - \frac{1}{2} \, \beta \, \delta^T \, H_\theta \, \delta \end{aligned}

We search \delta as a solution of \nabla_\delta L_\theta(\delta) = 0:

0 + \nabla_\theta J(\theta) + \beta \, H_\theta \, \delta = 0

Which gives:

\begin{aligned} \delta & = - \frac{1}{\beta} \, H_\theta^{-1} \nabla_\theta J(\theta) \\ \theta' &= \theta - \underbrace{\frac{1}{\beta}}_\alpha \, H_\theta^{-1} \nabla_\theta J(\theta) \end{aligned}

Morimura's Natural Policy Gradient

Morimura's Natural Policy Gradient (aka natural state-action gradient) uses a constrainted base on the KL divergence of the joint distribution over states and actions, P_\theta(s,a), instead of over states, \pi_\theta(a|s):

\theta \leftarrow \theta + \alpha \, \argmax_\delta J(\theta+\delta) \text{ subject to } D_\mathrm{KL}(P_\theta, P_{\theta + \delta}) \le \epsilon

TRPO

References

Trust Region Policy Optimization, Schulman et al (2015)
High-Dimensional Continuous Control Using Generalized Advantage Estimation, Schuman et al (2015)
Proximal Polict Optimizaion in HuggingFace Deep RL course

Trust Region Policy Optimization (TRPO) is closely related to Kakade's Natural Policy Gradient but

uses importance sampling for samplling actions a;
the practical algorithm in appendix C uses linear search for finding a correct step size.

We start with the objective:

\theta \leftarrow \argmax_{\theta'} \underbrace{ \underset{s \sim \rho_\theta, a \sim q(s)}{\mathbb{E}} \left[ \frac{\pi_{\theta'}(s|a)}{q(s|a)} Q^{\pi_\theta}(s,a) \right] }_{ \tilde{J}_\theta(\theta') \triangleq } \text{ s.t. } \underbrace{ \overbrace{ \underset{s \sim \rho_\theta}{\mathbb{E}}\left[ D_\mathrm{KL}(\pi_\theta(s) \| \pi_{\theta'}(s)) \right] }^{ D(\theta \| \theta') \triangleq } \le \epsilon}_\text{trusted region}

with q(a|s) a sampling distribution, typically \pi_\theta(a|s), and \pi_{\theta'}(s|a)/q(s|a) is the likelihood ratio introduced because we are sampling a using q(s) instead of \pi_{\theta'}(s) (importance sampling).

Penalized version:

\theta \leftarrow \max_{\theta'} \underset{s \sim \rho_{\theta}, a \sim q(s)}{\mathbb{E}}\left[ \frac{\pi_{\theta'}(a|s)}{q(a|s)} Q^{\pi_{\theta}}(s,a) \right] - \beta \, D(\theta \| \theta')

This is approximated using:

sample estimate of the expectations;
approximation of Q^{\pi_\theta}.

Approximation using Taylor expansion (with \delta = \theta' - \theta):

\begin{aligned} \tilde{J}_{\theta}(\theta+\delta) & \approx \tilde{J}_{\theta}(\theta) + g_\theta^T \, \delta \\ D(\theta \| \theta+\delta) & \approx \frac{1}{2} \, \delta^T \, H_\theta \, \delta \\ g_\theta &= \left[ \nabla_{\delta} \tilde{J}_{\theta}(\theta+\delta) \right]_{\delta \coloneqq 0} \\ H_\theta &= \left[ \nabla^2_\theta D(\theta \| \theta+\delta) \right]_{\delta \coloneqq 0} \end{aligned}

Leading to the update:

\begin{aligned} s & \leftarrow \hat{H}_\theta^{-1} \, \hat{g}_\theta & \text{ (search direction)} \\ \beta & \leftarrow \sqrt{\frac{2 \epsilon}{s^T \, \hat{H}_\theta^{-1} \, s}} \\ \beta & \leftarrow \eta^k \beta & \text{ where } k \in \mathbb{N} \text { is the smallest natural integer} \\ && \text{such that } \hat{D}(\theta, \theta + \eta^k \, \beta \, s) & \le \epsilon \\ && \text{and } \tilde{J}_\theta(\theta + \eta^k \, \beta \, s ) & \gt \tilde{J}_\theta(\theta) \\ \theta & \leftarrow \beta \, s \end{aligned}

with 0 < \eta < 1.

The Q values Q^{\pi_\theta}(s,a) can for example be estimated by the reward to go g_t of the sampled trajectories.

Note: TRPO (single-path) algorithm

Sample trajectories from \pi_\theta
Compute \hat{g}_\theta = \nabla_\theta\left[ \frac{1}{n} \sum_{s,a,g} \frac{\pi_{\theta+\delta}(a|s)}{\pi_\theta(a|s)} \, g \right] where g is the reward-to-go after (s,a) of the sampled trajectory
Find seach direction s \leftarrow \hat{H}_\theta^{-1} \, \hat{g}_\theta (using the conjugate gradient algorithm)
Find max step size \beta \leftarrow \sqrt{\frac{2 \epsilon}{s^T \, \hat{H}_\theta^{-1} \, s}}
Find step size \beta \leftarrow \eta^k \beta where k is the smallest natural integer such that
- Constraint: \hat{D}(\theta, \theta + \eta^k \, \beta \, s) \le \epsilon
- Increase of the objective: \text{and } \tilde{J}_\theta(\theta + \eta^k \, \beta \, s ) \gt \tilde{J}_\theta(\theta)
Update parameters \theta \leftarrow \theta + \beta s
Repeat

Note: derivation

Once the search direction s has been found, we find the step size (i.e. \beta) so that the KL constraint is verified.

We start with the largest \beta such that:

\begin{aligned} \frac{1}{2} \, (\beta s)^T \, \hat{H}_\theta \, (\beta s) & \le \epsilon \\ \frac{1}{2} \, \beta^2 \, s^T \, \hat{H}_\theta \, s & \le \epsilon \end{aligned}

Which gives:

\beta^\star = \sqrt{\frac{2 \epsilon}{s^T \, \hat{H}_\theta \, s}}

And then we choose the largest \beta = \eta^k \, \beta^\star (i.e. the smallest k \in \mathbb{N}) such that:

the constraint is verified, D(\theta, \theta + \beta \, s) \le \epsilon;
the objective is improved, \tilde{J}_\theta(\theta + \beta \, s ) \gt \tilde{J}_\theta(\theta) .

PPO

Proximal Policy Optimization (PPO) uses the same idea as TRPO (avoiding moving too far away from the current policy) but without using second-order derivatives (the hessian).

References

Proximal Policy Optimization Algorithms, Schulman et al (2017) (PPO)
Proximal Policy Optimization in OpenAI Spinning Up RL

PPO-Clip

PPO-clip is using an update of the form:

\begin{aligned} \theta & \leftarrow \theta + \argmax_\delta \underbrace{ \underset{s \sim \rho_\theta, a \sim \pi_\theta(s)}{\mathbb{E}} \left[ \overbrace{ \min \left( \frac{ \pi_{\theta+\delta}(a|s) }{ \pi_{\theta}(a|s) } \hat{A}_{\pi_\theta}(s,a), g_\epsilon( \hat{A}_{\pi_\theta}(s,a) ) \right) }^{L_\theta(\delta,s,a)} \right] }_{L_\theta(\delta)} \\ g_\epsilon(x) & = \begin{cases} 1 + \epsilon & \text{ if } x \ge 0 \\ 1 - \epsilon & \text{ if } x \lt 0 \end{cases} \end{aligned}

where \argmax_\delta is solved using gradient ascent.

The \epsilon parameter is typically around \epsilon = 0.2

Note: explanation of the surrogate objective

This surrogate objective L_\theta(\delta) removes the incentive to move to far away from \pi_\theta based on the \frac{ \pi_{\theta+\delta}(a|s) }{ \pi_{\theta}(a|s) } ratio:

For \hat{A}_{\pi_\theta}(s,a) > 0, \nabla_\delta L_\theta(\delta,s,a) = 0 when \frac{ \pi_{\theta+\delta}(a|s) }{ \pi_{\theta}(a|s) } > 1 + \epsilon.
For \hat{A}_{\pi_\theta}(s,a) < 0, \nabla_\delta L_\theta(\delta,s,a) = 0 when \frac{ \pi_{\theta+\delta}(a|s) }{ \pi_{\theta}(a|s) } < 1 - \epsilon.

Note: PPO-clip algorithm

Sample trajectories from \pi_\theta
Estimate advantages \hat{A}_{\pi_\theta}(s,a)
Optimize \max_\delta L_\theta(\delta) using gradient ascent
Update parameters \theta \leftarrow \theta + \delta
Repeat

PPO-Penalty

PPO-Penalty uses the following problem the penalized objective

\delta \leftarrow \argmax_\delta \underset{s \sim \rho_\theta, a \sim \pi_\theta(s)}{\mathbb{E}} \left[ \frac{ \pi_{\theta+\delta}(a|s) }{ \pi_{\theta}(a|s) } \hat{A}_{\pi_\theta}(s,a) - \beta D_\mathrm{KL}( \pi_\theta(s) \| \pi_{\theta+\delta}(s)) \right]

where \argmax_\delta is solved using gradient ascent.

The KL penalty is adaptative. After each update of \theta, the \beta parameter is updated in order to track a target KL divergence d_\text{target}:

\begin{aligned} d & \leftarrow \underset{s \sim \rho_\theta}{\mathbb{E}} \left[ D_\mathrm{KL}( \pi_\theta(s) \| \pi_{\theta+\delta}(s)) \right] \\ \beta & \leftarrow \begin{cases} \beta/2 & \text{if } d < d_\text{target}/1.5 \\ \beta & \text{if } d \in [d_\text{target}/1.5, 1.5 \, d_\text{target}] \\ 2 \, \beta & \text{if } d > 1.5 \, d_\text{target}/1.5 \\ \end{cases} \\ \theta & \leftarrow \theta + \delta \end{aligned}

If the expectation of the KL divergence was too small, \beta is reduced which leads to greater policy steps. On the other hand, if the expectation of the KL divergence was too big, \beta is increased which leads to smaller policy steps.

Note: PPO-penalty algorithm

Sample trajectories from \pi_\theta;
Estimate advantages \hat{A}_{\pi_\theta}(s,a);
Optimize \max_\delta L_\theta(\delta) using gradient ascent
Adaptative update of parameter \beta based on \underset{s \sim \rho_\theta}{\mathbb{E}} \left[ D_\mathrm{KL}( \pi_\theta(s) \| \pi_{\theta+\delta}(s)) \right]
Update parameters \theta \leftarrow \theta + \delta
Repeat

Model-based reinforcement learning

Model-based reinforcement learning (MBRL) methods learn the world model. Once the world mode is learned we can use planning algorithms.

Not covered here.

Exploration/exploitation tradeoff

In order to make the RL algorithm work, we need to find a balance between exploration (of new states, trajectories, actions) and exploitation (of known good states and actions).

ε-soft policies

An \epsilon-soft policy is a policy such that:

\forall s, \forall a, \pi(a|s) \ge \frac{\epsilon}{|\mathbb{A}|}

One example are \epsilon-greedy policies. An \epsilon-greedy policy takes the best action (according to some Q function) with probability (1 - \epsilon) and a random action with probability \epsilon.

This is often used with Q learning in order to have some amount of exploration.

Entropy regularization

We can add some regularization in order to prevent the policy from being too deterministic and keep some amount of exploration.

A max-entropy term can be added in the objective function in order to prevent the policies from being to deterministic:

\max_\pi \mathbb{E} \left[ \sum_{t=0}^T \gamma^t \left[ R_t + \beta \, \mathcal{H}(\pi(S_t)) \right] \middle| \pi \right]

References

Reinforcement Learning with Deep Energy-Based Policies, Haarnoja et al, 2017
Policy Optimization Reinforcement Learning with Entropy Regularization, Liu et al, 2019

Appendix, Log-derivative trick

The log-derivative trick is:

\nabla_x f(x) = f(x) \nabla_x \log f(x)

Note: proof

It derives from the chain rule of derivation:

\nabla_x \log f(x) = \frac{\nabla_x f(x)}{f(x)}

It is useful when trying to estimate the gradient of an expectation:

\begin{aligned} \nabla_\theta \underset{X \sim P_\theta}{\mathbb{E}} \left[ f(x) \right] &= \nabla_\theta \sum_x f(x) \, P_\theta(x) \\ &= \sum_x f(x) \, \nabla_\theta P_\theta(x) \\ &= \sum_x f(x) \, P_\theta(x) \, \nabla_\theta \log P_\theta(x) & \text{(log-derivative trick)} \\ &= \underset{X \sim P_\theta}{\mathbb{E}} \left[ f(X) \nabla_\theta \log P_\theta(X) \right] \end{aligned}

Which gives the following sample estimate,

\nabla_\theta \underset{X \sim P_\theta}{\mathbb{E}} \left[ f(X) \right] \approx \frac{1}{N} \sum_{i=1}^N f(x_i) \nabla_\theta \log P_\theta(x_i)

with samples x_i \sim P_\theta.

See Log-derivative trick.

Appendix, Importance sampling

References

Importance sampling chapter of “Monte Carlo theory, methods and examples”, Art B. Owen (2013)

Basic importance sampling

Estimating an expectation over distribution P using samples from another distribution Q.

For discrete distributions:

\begin{aligned} \underset{X \sim P}{\mathbb{E}} \left[ f(X) \right] & \approx \frac{1}{N} \sum_i \left[ \frac{P(x_i)}{Q(x_i)} f(x_i) \right] & \text{ with } x_i &\sim Q \text{ (i.i.d.)} \end{aligned}

with:

Nominal distribution: P(x)
Importance/proposal/sampling distribution: Q(x)
Likelihood ratio: w(x) = P(x)/Q(x)

More precisely,

\frac{1}{N} \sum_i \left[ \frac{P(X_i)}{Q(X_i)} f(X_i) \right] \text{ with } X_i \sim Q \text{ (i.i.d.)}

is an unbiaised estimator of \underset{X \sim P}{\mathbb{E}} \left[ f(X) \right] assuming that Q(x) > 0 for all x such that P(x) f(x) > 0 .

Proof: importance sampling

\begin{aligned} \underset{X \sim P}{\mathbb{E}} \left[ f(X) \right] &= \sum_x P(x) \, f(x) \\ &= \sum_x Q(x) \frac{P(x)}{Q(x)} f(x) \\ &= \underset{X \sim Q}{\mathbb{E}} \left[ \frac{P(X)}{Q(X)} f(X) \right] \\ &\approx \frac{1}{N} \sum_{i=1}^N \left[ \frac{P(x_i)}{Q(x_i)} f(x_i) \right] & \text{ with } x_i &\sim Q \end{aligned}

Note: inconsistent estimates

This estimator can produce estimates which are inconsistent (especially when \sum_i P(x_i)/Q(x_i) is not near 1).

For example:

for a constant function f(x) = c, it will yields estimates (\sum_i P(x_i)/Q(x_i)) c which is usually different from c;
for a function f giving values in [a,b], it might yield estimates outside of [a,b].

If this is a problem self-normalizing importance sampling can be used.

Self-normalizing importance sampling

\frac{\sum_i w(X_i) f(X_i)}{\sum_i w(X_i)} \text{ with } X_i \sim Q \text { i.i.d.}

is an estimator of \underset{X \sim P}{\mathbb{E}} \left[ f(X) \right] ,

with w(x) = P(x) / Q(x) or more generally w(x) = \lambda \, P(x) / Q(x) is the likelihood ratio;
assuming that as Q(x) > 0 for all x such that P(x) > 0 (which ensures that w(X_i) is finite and therefore that the denominator of the estimator does not become infinite).

Benefits:

This can be used if we can only know P(x) and/or Q(x) up to a scale factor.
Avoid the inconsistent estimates problem mentionned in the previous section.

Drawbacks:

In contrast to the previous one, it is not unbiased.
Stronger requirements on Q.

Appendix, Kullback–Leibler divergence

Kullback–Leibler divergence

The Kullback–Leibler divergence (KL divergence) D_\mathrm{KL}(P \| Q) between to distribution P and Q measures the difference between a (real) P distribution and a Q distribution (often an approximation of P):

For discrete distributions:

D_\mathrm{KL}(P \| Q) =\sum_x P(x) \, \log \frac{P(x)}{Q(x)}

Note: KL divergence and Fisher Information Matrix

The Hesssian of the KL divergence between P_\theta and P_{\theta'} evaluated at \theta' \coloneqq \theta is the Fisher information matrix \mathcal{I}_X(\theta) of X \sim P_\theta:

\begin{aligned} \left[ \nabla^2_\theta D_\mathrm{KL}(P_\theta \| P_{\theta'}) \right]_{\theta' \coloneqq \theta} & = \mathcal{I}_X(\theta) \end{aligned}

with

\begin{aligned} \mathcal{I}_X(\theta) & \triangleq - \underset{X \sim P_\theta}{\mathbb{E}} \left[ \left( \nabla_\theta \log P_\theta(X) \right) \, \left( \nabla_\theta \log P_\theta(X) \right)^T \right] \\ &= - \underset{X \sim P_\theta}{\mathbb{E}} \left[ \nabla^2_{\theta} \log P_\theta(X) \right] \end{aligned}

See Appendix A.3 of Natural Gradient Methods: Perspectives, Efficient-Scalable Approximations, and Analysis.

Entropy

The entropy of a distribution P is:

\mathcal{H}(P) = - \sum_x P(x) \log P(x)

The cross-entropy of Q relative to P is:

\mathcal{H}(P,Q) = - \sum_x P(x) \log Q(x)

Relation between the (cross-)entropy and the KL divergence:

\mathcal{H}(P,Q) = \mathcal{H}(P) + D_\mathrm{KL}(P \| Q)

Appendix, gradient descent

References

A New Natural Policy Gradient by Stationary Distribution Metric, Tetsuro Morimura et al (2008)
A Natural Policy Gradient, Sham Kakade (2001)

Ordinary gradient descent

Ordinary gradient descent for \min_\theta f(\theta) can be seen as an approximation of:

\theta ← \theta + \argmin_\delta f(\theta + \delta) \text{ subject to } \|\delta\|_2^2 \le \epsilon

i.e. we are searching for \theta' = \theta + \delta for a small \delta.

Leading to:

\theta ← \theta - \alpha \, \nabla_\theta f(\theta)

Note: derivation

We can use a relaxation of the constrained problem:

\min_\delta \underbrace{ f(\theta + \delta) + \lambda \|\delta\|_2^2 }_{L(\delta)}

for some \lambda \ge 0. We can then search a local optimum by solving \nabla_\delta L(\delta) = 0.

Equivalently, we can form the Lagrangian, L(\delta, \lambda) = f(\theta + \delta) + \lambda (\|\delta\|_2^2 - \epsilon) and find a local optimum by solving \nabla_\delta L(\delta, \lambda) = 0 and \nabla_\lambda L(\delta, \lambda) = 0 for unknown \delta and \lambda. The former equation is equivalent to the gradient of the relaxed problem and the latter equation is equivalent to the constraint ( \|\delta\|_2^2 = \epsilon ).

We use a linear approximation of f around \theta:

f(\theta + \delta) = f(\theta) + \delta^T \, \nabla_\theta f(\theta) + O(\|\delta\|^2)

Which gives:

\min_\delta \underbrace{ f(\theta) + \delta^T \, \nabla_\theta f(\theta) + \lambda \|\delta\|_2^2 }_{\tilde{L}(\delta)}

We search a local optimum by solving \nabla_\delta \tilde{L}(\delta) = 0 for the unknown \delta:

0 + \nabla_\theta f(\theta) + 2 \, \lambda \, \delta = 0

Which gives:

\begin{aligned} \delta & = - \frac{1}{2 \lambda} \nabla_\theta f(\theta) \\ \theta' & = \theta - \underbrace{\frac{1}{2 \lambda}}_\alpha \nabla_\theta f(\theta) \end{aligned}

Natural gradient descent

More generally, we can introduce a constraint using another metric \|\delta\|_G^2 defined as:

\|\delta\|_G^2 \triangleq \delta^T \, G(\theta) \, \delta = \sum_{i,j} \delta_i \, \delta_j \, G_{i,j}(\theta)

where G(\theta) is a positive-definite matrix.

Which gives the constrained optimization problem:

\theta ← \theta + \argmin_\delta f(\theta) \text{ subject to } \|\delta\|_G^2 \le \epsilon

Leading to the natural gradient descent update:

\theta ← \theta - \alpha \, G(\theta)^{-1} \, \nabla_\theta f(\theta)

This generalizes the previous section, when using G(\theta)=I.

Note: derivation

Relaxed problem after Taylor approximation of f:

\min_\delta \underbrace{ f(\theta) + \delta^T \, \nabla_\theta f(\theta) + \lambda \, \delta^T \, G(\theta) \, \delta }_{\tilde{L}(\delta)}

We find a local mimum by solving \nabla_\delta \tilde{L}(\delta) = 0:

\begin{aligned} \nabla_\theta f(\theta) + 2 \, \lambda \, G(\theta)\, \delta = 0 \end{aligned}

Which gives:

\begin{aligned} \delta &= - \frac{1}{2 \, \lambda} G(\theta)^{-1} \nabla_\theta f(\theta) \\ \theta' &= \theta - \underbrace{\frac{1}{2 \, \lambda}}_\alpha \underbrace{G(\theta)^{-1} \nabla_\theta f(\theta)}_\text{natural gradient} \end{aligned}

Natural gradient descent and hessian

More generally, if we consider a constraint of the form D(\theta,\theta+\delta) \le \epsilon where D(\theta,\theta+\delta) is a distance or divergence:

\min_\delta f(\theta + \delta) \text{ subject to } D(\theta,\theta+\delta) \le \epsilon

We get the natural gradient update:

\theta ← \theta - \alpha \, H_\theta^{-1} \, \nabla_\theta f(\theta)

where H_\theta is the hessian of the distance function D(\theta,\theta') evaluated at \theta'=\theta:

H_\theta = \left[ \nabla_\delta D(\theta, \theta+\delta) \right]_{\delta \coloneqq 0}

Some assumptions:

This assumes that \left[ \nabla_\theta D(\theta, \theta+\delta) \right]_{\delta \coloneqq 0} which is true if D(\theta,\theta) is a local minimum and \theta is not a boundary.
We assume that D(\theta,\theta') > 0 for \theta \ne \theta'. This implies that H_\theta is definite positive and therefore invertible.

Note: derivation

We can approximate D(\theta,\theta+\delta) using a second order Taylor expansion:

\begin{aligned} D(\theta,\theta+\delta) & = \underbrace{ D(\theta,\theta) }_{=0} + \delta^T \, \underbrace{ \left[ \nabla_\theta D(\theta, \theta+\delta) \right]_{\delta \coloneqq 0} }_{ =0 } + \frac{1}{2} \, \delta^T \, H_\theta \, \delta + O(\|\delta\|)^3 \\ & \approx \frac{1}{2} \delta^T \, H_\theta \, \delta \end{aligned}

Relaxed problem after Taylor approximation of f:

\min_\delta \underbrace{ f(\theta) + \delta^T \, \nabla_\theta f(\theta) + \frac{1}{2} \lambda \, \delta^T \, H_\theta \, \delta }_{\tilde{L}(\theta)}

Which gives:

\begin{aligned} \delta &= - \frac{1}{\lambda} H_\theta^{-1} \nabla_\theta f(\theta) \\ \theta' &= \theta - \underbrace{\frac{1}{\lambda}}_\alpha \underbrace{H_\theta^{-1} \nabla_\theta f(\theta)}_\text{natural gradient} \end{aligned}

Note: computation and inversion of the Hessian

We don't want to explicitly form \hat{H}(\theta) (which is very huge for large deep network models), and we don't want to inverse it using basic matrix inversion methods.

Because \hat{H}(\theta) is positive semi-definite, we can use the conjugate gradient algorithm to efficiently find an approximate solution. This algorithm only needs evaluating \hat{H}(\theta) \, x (Hessian-Vector product, HVP).

The Hessian-Vector product can be computed using the following equation (using some autodiff library):

(\nabla^2_\theta h(\theta)) \, x = \nabla_\theta\left[ (\nabla_\theta h(\theta))^T \, x \right]

References:

Natural gradient descent and KL divergence

When trying to learn a probability distribution P_\theta (eg. using maximum likelihood or maximum a posteriori), we can limit the size of the steps by adding a constraint in the probability distribution space:

\theta \leftarrow \theta + \argmax_{d\theta} f(\theta + \delta) \text{ subject to } D_\mathrm{KL}(P_\theta \| P_{\theta + \delta}) \le \epsilon

Note: constraint in the distribution space

In many cases, adding a constraint in the distribution space, D_\mathrm{KL}(P_\theta \| P_{\theta+\delta}) \le \epsilon, makes a lot more sense that a constraint on the parameter space, \|d\theta\|_2^2 \le \epsilon.

This gives the natural gradient descent update:

\theta \leftarrow \theta + \alpha \, H_\theta^{-1} \, \nabla_\theta f(\theta)

with

H_\theta = \left[ \nabla^2_{d\theta} D_\mathrm{KL}(P_\theta \| P_{\theta+\delta}) \right]_{\delta \coloneqq 0}

Appendix, notations

Notation	Meaning
t \in \{0, \ldots, H\}	Time
H	Horizon (either finite or infinite)
s \in \mathbb{S}	State value
a \in \mathbb{A}	Action value
r \in \mathbb{R}	Reward value
\tau = (s_0, a_0, r_0, \ldots)	Trajectory value
\mathcal{D}	Dataset/batch (of trajectories, etc.)
R(\tau)	Reward-to-go for a given trajectory
S_t	State (random variable)
A_t	Action (random variable)
R_t	Reward (random variable)
G_t	Reward-to-go (random variable)
\Tau = (S_0, A_0, R_0, \ldots)	Trajectory (random variable)
\gamma	Discount factor, \in (0,1]
\alpha, \eta	Learning rates, \in (0,1]
\lambda	Parameter of TD(\lambda), \in [0,1)
\beta	Soft-constraint penalty, Lagrange multiplier, step-size
s = \pi(a)	Déterministic policy
a \sim \pi(s)	Stochastic policy (distribution)
\pi(a\|s)	Stochastic policy (probability or density)
r \sim \mathcal{R}(s,a,s')	Reward distribution
\mathcal{R}(r\|s,a,s')	Reward distribution (probability or density)
R(s,a)	Expected reward
R(\tau)	Reward-to-go of a trajectory
s' \sim \mathcal{T}(s,a)	State transition distribution
\mathcal{T}(s'\|s,a)	State transition probability or density
V^\pi(s)	Value function
Q^\pi(s,a)	Action-value function
A^\pi(s,a)	Advantage function
a = \pi^\star(s)	Optimal policy
V^\star(s)	Optimal value function
Q^\star(s,a)	Optimal action-value function
A^\star(s,a)	Optimal advantage function
V_\phi(s)	Parameterized value function
Q_\phi(s,a)	Parameterized action-value function
A_\phi(s,a)	Parameterized advantage function
b(s)	Baseline function
a \sim \pi_\theta(s), \pi_\theta(a\|s)	Parameterized policy (stochastic)
a = \pi_\theta(s)	Parameterized policy (deterministic)
J(\pi), J(\theta)	Utility of a policy
\frac{\partial}{\partial x_i} f(x) or \partial_{x_i} f(x)	Partial derivative wrt x_i
\nabla_x f(x)	Gradient of f wrt x, i.e. column vector of \partial_{x_i} f(x)
\nabla^T_x f(x)	Transposed gradient of f wrt x, i.e. row vector of \partial_{x_j} f(x)
\nabla^2_x f(x)	Hessian matrix of f wrt x , i.e. matrix of \partial_{x_i} \partial_{x_j} f(x)
X \sim \mathcal{X}	Random variable X follows distribution \mathcal{X}
(X \| Y=y) \sim \mathcal{X}(y)	Random variable X follows distribution \mathcal{X} when Y=y
\mathbb{E}[X]	Expectation
\mathbb{E}[X \| Y=y]	Conditional expectation
\mathrm{softmax}(x)	Softmax function, i.e. vector of e^{x_i} / \sum_{i'} x^{e_i'}
\mathrm{Cat}(p)	Categorical distribution with probabilities p = (p_1, \ldots, p_n)
\mathcal{N}(\mu, \Sigma)	Multivariate normal distribution
D_\mathrm{KL}(P \| Q)	Kullback–Leibler divergence
\mathcal{H}(P)	Entropy
\mathcal{H}(P,Q)	Cross-entropy
\bar\phi, \bar{\varphi}	"Frozen" copy of \varphi, \phi (stop-gradient)
\phi	Parameters for value functions V_\phi, Q_\phi, etc.
\theta	Parameters for policy \pi_\theta
\varphi	Parameters shared (partially) between value function and policy eg. \pi_\varphi and Q_\varphi

References

Reinforcement learning books, courses, tutorials:

Reinforcement Learning, an Introduction, second edition, Sutton and Barto
Reinforcement Learning: An Overview, Kevin Murphy, 2024
OpenAI's Spinning Up in Deep RL!
Algorithms for Decision Making
Hugging Face Deep Reinforcement Learning course
The Definitive Guide to Policy Gradients in Deep Reinforcement Learning: Theory, Algorithms and Implementations, Matthias Lehmann
A (Long) Peek into Reinforcement Learning
Policy Gradient Algorithms ~ Lil'Log
CS 285: Deep Reinforcement Learning (Berkeley)
- CS 294: Deep Reinforcement Learning, Spring 2017 (Berkeley)
CS234: Reinforcement Learning (Stanford)
RL Course by David Silver (DeepMind)
Natural Policy Gradients, TRPO, PPO, Katerina Fragkiadaki

Reinforcement learning papers:

High-dimensionsal continuous control using generalized advantage estimation
Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton et al (1999)
A Natural Policy Gradient, Sham Kakade (2001)
A New Natural Policy Gradient by Stationary Distribution Metric, Tetsuro Morimura et al (2008)
Geometry and convergence of natural policy gradient methods, Johannes Müller, Guido Montúfar (2022)
Trust Region Policy Optimization, Schulman et al (2015) (TRPO)
Proximal Policy Optimization Algorithms, Schulman et al (2017) (PPO)
High-Dimensional Continuous Control Using Generalized Advantage Estimation, Schuman et al (2015)
Reinforcement Learning with Deep Energy-Based Policies, Haarnoja et al, 2017
Policy Optimization Reinforcement Learning with Entropy Regularization, Liu et al, 2019

Reinforcement learning frameworks:

Maths:

Log-derivative trick
The (Fisher) score function
Natural Gradient and Fisher by Jake Tae
How to compute Hessian-vector products?
The Autodiff Cookbook (JAX)
Natural Gradient Methods: Perspectives, Efficient-Scalable Approximations, and Analysis, Rajesh Shrestha (2023)
New insights and perspectives on the natural gradient method, James Martens (2014)
A note on the natural gradient and its connections with the Riemannian gradient, the mirror descent, and the ordinary gradient, Frank Nielsen (2020)
Dynamic Bayesian Networks: Representation, Inference and Learning, Kevin Murphy, 2022

See for example Reinforcement Learning, an Introduction for an off-policy Monte-Crlo control algorithm, using MC target instead of TD target. ↩︎