/dev/posts/

Reinforcement Learning formulas cheat sheet

Published:

Updated:

Cheat sheet for (some) reinforcement learning mathematical formulas and algorithms.

Table of content

MDP model

Markov Decision Process (MDP) represented as a (dynamic) Bayesian network

When the world model is knownplanning.

When the world model is not known (or not used) → reinforcement learning,

3 types of approaches for reinforcement learning:

Objective

Goal: (find a policy in order to) maximize the expected sum of (discounted) rewards.

Objective:

\max_\pi \mathbb{E} \left[ \sum_{t=0}^{H-1} \gamma^{t} R_t | S_0=s, \pi \right]

with:

Horizon and discount factor:

Note: continuous state and action spaces

For simplicity, I am assuming here finite state and action space for the formulas in most places. This makes the formulas more approachable for people who are not much confortable with integrals, probability densities and general distributions, etc. The formulas can be generalized for continuous state and action spaces in most cases.

Policy

For finite horizon problems, the action usually depends on the current time t:

For infinite horizon problems, the action selection does not need to depend on the current time t:

Additional definitions

The return (or reward-to-go) G_t is the (discounted) sum of the rewards R_t:

G_t \triangleq \sum_{u=t}^{H-1} \gamma^{u-t} R_u

We define R(s,a) as the expected reward given S_t = s and A_t = a:

R(s,a) \triangleq \mathbb{E} \left[ R_t | S_t = s, A_t = a \right] = \underset{r \sim \mathcal{R}(s,a)}{\mathbb{E}} \left[ r \right]

A trajectory \tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots) (or an episode) is a sequence of states, actions and rewards.

We note R(\tau) the return for a given trajectory \tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots):

R(\tau) \triangleq \sum_{t=0}^{H-1} \gamma^t r_t

We note J(\pi) the utility function, i.e. the expected return for policy \pi:

J(\pi) \triangleq \mathbb{E}\left[G_0 | \pi\right] = \mathbb{E}\left[\sum_{t=0}^{H-1} \gamma^t R_t | \pi\right] = \mathbb{E}\left[ R(\tau) | \pi\right] = \mathbb{E}\left[ V^\pi(S_0) \right]

Value of a policy

The value function of a policy \pi gives the expected return when following policy \pi starting from state S_0 = s:

V^\pi(s) \triangleq \mathbb{E} \left[ G_0 | S_0 = s, \pi \right] = \mathbb{E} \left[ \sum_{t=0}^{H-1} \gamma^t R^t | S_0 = s, \pi \right]

The action-value function (or Q-function) of a policy \pi gives the expected return when taking action a in state s and then following policy \pi:

Q^\pi(s,a) \triangleq \mathbb{E} \left[ G_0 | S_0 = s, A_0 = a, \pi \right] = \mathbb{E} \left[ \sum_{t=0}^{H-1} \gamma^t R^t | S_0 = s, A_0 = a, \pi \right]

The advantage function, gives the advantage of taking action a in state s and then following \pi instead of following \pi immediately:

A^\pi(s,a) \triangleq Q^\pi(s,a) - V^\pi(s)

In finite horizon, the functions depend on the current time stamp t:

\begin{aligned} V^\pi_t & \triangleq \mathbb{E} \left[ G_t | S_t = s, \pi \right] && = \mathbb{E} \left[ \sum_{u=t}^{H-1} \gamma^u R^u | S_t = s, \pi \right] \\ Q^\pi_t & \triangleq \mathbb{E} \left[ G_t | S_t = s, A_t = a, \pi \right] && = \mathbb{E} \left[ \sum_{u=t}^{H-1} \gamma^u R^u | S_t = s, A_t = a, \pi \right] \\ A^\pi_t & \triangleq Q^\pi_t(s,a) - V^pi_t(s) \end{aligned}

Bellman equation

When using infinite horizon, the Bellman equations characterizes the value function (and action-value function) of a policy \pi:

\begin{aligned} V^\pi(s) & = \underset{a \sim \pi(s), s' \sim \mathcal{T}(s,a) }{\mathbb{E}} \left[ R(s, a) + \gamma V^\pi(s') \right] \\ Q^\pi(s, a) & = R(s, a) + \gamma \underset{s' \sim \mathcal{T}(s,a), a' \sim \pi(s')}{\mathbb{E}} \left[ Q^\pi(s', a') \right] \end{aligned}

For discrete state and action spaces:

\begin{aligned} V^\pi(s) & = \sum_{a} \pi(a|s) \left[ R(s, a) + \gamma \sum_{s'} \mathcal{T}(s'|s,a) \, V^\pi(s') \right] \\ Q^\pi(s, a) & = R(s, a) + \gamma \sum_{s',a'} \mathcal{T}(s'|s,a) \, \pi(a'|s') \, Q^\pi(s', a') \end{aligned}

For discrete state space and deterministic policies:

\begin{aligned} V^\pi(s) & = R(s, \pi(s)) + \gamma \sum_{s'} \mathcal{T}(s'|s,\pi(s)) \, V^\pi(s') \\ Q^\pi(s, a) & = R(s, a) + \gamma \sum_{s'} \mathcal{T}(s'|s,a) \, Q^\pi(s', \pi(s')) \end{aligned}

Policy Evaluation

The policy Evaluation algorithm computes the value function of a given policy \pi by iterative computation of approximations of V^\pi

In infinite horizon, the following algorithm estimates V^\pi:

\begin{aligned} V_0(s) & \leftarrow 0 \text{ (or any other value)} \\ V_{k+1}(s) & \leftarrow \sum_a \pi(a|s) \left[ R(s, a) + \gamma \sum_{s'} \mathcal{T}(s'|s,a) V_k(s') \right] \end{aligned}

This converges to V^\pi: V_k(s) \overset{k \rightarrow \infty}\longrightarrow V^\pi(s).

In finite horizon, we can compute the value function t each time stamp V^\pi_t:

\begin{aligned} V^\pi_H(s) & \leftarrow 0 \\ V^\pi_{t-1}(s) & \leftarrow \sum_a \pi(a|s) \left[ R(s, a) + \gamma \sum_{s'} \mathcal{T}(s'|s,a) V^\pi_t(s') \right] \end{aligned}

Planning

Computing the optimal policy when the world model is known.

Optimal policy

A policy \pi^\star is optimal if its value function is at least as high as the value function of any other policy (for all states):

\forall \pi, \forall s, V^{\pi^\star}(s) \ge V^\pi(s)

There can be more than one optimal policy.

A deterministic optimal policy always exists. We can assume that the policy is deterministic:

A_t = \pi(S_t)

The optimal value/action-value/advantage functions, V^\star(s), Q^\star(s,a) and A^\star(s,a), is the value/action-value/advantage functions of a/any optimal policy:

\begin{aligned} V^\star(s) & = V^{\pi^\star}(s) && = \max_\pi V^\pi(s) \\ Q^\star(s,a) & = Q^{\pi^\star}(s,a) &&= \max_\pi Q^\pi(s,a) \\ A^\star(s,a) & = A^{\pi^\star}(s,a) &&= Q^\star(s,a) - V^\star(s) \end{aligned}

Note: using the optimal value functions

We can derive (a) the optimal policy \pi^\star from V^\star(s) if we know the world model (R(s,a) and \mathcal{T}(s'|s,a)):

\pi^\star(s) = \argmax_a \left[ R(s,a) + \gamma \sum_{s'} \mathcal{T}(s' | s,a) V^\star(s') \right]

We can derive the optimal policy from Q^\star(s,a) without the world model:

\pi^\star(s) = \argmax_a Q^\star(s,a)

If we learn Q^\star (Q-learning), we can directly use it to derive the optimal policy.

Bellman optimality equations

Bellman optimality equation for V^\star:

V^\star(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} \mathcal{T}(s' | s,a) V^\star(s') \right]

Bellman optimality equation for Q^\star:

Q^\star(s,a) = R(s,a) + \gamma \sum_{s'} \mathcal{T}(s' | s,a) \max_{a'} Q^\star(s',a')

Value Iteration

The value iteration algorithm computes the optimal value function V^\star (when the world model is known).

Infinite horizon:

\begin{aligned} V_0(s) & \leftarrow 0 \\ V_{k+1}(s) & \leftarrow \max_a \left[ R(s,a) + \gamma \sum_{s'} \mathcal{T}(s'|s,a) V_k(s') \right] \end{aligned}

This converges to V^\star: V_k(s) \overset{k \rightarrow \infty}\longrightarrow V^\star(s).

Policy Iteration

The policy iteration algorithm computing a series of policies \pi_k (converging to \pi^\star):

  1. Policy evaluation, computes V^{\pi_k}
  2. Policy improvement, uses V^{\pi_k} to derive a function \pi_{k+1} better than \pi_k
\pi_{k+1}(s) \leftarrow \argmax_{a}\left[ R(s,a) + \gamma \sum_{s'} \mathcal{T}(s' | s,a) V_k(s') \right]

Learning the value of a policy

Learning V (tabular)

General form:

V(s_t) \leftarrow V(s_t) + \alpha \left[ \overbrace{\text{target} - V(s_t)}^{\text{TD error}} \right] = (1 - \alpha) V(s_t) + \alpha \; \text{target}

with \alpha, the learning rate ( 0 < \alpha < 1 ).

Monte Carlo estimation (where g_t is the value of the return G_t):

V(s_t) \leftarrow V(s_t) + \alpha \left( g_t - V(s_t) \right) = (1 - \alpha) \, V(s_t) + \alpha \, g_t

One-step Temporal Difference, TD(0):

V(s_t) \leftarrow V(s_t) + \alpha \left[ r_t + \gamma V(s_{t+1}) - V(s_t) \right]

n-step Temporal Difference:

V(s_t) \leftarrow V(s_t) + \alpha \left[ \underbrace{r_t + \gamma r_{t+1} + \ldots + \gamma^{n-1} r_{t+n-1} + \gamma ^{n} V(s_{t+n})}_{g_{t:t+n} \triangleq} - V(s_t) \right]

TD(λ) uses a combination of each n-step temporal difference:

V(s_t) \leftarrow V(s_t) + \alpha \left[ \underbrace{(1 - \lambda) \sum_{n=1}^{H-t-1} \lambda^{n-1} g_{t:t+n} + \lambda^{H-t-1} g_t}_{g^\lambda_t \triangleq} - V(s_t) \right]

Learning Q (tabular)

SARSA:

Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \, Q(s',a') - Q(s,a) \right]

n-step SARSA:

Q(s_t,a_t) \leftarrow V(s_t,a_t) + \alpha \left[ r_t + \gamma \, r_{t+1} + \ldots + \gamma^{n-1} \, r_{t+n-1} + \gamma ^{n} \, Q(s_{t+n},a_{t+n}) - Q(s_t,a_t) \right]

SARSA(λ):

Q(s_t,a_t) \leftarrow V(s_t,a_t) + \alpha \left[ g^\lambda_t - Q(s_t,a_t) \right]

Learning V (parametric)

Instead of using a tabular value function, we can use a parametric function V_\phi(s).

We can minimize the squared-error loss:

\mathcal{L}(\phi) \triangleq \frac{1}{|\mathcal{D}|} \sum_{s,\text{target}} \left( \text{target} - V_\phi(s) \right)^2

This gives the following gradient:

\nabla_\phi \mathcal{L}(\phi) = - \frac{2}{|\mathcal{D}|} \sum_{s,\text{target}} \left( \text{target} - V_\phi(s) \right) \nabla_\phi V_\phi(s)

Monte Carlo estimation uses the reward to go of a single trajectory (g = R(\tau)) as target:

\phi \leftarrow \phi + \alpha \left( g_t - V_\phi(s_t) \right) \nabla_\phi V_\phi(s_t)

One-step TD Learning:

\phi \leftarrow \phi + \alpha \left[ r_t + \gamma V_{\bar{\phi}}(s_{t+1}) - V_\phi(s_t) \right] \nabla_\phi V_\phi(s_t)

Learning Q (parametric)

Instead of using a tabular value function, we can use a parametric function Q_\phi(s,a). For example, this can be a neural network taking the state as input and with one output neuron per action value.

SARSA with function approximation:

\phi \leftarrow \phi + \alpha \left[ r + \gamma Q_{\bar{\phi}}(s',a') - Q_\phi(s,a) \right] \nabla_\phi Q_\phi(s,a)

Q-learning

In Q-learning, we learn the optimal action-value function Q^\star, or an approximation of it (\hat{Q}^\star \approx Q^\star), from (off-policy) trajectories. We can then use this learned function to choose the best action for a given state:

\begin{aligned} \pi^\star(s) = \argmax_{a} Q^\star(s,a) \\ \hat\pi^\star(s) = \argmax_{a} \hat{Q}^\star(s,a) \end{aligned}

Tabular Q-learning

In tabular Q-learning, the value function Q(s,a) is reprented as a table:

Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right]

This assumes that the state space and the action space are finite and is not practical if the action and state spaces are not too big. One drawback is that there is no generalization across states and actions.

Tabular Double Q-learning (mitigates the maximisation bias) learns to functions Q and Q':

\begin{aligned} Q(s,a) & \leftarrow Q(s,a) + \alpha \left[ r + \gamma Q'(s', \argmax_{a'} Q(s',a')) - Q(s,a) \right] \\ Q'(s,a) & \leftarrow Q'(s,a) + \alpha \left[ r + \gamma Q(s', \argmax_{a'} Q'(s',a')) - Q'(s,a) \right] \end{aligned}

Q-Networks

Q-networks: using Q-learning with function approximation (i.e. neural networks). When the Q-networkds are deep neural networks, we have Deep Q-learning.

We can minimize the a mean squared error loss of the form:

\mathcal{L}(\phi) \triangleq \frac{1}{|\mathcal{D}|} \sum_{(s,a,r,s') \in \mathcal{D}} \left( r + \gamma \max_{a'} Q_{\bar{\phi}}(s',a') - Q_\phi(s,a) \right)^2

where \bar{\phi}=\phi but is considered frozen and therefore does not contribute to the derivation (stop-gradient).

This gives the following gradient:

\nabla_\phi \mathcal{L}(\phi) = - \frac{2}{|\mathcal{D}|} \sum_{(s,a,r,s') \in \mathcal{D}} \left( r + \gamma \max_{a'} Q_{\bar{\phi}}(s',a') - Q_\phi(s,a) \right) \nabla_\phi Q_\phi(s, a)

And the update:

\phi \leftarrow \phi + \alpha \frac{1}{|\mathcal{D}|} \sum_{(s,a,r,s') \in \mathcal{D}} \left( r + \gamma \max_{a'} Q_{\bar{\phi}}(s',a') - Q_\phi(s,a) \right) \nabla_\phi Q_\phi(s, a)

Double DQN

Double Deep Q Networks (Double DQN) is double Q-learning with Q-networks:

\begin{aligned} \phi \leftarrow \phi + \alpha \frac{1}{|\mathcal{D}|} \sum_{(s,a,r,s',a') \in \mathcal{D}} \left( r + \gamma Q_{\phi'}(s', \argmax_{a'} Q_{\bar{\phi}}(s',a') - Q_\phi(s,a) \right) \nabla_\phi Q_\phi(s, a) \\ \phi' \leftarrow \phi' + \alpha \frac{1}{|\mathcal{D}|} \sum_{(s,a,r,s',a') \in \mathcal{D}} \left( r + \gamma Q_{\phi}(s', \argmax_{a'} Q_{\bar{\phi}'}(s',a') - Q_{\phi'}(s,a) \right) \nabla_{\phi'} Q_{\phi'}(s, a) \end{aligned}

Policy optimization

In policy optimization methods, we directly learn/optimize the policy. The goal is to find a policy \pi in order to optimize the utility J(\pi):

J(\pi) \triangleq \mathbb{E}\left[G_0 | \pi\right] = \mathbb{E}\left[\sum_{t=0}^{H-1} \gamma^t R_t | \pi\right] = \mathbb{E}\left[ R(\tau) | \pi\right]

Note: policy optimization objective and state visitation

In infinite-horizon discounted settings we have:

J(\pi) = \sum_s \rho_\pi(s) \sum_a \pi(a|s) \, R(s,a)

with \rho_\pi(s) the (non-normalized) state visitation weights given by,

\rho_\pi(s) = \sum_{t=0}^{H-1} \gamma^t P(S_t = s | \pi)

In infinite horizon no discount, we can maximize the long-term expected reward instead,

\max_\pi \lim_{n \rightarrow \infty} \frac{1}{n} \mathbb{E}\left[ \sum_{t=0}^n R_t \right]

which gives:

J(\pi) = \sum_s \rho_\pi(s) \sum_a \pi(a|s) R(s,a)

with \rho_\pi(s) the (normalized) state visitation probabilities.

In both cases, we have the policy gradient:

\nabla_\theta J(\theta) = \sum_s \rho_\pi(s) \sum_a Q^\pi(s,a) \nabla_\theta \pi_\theta(a|s)

Form of the policy

In order to make the policy derivable, we often consider stochastic policies. When using a continous action space, the policy can be deterministic as well.

We use the following notations:

\begin{align*} (A_t|S_t = s) & \sim\pi_\theta(s) \\ \pi_\theta(a|s) &= P(A_t = a | S_t = s) & \text{ discrete action space } \\ \pi_\theta(a|s) &= p(A_t = a | S_t = s) & \text{ continuous action space } \end{align*}

Note: finite horizon

When using finite horizon, the action selection usually depends on the current time t: (A_t|S_t = s) \sim\pi_{\theta,t}(s), etc. When using a parameterized function (eg. based on a neural network), t can be an input of the neural network.

We simplify the notations by introducing:

J(\theta) \triangleq J(\pi_\theta)

Finite action space

For a finite action space, the policy \pi_\theta uses a categorical policy and is typically defined by a neural network of the form \pi_\theta(s) = \mathrm{softmax}(f_\theta(x)) (i.e. taking the state as input and with one output neuron per action value):

\begin{aligned} (A_t | S_t = s) & \sim \mathrm{Cat}(\mathrm{softmax}(f_\theta(s))) \\ \pi_\theta(s) & = \begin{pmatrix} P(A_t = \mathfrak{a}_1 | S_t = s, \pi_\theta) \\ P(A_t = \mathfrak{a}_2 | S_t = s, \pi_\theta) \\ \ldots \end{pmatrix} = \mathrm{softmax} (f_\theta(s)) \end{aligned}

where \mathfrak{a}_i are the different possible action values.

Continuous action space

For a contiunous action space, we can often use diagonal normal policies given by a neural network of the form:

(A_t | S_t = s) \sim \mathcal{N}( \mu_\theta(s), \Sigma_\theta(s))

where \mu_\theta(s) gives the means and \Sigma_\theta(s) is a diagonal covariance matrix.

Note: policy gradient for deterministic policies

For a policy gradient algorithm which works with deterministic policies, a = \pi_\theta(s) (on continous action spaces), see Deep Deterministic Policy Gradient (DDPG), in Continuous control with deep reinforcement learningnn, Lillicrap et al (2015).

Policy Gradient methods

Policy gradient methods are of the form:

\theta \leftarrow \theta + \alpha \, \underbrace{\nabla_\theta J(\theta)}_\text{policy gradient}

We can derive the policy gradient:

\begin{aligned} \nabla_\theta J(\theta) & = \nabla_\theta \left( \sum_\tau R(\tau) \, P_\theta(\tau) \right) \\ & = \sum_\tau R(\tau) \, \nabla_\theta P_\theta(\tau) \\ & = \sum_\tau R(\tau) \, O_\theta(\tau) \nabla_\theta \log P_\theta(\tau) & \text{(log-derivative trick)} \\ & = \underset{\tau \sim P_\theta}{\mathbb{E}} \left[R(\tau) \,\nabla_\theta \log P_\theta(\tau) \right] \\ & = \underset{\tau \sim P_\theta}{\mathbb{E}} \left[R(\tau) \,\nabla_\theta \log \left( \prod_{t=0}^{H-1} P_\theta(s_t|s_t) \times \ldots \right) \right] \\ & = \underset{\tau \sim P_\theta}{\mathbb{E}} \left[R(\tau) \sum_{t=0}^{H-1} \,\nabla_\theta \log \pi_\theta(a_t | s_t) \right] \\ & = \underset{\tau \sim P_\theta}{\mathbb{E}} \left[ \sum_{t=0}^{H-1} \gamma^t \, g_t \,\nabla_\theta \log \pi_\theta(a_t | s_t) \right] \end{aligned}

This gives the following sample estimate:

\nabla_\theta J(\theta) \approx \frac{1}{|\mathcal{D}|} \sum_{(s,a,g) \in \mathcal{D}} \sum_{t=0}^{H-1} \gamma^t \, g_t \,\nabla_\theta \log \pi_\theta(a_t | s_t)

where \mathcal{D} is a on-policy dataset. This is contrast to Q-learning which don't require on-policy data.

Basic policy gradient algorithm:

\theta \leftarrow \theta + \alpha \frac{1}{|\mathcal{D}|} \sum_{(s,a,g) \in \mathcal{D}} \sum_{t=0}^{H-1} \gamma^t \, g_t \, \underbrace{ \nabla_\theta \log \pi_{\theta}(a_t | s_t) }_{ = \frac{ \nabla_\theta \pi_{\theta}(a_t | s_t) }{\pi_{\theta}(a_t | s_t)} }

The REINFORCE algorithm (|\mathcal{D}| = 1):

\theta \leftarrow \theta + \alpha \, \gamma^t \, g_t \, \nabla_\theta \log \pi_{\theta}(a_t | s_t)

REINFORCE with baseline:

\theta \leftarrow \theta + \alpha \, \gamma^t \, (g_t - b(s_t)) \, \nabla_\theta \log \pi_{\theta}(a_t | s_t)

where b(s) is the baseline (it must not depend on a_t or \theta).

Note: baseline

For any function b(s), which does not depend on a, we have:

\begin{aligned} \mathbb{E}_{a \sim \pi_\theta(s)}\left[b(s) \nabla_\theta \log \pi(a|s) \right] &= b(s) \, \mathbb{E}_{a \sim \pi_\theta(s)}\left[ \nabla_\theta \log \pi(a|s) \right] \\ &= b(s) \, \mathbb{E}_{a \sim \pi_\theta(s)}\left[ \frac{\nabla_\theta \pi(a|s)}{ \pi(a|s)} \right] \\ &= b(s) \, \sum_a \frac{\nabla_\theta \pi(a|s)}{ \pi(a|s)} \, \pi(a|s) \\ &= b(s) \, \sum_a \nabla_\theta \pi(a|s) \\ &= b(s) \, \nabla_\theta \sum_a \pi(a|s) \\ &= b(s) \, \nabla_\theta 1 \\ &= 0 \end{aligned}

Therefore adding/substracting such a function still yields to the expectation of the gradient:

\begin{aligned} \nabla_\theta J(\theta) &= \underset{\tau \sim p_\theta}{\mathbb{E}} \left[ \sum_{t=0}^{H-1} \gamma^t \, g_t \nabla_\theta \log \pi_\theta(a_t | s_t) \right] \\ & = \underset{\tau \sim p_\theta}{\mathbb{E}} \left[ \sum_{t=0}^{H-1} \gamma^t \, \left(g_t - b(s_t)\right)\,\nabla_\theta \log \pi_\theta(a_t | s_t) \right] \end{aligned}

A good choice for variance reduction is b(s) = V^\pi(s).

In practice, we often use an approximation V_\phi(s) \approx V^\pi(s).

More generally:

\begin{aligned} \nabla_\theta J(\theta) & = \underset{(s, a, \psi) \sim p_\theta}{\mathbb{E}} \sum_{t=0}^{H -1} \left[ \psi_t \,\nabla_\theta \log \pi_\theta(a_t | s_t) \right] \\ \nabla_\theta J(\theta) & \approx \frac{1}{|\mathcal{D}|} \sum_{(s,a,g) \in \mathcal{D}} \sum_{t=0}^{H-1} \psi_t \,\nabla_\theta \log \pi_\theta(a_t | s_t) & \text{(sample estimate)} \end{aligned}

with either of:

\begin{aligned} \psi_t & = G_0 = \sum_{u=0}^{H-1} \, \gamma^u r_u \\ \psi_t &= \gamma^t \, \left( g_t - b(s_t) \right) = \gamma^t \, \left( \sum_{u=t}^{H-1} \gamma^{u-t} r_u - b(s_t) \right) \\ \psi_t &= \gamma^t \, \left( Q^{\pi_\theta}(s_t, a_t) - b(s_t) \right) \\ \psi_t &= \gamma^t \, \left( A^{\pi_\theta}(s_t, a_t) - b(s_t) \right) \\ \psi_t &= \gamma^t \left( r_t + \gamma V^{\pi_\theta}(s_{t+1}) - V^{\pi_\theta}(s_t) \right) \end{aligned}

Leading to:

\theta \leftarrow \theta + \alpha \frac{1}{|\mathcal{D}|} \sum_{(a,s,\psi) \in \mathcal{D}} \sum_{t=0}^{H-1} \psi_t \, \nabla_\theta \log \pi_{\theta}(a_t | s_t)

Actor-critic methods

Actor-critic methods are policy gradient methods but use TD learning for estimating the value of G_t.

A2C

Advantage actor-critic (A2C) uses G_t \approx g_{t:t+1} (one-step) and b(s) = V_\phi(s) and learn \pi_\theta and an approximation V_\phi of its value fuction at the same time:

\begin{aligned} \delta_t & = r_t + \gamma \, V_\phi(s_{t+1}) - V_\phi(s_t) &\text{(TD error)} \\ \phi & \leftarrow \phi + \eta \, \delta_t \, \nabla_\phi V_\phi(s_t) &\text{(critic)} \\ \theta & \leftarrow \theta + \alpha \, \delta_t \, \nabla_\theta \log \pi_\theta(a_t | s_t) &\text{(actor)} \end{aligned}

When the policy and the value function have shared parameters \varphi (multi-headed neural network), we can use a single objective:

\begin{aligned} \min_\varphi \mathcal{L}(\varphi) \\ \mathcal{L}(\varphi) & = \lambda_1 \mathcal{L}_1(\varphi) - \lambda_2 \mathcal{L}_2(\varphi) - \lambda_3 \mathcal{L}_3(\varphi) \\ \mathcal{L}_1(\varphi) &= \frac{1}{T} \sum_{0}^{H-1} (r_t + \gamma \, V_{\bar{\varphi}}(s_{t+1}) - V_\varphi(s_{t+1}))^2 & \text{(critic)}\\ \\ \mathcal{L}_2(\varphi) &= \frac{1}{T} \sum_{t=0}^{H-1} (r_t + \gamma \, V_{\bar{\varphi}}(s_{t+1}) - V_\varphi(s_{t+1})) \log \pi_\varphi(a_t | s_t) & \text{(actor)} \\ \\ \mathcal{L}_3(\varphi) &= - \frac{1}{T} \sum_{t=0}^{H-1} \sum_a \pi_\varphi(a | s_t) \log \pi_\varphi(a | s_t) & \text{(entropy)} \end{aligned}

The (optional) entropy term acts as a regularizer.

A3C

In Asynchronous Advantage Actor Critic (A3C), several worker threads accumulate gradients independently:

\begin{aligned} \delta_t & = r_t + \gamma \, V_\phi(s_{t+1}) - V_\phi(s_t) &\text{(TD error)} \\ \delta\phi & \leftarrow \delta\phi + \eta \, \delta_t \, \nabla_\phi V_\phi(s_t) &\text{(critic)} \\ \delta\theta & \leftarrow \delta\theta + \alpha \, \delta_t \, \nabla_\theta \log \pi_\theta(a_t | s_t) &\text{(actor)} \end{aligned}

and apply them asynchronously to shared parameters \theta and \phi:

\begin{aligned} \phi &\leftarrow \phi + \delta\phi & \delta\phi &\leftarrow 0 \\ \theta &\leftarrow \theta + \delta\theta & \delta\theta &\leftarrow 0 \end{aligned}

Natural policy gradient

References

Kakade's Natural Policy Gradient

Kakade's Natural Policy Gradient is using a natural gradient ascent with a constraint on the KL divergence between the current policy \pi_\theta and the new policy \pi_{\theta'}:

\theta \leftarrow \theta + \alpha \, \argmax_{\delta} J(\theta+\delta) \text{ s.t. } \underbrace{ \underset{s \sim \rho_\theta}{\mathbb{E}}\left[ D_\mathrm{KL}(\pi_\theta(s) \| \pi_{\theta+\delta}(s)) \right] }_{D(\theta\|\theta+\delta) \triangleq} \le \epsilon

This gives the update:

\theta \leftarrow \theta + \alpha \, \underbrace{ \hat{H}_\theta^{-1} \, \nabla_\theta \hat{J}(\theta)}_{\text{natural policy gradient}}

with

\begin{aligned} H_\theta & = \left[ \nabla^2_\theta \underset{s \sim \rho_\theta}{\mathbb{E}} \left[ D_\mathrm{KL}(\pi_\theta(s) \| \pi_{\theta+\delta}(s)) \right] \right]_{\delta \coloneqq 0} \\ & = \underset{s \sim \rho_\theta}{\mathbb{E}} \left[ \left[ \nabla^2_\theta D_\mathrm{KL}(\pi_\theta(s) \| \pi_{\theta+\delta}(s)) \right]_{\delta \coloneqq 0} \right] \\ & = \underset{s \sim \rho_\theta}{\mathbb{E}} \left[ \underset{a \sim \pi_\theta(s)}{\mathbb{E}} \left[ (\nabla_{\theta} \log \pi_{\theta}(a|s)) \, (\nabla_{\theta} \log \pi_{\theta}(a|s))^T \right] \right] \end{aligned}

Note: justification

We can replace the constraint of the optimization problem by a penalization (soft-constraint):

\theta \leftarrow \theta + \alpha \, \argmax_\delta \underbrace{ \left( J(\theta) - \beta \underset{s \sim \rho_\theta}{\mathbb{E}}\left[ D_\mathrm{KL}(\pi_\theta(s) \| \pi_{\theta+\delta}(s)) \right] \right) }_{L_\theta(\delta)}

Approximation (Taylor expansion):

\begin{aligned} J(\theta+\delta) & \approx J(\theta) + \delta^T \, \nabla_\theta J(\theta) \\ D(\theta \| \theta+\delta) & \approx \frac{1}{2} \, \delta^T \, H_\theta \, \delta \\ L_\theta(\delta) & \approx J(\theta) + \delta^T \, \nabla_\theta J(\theta) - \frac{1}{2} \, \beta \, \delta^T \, H_\theta \, \delta \end{aligned}

We search \delta as a solution of \nabla_\delta L_\theta(\delta) = 0:

0 + \nabla_\theta J(\theta) + \beta \, H_\theta \, \delta = 0

Which gives:

\begin{aligned} \delta & = - \frac{1}{\beta} \, H_\theta^{-1} \nabla_\theta J(\theta) \\ \theta' &= \theta - \underbrace{\frac{1}{\beta}}_\alpha \, H_\theta^{-1} \nabla_\theta J(\theta) \end{aligned}

Morimura's Natural Policy Gradient

Morimura's Natural Policy Gradient (aka natural state-action gradient) uses a constrainted base on the KL divergence of the joint distribution over states and actions, P_\theta(s,a), instead of over states, \pi_\theta(a|s):

\theta \leftarrow \theta + \alpha \, \argmax_\delta J(\theta+\delta) \text{ subject to } D_\mathrm{KL}(P_\theta, P_{\theta + \delta}) \le \epsilon

TRPO

References

Trust Region Policy Optimization (TRPO) is closely related to Kakade's Natural Policy Gradient but

We start with the objective:

\theta \leftarrow \argmax_{\theta'} \underbrace{ \underset{s \sim \rho_\theta, a \sim q(s)}{\mathbb{E}} \left[ \frac{\pi_{\theta'}(s|a)}{q(s|a)} Q^{\pi_\theta}(s,a) \right] }_{ \tilde{J}_\theta(\theta') \triangleq } \text{ s.t. } \underbrace{ \overbrace{ \underset{s \sim \rho_\theta}{\mathbb{E}}\left[ D_\mathrm{KL}(\pi_\theta(s) \| \pi_{\theta'}(s)) \right] }^{ D(\theta \| \theta') \triangleq } \le \epsilon}_\text{trusted region}

with q(a|s) a sampling distribution, typically \pi_\theta(a|s), and \pi_{\theta'}(s|a)/q(s|a) is the likelihood ratio introduced because we are sampling a using q(s) instead of \pi_{\theta'}(s) (importance sampling).

Penalized version:

\theta \leftarrow \max_{\theta'} \underset{s \sim \rho_{\theta}, a \sim q(s)}{\mathbb{E}}\left[ \frac{\pi_{\theta'}(a|s)}{q(a|s)} Q^{\pi_{\theta}}(s,a) \right] - \beta \, D(\theta \| \theta')

This is approximated using:

Approximation using Taylor expansion (with \delta = \theta' - \theta):

\begin{aligned} \tilde{J}_{\theta}(\theta+\delta) & \approx \tilde{J}_{\theta}(\theta) + g_\theta^T \, \delta \\ D(\theta \| \theta+\delta) & \approx \frac{1}{2} \, \delta^T \, H_\theta \, \delta \\ g_\theta &= \left[ \nabla_{\delta} \tilde{J}_{\theta}(\theta+\delta) \right]_{\delta \coloneqq 0} \\ H_\theta &= \left[ \nabla^2_\theta D(\theta \| \theta+\delta) \right]_{\delta \coloneqq 0} \end{aligned}

Leading to the update:

\begin{aligned} s & \leftarrow \hat{H}_\theta^{-1} \, \hat{g}_\theta & \text{ (search direction)} \\ \beta & \leftarrow \sqrt{\frac{2 \epsilon}{s^T \, \hat{H}_\theta^{-1} \, s}} \\ \beta & \leftarrow \eta^k \beta & \text{ where } k \in \mathbb{N} \text { is the smallest natural integer} \\ && \text{such that } \hat{D}(\theta, \theta + \eta^k \, \beta \, s) & \le \epsilon \\ && \text{and } \tilde{J}_\theta(\theta + \eta^k \, \beta \, s ) & \gt \tilde{J}_\theta(\theta) \\ \theta & \leftarrow \beta \, s \end{aligned}

with 0 < \eta < 1.

The Q values Q^{\pi_\theta}(s,a) can for example be estimated by the reward to go g_t of the sampled trajectories.

Note: TRPO (single-path) algorithm

  1. Sample trajectories from \pi_\theta
  2. Compute \hat{g}_\theta = \nabla_\theta\left[ \frac{1}{n} \sum_{s,a,g} \frac{\pi_{\theta+\delta}(a|s)}{\pi_\theta(a|s)} \, g \right] where g is the reward-to-go after (s,a) of the sampled trajectory
  3. Find seach direction s \leftarrow \hat{H}_\theta^{-1} \, \hat{g}_\theta (using the conjugate gradient algorithm)
  4. Find max step size \beta \leftarrow \sqrt{\frac{2 \epsilon}{s^T \, \hat{H}_\theta^{-1} \, s}}
  5. Find step size \beta \leftarrow \eta^k \beta where k is the smallest natural integer such that
    • Constraint: \hat{D}(\theta, \theta + \eta^k \, \beta \, s) \le \epsilon
    • Increase of the objective: \text{and } \tilde{J}_\theta(\theta + \eta^k \, \beta \, s ) \gt \tilde{J}_\theta(\theta)
  6. Update parameters \theta \leftarrow \theta + \beta s
  7. Repeat

Note: derivation

Once the search direction s has been found, we find the step size (i.e. \beta) so that the KL constraint is verified.

We start with the largest \beta such that:

\begin{aligned} \frac{1}{2} \, (\beta s)^T \, \hat{H}_\theta \, (\beta s) & \le \epsilon \\ \frac{1}{2} \, \beta^2 \, s^T \, \hat{H}_\theta \, s & \le \epsilon \end{aligned}

Which gives:

\beta^\star = \sqrt{\frac{2 \epsilon}{s^T \, \hat{H}_\theta \, s}}

And then we choose the largest \beta = \eta^k \, \beta^\star (i.e. the smallest k \in \mathbb{N}) such that:

  • the constraint is verified, D(\theta, \theta + \beta \, s) \le \epsilon;
  • the objective is improved, \tilde{J}_\theta(\theta + \beta \, s ) \gt \tilde{J}_\theta(\theta) .

PPO

Proximal Policy Optimization (PPO) uses the same idea as TRPO (avoiding moving too far away from the current policy) but without using second-order derivatives (the hessian).

References

PPO-Clip

PPO-clip is using an update of the form:

\begin{aligned} \theta & \leftarrow \theta + \argmax_\delta \underbrace{ \underset{s \sim \rho_\theta, a \sim \pi_\theta(s)}{\mathbb{E}} \left[ \overbrace{ \min \left( \frac{ \pi_{\theta+\delta}(a|s) }{ \pi_{\theta}(a|s) } \hat{A}_{\pi_\theta}(s,a), g_\epsilon( \hat{A}_{\pi_\theta}(s,a) ) \right) }^{L_\theta(\delta,s,a)} \right] }_{L_\theta(\delta)} \\ g_\epsilon(x) & = \begin{cases} 1 + \epsilon & \text{ if } x \ge 0 \\ 1 - \epsilon & \text{ if } x \lt 0 \end{cases} \end{aligned}

where \argmax_\delta is solved using gradient ascent.

The \epsilon parameter is typically around \epsilon = 0.2

Note: explanation of the surrogate objective

This surrogate objective L_\theta(\delta) removes the incentive to move to far away from \pi_\theta based on the \frac{ \pi_{\theta+\delta}(a|s) }{ \pi_{\theta}(a|s) } ratio:

  • For \hat{A}_{\pi_\theta}(s,a) > 0, \nabla_\delta L_\theta(\delta,s,a) = 0 when \frac{ \pi_{\theta+\delta}(a|s) }{ \pi_{\theta}(a|s) } > 1 + \epsilon.
  • For \hat{A}_{\pi_\theta}(s,a) < 0, \nabla_\delta L_\theta(\delta,s,a) = 0 when \frac{ \pi_{\theta+\delta}(a|s) }{ \pi_{\theta}(a|s) } < 1 - \epsilon.

Note: PPO-clip algorithm

  1. Sample trajectories from \pi_\theta
  2. Estimate advantages \hat{A}_{\pi_\theta}(s,a)
  3. Optimize \max_\delta L_\theta(\delta) using gradient ascent
  4. Update parameters \theta \leftarrow \theta + \delta
  5. Repeat

PPO-Penalty

PPO-Penalty uses the following problem the penalized objective

\delta \leftarrow \argmax_\delta \underset{s \sim \rho_\theta, a \sim \pi_\theta(s)}{\mathbb{E}} \left[ \frac{ \pi_{\theta+\delta}(a|s) }{ \pi_{\theta}(a|s) } \hat{A}_{\pi_\theta}(s,a) - \beta D_\mathrm{KL}( \pi_\theta(s) \| \pi_{\theta+\delta}(s)) \right]

where \argmax_\delta is solved using gradient ascent.

The KL penalty is adaptative. After each update of \theta, the \beta parameter is updated in order to track a target KL divergence d_\text{target}:

\begin{aligned} d & \leftarrow \underset{s \sim \rho_\theta}{\mathbb{E}} \left[ D_\mathrm{KL}( \pi_\theta(s) \| \pi_{\theta+\delta}(s)) \right] \\ \beta & \leftarrow \begin{cases} \beta/2 & \text{if } d < d_\text{target}/1.5 \\ \beta & \text{if } d \in [d_\text{target}/1.5, 1.5 \, d_\text{target}] \\ 2 \, \beta & \text{if } d > 1.5 \, d_\text{target}/1.5 \\ \end{cases} \\ \theta & \leftarrow \theta + \delta \end{aligned}

If the expectation of the KL divergence was too small, \beta is reduced which leads to greater policy steps. On the other hand, if the expectation of the KL divergence was too big, \beta is increased which leads to smaller policy steps.

Note: PPO-penalty algorithm

  1. Sample trajectories from \pi_\theta;
  2. Estimate advantages \hat{A}_{\pi_\theta}(s,a);
  3. Optimize \max_\delta L_\theta(\delta) using gradient ascent
  4. Adaptative update of parameter \beta based on \underset{s \sim \rho_\theta}{\mathbb{E}} \left[ D_\mathrm{KL}( \pi_\theta(s) \| \pi_{\theta+\delta}(s)) \right]
  5. Update parameters \theta \leftarrow \theta + \delta
  6. Repeat

Model-based reinforcement learning

Model-based reinforcement learning (MBRL) methods learn the world model. Once the world mode is learned we can use planning algorithms.

Not covered here.

Exploration/exploitation tradeoff

In order to make the RL algorithm work, we need to find a balance between exploration (of new states, trajectories, actions) and exploitation (of known good states and actions).

ε-soft policies

An \epsilon-soft policy is a policy such that:

\forall s, \forall a, \pi(a|s) \ge \frac{\epsilon}{|\mathbb{A}|}

One example are \epsilon-greedy policies. An \epsilon-greedy policy takes the best action (according to some Q function) with probability (1 - \epsilon) and a random action with probability \epsilon.

This is often used with Q learning in order to have some amount of exploration.

Entropy regularization

We can add some regularization in order to prevent the policy from being too deterministic and keep some amount of exploration.

A max-entropy term can be added in the objective function in order to prevent the policies from being to deterministic:

\max_\pi \mathbb{E} \left[ \sum_{t=0}^T \gamma^t \left[ R_t + \beta \, \mathcal{H}(\pi(S_t)) \right] \middle| \pi \right]

Appendix, Log-derivative trick

The log-derivative trick is:

\nabla_x f(x) = f(x) \nabla_x \log f(x)

Note: proof

It derives from the chain rule of derivation:

\nabla_x \log f(x) = \frac{\nabla_x f(x)}{f(x)}

It is useful when trying to estimate the gradient of an expectation:

\begin{aligned} \nabla_\theta \underset{X \sim P_\theta}{\mathbb{E}} \left[ f(x) \right] &= \nabla_\theta \sum_x f(x) \, P_\theta(x) \\ &= \sum_x f(x) \, \nabla_\theta P_\theta(x) \\ &= \sum_x f(x) \, P_\theta(x) \, \nabla_\theta \log P_\theta(x) & \text{(log-derivative trick)} \\ &= \underset{X \sim P_\theta}{\mathbb{E}} \left[ f(X) \nabla_\theta \log P_\theta(X) \right] \end{aligned}

Which gives the following sample estimate,

\nabla_\theta \underset{X \sim P_\theta}{\mathbb{E}} \left[ f(X) \right] \approx \frac{1}{N} \sum_{i=1}^N f(x_i) \nabla_\theta \log P_\theta(x_i)

with samples x_i \sim P_\theta.

See Log-derivative trick.

Appendix, Importance sampling

Basic importance sampling

Estimating an expectation over distribution P using samples from another distribution Q.

For discrete distributions:

\begin{aligned} \underset{X \sim P}{\mathbb{E}} \left[ f(X) \right] & \approx \frac{1}{N} \sum_i \left[ \frac{P(x_i)}{Q(x_i)} f(x_i) \right] & \text{ with } x_i &\sim Q \text{ (i.i.d.)} \end{aligned}

with:

More precisely,

\frac{1}{N} \sum_i \left[ \frac{P(X_i)}{Q(X_i)} f(X_i) \right] \text{ with } X_i \sim Q \text{ (i.i.d.)}

is an unbiaised estimator of \underset{X \sim P}{\mathbb{E}} \left[ f(X) \right] assuming that Q(x) > 0 for all x such that P(x) f(x) > 0 .

Proof: importance sampling

\begin{aligned} \underset{X \sim P}{\mathbb{E}} \left[ f(X) \right] &= \sum_x P(x) \, f(x) \\ &= \sum_x Q(x) \frac{P(x)}{Q(x)} f(x) \\ &= \underset{X \sim Q}{\mathbb{E}} \left[ \frac{P(X)}{Q(X)} f(X) \right] \\ &\approx \frac{1}{N} \sum_{i=1}^N \left[ \frac{P(x_i)}{Q(x_i)} f(x_i) \right] & \text{ with } x_i &\sim Q \end{aligned}

Note: inconsistent estimates

This estimator can produce estimates which are inconsistent (especially when \sum_i P(x_i)/Q(x_i) is not near 1).

For example:

  • for a constant function f(x) = c, it will yields estimates (\sum_i P(x_i)/Q(x_i)) c which is usually different from c;
  • for a function f giving values in [a,b], it might yield estimates outside of [a,b].

If this is a problem self-normalizing importance sampling can be used.

Self-normalizing importance sampling

\frac{\sum_i w(X_i) f(X_i)}{\sum_i w(X_i)} \text{ with } X_i \sim Q \text { i.i.d.}

is an estimator of \underset{X \sim P}{\mathbb{E}} \left[ f(X) \right] ,

Benefits:

Drawbacks:

Appendix, Kullback–Leibler divergence

Kullback–Leibler divergence

The Kullback–Leibler divergence (KL divergence) D_\mathrm{KL}(P \| Q) between to distribution P and Q measures the difference between a (real) P distribution and a Q distribution (often an approximation of P):

For discrete distributions:

D_\mathrm{KL}(P \| Q) =\sum_x P(x) \, \log \frac{P(x)}{Q(x)}

Note: KL divergence and Fisher Information Matrix

The Hesssian of the KL divergence between P_\theta and P_{\theta'} evaluated at \theta' \coloneqq \theta is the Fisher information matrix \mathcal{I}_X(\theta) of X \sim P_\theta:

\begin{aligned} \left[ \nabla^2_\theta D_\mathrm{KL}(P_\theta \| P_{\theta'}) \right]_{\theta' \coloneqq \theta} & = \mathcal{I}_X(\theta) \end{aligned}

with

\begin{aligned} \mathcal{I}_X(\theta) & \triangleq - \underset{X \sim P_\theta}{\mathbb{E}} \left[ \left( \nabla_\theta \log P_\theta(X) \right) \, \left( \nabla_\theta \log P_\theta(X) \right)^T \right] \\ &= - \underset{X \sim P_\theta}{\mathbb{E}} \left[ \nabla^2_{\theta} \log P_\theta(X) \right] \end{aligned}

See Appendix A.3 of Natural Gradient Methods: Perspectives, Efficient-Scalable Approximations, and Analysis.

Entropy

The entropy of a distribution P is:

\mathcal{H}(P) = - \sum_x P(x) \log P(x)

The cross-entropy of Q relative to P is:

\mathcal{H}(P,Q) = - \sum_x P(x) \log Q(x)

Relation between the (cross-)entropy and the KL divergence:

\mathcal{H}(P,Q) = \mathcal{H}(P) + D_\mathrm{KL}(P \| Q)

Appendix, gradient descent

References

Ordinary gradient descent

Ordinary gradient descent for \min_\theta f(\theta) can be seen as an approximation of:

\theta ← \theta + \argmin_\delta f(\theta + \delta) \text{ subject to } \|\delta\|_2^2 \le \epsilon

i.e. we are searching for \theta' = \theta + \delta for a small \delta.

Leading to:

\theta ← \theta - \alpha \, \nabla_\theta f(\theta)

Note: derivation

We can use a relaxation of the constrained problem:

\min_\delta \underbrace{ f(\theta + \delta) + \lambda \|\delta\|_2^2 }_{L(\delta)}

for some \lambda \ge 0. We can then search a local optimum by solving \nabla_\delta L(\delta) = 0.

Equivalently, we can form the Lagrangian, L(\delta, \lambda) = f(\theta + \delta) + \lambda (\|\delta\|_2^2 - \epsilon) and find a local optimum by solving \nabla_\delta L(\delta, \lambda) = 0 and \nabla_\lambda L(\delta, \lambda) = 0 for unknown \delta and \lambda. The former equation is equivalent to the gradient of the relaxed problem and the latter equation is equivalent to the constraint ( \|\delta\|_2^2 = \epsilon ).

We use a linear approximation of f around \theta:

f(\theta + \delta) = f(\theta) + \delta^T \, \nabla_\theta f(\theta) + O(\|\delta\|^2)

Which gives:

\min_\delta \underbrace{ f(\theta) + \delta^T \, \nabla_\theta f(\theta) + \lambda \|\delta\|_2^2 }_{\tilde{L}(\delta)}

We search a local optimum by solving \nabla_\delta \tilde{L}(\delta) = 0 for the unknown \delta:

0 + \nabla_\theta f(\theta) + 2 \, \lambda \, \delta = 0

Which gives:

\begin{aligned} \delta & = - \frac{1}{2 \lambda} \nabla_\theta f(\theta) \\ \theta' & = \theta - \underbrace{\frac{1}{2 \lambda}}_\alpha \nabla_\theta f(\theta) \end{aligned}

Natural gradient descent

More generally, we can introduce a constraint using another metric \|\delta\|_G^2 defined as:

\|\delta\|_G^2 \triangleq \delta^T \, G(\theta) \, \delta = \sum_{i,j} \delta_i \, \delta_j \, G_{i,j}(\theta)

where G(\theta) is a positive-definite matrix.

Which gives the constrained optimization problem:

\theta ← \theta + \argmin_\delta f(\theta) \text{ subject to } \|\delta\|_G^2 \le \epsilon

Leading to the natural gradient descent update:

\theta ← \theta - \alpha \, G(\theta)^{-1} \, \nabla_\theta f(\theta)

This generalizes the previous section, when using G(\theta)=I.

Note: derivation

Relaxed problem after Taylor approximation of f:

\min_\delta \underbrace{ f(\theta) + \delta^T \, \nabla_\theta f(\theta) + \lambda \, \delta^T \, G(\theta) \, \delta }_{\tilde{L}(\delta)}

We find a local mimum by solving \nabla_\delta \tilde{L}(\delta) = 0:

\begin{aligned} \nabla_\theta f(\theta) + 2 \, \lambda \, G(\theta)\, \delta = 0 \end{aligned}

Which gives:

\begin{aligned} \delta &= - \frac{1}{2 \, \lambda} G(\theta)^{-1} \nabla_\theta f(\theta) \\ \theta' &= \theta - \underbrace{\frac{1}{2 \, \lambda}}_\alpha \underbrace{G(\theta)^{-1} \nabla_\theta f(\theta)}_\text{natural gradient} \end{aligned}

Natural gradient descent and hessian

More generally, if we consider a constraint of the form D(\theta,\theta+\delta) \le \epsilon where D(\theta,\theta+\delta) is a distance or divergence:

\min_\delta f(\theta + \delta) \text{ subject to } D(\theta,\theta+\delta) \le \epsilon

We get the natural gradient update:

\theta ← \theta - \alpha \, H_\theta^{-1} \, \nabla_\theta f(\theta)

where H_\theta is the hessian of the distance function D(\theta,\theta') evaluated at \theta'=\theta:

H_\theta = \left[ \nabla_\delta D(\theta, \theta+\delta) \right]_{\delta \coloneqq 0}

Some assumptions:

Note: derivation

We can approximate D(\theta,\theta+\delta) using a second order Taylor expansion:

\begin{aligned} D(\theta,\theta+\delta) & = \underbrace{ D(\theta,\theta) }_{=0} + \delta^T \, \underbrace{ \left[ \nabla_\theta D(\theta, \theta+\delta) \right]_{\delta \coloneqq 0} }_{ =0 } + \frac{1}{2} \, \delta^T \, H_\theta \, \delta + O(\|\delta\|)^3 \\ & \approx \frac{1}{2} \delta^T \, H_\theta \, \delta \end{aligned}

Relaxed problem after Taylor approximation of f:

\min_\delta \underbrace{ f(\theta) + \delta^T \, \nabla_\theta f(\theta) + \frac{1}{2} \lambda \, \delta^T \, H_\theta \, \delta }_{\tilde{L}(\theta)}

Which gives:

\begin{aligned} \delta &= - \frac{1}{\lambda} H_\theta^{-1} \nabla_\theta f(\theta) \\ \theta' &= \theta - \underbrace{\frac{1}{\lambda}}_\alpha \underbrace{H_\theta^{-1} \nabla_\theta f(\theta)}_\text{natural gradient} \end{aligned}

Note: computation and inversion of the Hessian

We don't want to explicitly form \hat{H}(\theta) (which is very huge for large deep network models), and we don't want to inverse it using basic matrix inversion methods.

Because \hat{H}(\theta) is positive semi-definite, we can use the conjugate gradient algorithm to efficiently find an approximate solution. This algorithm only needs evaluating \hat{H}(\theta) \, x (Hessian-Vector product, HVP).

The Hessian-Vector product can be computed using the following equation (using some autodiff library):

(\nabla^2_\theta h(\theta)) \, x = \nabla_\theta\left[ (\nabla_\theta h(\theta))^T \, x \right]

References:

Natural gradient descent and KL divergence

When trying to learn a probability distribution P_\theta (eg. using maximum likelihood or maximum a posteriori), we can limit the size of the steps by adding a constraint in the probability distribution space:

\theta \leftarrow \theta + \argmax_{d\theta} f(\theta + \delta) \text{ subject to } D_\mathrm{KL}(P_\theta \| P_{\theta + \delta}) \le \epsilon

Note: constraint in the distribution space

In many cases, adding a constraint in the distribution space, D_\mathrm{KL}(P_\theta \| P_{\theta+\delta}) \le \epsilon, makes a lot more sense that a constraint on the parameter space, \|d\theta\|_2^2 \le \epsilon.

This gives the natural gradient descent update:

\theta \leftarrow \theta + \alpha \, H_\theta^{-1} \, \nabla_\theta f(\theta)

with

H_\theta = \left[ \nabla^2_{d\theta} D_\mathrm{KL}(P_\theta \| P_{\theta+\delta}) \right]_{\delta \coloneqq 0}

Appendix, notations

Notation Meaning
t \in \{0, \ldots, H\} Time
H Horizon (either finite or infinite)
s \in \mathbb{S} State value
a \in \mathbb{A} Action value
r \in \mathbb{R} Reward value
\tau = (s_0, a_0, r_0, \ldots) Trajectory value
\mathcal{D} Dataset/batch (of trajectories, etc.)
R(\tau) Reward-to-go for a given trajectory
S_t State (random variable)
A_t Action (random variable)
R_t Reward (random variable)
G_t Reward-to-go (random variable)
\Tau = (S_0, A_0, R_0, \ldots) Trajectory (random variable)
\gamma Discount factor, \in (0,1]
\alpha, \eta Learning rates, \in (0,1]
\lambda Parameter of TD(\lambda), \in [0,1)
\beta Soft-constraint penalty, Lagrange multiplier, step-size
s = \pi(a) Déterministic policy
a \sim \pi(s) Stochastic policy (distribution)
\pi(a|s) Stochastic policy (probability or density)
r \sim \mathcal{R}(s,a,s') Reward distribution
\mathcal{R}(r|s,a,s') Reward distribution (probability or density)
R(s,a) Expected reward
R(\tau) Reward-to-go of a trajectory
s' \sim \mathcal{T}(s,a) State transition distribution
\mathcal{T}(s'|s,a) State transition probability or density
V^\pi(s) Value function
Q^\pi(s,a) Action-value function
A^\pi(s,a) Advantage function
a = \pi^\star(s) Optimal policy
V^\star(s) Optimal value function
Q^\star(s,a) Optimal action-value function
A^\star(s,a) Optimal advantage function
V_\phi(s) Parameterized value function
Q_\phi(s,a) Parameterized action-value function
A_\phi(s,a) Parameterized advantage function
b(s) Baseline function
a \sim \pi_\theta(s), \pi_\theta(a|s) Parameterized policy (stochastic)
a = \pi_\theta(s) Parameterized policy (deterministic)
J(\pi), J(\theta) Utility of a policy
\frac{\partial}{\partial x_i} f(x) or \partial_{x_i} f(x) Partial derivative wrt x_i
\nabla_x f(x) Gradient of f wrt x, i.e. column vector of \partial_{x_i} f(x)
\nabla^T_x f(x) Transposed gradient of f wrt x, i.e. row vector of \partial_{x_j} f(x)
\nabla^2_x f(x) Hessian matrix of f wrt x , i.e. matrix of \partial_{x_i} \partial_{x_j} f(x)
X \sim \mathcal{X} Random variable X follows distribution \mathcal{X}
(X | Y=y) \sim \mathcal{X}(y) Random variable X follows distribution \mathcal{X} when Y=y
\mathbb{E}[X] Expectation
\mathbb{E}[X | Y=y] Conditional expectation
\mathrm{softmax}(x) Softmax function, i.e. vector of e^{x_i} / \sum_{i'} x^{e_i'}
\mathrm{Cat}(p) Categorical distribution with probabilities p = (p_1, \ldots, p_n)
\mathcal{N}(\mu, \Sigma) Multivariate normal distribution
D_\mathrm{KL}(P | Q) Kullback–Leibler divergence
\mathcal{H}(P) Entropy
\mathcal{H}(P,Q) Cross-entropy
\bar\phi, \bar{\varphi} "Frozen" copy of \varphi, \phi (stop-gradient)
\phi Parameters for value functions V_\phi, Q_\phi, etc.
\theta Parameters for policy \pi_\theta
\varphi Parameters shared (partially) between value function and policy eg. \pi_\varphi and Q_\varphi

References

Reinforcement learning books, courses, tutorials:

Reinforcement learning papers:

Reinforcement learning frameworks:

Maths: