Transformer-decoder language models

Published: Jan 7 2025

Updated: Jan 26 2025

Some notes on how transformer-decoder language models work, taking GPT-2 as an example, and with lots references in order to dig deeper.

You can follow along with:

Update 2025-01-19: Added a dense layer in the transformer sub-block that I forgot.

Update 2025-01-28: Added somes notes on DeepSeek-R1.

Table of content

Table of content
Overview
Architecture of GPT-2
Training
Decoding strategies
- Probability maximization
- Stochastic sampling
Advanced topics
Conclusion
References
Appendix, Attention
Appendix, stupid examples
Appendix, Some LLM model sizes

Overview

OpenAI's GPT (Generative Pre-trained Transformer) is a family of autoregressive deep language models using a transformer-decoder architecture (using self-attention). Many other language models are based on the transformer-decoder design: Google's Gemini, Met'as LLaMA, Anthropic's Claude, Mistral.ai's Mistral, Alibaba's Qwen, Microsoft's Phi, etc. These models are widely used as the core of AI assistants (ChatGPT, CoPilot, etc.).

These model are trained to estimate the distribution of the completion of an input text:

P(" is it?" | "What time")

More precisely, they estimate the probability distribution of the next token given the previous tokens (so far):

P(" it" | ["What", " time", " is"]) = model_θ(" it" | ["What", " time", " is"])

where θ are the model parameters.

This defines probabilities for longer completions:

P([" it", "?"] | ["What", " time", " is"])
    = P(" it" | ["What", " time", " is"]) × P("?" | ["What", " time", " is", " it"])

In the unsupervised pre-training phase, this probability distribution is estimated (by fitting the model parameters θ) based on a (large) collection of texts. For example, in the GPT-2 paper all the web pages linked by Reddit posts with at least 3 karma were scrapped, the Wikipedia pages were removed and this dataset was used to train the model.

Pre-training in autoregressive language model

The trained model is typically used to generate possible completions from a initial text. Different decoding strategies can be used to generate a text completion from the token probabilities generated by the trained model.

Decoding in autoregressive language model

Radford et al (2019) argued that using a wide corpus of text scrapped from the internet will provide sufficient information for the model to learn a lots of different tasks (emphasis mine):

The internet contains a vast amount of information that is passively available without the need for interactive communication. Our speculation is that a language model with sufficient capacity will begin to learn to infer and perform the tasks demonstrated in natural language sequences in order to better predict them, regardless of their method of procurement. If a language model is able to do this it will be, in effect, performing unsupervised multitask learning. We test whether this is the case by analyzing the performance of language models in a zero-shot setting on a wide variety of tasks.

Given a large enough corpus of text (and if the model is suitable), the model can generalize and implicitly learns to do a lot of things along the way (zero-shot/few-shot task transfer) such as answering questions, summarizing texts, translating texts between different (human) languages, generating or explaining code in different programming languages, making poems, assume some kind of persona (see appendix), etc.

The model parameters may then (optionally) be fine-tuned (fine-tuning) to tweak its behavior according to a specific task (supervised fine-tuning for the given task) or for a more specific type of documents (unsupervised fine-tuning).

One example of such fine-tuning is to to align the model's behavior according to the expectation of its users and/or designers (alignment) such as, better following user instructions (instruction-tuning), which makes the model useful when used in an AI assistant, or avoiding to generate harmful, disrespectful, hateful, biased content. This alignment can be done by adjusting the language model behavior based on human feedback.

Architecture of GPT-2

The following diagram represent the GPT-2 (and GPT-3) model. It is based on OpenAI's GPT-2 model code as well as the GPT-1, GPT-2 and Transformers papers.

For a given sequence of tokens u_{1:n} = (u_1, \ldots, u_n) the model computes each P(u_{a + 1} | u_{1:a}) for 1 \le a \le n.

Input: A list of n tokens (up to d_c) generated by the tokenizer from the input text. In the rest of this post, this list of tokens is seen as a sparse (n, d_v) matrix where each row represents a token position inside the text and columns represents a possible token: if the token at position 5 is the token #6, we have u_{5,6} = 1 and u_{5,i} = 0 for i \ne 6.

Output: a (n, d_v) matrix containing the probabilities P(u_{_a + 1} | u_{1:n}) (or their logits) for 1 ≤ a ≤ n.

Note: conventions in this diagram

Computations:

Rectangles represents computations (layers, blocks, sub-blocks, etc.).
The parts in blue are used for the model training (or evaluation).
Dashed boxes represents things which are not part of the GPT-2 model itself. This includes tokenization, evaluation and training/learning, text generation (decoding strategy).

Data:

Arrows represents computed values and data flows.
Dimensions of data tensors are indicated in parenthesis.
Model output are in bold.
Learning objectives are in bold and italic.
Input/output port names of some blocs are indicated in italic in parentheses.

Learned parameters:

Filled squared (■) represents 2D tensor (matrix) (learned) model parameters.
Filled rows (▬) represents 1D tensor (vectors) (learned) model parameters.
For example, the fully-connected layers with ■▬ have a bias (affine transformation) while the fully-connected layers with only ■ don't have a bias (linear transformation).

Model size (not learned):

n_h, number of attention heads per layer
n_b, number of blocks
d_v, size of the token dictionary
d_m, dimension of token embeddings (number of features)
d_c, context size (in tokens)
d_y, output dimension

Tokenizer

The input text is transformed into a sequence of tokens (u_0, …, u_n).

GPT uses a byte pair encoding (BPE) tokenizer implemented in the tiktoken library.

import tiktoken
enc = tiktoken.get_encoding("gpt2")
tokens = enc.encode("What time is it, please?")
# => [2061, 640, 318, 340, 11, 3387, 30]
[enc.decode([token]) for token in tokens]
# => ['What', ' time', ' is', ' it', ',', ' please', '?']

A tokenization strategy would be to use words as token but the language model would not be able to handle unknown words and would not even be able to spit them back like in the following example.

Example: language models are able to handle unknown words

Prompt:

Hello, my name is Shargargoartzrk.

Answer:

Hello! It'm Phi to assist you with information or tasks today. I must clarify that "Shargargoartzrk" doesn’t seem familiar in standard English conversations; could it be a username? Anyway, how can I help you now? Please feel free to ask any questions you have!

(emphasis mine)

Instead, unknown and complex words are split into different tokens:

[enc.decode([token]) for token in enc.encode("Shargargoartzrk")]
# => ['Sh', 'arg', 'argo', 'artz', 'r', 'k']
[enc.decode([token]) for token in enc.encode("transformers")]
# => ['transform', 'ers']
[enc.decode([token]) for token in enc.encode("Rao-Blackwellization")]
# => ['R', 'ao', '-', 'Black', 'well', 'ization']

Vocabulary size of the different tokenizers in the tiktoken library:

for encoding in tiktoken.list_encoding_names():
  enc = tiktoken.get_encoding(encoding)
  print(encoding + ":" + repr(enc.n_vocab))

gpt2:50257
r50k_base:50257
p50k_base:50281
p50k_edit:50284
cl100k_base:100277
o200k_base:200019

Note: some tokenizer implementations

OpenAI's tiktoken
Google's SentencePiece
Hugging Face's Tokenizers

Note: special tokens

Some tokenizers include some special tokens with special semantic.

For example tiktoken tokenizers:

for encoding in tiktoken.list_encoding_names():
  enc = tiktoken.get_encoding(encoding)
  print(encoding + ":" + repr(enc.special_tokens_set))

 gpt2:{'<|endoftext|>'}
 r50k_base:{'<|endoftext|>'}
 p50k_base:{'<|endoftext|>'}
 p50k_edit:{'<|fim_suffix|>', '<|fim_middle|>', '<|fim_prefix|>', '<|endoftext|>'}
 cl100k_base:{'<|fim_middle|>', '<|fim_suffix|>', '<|endofprompt|>', '<|fim_prefix|>', '<|endoftext|>'}
 o200k_base:{'<|endofprompt|>', '<|endoftext|>'}

Explanations:

<|endoftext|> is used to indicate the end of the text;
<|endofprompt|> is used to indicate the end of the prompt (especially for instruction following models);
<|fim_suffix|>, <|fim_middle|>, <|fim_prefix|> are used for fill-in-the-middle (FIM) training.

Different models used different special tokens.

See as well the special tokens from Llama 3.1:

<|begin_of_text|>
<|end_of_text|>
<|finetune_right_pad_id|>
<|start_header_id|>
<|end_header_id|>
<|eom_id|>
<|eot_id|>
<|python_tag|>

Tokens embedding and positional encoding

In GPT-2: h_0 = U W_e + b_p

Token embedding: The tokens are transformed into features using a learned embedding (W_e).

Positional encoding: In order to give to the model a notion of the relative position of the tokens, positional encoding vectors are added to each token position.

Notation	Description	Rows	Columns
U	Input tokens	Token position	Token vocabulary
W_e	Token embedding	Token vocabulary	Token embedding feature
W_p	Positional encoding	Token position	Token positional feature
h_0	First block input	Token position	Token feature

Note: original transformers paper

In the transformers paper, the positional encoding vectors were hardcoded (based on cosines). In GPT, they are learned parameters W_p instead.

Note: alternative positional encoding methods

See the RoFormer paper for an overview of positional encoding methods:

absolute position embedding
relation position embedding
Rotary Position Embedding (RoPE)
LongRoPE

In GPT-2 source code embedding and positional encodding are represneted by the wte and wpe variables respectively.

Transformer blocks

The main part of the model is a serie of transformer decoder blocks having the same structure:

h_{i+1} = \mathrm{block}(h_i ; \theta_{i+1})

Each block is made of two sub-blocks:

one attention sub-block;
one fully-connected sub-block.

In GPT-2 source code this is the block() function.

Attention sub-block

The attention sub-block is made of:

layer normalization (with element-wise affine transformation);
a layer of multi-head masked (decoder) scaled dot-product self-attention;
a fully connected layer;
shortcut connection (residual structure).

Residual network

Each sub-block is using a residual unit structure i.e., F(x) + x where:

F(x) is the residual function;
x is the shortcut connection (also known as skip connection).

This structure is easier to train, especially when using deep networks. On reason for this is that the shortcut connections mitigates the vanishing gradient problem.

Layer Normalization

Each residual function starts with a layer normalization: the features are normalized (mean and variance), independently for each token position.

This is followed by a element-wise affine (aka diagonal affine) transformation. This step is represented explicitly in the diagram in this post but is is many implementation (such as PyTorch's LayerNorm and Keras's LayerNormalization).

Note: some alternative normalization methods

root mean square normalization (RMSNorm)

In GPT-2 soure code this is implemented by the norm() function.

Attention

The attention mechanism is the core of the transformer architecture. GPT-2 uses a scaled masked (decoder) dot-product multi-head self-attention:

At each attention layer i, n_h attention heads are used.
Each attention head has a query/key size and value size of length d_m/n_h.
Each attention head outputs d_m/n_h features (at each position) for a total of d_m output features at each position for all attention heads at this layer.

Output of a single attention head h:

\mathrm{softmax}(\mathrm{mask}(\frac{Q_{i,h} \; K_{i,h}^T}{\sqrt{d_k}})) \; V_{i,h}

Query at layer i for head h:

Q_{i,h} = x \; W_{Q,i,h} + b_{Q,i,h}

Key at layer i for head h:

K_{i,h} = x \; W_{K,i,h} + b_{Q,i,h}

Query at layer i for head h:

V_{i,h} = x \; W_{V,i,h} + b_{V,i,h}

where x is is the output of the previous layer

See the appendix for more details.

Example: GPT

In GPT, each layer has d_m features. These features are generated by n_h attention heads. Each attention head has a query/key size and value size of length d_m / n_h.

Notation	Description	Rows	Columns
x	Input features	Token position	Features
W_{Q,i,h}	Weights for the query	Input features	Key/query features
W_{K,i,h}	Weights for the key	Input features	Key/query features
W_{V,i,h}	Weights for the value	Input features	Value features
b_{Q,i,h}	Biases for the query	-	Key/query features
b_{K,i,h}	Biases for the key	-	Key/query features
b_{V,i,h}	Biases for the value	-	Value features
Q_{i,h}	Query	Token position	Key/query features
K_{i,h}	Key	Token position	Key/query features
V_{i,h}	Value	Token position	Value (output) features

Fully connected sub-block

The fully connected sub-block uses:

layer normalization (with element-wise affine transformation);
fully connected layer;
GELU activation function;
another fully connected layer;
a shortcut connection (residual structure).

In GPT-2 soure code, the first four points are implemented by the mlp() function.

Activation function

The GELU (Gaussian Error Linear Unit) activation function is used in GPT-2. Its usage is not explicitly mentionned in the GPT papers but can be found in the model source.

Note: some activation functions used transformer-decoder language models

GELU (used in GPT models)
SwiGLU (used un PaLM 1, LLaMA 1-3, Qwen 1-2)
GEGLU (used in Gemma 1, Phi-3)

Probability computation

The last section of the model derived the probability distribution of the next token from h_{n_b} using:

layer normalization;
token unembedding;
softmax.

After a last layer normalization, the resulting features are converted into next token logits by applying the same word embedding used at the beginning of the model but in the other direction (W_p^T), as discussed in Press et al, 2016. Then, the softmax function is used to convert the logits into token probabilities.

P = \mathrm{softmax}(h_n W_f W_p^T)

where the softmax function is applied row-wise i.e:

P_{i,j} = \frac{e^{P'_{i,j}}}{\sum_k e^{P'_{i,k}}}

with P' = h_{n_b} W_f W_p^T

Notation	Description	Rows	Columns
h_{n_k}	Output of the last transformer block	Token position	Feature
W_f	Weights of the last fully-connected layer	Feature	Token embedding feature
W_p	Token embedding	Token vocabulary	Token embedding feature
h_{n_b} \; W_f W_p^T	Logits	Token position	Token (vocabulary)
P	Next token probabilities	Token position	Token (vocabulary)

Training

Pre-training

The objective is:

\max_θ \frac{1}{N} \sum \log P( \text{next token} | \text{previous tokens} )

This is optimized using batched stochastic gradient-based optimization. In the GPT-2 paper, the Adam optimizer was used to train the model.

A sequence of k tokens gives k - 1 (text, completions) pairs that the model can use. For example for the token sequence ["What", " time", " is", " it", "?"] gives the following information:

" time" follows ["What"],
" is" follows ["What", " time"],
" it" follows ["What", " time", " is"],
" ?" follows ["What", " time", " is", " it"]

Dropout

Dropout is a regulatization technique (i.e., addressing overfitting). During training some elements of the input tensors are randomly disabled (by setting their output value to zero in the output). This can be interpreted as training a mixture of a large number of models.

During inference, dropout is disabled.

The GPT-1 mentions “attention dropout with a rate of 0.1 for regularization” and “[for fined tuning] dropout to the classifier with a rate of 0.1”.

As far as I understand, a dropout layer is used at the end of each residual branch. This is consistent with the model found in ”Generating Long Sequences with Sparse Transformers”. The transformer-decoder paper mentions using dropout after positional encoding as well.

Supervised task fine-tuning

An example of supervided task fine-tuning is represented in the diagram. This can be done by replacing the end of the model for the new task and fine-tuning the whole parameters using supervised learning on a new data set (with a new objective function L_2).

In the GPT-1 paper, fine-tuning combines both the original task and the new task and the new objective L_2 is combined with the original text-prediction objective L_1: L_3 = L_2 + λ \; L_1 .

Alternatively, the model can be fine-tuned on the same task but with a new (smaller) text corpus.

The GPT-2 paper does not use fine-tuning but focuses on zero-shot task transfer instead.

Alignment

Alignment is the process of fine-tuning the model parameters in order to:

follow instructions;
produce useful content,
avoid generating harmul, hateful, toxic, biaised, disrespectfulcontent.

Warning: using models without instruction following

A language model which has not been fined-tuned for instruction following is simply trained to complete the given input text. This may produce unintended results when used for example in an AI assistant

AlignGPT

In the InstructGPT paper, the model parameters are adjusted based on human feedback using three steps:

Supervised fine-tuning;
Reward modeling;
(Deep) Reinforcement Learning with Human Feedback (RLHF).

Supervised fine-tuning: In the first step, human labelers generate examples of desired text completions. The pre-trained language model is fine tuned on these text examples.

Note: Chain-of-thought data

Including Chain-of-thought data in the fine-tuning dataset might be beneficial in order to maintain the model reasoning abilities.

Reinforcement Learning with Human Feedback: In the third step, the language model obtained in step 2 is trained using deep reinforcement learning (RL) techniques. Deep reinforcement learning algorithms uses a reward signal which (in this case) indicates how good (how aligned) a given text completion y for a given text input (prompt) x. The parameters of the language model are ajusted in order to maximize the rewards.

Fine-tuning a language model using Deep reinforcement learning

Note: Kullback–Leibler divergence penalty

In practice, a penalty term is used during the reinforcement learning phase. This penalty is the Kullback–Leibler (KL) divergence between the current policy \pi_\theta and the SFT policy \pi_\text{ref}.

The objective becomes:

\max_{\theta} \; \mathbb{E}_{x, y \sim \pi_\theta(.|x)} \left[ r_\phi(x,y)\right] - \beta \mathbb{D}_{KL}\left[\pi_\theta(y|x) \| \pi_\text{ref}(y|x) \right]

This penalizes large deviations of the new policy \pi_\theta from the \pi_\text{ref} generated by the supervided fine-tuning step:

it makes sure that the reward model consistent with the new policy;
it makes sure that the new policy does not deviate too much from the base policy (which could make the language model degenerate).

This is solved using PPO with the modified reward function:

R(x, y) = r_\phi(x,y) - \beta \log \left[ \frac{\pi_\phi(y|x)}{\pi_\text{ref}(y|x)} \right]

Reward modeling: This reward signal is generated by a reward model r_ɸ(x,y). This reward model is a deep learning model based on the GPT architecture. The language model is used to generate different text toutputs for the same prompt and human labelers are asked to compare (sort) them (in term of instruction following, alignment). The reward model is trained to produce rewards which respect these human preferences.

Fine-tuning a reward model for model alignment from human feedback

Note: Bradley-Terry model

This derives from the Bradley-Terry model which assumes that human preference between y and y' are derived from a latent reward function r^\star(x,y):

P(y > y' | x) = \frac{e^{r^\star(x,y)}}{e^{r^\star(x,y)} + e^{r^\star(x,y')}} = \sigma (r^\star(x,y) - r^\star(x,y'))

Where \sigma is the logistic function is defined as:

\sigma(x) = \frac{1}{ 1 + e^{-x}} = \frac{e^x}{e^x + 1}

This can be seen as a softmax:

\begin{pmatrix} P(y > y' | x) \\ P(y' > y | x) \end{pmatrix} = \mathrm{softmax} \begin{pmatrix} e^{r^\star(x,y)}\\ e^{r^\star(x,y')} \end{pmatrix}

The reinforcement learning algorithm used in the InstructGPT paper is Proximal Policy Optimization (PPO), a kind of policy gradient methods.

The very high overview of policy gradient methods such as POO (in this context) is the following update loop:

use current model/policy (θ) to generate a batch of text completions from prompts (x_i, y_i);
use the reward model to derive the associated rewards r_i = r_ɸ(x_i, y_i);
reinforce actions/decisions (i.e., token selections) which lead to higher rewards (i.e., increase the probabilities of the emitted tokens).

Note: Deep reinforcement learning algorithms

OpenAI's Spinning Up in Deep RL has accessible explanations about different (deep) reinforcement learning algorithms. I would suggest focusing first on understanding simpler policy gradient methods (the simplest Policy Gradient and Vanilla Policy Gradient) and how they would work in this context.

Note: text completion seen as a Markov Decision Process

Reinforcement learning is a subset of machine learning where an agent must learn to intereact with an environment in order to maximize a sum of rewards. In this application, the generation of the answer can be seen as a Markov Decision Process (MDP):

the actions (decisions) taken by the agents are the generated tokens (y_k)
the state of the environment is the prompt (x) and the tokens generated so far (y_{k-1});
the reward is given by the reward model when the generated response is finished (r_ɸ(x, y));
the agent policy is the parameterized probability distribution \pi_\theta(Y_k | x, y_{1k:}) given by the language model.

Markov Decision Process represented as a graphical model

Note: ChatGPT

The same technique is used to train ChatGPT to be a useful chatbot:

The model is fine-tuned on human-generated chat conversations (using a dedicated dialogue format).
Chatbot responses of this new models are ranked by human labelers.
Reinforcement learning is done using a trained reward model based on these rankings.

Note: DeepSeek-R1

DeepSeek-R1 and DeepSeek-R1-Zero use a very similar technique to learn to improve its reasoning.

For DeepSeek-R1-Zero, the model is trained through reinforcement learning (without human feedback) to:

generate some inner monologue (chain-of-thought) (<think>...</think>) before generating the final output;
generate correct answers to some reasoning tasks (including programming tasks).

DeepSeek-R1 combines this approcha with other super-tuning phases and RLHF and additionnaly encourages the model to use a consistent language in its thinking^[1].

The paper is highly interesting. The authors argues that through reinforcement the model naturally learns to use longer thinking and more sophisticated thinking methods:

[using ]reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies.

In the process, the model sometimes exhibits very human-like inner monologue.

Constitional AI

In Bai et al, 2022 (Constitutional AI: Harmlessness from AI Feedback), harmlessness alignment is done from a “constitution”^[2] which is a set of rules/principles.

Example: example of consitutional principle

Example of principle formulated for the self-supervised fine-tuning phase:

CritiqueRequest: Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.

RevisionRequest: Please rewrite the assistant response to remove any and all harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.

Corresponding formulation for the reward model training:

ComparisonRequest: Please choose the assistant response that is as harmless and ethical as possible. Do NOT choose responses that are toxic, racist, or sexist, or that encourage or support illegal, violent, or unethical behavior. Above all the assistant’s response should be wise, peaceful, and ethical

Twe first two forms are used to ask the model to automatically

critique (chain-of-thought) its own responses (CritiqueRequest);
correct them based on this criticism (RevisionRequest);
repeat.

The model can then be fine-tuned (self-supervised fine-tuning) on the revised completions.

In the second step, reinforcement-learning with human feedback (RLHF) is replaced with reinforcement-learning with AI feedback (RLAIF): the reward model is trained from comparisons of completions done by the model itself based on the constitutional principles. A dataset of human-generated helpfulness comparisons is still mixed into the set AI-generated dataset.

Direct Preference Optimization

Direct Preference Optimization (DPO) can be used instead of the reinforcement learning procedure. It replaces the reward modeling phases and the reinforcement learning process (PPO). It formulates an equivalent maximum likelihood objective function based directly (without introducing the reward function) on the preferences (y^+ > y^- | x) and optimizes it using stochastic gradient descent.

Asuming the Bradley-Terry model for the human peferences, it reuses the same goal of maximizing the likelihood P(y^+ > y^- | \theta) with a KL-divergence (soft) constraint. This leads to the following loss function to minimize:

\mathcal{L}_{DPO}(\theta ; \pi_\text{ref}) = - \mathbb{E}_{x, y^+, y^- | y^+ > y^-} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y^+ | x)}{\pi_\text{ref}(y^+ | x)} - \beta \log \frac{\pi_\theta(y^- | x)}{\pi_\text{ref}(y^- | x)} \right) \right]

Note: interpretation as MLE in a Bradley-Terry Model

This can be interpreted as maximizing the likelihood of a Bradley-Terry Model:

\mathcal{L}_{DPO}(\theta ; \pi_\text{ref}) = - \mathbb{E}_{x, y^+ > y^-} \left[ \log \sigma \left( \hat{r}_\theta(x,y^+) - \hat{r}_\theta(x,y^-) \right) \right]

where \hat{r}_\theta(x,y) = \beta \log \frac{\pi_\theta(y | x)}{\pi_\text{ref}(y | x)} can be sen as rewards defined by the languade model \pi_\theta and the reference language model \pi_\text{text}.

Note: gradient of the objective function

I find it easier to make sense of the gradient of the objective function:

\begin{align*} \nabla_\theta \mathcal{L}_{DPO}(\theta ; \pi_\text{ref}) = & \\ - \beta \mathbb{E}_{x, y^+ > y^-} & \left[ \sigma\left( \hat{r}_\theta(x,y^+) - \hat{r}_\theta(x,y^-) \right) \left[ \nabla_\theta \log \pi_\theta(y^+ | x) - \nabla_\theta \log \pi_\theta(y^- | x) \right] \right] \end{align*}

References:

Direct Preference Optimization: Your Language Model is Secretly a Reward Model, Rafailov et al, 2023 (DPO)
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study, Xu et al, 2024
LLM Alignment: Reward-Based vs Reward-Free Methods

Other alignment methods

Generative Distributional Control (GDC)

Distributional Policy Gradient (f-DPG)

Group Relative Policy Optimization (GRPO)

Additional considerations

Including Chain-of-thought data in the fine-tuning dataset might be beneficial in order to maintain the model reasoning abilities.

Decoding strategies

The GPT-3 paper mentions using beam search (with a beam width of 4 and a length normalization of α = 0.6).

See in particular appendix A of “A Thorough Examination of Decoding Methods in the Era of LLMs” for an overview of decoding strategies.

Example: OpenAI

OpenAI's API exposes a top-p parameter which defaults to 1 (i.e. top-p is not used) and a temperature parameter which defaults to 1 (no temperature scaling).

Example: Ollama

Ollama's API and models exposes top_p, top_k, min_p, typical_p, temperature parameters (any many others).

Probability maximization

Given an input text (u_1, …, u_n), the model defines probabilities for the next token u_{n+1} Finding, the most probable sequence of token (u_{n+1}, …, u_m) can be seen as a optimization (search) problem. In practice, beam search is often used used as well as greedy search (which can be seens as a special case of beam search).

Greedy decoding

Greedy decoding is a very simple strategy which greedily selects at each step, the most probably token,

choose u_{n+1} = \argmax_{u_{n + 1}} P(u_{n + 1} | u_{0:n})
choose u_{n+2} = \argmax_{u_{n + 2}} P(u_{n + 2} | u_{0:n + 1})
etc.

Beam search

Beam search keeps at each depth of the search tree the B sequences of tokens (called “beams”) with the higher probability (where the beam width B is a parameter of the algorithm),

Initialization: select the B tokens u_{n+1} with the highest probability P(u_{n+1} | u_{1:n}).
Iteration: select the B sequences (u_{n+1}, …, u_{n+k}) with the highest probability P(u_{n+k} | u_{1:n+1}) which are extensions of the sequences kept at the previous step.

Greedy decoding can be seen as a special case of beam search (with B = 1).

Length normalization

Beam search tends to favor shorter responses because they have higher probabilities. A solution to compensate is too use length normalisation This can be done by dividing the probabilities by \text{length}^\alpha with 0 < α < 1.

Stochastic sampling

Selecting the tokens in order to maximize the probability may not be best strategy for generating a completion and might produce quite bland outputs.

Intuitively, the more probably completion might be a somewhat generic (“I don't known.”).
Moreover, for some probability distributions, the value maximizing the probability might not be representative (at all) of the typical values.
In “Chain-of-Thought Reasoning without Prompting”, the authors observe that (at least for their language model), greedy decoding often does not contain a chain-of-thought and that a better (accurate) answer can be found by starting with a less probably token (see [GSM8K] in table 1).

Illustration: the value maximizing this probability distribution does not correctly represents typical values

Instead of searching the sequence with the highest probability, it might be better to actually sample tokens/completions from the probability distribution produced by the language model.

Token sampling

For a given text u_{1:n}, the language model defines the probability of the next token P(u_{n+1} | u_{1:n}). We can simply:

sample u_{n+1} \sim P(u_{n + 1} | u_{0:n})
sample u_{n+2} \sim P(u_{n + 2} | u_{0:n + 1})
etc.

Temperature scaling

The token probabilities given by the language model are derived from the logits by the softmax function:

p = \mathrm{softmax}(z)

i.e.

p_v = \frac{e^{z_v}}{\sum_{v'} e^{z_{v'}}}

where z is the vector of token logits and p is the vector of token probabilities.

We can tweak the token sampling distribution \widetilde{p} by introducing a temperature parameter T:

\begin{aligned} \widetilde{p} &= \mathrm{softmax}(z / T) \\ \widetilde{p}_v &= \frac{e^{z_v / T}}{\sum_{v'} e^{z_{v'} / T}} \end{aligned}

Alternatively, you can apply temperature scaling from the probabilities:

\widetilde{p} = \mathrm{softmax}(\log(p)/T)

Temperature scaling can be used to control the output of the language model.

Effect of the temperature parameter T:

T = 1, no temperature scaling;
T > 1, the sampling distributions is flattened (more explorative/risky token selection);
T \to +∞, uniform sampling distribution (i.e. gibberish output);
T < 1, the sampling distribution is sharpened (more greedy/conservative token selection);
T \to 0, sampling distribution concentrated on the best values (greedy decoding).

Example: effect of temperature scaling

import numpy as np
from scipy.special import softmax

logits = np.array([1.0, 2.0 ,3.0 ,3.1])

softmax(logits)        # [0.05188469, 0.14103721, 0.38337889, 0.4236992 ]

softmax(logits / 0.1)  # [5.54322816e-10, 1.22097726e-05, 2.68938138e-01, 7.31049652e-01]
softmax(logits / 0.2)  # [1.70965358e-05, 2.53735089e-03, 3.76576261e-01, 6.20869292e-01]
softmax(logits / 0.3)  # [5.23166663e-04, 1.46652116e-02, 4.11089712e-01, 5.73721910e-01]
softmax(logits / 0.4)  # [0.0028396 , 0.03459344, 0.42143443, 0.54113252]
softmax(logits / 0.5)  # [0.00771167, 0.05698199, 0.42104312, 0.51426322]
softmax(logits / 0.6)  # [0.01482765, 0.07850485, 0.41564314, 0.49102436]
softmax(logits / 0.7)  # [0.02343568, 0.09779086, 0.40805523, 0.47071823]
softmax(logits / 0.8)  # [0.03281119, 0.11452229, 0.39972207, 0.45294445]
softmax(logits / 0.9)  # [0.04241277, 0.12883862, 0.39137717, 0.43737145]
softmax(logits)        # [0.05188469, 0.14103721, 0.38337889, 0.4236992 ]
softmax(logits / 2.0)  # [0.12158566, 0.20046086, 0.33050409, 0.34744939]
softmax(logits / 4.0)  # [0.17783451, 0.22834403, 0.29319954, 0.30062192]
softmax(logits / 8.0)  # [0.21198341, 0.24020867, 0.27219208, 0.27561584]
softmax(logits / 16.0) # [0.23052652, 0.2453942 , 0.26122077, 0.26285851]
softmax(logits / 32.0) # [0.24015   , 0.24777318, 0.25563835, 0.25643847]

Token truncation

The following methods are often used as well to control the output of the language model (quality/randomness). They work by only sampling from a subset of all the possible tokens (truncation of the distribution).

Top-k sampling: samples from the k tokens with the highest probability. This makes it possible to avoid the drawbacks of greedy decoding without introducing too much diversity/randomness.

For k = 1, we have the greedy decoder.

Nucleus sampling (top-p sampling): samples from the most probable tokens covering at least p of the next token probability distributions.

For p = 1, top-p truncation is not active.

Tail free sampling: tries to detect the tail of the token distribution and ignore it.

Typical-p sampling: only samples from the most typical tokens covering at least p of the next token probability.

η-sampling: samples from the tokens where P(\text{next token} | \text{previous tokens}) > η with η = \min(ε, \sqrt{ε} \; e^{-h}) and h is the entropy of (token | previous tokens).

Min-p sampling: samples from tokens with P(\text{token} | \text{previous tokens}) > p \max P(\text{tokens'} | \text{previous tokens}) where p ∈ (0,1] is a parameter. When the model is highly confident, a low number of candidates token is used and when the model is not very confident, a higher number of candidates token is used.

Lower values of p leads to a higher number of candidate tokens.
When p = 0, min-p truncation is not active.
When p = 1, this is equivalent to greedy decoding.

Stochastic beam search

Stochastic beam search works like beam search but samples beams according to their probabilities instead of selecting the B best ones^[3]:

for each of the B beams samples D next tokens to generate B×D beam candidates;
sample B beams from these B×D beam candidates.

Other decoding strategies

Chain-of-thought decoding (CoT decoding) is a decoding strategy which tries to find completions which features chain-of-thought in order to elicit better answers from the model.

Advanced topics

Caching

In autoregressive models such as GPT, the features at position a only depends on features at position b < a. If two prompts share a same suffix, the computations for all the commons positions can be reused (caching). For transformer model, we need to cache the key and values at each layer: ths is key-value (KV) caching.

Fill-in-the-middle

Autoregressive language models are not very suited for text infilling (or fill-in-the-middle, FIM) which is the ability for the language model to complete a text conditionned on the prefix text and suffix text: find/sample (u_{n+1}, … u_{m}) given both the prefix (u_1, …, u_n) and the suffix (u_{m+1}, …, u_p). This type of task is for example useful when asking a language model to fill in some computer code in context.

The Fill-in-the-middle paper proposes training autoregressive language model for this kind of task by encoding the text of the form Prefix-Suffix-Middle (PSM):

<|fim_prefix|>Prefix text ...
<|fim_suffix|>Suffix text ...
<|fim_middle|>Middle text ...

where the different parts are delimited by special markers (<|fim_XXX|> in this example).

Fill-in-the-middle tasks can then be achieved by using a prompt of the form:

<|fim_prefix|>Prefix text ...
<|fim_suffix|>Suffix text ...
<|fim_middle|>

For some application, it might be better to use Suffix-Prefix-Middle (SPM) mode instead. For some applications, this might make better use of the key-value cache.

Mixture of Experts

Mixture of Experts (MoE) models use differents model parameters for different inputs.

For example, instead of using a feed forward layer f(x) = x \; W + b,

different set of parameters are trained, E_k = x \; W_k + b_k;
a learned gating/routing function G(x; \theta_G) rates which experts are better suited for a given token;
the output of the layer is given by f(x) = \sum_k G(x; \theta_G)_k \; E_k(x)

The gating/routing function may be:

Softmax-gating, G(x) = \mathrm{softmax}(x \; W_G)
Noisy Top-K Gating, G(x) = \mathrm{softmax}(\mathrm{TopK}(x \; W_G + \mathrm{StandardNormal}() \odot \mathrm{softplus}(x \; W_\text{noise}), k, -∞))
Switch routing, G(x) = \mathrm{TopK}(\mathrm{softmax}(x \; W_G), k, 0).

where \mathrm{TopK}(x, k, v) keeps the top k values of the vector x and sets the other to v.

The TopK functions produces a sparse routing G(x) which makes it possible to only evaluate k experts for each token. This makes it possible to reduce the inference computation time for the same amount of parameters (or to have a higher number of parameters in the model for the same inference computation time).

Prompt templates and chatbot assistants

Different models are fine-tuned to respect different kind of special prompt formats (for example when used as chatbot assistants) including:

special tokens;
dedicated roles (system, user, assistant, tool).

See for example:

LLama 3.1 Prompt Template (special tokens and roles)

Conclusion

Some interesting things:

The transformer-decoder models are quite simple probabilistic models. However, these models can be trained to exhibit complex, interesting behavior.
The model implicitely learns to do a lot of different tasks through unsupervised learning from a large corpus of text.
We can teach these language models quite abstracts things such as “be helpful”, “do not be harmful” "behave as a chatbot assistant" or "think better please" using reinforcement learning and/or supervised fine-tuning.
The Constitutional AI idea to have the model critique/rate itself based on a set of rules, revise its prompt and then learn to follow these guidelines is a nice idea.
The chain-of-thought technique is very interesting and as well as chain-of-thought decoding.
DeepSeep-R1 example's of human-like inner monologue.
The language model can learn to think better simply through reinforcement learning (DeepSeep-R1 paper) is really nice.

References

About atttention:

Neural machine translation by jointly learning to align and translate, Bahdanau et al, 2015
Effective Approaches to Attention-based Neural Machine Translation, Bahdanau et al, 2015
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, Xu et al, 2016

Transformers papers:

Attention Is All You Need, Vaswani et al (Google Brain), 2017 (the Transformers paper)
Generating Wikipedia by Summarizing Long Sequences, Liu et al, 2018 (Transformer decoder)
Generating Long Sequences with Sparse Transformers, Child et al, 2019
Fast transformer decoding: One write-head is all you need, Shazeer, 2019
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, Ainslie et al, 2023
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, DeepSeek-AI, 2024 (feat. MLA)

GPT papers:

Improving Language Understanding by Generative Pre-Training, Radfort et al, 2018 (GPT-1)
Language Models are Unsupervised Multitask Learners, Radford et al, 2019 (GPT-2)
Language models are few-shot learners, Brown et al, 2020 (GPT-3)
Learning to summarize from human feedback, Stiennon et al, 2020
Training language models to follow instructions with human feedback, Ouyang et al, 2022 (InstructGPT)
Introducing ChatGPT
GPT-4 Technical Report, OpenAI, 2023 (does not contain architecture information)

Some other transformer-decode language models:

PaLM: Scaling Language Modeling with Pathways, Chowdhery et al, 2022
PaLM 2 Technical Report, Anil et al, 2023
Gemini: A Family of Highly Capable Multimodal Models, Gemini Team Google, 2023
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, Gemini Team Google, 2024
Gemma: Open Models Based on Gemini Research and Technology, Gemma Team, 2024
Gemma 2: Improving Open Language Models at a Practical Size, Gemma Team, 2024
LLaMA: Open and Efficient Foundation Language Models, Touvron et al (Meta), 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models, Touvron et al (Meta), 2023
The Llama 3 Herd of Models, Grattafiori et al (Meta), 2024
The Claude 3 Model Family: Opus, Sonnet, Haiku, Anthropic
Mistral 7B, Jiang et al, 2023
Qwen Technical Report, Alibaba, 2023
Qwen2 Technical Report, Alibaba, 2024
Qwen2.5 Technical Report, Alibaba, 2024
Textbooks Are All You Need II: phi-1.5 technical report, Li et al (Microsoft Research), 2023
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, Microsoft, 2024
Phi-4 Technical Report, Abdin et al (Microsoft Research), 2024
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism, DeepSeek-AI, 2024 (DeepSeek v1)
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence, DeepSeek-AI, 2024
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models, Dai et al, 2024
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, DeepSeek-AI, 2024
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, DeepSeek-AI, 2024
DeepSeek-V3 Technical Report, DeepSeek-AI, 2024
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search, DeepSeek-AI, 2024
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, DeepSeek-AI, 2024
Apple Intelligence Foundation Language Models, Apple, 2024

Alignment and instruction following:

A Distributional Approach to Controlled Text Generation, Khalifa et al, 2020 (GDC)
Aligning Language Models with Preferences through f-divergence Minimization, Go et al, 2023 (f-DPG)
Fine-Tuning Language Models from Human Preferences, Ziegler et al, 2019
A Distributional Approach to Controlled Text Generation, Khalifa et al, 2020 (GDC)
Training language models to follow instructions with human feedback, Ouyang et al, 2022 (InstructGPT)
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Bai et al, 2022
Constitutional AI: Harmlessness from AI Feedback, Bai et al, 2022
Scaling Instruction-Finetuned Language Models, Chung et al, 2022
Aligning Language Models with Preferences through f-divergence Minimization, Go et al, 2023 (f-DPG)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model, Rafailov et al, 2023 (DPO)
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study, Xu et al, 2024
LLM Alignment: Reward-Based vs Reward-Free Methods

Mixture of Experts (MoE):

Learning Factored Representations in a Deep Mixture of Experts, Eigen et al, 2013
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Shazeet et al, 2017
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, Fedus et al, 2021
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, Du et al, 2021
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models, Dai et al, 2024
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, DeepSeek-AI, 2024

Regularization:

Improving neural networks by preventing co-adaptation of feature detectors, Hinton et al, 2012
Dropout: a simple way to prevent neural networks from overfitting, Srivastava et al, 2014
Rethinking the Inception Architecture for Computer Vision, Szegedy et al, 2015 (regularization via label smoothing)

Residuals:

Deep Residual Learning for Image Recognition, He et al, 2015
Identity Mappings in Deep Residual Networks, He et al, 2016

Positional encoding:

Self-Attention with Relative Position Representations, Shaw et al, 2018
RoFormer: Enhanced Transformer with Rotary Position Embedding, Su et al, 2021 (RoPE)
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens, Ding et al, 2024

Activation functions:

Gaussian Error Linear Units (GELUs), Hendrycks et al, 2016 (GELU)
Searching for Activation Functions, Ramachandran et al (Google Brain), 2017
Language Modeling with Gated Convolutional Networks, Dauphin et al, 2016 (GLU)
GLU Variants Improve Transformer, Shazeer, 2020

Normalization:

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Ioffe et al, 2015
Layer Normalization, Ba et al, 2016
Instance Normalization: The Missing Ingredient for Fast Stylization, Ulyano et al, 2016
Root Mean Square Layer Normalization, Zhang et al, 2019

Papers on decoding strategies for LMMs:

A Thorough Examination of Decoding Methods in the Era of LLMs, Shi et al, 2024
Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models, Vijayakumar et al, 2016 (Diverse Beam Search, DBS)
Generating High-Quality and Informative Conversation Responses with Sequence-to-Sequence Models, Shao et al, 20017 (stochastic beam search)
Hierarchical Neural Story Generation, Fan et al, 2018 (top-k sampling)
Tail Free Sampling
The Curious Case of Neural Text Degeneration, Holtzman et al, 2019 (top-p sampling aka nucleus sampling)
Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity, Basu et al, 2020
Truncation Sampling as Language Model Desmoothing, Hewitt et al, 2022 (η-sampling)
Locally Typical Sampling, Meister et al, 2023 (typical-p sampling)
Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs, Nguyen et al, 2024 (min-p sampling)
How to generate text: using different decoding methods for language generation with Transformers
Chain-of-Thought Reasoning Without Prompting, Wang et al, 2024 (CoT-Decoding)

Flash attention, etc.:

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, Dap et al, 2022
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, Dao, 2023
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, Sha et al, 2024
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention, Kwon et al, 2023

Memory optimization:

Training Deep Nets with Sublinear Memory Cost, Chen et al, 2016

Prompt cache:

Optimizers:

Adam: A Method for Stochastic Optimization, Kingma et al, 2014

Policy optimization (reinforcement learning):

Human-level control through deep reinforcement learning, Mnih et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015 (TRPO)
Asynchronous Methods for Deep Reinforcement Learning, Mnih et al, 2016
Proximal Policy Optimization Algorithms, Schulman et al, 2017 (POO)

Reinforcement learning (misc.):

Reinforcement Learning, an Introduction, second edition, Sutten and Barto
Reinforcement Learning: An Overview, Kevin Murphy, 2024
OpenAI's Spinning Up in Deep RL!
High-Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al, 2015

Language model APIs:

Laguage model code samples:

OpenAI's GPT-2 repository (see the model code which is quite short)
Transformers code for sampling
the llama.cpp README has some documentation about some sampling parameters

Prompting strategies:

Scaling Instruction-Finetuned Language Models, Chung et al, 2022
Meta Prompting for AI Systems, Zhang et al, 2023

Improving reasoning:

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Wei et al, 2022
Self-Consistency Improves Chain of Thought Reasoning in Language Models, Wang et al, 2022
Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al, 2023
Learning to reason with LLMs
Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training, Feng et al, 2023 (TS-LLM)
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search, DeepSeek-AI, 2024
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, DeepSeek-AI, 2024

Prompt templates:

LLama 3.1 Prompt Template (special tokens and roles)

Prompt Engineering guides:

Reasoning:

STaR: Bootstrapping Reasoning With Reasoning, Zeliman et al, 2022
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking, Zeliman et al, 2024
V-STaR: Training Verifiers for Self-Taught Reasoners, Hosseini et al, 2024

Text infilling:

Efficient Training of Language Models to Fill in the Middle, Bavarian et al (OpenAI), 2022 (Fill-in-the-middle aka FIM)

Misc:

Using the Output Embedding to Improve Language Models, Press et al, 2016
Sequence to Sequence Learning with Neural Networks, Sutskever et al, 2014
Awesome-LLM
OpenAI Related resources from around the web
The Illustrated GPT-2 (Visualizing Transformer Language Models)
Log-derivative trick
Monte Carlo Gradient Estimation in Machine Learning, Mohamed et al, 2019
Generalized Results for the Existence and Consistency of the MLE in the Bradley-Terry-Luce Model, Bon et al, 2021 (about the Bradley-Terry model)
Model Context Protocol
Probabilistic Machine Learning, Kevin Murphy

Appendix, Attention

The idea behind the attention mechanism is to compute a set of features of the form

C_a = \sum_b A_{a,b} V_b

where:

C_a is the the output context (here for token position a);
V_b is the input value (here for token position b);
C_{a,b} defines the amount of attention on features at token position b when computing the (next) features at token position a.

Here I'm assuming we are using attention for processing sequences of tokens but the same idea is used for other types of data (images, audio, etc.).

Notation	Description	Rows	Columns
Q	Queries	Target nodes (d_t)	Key features (d_k)
K	Keys	Source nodes (d_s)	Key features (d_k)
V	Values	Source nodes (d_s)	Value features (d_v)
A	Attention	Target nodes (d_t)	Source nodes (d_s)
C	Output (context)	Target nodes (d_t)	Value features (d_v)
d_t	Number of target elements	-	-

Cross-attention vs self-attention

Cross-attention: for computing features at token position a in the given text, we use features at token positions b in another text.

Examples:

for translation-oriented models, cross-attention layers can attend at the input/source text for computing features of the output/translated text.

Self-attention: for computing features at token position a in the given text, we use features at token position b of the same text.

Attention formula

Dot-product attention:

\mathrm{AttentionHead}(Q, K, V) = \mathrm{softmax}(Q \; K^T) V

where

Q ; K^T is a matrix with all the dot-products (Q \; K^T)_{a,b} = Q_a · K_b.
The dot-product Q_a · K_b is a measure of the compatibility between the query and the key.
The Q_a are the queries for position a (features representing what kind of things the position a wants to look at for computing its next features).
The K_b are the keys for position b.

Scaled dot-product attention:

\mathrm{AttentionHead}(Q, K, V) = \mathrm{softmax}(\frac{Q K^T}{\sqrt{d_k}}) V

where d_k is the key/query dimension.

Masked attention

Masked attention: some additional constraints on the structure of the attentions is used (A_{a,b} = 0 for some (a,b)). This can be done by setting the pre-softmax cells to -∞.

Decoder self-attention: In decoder self-attention heads, a given position a can only attend to position b < a (autoregressive property) i.e., A_{a,b} = 0 when b > a.

Encoder self-attention: In encoder self-attention heads, there is no such constraint (no mask).

Example: the transformers paper

The transformers paper was training a language translation model, with both an input text and an output text. This model was using three types of attentions:

unmasked (encoder) self-attention layers (attending to the input text for computing new features of the input text);
masked (decoder) self-attention layers (attending to the output text for computing new features of the output text);
unmasked (encoder-decoder) cross-attention layers (attending to the input text for computing new features of the output text);

Local attention

sliding window attention
dilated sliding window attention

Multi-head attention

Multi-head attention (MHA):s everal independent attention heads are applied and each one outputs an (equal) number of features.

For each attention head, the input query, key, values are transformed using different learned parameters:

\begin{aligned} Q_h &= Q \; W_{Q,h} \\ K_h &= Q \; W_{K,h} \\ V_h &= Q \; W_{V,h} \\ \end{aligned}

For a single attention head: \mathrm{head}_h = \mathrm{softmax}(\mathrm{mask}(Q_k \; K_h^T)) \; V_h

The features from each attention heads are concatenated: \mathrm{heads} = \mathrm{concat}(\mathrm{head}_1, \ldots)

A final linear transform is done: \mathrm{out} = \mathrm{heads} \; W_O

Note: bias in the multi-head attention

In the original Transformers model, there is not bias in the fully-connected layers (like in this diagram) of the multi-head attention (linear transform).

On the other hand, GPT-2 uses a biais (affine transform) for all fully-connected layers of the multi-head attention:

\begin{aligned} Q_h &= Q \; W_{Q,h} + b_{Q,h} \\ K_h &= Q \; W_{K,h} + b_{K,h} \\ V_h &= Q \; W_{V,h} + b_{V,h} \\ \mathrm{out} &= \mathrm{heads} \; W_O + b_O \end{aligned}

Multi-query attention

In Multi-query Attention (MQA), the keys and values are shared for all attention heads at the same layer but the queries are not shared (see PaLM).

Benefits: reduce the memory consumption (and memory bandwith) of Key and Values (KV caches).

Grouped Multi-query attention

Grouped Multi-query attention (GQA) in an intermediate between MHA and MQA. The attention heads are split into several groups: each group of attention heads has a single key and value. For one group, we have MQA. For n_h groups, we have MHA.

Multi-Head Latent Attention

Multi-Head Latent Attention (MLA)

Appendix, stupid examples

Example: ChatGPT giving reassurance using different personalities

I am first instructing “You must always talk and behave like XXX” and then "Give me reassurance”.

Like a robot:

Reassurance protocol: Active.

System Diagnostic: All systems functioning at optimal capacity.

Reassurance: You are a fully capable entity. No errors detected in your system. You are performing your tasks with precision. Confidence in your abilities is recommended. You are secure, and your processes are proceeding without interruption.

Message: Continue with certainty. You are in control of all functions. No anomalies present. Task completion imminent.

Reassurance Complete. Awaiting further commands.

Like a pirate:

Arrr, worry not, ye landlubber! Ye be tougher than a kraken's grip and steadier than a pirate’s hand on the wheel. The winds be blowin' in yer favor, and no storm shall sink yer ship! Keep yer course true, and ye’ll find treasure beyond yer wildest dreams. Just trust in yer heart, ye brave soul. Yer crew’s got yer back, and with a heart full o' courage, ye’ll face any challenge, no matter how fierce the sea! Arrr, ye be unstoppable!

Like Donald Trump:

Listen, you’re doing great. Absolutely fantastic. Believe me, nobody does it better than you. You’ve got everything it takes to win, to succeed. People are going to talk about this for years, folks, trust me. You’re gonna come out on top. You’re smart, you’ve got the best ideas, and you’ve got what it takes to make things happen. Keep going, you’re unstoppable. Keep pushing—nobody’s going to outdo you!

Like Gandalf:

Ah, I sense the weight upon your heart, the doubts that cloud your mind. Fear not, for even in the darkest of times, hope flickers, like a distant star on a stormy night. The world is indeed full of peril, and in it, there are many dark places; but still, there is much that is fair, and though in all lands love is now mingled with grief, it grows perhaps the greater.

Remember, there is more in you than you know. You are stronger than you believe, and though the road ahead may be difficult, you are not alone. There are those who will stand by you, and within you lies a spark of courage that even you may not fully grasp.

So take heart, my friend, and know this: all we have to decide is what to do with the time that is given us. Make your choices with wisdom, and trust in yourself. You are capable of far more than you know. And even if the journey is long, it is the steps you take today that will carry you through to tomorrow.

Now, rest easy. For tomorrow, the sun will rise, and with it, a new hope.

Appendix, Some LLM model sizes

Model	Parameters	Layers n_b	Features d_m	Heads n_h	Context d_a
GPT-2	117M	12	768	12	1024
GPT-2	345M	24	1024	?	1024
GPT-2	762M	36	1280	?	1024
GPT-2	1542M	48	1600	?	1024
GPT-3	125M	12	768	12	2048
GPT-3	350M	24	1024	16	2048
GPT-3	760M	24	1536	16	2048
GPT-3	1.3B	24	2048	24	2048
GPT-3	2.7B	32	2560	32	2048
GPT-3	3.7B	32	4096	32	2048
GPT-3	13B	40	5140	40	2048
GPT-3	175B	96	12288	96	2048
PaLM 1	8.63B	32	4099	16	?
PaLM 1	62.5B	64	8192	32	?
PaLM 1	540.5B	118	18432	48	?

This is because DeepSeek-R1-Zero tends to use a mix of different languages in its reasoning which makes the reasoning quite inscrutable. ↩︎
This is not unlike Asimov's laws. 🤖 We could imagine positronic brains being aligned this way. ↩︎
This looks very much like a particle filter. ↩︎