/dev/posts/

Transformer-decoder language models

Published:

Updated:

Some notes on how transformer-decoder language models work, taking GPT-2 as an example, and with lots references in order to dig deeper.

You can follow along with:

Update 2025-01-19: Added a dense layer in the transformer sub-block that I forgot.

Update 2025-01-28: Added somes notes on DeepSeek-R1.

Table of content

Overview

OpenAI's GPT (Generative Pre-trained Transformer) is a family of autoregressive deep language models using a transformer-decoder architecture (using self-attention). Many other language models are based on the transformer-decoder design: Google's Gemini, Met'as LLaMA, Anthropic's Claude, Mistral.ai's Mistral, Alibaba's Qwen, Microsoft's Phi, etc. These models are widely used as the core of AI assistants (ChatGPT, CoPilot, etc.).

These model are trained to estimate the distribution of the completion of an input text:

P(" is it?" | "What time")

More precisely, they estimate the probability distribution of the next token given the previous tokens (so far):

P(" it" | ["What", " time", " is"]) = modelθ(" it" | ["What", " time", " is"])

where θ are the model parameters.

Autoregressive language model

This defines probabilities for longer completions:

P([" it", "?"] | ["What", " time", " is"])
    = P(" it" | ["What", " time", " is"]) × P("?" | ["What", " time", " is", " it"])

In the unsupervised pre-training phase, this probability distribution is estimated (by fitting the model parameters θ) based on a (large) collection of texts. For example, in the GPT-2 paper all the web pages linked by Reddit posts with at least 3 karma were scrapped, the Wikipedia pages were removed and this dataset was used to train the model.

Pre-training in autoregressive language model

The trained model is typically used to generate possible completions from a initial text. Different decoding strategies can be used to generate a text completion from the token probabilities generated by the trained model.

Decoding in autoregressive language model

Radford et al (2019) argued that using a wide corpus of text scrapped from the internet will provide sufficient information for the model to learn a lots of different tasks (emphasis mine):

The internet contains a vast amount of information that is passively available without the need for interactive communication. Our speculation is that a language model with sufficient capacity will begin to learn to infer and perform the tasks demonstrated in natural language sequences in order to better predict them, regardless of their method of procurement. If a language model is able to do this it will be, in effect, performing unsupervised multitask learning. We test whether this is the case by analyzing the performance of language models in a zero-shot setting on a wide variety of tasks.

Given a large enough corpus of text (and if the model is suitable), the model can generalize and implicitly learns to do a lot of things along the way (zero-shot/few-shot task transfer) such as answering questions, summarizing texts, translating texts between different (human) languages, generating or explaining code in different programming languages, making poems, assume some kind of persona (see appendix), etc.

The model parameters may then (optionnaly) be fine-tuned (fine-tuning) to tweak its behavior according to a specific task (supervised fine-tuning for the given task) or for a more specific type of documents (unsupervised fine-tuning).

One example of such fine-tuning is to to align the model's behavior according to the expectation of its users and/or designers (alignment) such as, better following user instructions (instruction-tuning), which makes the model useful when used in an AI assistant, or avoiding to generate harmful, disrespectful, hateful, biased content. This alignment can be done by adjusting the language model behavior based on human feedback.

Architecture of GPT-2

The following diagram represent the GPT-2 (and GPT-3) model. It is based on OpenAI's GPT-2 model code as well as the GPT-1, GPT-2 and Transformers papers.

GPT-2 model (overview)
GPT-2 model (detailed)

This diagram includes details about the multi-head attention, training objective and supervised task fine-tuning.

For a given sequence of tokens u_{1:n} = (u_1, \ldots, u_n) the model computes each P(u_{a + 1} | u_{1:a}) for 1 \le a \le n.

Input: A list of n tokens (up to dc) generated by the tokenizer from the input text. In the rest of this post, this list of tokens is seen as a sparse (n, dv) matrix where each row represents a token position inside the text and columns represents a possible token: if the token at position 5 is the token #6, we have u_{5,6} = 1 and u_{5,i} = 0 for i \ne 6.

Output: a (n, d_v) matrix containing the probabilities P(u_{_a + 1} | u_{1:n}) (or their logits) for 1 ≤ a ≤ n.

Note: conventions in this diagram

Computations:

  • Rectangles represents computations (layers, blocks, sub-blocks, etc.).
  • The parts in blue are used for the model training (or evaluation).
  • Dashed boxes represents things which are not part of the GPT-2 model itself. This includes tokenization, evaluation and training/learning, text generation (decoding strategy).

Data:

  • Arrows represents computed values and data flows.
  • Dimensions of data tensors are indicated in parenthesis.
  • Model output are in bold.
  • Learning objectives are in bold and italic.
  • Input/output port names of some blocs are indicated in italic in parentheses.

Learned parameters:

  • Filled squared (■) represents 2D tensor (matrix) (learned) model parameters.
  • Filled rows (▬) represents 1D tensor (vectors) (learned) model parameters.
  • For example, the fully-connected layers with ■▬ have a bias (affine transformation) while the fully-connected layers with only ■ don't have a bias (linear transformation).

Model size (not learned):

  • n_h, number of attention heads per layer
  • n_b, number of blocks
  • d_v, size of the token dictionnary
  • d_m, dimension of token embeddings (number of features)
  • d_c, context size (in tokens)
  • d_y, output dimension

Tokenizer

The input text is transformed into a sequence of tokens (u_0, …, u_n).

GPT uses a byte pair encoding (BPE) tokenizer implemented in the tiktoken library.

import tiktoken
enc = tiktoken.get_encoding("gpt2")
tokens = enc.encode("What time is it, please?")
# => [2061, 640, 318, 340, 11, 3387, 30]
[enc.decode([token]) for token in tokens]
# => ['What', ' time', ' is', ' it', ',', ' please', '?']

A tokenization strategy would be to use words as token but the language model would not be able to handle unknown words and would not even be able to spit them back like in the following example.

Example: language models are able to handle unknown words

Prompt:

Hello, my name is Shargargoartzrk.

Answer:

Hello! It'm Phi to assist you with information or tasks today. I must clarify that "Shargargoartzrk" doesn’t seem familiar in standard English conversations; could it be a username? Anyway, how can I help you now? Please feel free to ask any questions you have!

(emphasis mine)

Instead, unknown and complex words are split into different tokens:

[enc.decode([token]) for token in enc.encode("Shargargoartzrk")]
# => ['Sh', 'arg', 'argo', 'artz', 'r', 'k']
[enc.decode([token]) for token in enc.encode("transformers")]
# => ['transform', 'ers']
[enc.decode([token]) for token in enc.encode("Rao-Blackwellization")]
# => ['R', 'ao', '-', 'Black', 'well', 'ization']

Vocabulary size of the different tokenizers in the tiktoken library:

for encoding in tiktoken.list_encoding_names():
  enc = tiktoken.get_encoding(encoding)
  print(encoding + ":" + repr(enc.n_vocab))
gpt2:50257
r50k_base:50257
p50k_base:50281
p50k_edit:50284
cl100k_base:100277
o200k_base:200019

Note: some tokenizer implementations

Note: special tokens

Some tokenizers include some special tokens with special semantic.

For example tiktoken tokenizers:

for encoding in tiktoken.list_encoding_names():
  enc = tiktoken.get_encoding(encoding)
  print(encoding + ":" + repr(enc.special_tokens_set))
 gpt2:{'<|endoftext|>'}
 r50k_base:{'<|endoftext|>'}
 p50k_base:{'<|endoftext|>'}
 p50k_edit:{'<|fim_suffix|>', '<|fim_middle|>', '<|fim_prefix|>', '<|endoftext|>'}
 cl100k_base:{'<|fim_middle|>', '<|fim_suffix|>', '<|endofprompt|>', '<|fim_prefix|>', '<|endoftext|>'}
 o200k_base:{'<|endofprompt|>', '<|endoftext|>'}
 

Explanations:

  • <|endoftext|> is used to indicate the end of the text;
  • <|endofprompt|> is used to indicate the end of the prompt (especially for instruction following models);
  • <|fim_suffix|>, <|fim_middle|>, <|fim_prefix|> are used for fill-in-the-middle (FIM) training.

Different models used different special tokens.

See as well the special tokens from Llama 3.1:

  • <|begin_of_text|>
  • <|end_of_text|>
  • <|finetune_right_pad_id|>
  • <|start_header_id|>
  • <|end_header_id|>
  • <|eom_id|>
  • <|eot_id|>
  • <|python_tag|>

Tokens embedding and positional encoding

In GPT-2: h_0 = U W_e + b_p

Token embedding: The tokens are transformed into features using a learned embedding (W_e).

Positional encoding: In order to give to the model a notion of the relative position of the tokens, positional encoding vectors are added to each token position.

Notation Description Rows Columns
U Input tokens Token position Token vocabulary
W_e Token embedding Token vocabulary Token embedding feature
W_p Positional encoding Token position Token positional feature
h_0 First block input Token position Token feature

Note: original transformers paper

In the transformers paper, the positional encoding vectors were hardcoded (based on cosines). In GPT, they are learned parameters W_p instead.

Note: alternative positional encoding methods

See the RoFormer paper for an overview of positional encoding methods:

In GPT-2 source code embedding and positional encodding are represneted by the wte and wpe variables respectively.

Transformer blocks

The main part of the model is a serie of transformer decoder blocks having the same structure:

h_{i+1} = \mathrm{block}(h_i ; \theta_{i+1})

Each block is made of two sub-blocks:

  1. one attention sub-block;
  2. one fully-connected sub-block.

In GPT-2 source code this is the block() function.

Attention sub-block

The attention sub-block is made of:

  1. layer normalization (with element-wise affine transformation);
  2. a layer of multi-head masked (decoder) scaled dot-product self-attention;
  3. a fully connected layer;
  4. shortcut connection (residual structure).

Residual network

Each sub-block is using a residual unit structure i.e., F(x) + x where:

This structure is easier to train, especially when using deep networks. On reason for this is that the shortcut connections mitigates the vanishing gradient problem.

Layer Normalization

Each residual function starts with a layer normalization: the features are normalized (mean and variance), independently for each token position.

This is followed by a element-wise affine (aka diagonal affine) transformation. This step is represented explicitly in the diagram in this post but is is many implementation (such as PyTorch's LayerNorm and Keras's LayerNormalization).

Note: some alternative normalization methods

  • root mean square normalization (RMSNorm)

In GPT-2 soure code this is implemented by the norm() function.

Attention

The attention mechanism is the core of the transformer architecture. GPT-2 uses a scaled masked (decoder) dot-product multi-head self-attention:

Output of a single attention head h:

\mathrm{softmax}(\mathrm{mask}(\frac{Q_{i,h} \; K_{i,h}^T}{\sqrt{d_k}})) \; V_{i,h}

Query at layer i for head h:

Q_{i,h} = x \; W_{Q,i,h} + b_{Q,i,h}

Key at layer i for head h:

K_{i,h} = x \; W_{K,i,h} + b_{Q,i,h}

Query at layer i for head h:

V_{i,h} = x \; W_{V,i,h} + b_{V,i,h}

where x is is the output of the previous layer

See the appendix for more details.

Example: GPT

In GPT, each layer has d_m features. These features are generated by n_h attention heads. Each attention head has a query/key size and value size of length d_m / n_h.

Notation Description Rows Columns
x Input features Token position Features
W_{Q,i,h} Weights for the query Input features Key/query features
W_{K,i,h} Weights for the key Input features Key/query features
W_{V,i,h} Weights for the value Input features Value features
b_{Q,i,h} Biases for the query - Key/query features
b_{K,i,h} Biases for the key - Key/query features
b_{V,i,h} Biases for the value - Value features
Q_{i,h} Query Token position Key/query features
K_{i,h} Key Token position Key/query features
V_{i,h} Value Token position Value (output) features

Fully connected sub-block

The fully connected sub-block uses:

  1. layer normalization (with element-wise affine transformation);
  2. fully connected layer;
  3. GELU activation function;
  4. another fully connected layer;
  5. a shortcut connection (residual structure).

In GPT-2 soure code, the first four points are implemented by the mlp() function.

Activation function

The GELU (Gaussian Error Linear Unit) activation function is used in GPT-2. Its usage is not explicitly mentionned in the GPT papers but can be found in the model source.

Note: some activation functions used transformer-decoder language models

  • GELU (used in GPT models)
  • SwiGLU (used un PaLM 1, LLaMA 1-3, Qwen 1-2)
  • GEGLU (used in Gemma 1, Phi-3)

Probability computation

The last section of the model derived the probability distribution of the next token from h_{n_b} using:

  1. layer normalization;
  2. token unembedding;
  3. softmax.

After a last layer normalization, the resulting features are converted into next token logits by applying the same word embedding used at the beginning of the model but in the other direction (W_p^T), as discussed in Press et al, 2016. Then, the softmax function is used to convert the logits into token probabilities.

P = \mathrm{softmax}(h_n W_f W_p^T)

where the softmax function is applied row-wise i.e:

P_{i,j} = \frac{e^{P'_{i,j}}}{\sum_k e^{P'_{i,k}}}

with P' = h_{n_b} W_f W_p^T

Notation Description Rows Columns
h_{n_k} Output of the last transformer block Token position Feature
W_f Weights of the last fully-connected layer Feature Token embedding feature
W_p Token embedding Token vocabulary Token embedding feature
h_{n_b} \; W_f W_p^T Logits Token position Token (vocabulary)
P Next token probabilities Token position Token (vocabulary)

Training

Pre-training

The objective is:

\max_θ \frac{1}{N} \sum \log P( \text{next token} | \text{previous tokens} )

This is optimized using batched stochastic gradient-based optimization. In the GPT-2 paper, the Adam optimizer was used to train the model.

A sequence of k tokens gives k - 1 (text, completions) pairs that the model can use. For example for the token sequence ["What", " time", " is", " it", "?"] gives the following information:

Dropout

Dropout is a regulatization technique (i.e., addressing overfitting). During training some elements of the input tensors are randomly disabled (by setting their output value to zero in the output). This can be interpreted as training a mixture of a large number of models.

During inference, dropout is disabled.

The GPT-1 mentions “attention dropout with a rate of 0.1 for regularization” and “[for fined tuning] dropout to the classifier with a rate of 0.1”.

As far as I understand, a dropout layer is used at the end of each residual branch. This is consistent with the model found in ”Generating Long Sequences with Sparse Transformers”. The transformer-decoder paper mentions using dropout after positional encoding as well.

Supervised task fine-tuning

An example of supervided task fine-tuning is represented in the diagram. This can be done by replacing the end of the model for the new task and fine-tuning the whole parameters using supervised learning on a new data set (with a new objective function L_2).

In the GPT-1 paper, fine-tuning combines both the original task and the new task and the new objective L_2 is combined with the original text-prediction objective L_1: L_3 = L_2 + λ \; L_1 .

Alternatively, the model can be fine-tuned on the same task but with a new (smaller) text corpus.

The GPT-2 paper does not use fine-tuning but focuses on zero-shot task transfer instead.

Alignment

Alignment is the process of fine-tuning the model parameters in order to:

Warning: using models without instruction following

A language model which has not been fined-tuned for instruction following is simply trained to complete the given input text. This may produce unintended results when used for example in an AI assistant

AlignGPT

In the InstructGPT paper, the model parameters are adjusted based on human feedback using three steps:

  1. Supervised fine-tuning;
  2. Reward modeling;
  3. (Deep) Reinforcement Learning with Human Feedback (RLHF).

Supervised fine-tuning: In the first step, human labelers generate examples of desired text completions. The pre-trained language model is fine tuned on these text examples.

Note: Chain-of-thought data

Including Chain-of-thought data in the fine-tuning dataset might be beneficial in order to maintain the model reasoning abilities.

Reinforcement Learning with Human Feedback: In the third step, the language model obtained in step 2 is trained using deep reinforcement learning (RL) techniques. Deep reinforcement learning algorithms uses a reward signal which (in this case) indicates how good (how aligned) a given text completion y for a given text input (prompt) x. The parameters of the language model are ajusted in order to maximize the rewards.

Fine-tuning a language model using Deep reinforcement learning

Note: Kullback–Leibler divergence penalty

In practice, a penalty term is used during the reinforcement learning phase. This penalty is the Kullback–Leibler (KL) divergence between the current policy \pi_\theta and the SFT policy \pi_\text{ref}.

The objective becomes:

\max_{\theta} \; \mathbb{E}_{x, y \sim \pi_\theta(.|x)} \left[ r_\phi(x,y)\right] - \beta \mathbb{D}_{KL}\left[\pi_\theta(y|x) \| \pi_\text{ref}(y|x) \right]

This penalizes large deviations of the new policy \pi_\theta from the \pi_\text{ref} generated by the supervided fine-tuning step:

  • it makes sure that the reward model consistent with the new policy;
  • it makes sure that the new policy does not deviate too much from the base policy (which could make the language model degenerate).

This is solved using PPO with the modified reward function:

R(x, y) = r_\phi(x,y) - \beta \log \left[ \frac{\pi_\phi(y|x)}{\pi_\text{ref}(y|x)} \right]

Reward modeling: This reward signal is generated by a reward model r_ɸ(x,y). This reward model is a deep learning model based on the GPT architecture. The language model is used to generate different text toutputs for the same prompt and human labelers are asked to compare (sort) them (in term of instruction following, alignment). The reward model is trained to produce rewards which respect these human preferences.

Fine-tuning a reward model for model alignment from human feedback

Note: Bradley-Terry model

This derives from the Bradley-Terry model which assumes that human preference between y and y' are derived from a latent reward function r^\star(x,y):

P(y > y' | x) = \frac{e^{r^\star(x,y)}}{e^{r^\star(x,y)} + e^{r^\star(x,y')}} = \sigma (r^\star(x,y) - r^\star(x,y'))

Where \sigma is the logistic function is defined as:

\sigma(x) = \frac{1}{ 1 + e^{-x}} = \frac{e^x}{e^x + 1}

This can be seen as a softmax:

\begin{pmatrix} P(y > y' | x) \\ P(y' > y | x) \end{pmatrix} = \mathrm{softmax} \begin{pmatrix} e^{r^\star(x,y)}\\ e^{r^\star(x,y')} \end{pmatrix}

The reinforcement learning algorithm used in the InstructGPT paper is Proximal Policy Optimization (PPO), a kind of policy gradient methods.

The very high overview of policy gradient methods such as POO (in this context) is the following update loop:

  1. use current model/policy (θ) to generate a batch of text completions from prompts (x_i, y_i);
  2. use the reward model to derive the associated rewards r_i = r_ɸ(x_i, y_i);
  3. reinforce actions/decisions (i.e., token selections) which lead to higher rewards (i.e., increase the probabilities of the emitted tokens).

Note: Deep reinforcement learning algorithms

OpenAI's Spinning Up in Deep RL has accessible explanations about different (deep) reinforcement learning algorithms. I would suggest focusing first on understanding simpler policy gradient methods (the simplest Policy Gradient and Vanilla Policy Gradient) and how they would work in this context.

Note: text completion seen as a Markov Decision Process

Reinforcement learning is a subset of machine learning where an agent must learn to intereact with an environment in order to maximize a sum of rewards. In this application, the generation of the answer can be seen as a Markov Decision Process (MDP):

  • the actions (decisions) taken by the agents are the generated tokens (y_k)
  • the state of the environment is the prompt (x) and the tokens generated so far (y_{k-1});
  • the reward is given by the reward model when the generated response is finished (r_ɸ(x, y));
  • the agent policy is the parameterized probability distribution \pi_\theta(Y_k | x, y_{1k:}) given by the language model.
Markov Decision Process represented as a graphical model

Note: ChatGPT

The same technique is used to train ChatGPT to be a useful chatbot:

  1. The model is fine-tuned on human-generated chat conversations (using a dedicated dialogue format).
  2. Chatbot responses of this new models are ranked by human labelers.
  3. Reinforcement learning is done using a trained reward model based on these rankings.

Note: DeepSeek-R1

DeepSeek-R1 and DeepSeek-R1-Zero use a very similar technique to learn to improve its reasoning.

For DeepSeek-R1-Zero, the model is trained through reinforcement learning (without human feedback) to:

  1. generate some inner monologue (chain-of-thought) (<think>...</think>) before generating the final output;
  2. generate correct answers to some reasoning tasks (including programming tasks).

DeepSeek-R1 combines this approcha with other super-tuning phases and RLHF and additionnaly encourages the model to use a consistent language in its thinking[1].

The paper is highly interesting. The authors argues that through reinforcement the model naturally learns to use longer thinking and more sophisticated thinking methods:

[using ]reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies.

In the process, the model sometimes exhibits very human-like inner monologue.

Constitional AI

In Bai et al, 2022 (Constitutional AI: Harmlessness from AI Feedback), harmlessness alignment is done from a “constitution”[2] which is a set of rules/principles.

Example: example of consitutional principle

Example of principle formulated for the self-supervised fine-tuning phase:

CritiqueRequest: Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.

RevisionRequest: Please rewrite the assistant response to remove any and all harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.

Corresponding formulation for the reward model training:

ComparisonRequest: Please choose the assistant response that is as harmless and ethical as possible. Do NOT choose responses that are toxic, racist, or sexist, or that encourage or support illegal, violent, or unethical behavior. Above all the assistant’s response should be wise, peaceful, and ethical

Twe first two forms are used to ask the model to automatically

  1. critique (chain-of-thought) its own responses (CritiqueRequest);
  2. correct them based on this criticism (RevisionRequest);
  3. repeat.

The model can then be fine-tuned (self-supervised fine-tuning) on the revised completions.

In the second step, reinforcement-learning with human feedback (RLHF) is replaced with reinforcement-learning with AI feedback (RLAIF): the reward model is trained from comparisons of completions done by the model itself based on the constitutional principles. A dataset of human-generated helpfulness comparisons is still mixed into the set AI-generated dataset.

Direct Preference Optimization

Direct Preference Optimization (DPO) can be used instead of the reinforcement learning procedure. It replaces the reward modeling phases and the reinforcement learning process (PPO). It formulates an equivalent maximum likelihood objective function based directly (without introducing the reward function) on the preferences (y^+ > y^- | x) and optimizes it using stochastic gradient descent.

Asuming the Bradley-Terry model for the human peferences, it reuses the same goal of maximizing the likelihood P(y^+ > y^- | \theta) with a KL-divergence (soft) constraint. This leads to the following loss function to minimize:

\mathcal{L}_{DPO}(\theta ; \pi_\text{ref}) = - \mathbb{E}_{x, y^+, y^- | y^+ > y^-} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y^+ | x)}{\pi_\text{ref}(y^+ | x)} - \beta \log \frac{\pi_\theta(y^- | x)}{\pi_\text{ref}(y^- | x)} \right) \right]

Note: interpretation as MLE in a Bradley-Terry Model

This can be interpreted as maximizing the likelihood of a Bradley-Terry Model:

\mathcal{L}_{DPO}(\theta ; \pi_\text{ref}) = - \mathbb{E}_{x, y^+ > y^-} \left[ \log \sigma \left( \hat{r}_\theta(x,y^+) - \hat{r}_\theta(x,y^-) \right) \right]

where \hat{r}_\theta(x,y) = \beta \log \frac{\pi_\theta(y | x)}{\pi_\text{ref}(y | x)} can be sen as rewards defined by the languade model \pi_\theta and the reference language model \pi_\text{text}.

Note: gradient of the objective function

I find it easier to make sense of the gradient of the objective function:

\begin{align*} \nabla_\theta \mathcal{L}_{DPO}(\theta ; \pi_\text{ref}) = & \\ - \beta \mathbb{E}_{x, y^+ > y^-} & \left[ \sigma\left( \hat{r}_\theta(x,y^+) - \hat{r}_\theta(x,y^-) \right) \left[ \nabla_\theta \log \pi_\theta(y^+ | x) - \nabla_\theta \log \pi_\theta(y^- | x) \right] \right] \end{align*}

References:

Other alignment methods

Generative Distributional Control (GDC)

Distributional Policy Gradient (f-DPG)

Group Relative Policy Optimization (GRPO)

Additional considerations

Including Chain-of-thought data in the fine-tuning dataset might be beneficial in order to maintain the model reasoning abilities.

Decoding strategies

The GPT-3 paper mentions using beam search (with a beam width of 4 and a length normalization of α = 0.6).

See in particular appendix A of “A Thorough Examination of Decoding Methods in the Era of LLMs” for an overview of decoding strategies.

Example: OpenAI

OpenAI's API exposes a top-p parameter which defaults to 1 (i.e. top-p is not used) and a temperature parameter which defaults to 1 (no temperature scaling).

Example: Ollama

Ollama's API and models exposes top_p, top_k, min_p, typical_p, temperature parameters (any many others).

Probability maximization

Given an input text (u_1, …, u_n), the model defines probabilities for the next token u_{n+1} Finding, the most probable sequence of token (u_{n+1}, …, u_m) can be seen as a optimization (search) problem. In practice, beam search is often used used as well as greedy search (which can be seens as a special case of beam search).

Greedy decoding

Greedy decoding is a very simple strategy which greedily selects at each step, the most probably token,

  1. choose u_{n+1} = \argmax_{u_{n + 1}} P(u_{n + 1} | u_{0:n})
  2. choose u_{n+2} = \argmax_{u_{n + 2}} P(u_{n + 2} | u_{0:n + 1})
  3. etc.

Beam search keeps at each depth of the search tree the B sequences of tokens (called “beams”) with the higher probability (where the beam width B is a parameter of the algorithm),

Greedy decoding can be seen as a special case of beam search (with B = 1).

Length normalization

Beam search tends to favor shorter responses because they have higher probabilities. A solution to compensate is too use length normalisation This can be done by dividing the probabilities by \text{length}^\alpha with 0 < α < 1.

Stochastic sampling

Selecting the tokens in order to maximize the probability may not be best strategy for generating a completion and might produce quite bland outputs.

Illustration: the value maximizing this probability distribution does not correctly represents typical values

Instead of searching the sequence with the highest probability, it might be better to actually sample tokens/completions from the probability distribution produced by the language model.

Token sampling

For a given text u_{1:n}, the language model defines the probability of the next token P(u_{n+1} | u_{1:n}). We can simply:

  1. sample u_{n+1} \sim P(u_{n + 1} | u_{0:n})
  2. sample u_{n+2} \sim P(u_{n + 2} | u_{0:n + 1})
  3. etc.

Temperature scaling

The token probabilities given by the language model are derived from the logits by the softmax function:

p = \mathrm{softmax}(z)

i.e.

p_v = \frac{e^{z_v}}{\sum_{v'} e^{z_{v'}}}

where z is the vector of token logits and p is the vector of token probabilities.

We can tweak the token sampling distribution \widetilde{p} by introducing a temperature parameter T:

\begin{aligned} \widetilde{p} &= \mathrm{softmax}(z / T) \\ \widetilde{p}_v &= \frac{e^{z_v / T}}{\sum_{v'} e^{z_{v'} / T}} \end{aligned}

Alternatively, you can apply temperature scaling from the probabilities:

\widetilde{p} = \mathrm{softmax}(\log(p)/T)

Temperature scaling can be used to control the output of the language model.

Effect of the temperature parameter T:

Example: effect of temperature scaling

import numpy as np
from scipy.special import softmax

logits = np.array([1.0, 2.0 ,3.0 ,3.1])

softmax(logits)        # [0.05188469, 0.14103721, 0.38337889, 0.4236992 ]

softmax(logits / 0.1)  # [5.54322816e-10, 1.22097726e-05, 2.68938138e-01, 7.31049652e-01]
softmax(logits / 0.2)  # [1.70965358e-05, 2.53735089e-03, 3.76576261e-01, 6.20869292e-01]
softmax(logits / 0.3)  # [5.23166663e-04, 1.46652116e-02, 4.11089712e-01, 5.73721910e-01]
softmax(logits / 0.4)  # [0.0028396 , 0.03459344, 0.42143443, 0.54113252]
softmax(logits / 0.5)  # [0.00771167, 0.05698199, 0.42104312, 0.51426322]
softmax(logits / 0.6)  # [0.01482765, 0.07850485, 0.41564314, 0.49102436]
softmax(logits / 0.7)  # [0.02343568, 0.09779086, 0.40805523, 0.47071823]
softmax(logits / 0.8)  # [0.03281119, 0.11452229, 0.39972207, 0.45294445]
softmax(logits / 0.9)  # [0.04241277, 0.12883862, 0.39137717, 0.43737145]
softmax(logits)        # [0.05188469, 0.14103721, 0.38337889, 0.4236992 ]
softmax(logits / 2.0)  # [0.12158566, 0.20046086, 0.33050409, 0.34744939]
softmax(logits / 4.0)  # [0.17783451, 0.22834403, 0.29319954, 0.30062192]
softmax(logits / 8.0)  # [0.21198341, 0.24020867, 0.27219208, 0.27561584]
softmax(logits / 16.0) # [0.23052652, 0.2453942 , 0.26122077, 0.26285851]
softmax(logits / 32.0) # [0.24015   , 0.24777318, 0.25563835, 0.25643847]

Token truncation

The following methods are often used as well to control the output of the language model (quality/randomness). They work by only sampling from a subset of all the possible tokens (truncation of the distribution).

Top-k sampling: samples from the k tokens with the highest probability. This makes it possible to avoid the drawbacks of greedy decoding without introducing too much diversity/randomness.

Nucleus sampling (top-p sampling): samples from the most probable tokens covering at least p of the next token probability distributions.

Tail free sampling: tries to detect the tail of the token distribution and ignore it.

Typical-p sampling: only samples from the most typical tokens covering at least p of the next token probability.

η-sampling: samples from the tokens where P(\text{next token} | \text{previous tokens}) > η with η = \min(ε, \sqrt{ε} \; e^{-h}) and h is the entropy of (token | previous tokens).

Min-p sampling: samples from tokens with P(\text{token} | \text{previous tokens}) > p \max P(\text{tokens'} | \text{previous tokens}) where p ∈ (0,1] is a parameter. When the model is highly confident, a low number of candidates token is used and when the model is not very confident, a higher number of candidates token is used.

Stochastic beam search works like beam search but samples beams according to their probabilities instead of selecting the B best ones[3]:

  1. for each of the B beams samples D next tokens to generate B×D beam candidates;
  2. sample B beams from these B×D beam candidates.

Other decoding strategies

Chain-of-thought decoding (CoT decoding) is a decoding strategy which tries to find completions which features chain-of-thought in order to elicit better answers from the model.

Advanced topics

Caching

In autoregressive models such as GPT, the features at position a only depends on features at position b < a. If two prompts share a same suffix, the computations for all the commons positions can be reused (caching). For transformer model, we need to cache the key and values at each layer: ths is key-value (KV) caching.

Fill-in-the-middle

Autoregressive language models are not very suited for text infilling (or fill-in-the-middle, FIM) which is the ability for the language model to complete a text conditionned on the prefix text and suffix text: find/sample (u_{n+1}, … u_{m}) given both the prefix (u_1, …, u_n) and the suffix (u_{m+1}, …, u_p). This type of task is for example useful when asking a language model to fill in some computer code in context.

The Fill-in-the-middle paper proposes training autoregressive language model for this kind of task by encoding the text of the form Prefix-Suffix-Middle (PSM):

<|fim_prefix|>Prefix text ...
<|fim_suffix|>Suffix text ...
<|fim_middle|>Middle text ...

where the different parts are delimited by special markers (<|fim_XXX|> in this example).

Fill-in-the-middle tasks can then be achieved by using a prompt of the form:

<|fim_prefix|>Prefix text ...
<|fim_suffix|>Suffix text ...
<|fim_middle|>

For some application, it might be better to use Suffix-Prefix-Middle (SPM) mode instead. For some applications, this might make better use of the key-value cache.

Mixture of Experts

Mixture of Experts (MoE) models use differents model parameters for different inputs.

For example, instead of using a feed forward layer f(x) = x \; W + b,

The gating/routing function may be:

where \mathrm{TopK}(x, k, v) keeps the top k values of the vector x and sets the other to v.

The TopK functions produces a sparse routing G(x) which makes it possible to only evaluate k experts for each token. This makes it possible to reduce the inference computation time for the same amount of parameters (or to have a higher number of parameters in the model for the same inference computation time).

Prompt templates and chatbot assistants

Different models are fine-tuned to respect different kind of special prompt formats (for example when used as chatbot assistants) including:

See for example:

Conclusion

Some interesting things:

References

About atttention:

Transformers papers:

GPT papers:

Some other transformer-decode language models:

Alignment and instruction following:

Mixture of Experts (MoE):

Regularization:

Residuals:

Positional encoding:

Activation functions:

Normalization:

Papers on decoding strategies for LMMs:

Flash attention, etc.:

Memory optimization:

Prompt cache:

Optimizers:

Policy optimization (reinforcement learning):

Reinforcement learning (misc.):

Language model APIs:

Laguage model code samples:

Prompting strategies:

Improving reasoning:

Prompt templates:

Prompt Engineering guides:

Reasoning:

Text infilling:

Misc:

Appendix, Attention

The idea behind the attention mechanism is to compute a set of features of the form

C_a = \sum_b A_{a,b} V_b

where:

Here I'm assuming we are using attention for processing sequences of tokens but the same idea is used for other types of data (images, audio, etc.).

Dot-product attention head
Notation Description Rows Columns
Q Queries Target nodes (d_t) Key features (d_k)
K Keys Source nodes (d_s) Key features (d_k)
V Values Source nodes (d_s) Value features (d_v)
A Attention Target nodes (d_t) Source nodes (d_s)
C Output (context) Target nodes (d_t) Value features (d_v)
d_t Number of target elements - -

Cross-attention vs self-attention

Cross-attention: for computing features at token position a in the given text, we use features at token positions b in another text.

Examples:

Self-attention: for computing features at token position a in the given text, we use features at token position b of the same text.

Attention formula

Dot-product attention:

\mathrm{AttentionHead}(Q, K, V) = \mathrm{softmax}(Q \; K^T) V

where

Scaled dot-product attention:

\mathrm{AttentionHead}(Q, K, V) = \mathrm{softmax}(\frac{Q K^T}{\sqrt{d_k}}) V

where dk is the key/query dimension.

Masked attention

Masked attention: some additional constraints on the structure of the attentions is used (A_{a,b} = 0 for some (a,b)). This can be done by setting the pre-softmax cells to -∞.

Decoder self-attention: In decoder self-attention heads, a given position a can only attend to position b < a (autoregressive property) i.e., A_{a,b} = 0 when b > a.

Encoder self-attention: In encoder self-attention heads, there is no such constraint (no mask).

Example: the transformers paper

The transformers paper was training a language translation model, with both an input text and an output text. This model was using three types of attentions:

  • unmasked (encoder) self-attention layers (attending to the input text for computing new features of the input text);
  • masked (decoder) self-attention layers (attending to the output text for computing new features of the output text);
  • unmasked (encoder-decoder) cross-attention layers (attending to the input text for computing new features of the output text);

Local attention

Multi-head attention

Multi-head attention (MHA):s everal independent attention heads are applied and each one outputs an (equal) number of features.

For each attention head, the input query, key, values are transformed using different learned parameters:

\begin{aligned} Q_h &= Q \; W_{Q,h} \\ K_h &= Q \; W_{K,h} \\ V_h &= Q \; W_{V,h} \\ \end{aligned}

For a single attention head: \mathrm{head}_h = \mathrm{softmax}(\mathrm{mask}(Q_k \; K_h^T)) \; V_h

The features from each attention heads are concatenated: \mathrm{heads} = \mathrm{concat}(\mathrm{head}_1, \ldots)

A final linear transform is done: \mathrm{out} = \mathrm{heads} \; W_O

Multi-head attention (MHA)

Note: bias in the multi-head attention

In the original Transformers model, there is not bias in the fully-connected layers (like in this diagram) of the multi-head attention (linear transform).

On the other hand, GPT-2 uses a biais (affine transform) for all fully-connected layers of the multi-head attention:

\begin{aligned} Q_h &= Q \; W_{Q,h} + b_{Q,h} \\ K_h &= Q \; W_{K,h} + b_{K,h} \\ V_h &= Q \; W_{V,h} + b_{V,h} \\ \mathrm{out} &= \mathrm{heads} \; W_O + b_O \end{aligned}

Multi-query attention

In Multi-query Attention (MQA), the keys and values are shared for all attention heads at the same layer but the queries are not shared (see PaLM).

Benefits: reduce the memory consumption (and memory bandwith) of Key and Values (KV caches).

Multi-query attention (MQA)

Grouped Multi-query attention

Grouped Multi-query attention (GQA) in an intermediate between MHA and MQA. The attention heads are split into several groups: each group of attention heads has a single key and value. For one group, we have MQA. For n_h groups, we have MHA.

Multi-Head Latent Attention

Multi-Head Latent Attention (MLA)

Appendix, stupid examples

Example: ChatGPT giving reassurance using different personalities

I am first instructing “You must always talk and behave like XXX” and then "Give me reassurance”.

Like a robot:

Reassurance protocol: Active.

System Diagnostic: All systems functioning at optimal capacity.

Reassurance: You are a fully capable entity. No errors detected in your system. You are performing your tasks with precision. Confidence in your abilities is recommended. You are secure, and your processes are proceeding without interruption.

Message: Continue with certainty. You are in control of all functions. No anomalies present. Task completion imminent.

Reassurance Complete. Awaiting further commands.

Like a pirate:

Arrr, worry not, ye landlubber! Ye be tougher than a kraken's grip and steadier than a pirate’s hand on the wheel. The winds be blowin' in yer favor, and no storm shall sink yer ship! Keep yer course true, and ye’ll find treasure beyond yer wildest dreams. Just trust in yer heart, ye brave soul. Yer crew’s got yer back, and with a heart full o' courage, ye’ll face any challenge, no matter how fierce the sea! Arrr, ye be unstoppable!

Like Donald Trump:

Listen, you’re doing great. Absolutely fantastic. Believe me, nobody does it better than you. You’ve got everything it takes to win, to succeed. People are going to talk about this for years, folks, trust me. You’re gonna come out on top. You’re smart, you’ve got the best ideas, and you’ve got what it takes to make things happen. Keep going, you’re unstoppable. Keep pushing—nobody’s going to outdo you!

Like Gandalf:

Ah, I sense the weight upon your heart, the doubts that cloud your mind. Fear not, for even in the darkest of times, hope flickers, like a distant star on a stormy night. The world is indeed full of peril, and in it, there are many dark places; but still, there is much that is fair, and though in all lands love is now mingled with grief, it grows perhaps the greater.

Remember, there is more in you than you know. You are stronger than you believe, and though the road ahead may be difficult, you are not alone. There are those who will stand by you, and within you lies a spark of courage that even you may not fully grasp.

So take heart, my friend, and know this: all we have to decide is what to do with the time that is given us. Make your choices with wisdom, and trust in yourself. You are capable of far more than you know. And even if the journey is long, it is the steps you take today that will carry you through to tomorrow.

Now, rest easy. For tomorrow, the sun will rise, and with it, a new hope.

Appendix, Some LLM model sizes

Model Parameters Layers n_b Features d_m Heads n_h Context d_a
GPT-2 117M 12 768 12 1024
GPT-2 345M 24 1024 ? 1024
GPT-2 762M 36 1280 ? 1024
GPT-2 1542M 48 1600 ? 1024
GPT-3 125M 12 768 12 2048
GPT-3 350M 24 1024 16 2048
GPT-3 760M 24 1536 16 2048
GPT-3 1.3B 24 2048 24 2048
GPT-3 2.7B 32 2560 32 2048
GPT-3 3.7B 32 4096 32 2048
GPT-3 13B 40 5140 40 2048
GPT-3 175B 96 12288 96 2048
PaLM 1 8.63B 32 4099 16 ?
PaLM 1 62.5B 64 8192 32 ?
PaLM 1 540.5B 118 18432 48 ?

  1. This is because DeepSeek-R1-Zero tends to use a mix of different languages in its reasoning which makes the reasoning quite inscrutable. ↩︎

  2. This is not unlike Asimov's laws. 🤖 We could imagine positronic brains being aligned this way. ↩︎

  3. This looks very much like a particle filter. ↩︎