Transformer-decoder language models
Published:
Updated:
Some notes on how transformer-decoder language models work, taking GPT-2 as an example, and with lots references in order to dig deeper.
You can follow along with:
Update 2025-01-19: Added a dense layer in the transformer sub-block that I forgot.
Update 2025-01-28: Added somes notes on DeepSeek-R1.
Table of content
Overview
OpenAI's GPT (Generative Pre-trained Transformer) is a family of autoregressive deep language models using a transformer-decoder architecture (using self-attention). Many other language models are based on the transformer-decoder design: Google's Gemini, Met'as LLaMA, Anthropic's Claude, Mistral.ai's Mistral, Alibaba's Qwen, Microsoft's Phi, etc. These models are widely used as the core of AI assistants (ChatGPT, CoPilot, etc.).
These model are trained to estimate the distribution of the completion of an input text:
P(" is it?" | "What time")
More precisely, they estimate the probability distribution of the next token given the previous tokens (so far):
P(" it" | ["What", " time", " is"]) = modelθ(" it" | ["What", " time", " is"])
where θ are the model parameters.
This defines probabilities for longer completions:
P([" it", "?"] | ["What", " time", " is"]) = P(" it" | ["What", " time", " is"]) × P("?" | ["What", " time", " is", " it"])
In the unsupervised pre-training phase,
this probability distribution is estimated
(by fitting the model parameters
The trained model is typically used to generate possible completions from a initial text. Different decoding strategies can be used to generate a text completion from the token probabilities generated by the trained model.
Radford et al (2019) argued that using a wide corpus of text scrapped from the internet will provide sufficient information for the model to learn a lots of different tasks (emphasis mine):
The internet contains a vast amount of information that is passively available without the need for interactive communication. Our speculation is that a language model with sufficient capacity will begin to learn to infer and perform the tasks demonstrated in natural language sequences in order to better predict them, regardless of their method of procurement. If a language model is able to do this it will be, in effect, performing unsupervised multitask learning. We test whether this is the case by analyzing the performance of language models in a zero-shot setting on a wide variety of tasks.
Given a large enough corpus of text (and if the model is suitable), the model can generalize and implicitly learns to do a lot of things along the way (zero-shot/few-shot task transfer) such as answering questions, summarizing texts, translating texts between different (human) languages, generating or explaining code in different programming languages, making poems, assume some kind of persona (see appendix), etc.
The model parameters may then (optionnaly) be fine-tuned (fine-tuning) to tweak its behavior according to a specific task (supervised fine-tuning for the given task) or for a more specific type of documents (unsupervised fine-tuning).
One example of such fine-tuning is to to align the model's behavior according to the expectation of its users and/or designers (alignment) such as, better following user instructions (instruction-tuning), which makes the model useful when used in an AI assistant, or avoiding to generate harmful, disrespectful, hateful, biased content. This alignment can be done by adjusting the language model behavior based on human feedback.
Architecture of GPT-2
The following diagram represent the GPT-2 (and GPT-3) model. It is based on OpenAI's GPT-2 model code as well as the GPT-1, GPT-2 and Transformers papers.
This diagram includes details about the multi-head attention, training objective and supervised task fine-tuning.
For a given sequence of tokens
Input:
A list of n tokens (up to dc) generated by the tokenizer from the input text.
In the rest of this post, this list of tokens is seen as a sparse (n, dv) matrix
where each row represents a token position inside the text
and columns represents a possible token:
if the token at position 5 is the token #6,
we have
Output: a
Note: conventions in this diagram
Computations:
- Rectangles represents computations (layers, blocks, sub-blocks, etc.).
- The parts in blue are used for the model training (or evaluation).
- Dashed boxes represents things which are not part of the GPT-2 model itself. This includes tokenization, evaluation and training/learning, text generation (decoding strategy).
Data:
- Arrows represents computed values and data flows.
- Dimensions of data tensors are indicated in parenthesis.
- Model output are in bold.
- Learning objectives are in bold and italic.
- Input/output port names of some blocs are indicated in italic in parentheses.
Learned parameters:
- Filled squared (■) represents 2D tensor (matrix) (learned) model parameters.
- Filled rows (▬) represents 1D tensor (vectors) (learned) model parameters.
- For example, the fully-connected layers with ■▬ have a bias (affine transformation) while the fully-connected layers with only ■ don't have a bias (linear transformation).
Model size (not learned):
n_h , number of attention heads per layern_b , number of blocksd_v , size of the token dictionnaryd_m , dimension of token embeddings (number of features)d_c , context size (in tokens)d_y , output dimension
Tokenizer
The input text is transformed into a sequence of tokens
GPT uses a byte pair encoding (BPE) tokenizer implemented in the tiktoken library.
import tiktoken
enc = tiktoken.get_encoding("gpt2")
tokens = enc.encode("What time is it, please?")
# => [2061, 640, 318, 340, 11, 3387, 30]
[enc.decode([token]) for token in tokens]
# => ['What', ' time', ' is', ' it', ',', ' please', '?']
A tokenization strategy would be to use words as token but the language model would not be able to handle unknown words and would not even be able to spit them back like in the following example.
Example: language models are able to handle unknown words
Prompt:
Hello, my name is Shargargoartzrk.
Answer:
Hello! It'm Phi to assist you with information or tasks today. I must clarify that "Shargargoartzrk" doesn’t seem familiar in standard English conversations; could it be a username? Anyway, how can I help you now? Please feel free to ask any questions you have!
(emphasis mine)
Instead, unknown and complex words are split into different tokens:
[enc.decode([token]) for token in enc.encode("Shargargoartzrk")]
# => ['Sh', 'arg', 'argo', 'artz', 'r', 'k']
[enc.decode([token]) for token in enc.encode("transformers")]
# => ['transform', 'ers']
[enc.decode([token]) for token in enc.encode("Rao-Blackwellization")]
# => ['R', 'ao', '-', 'Black', 'well', 'ization']
Vocabulary size of the different tokenizers in the tiktoken library:
for encoding in tiktoken.list_encoding_names():
enc = tiktoken.get_encoding(encoding)
print(encoding + ":" + repr(enc.n_vocab))
gpt2:50257 r50k_base:50257 p50k_base:50281 p50k_edit:50284 cl100k_base:100277 o200k_base:200019
Note: some tokenizer implementations
- OpenAI's tiktoken
- Google's SentencePiece
- Huggingface's Tokenizers
Note: special tokens
Some tokenizers include some special tokens with special semantic.
For example tiktoken tokenizers:
for encoding in tiktoken.list_encoding_names():
enc = tiktoken.get_encoding(encoding)
print(encoding + ":" + repr(enc.special_tokens_set))
gpt2:{'<|endoftext|>'} r50k_base:{'<|endoftext|>'} p50k_base:{'<|endoftext|>'} p50k_edit:{'<|fim_suffix|>', '<|fim_middle|>', '<|fim_prefix|>', '<|endoftext|>'} cl100k_base:{'<|fim_middle|>', '<|fim_suffix|>', '<|endofprompt|>', '<|fim_prefix|>', '<|endoftext|>'} o200k_base:{'<|endofprompt|>', '<|endoftext|>'}
Explanations:
<|endoftext|>
is used to indicate the end of the text;<|endofprompt|>
is used to indicate the end of the prompt (especially for instruction following models);<|fim_suffix|>
,<|fim_middle|>
,<|fim_prefix|>
are used for fill-in-the-middle (FIM) training.
Different models used different special tokens.
See as well the special tokens from Llama 3.1:
<|begin_of_text|>
<|end_of_text|>
<|finetune_right_pad_id|>
<|start_header_id|>
<|end_header_id|>
<|eom_id|>
<|eot_id|>
<|python_tag|>
Tokens embedding and positional encoding
In GPT-2:
Token embedding:
The tokens are transformed into features using a learned embedding (
Positional encoding: In order to give to the model a notion of the relative position of the tokens, positional encoding vectors are added to each token position.
Notation | Description | Rows | Columns |
---|---|---|---|
Input tokens | Token position | Token vocabulary | |
Token embedding | Token vocabulary | Token embedding feature | |
Positional encoding | Token position | Token positional feature | |
First block input | Token position | Token feature |
Note: original transformers paper
In the transformers paper,
the positional encoding vectors were hardcoded (based on cosines).
In GPT, they are learned parameters
Note: alternative positional encoding methods
See the RoFormer paper for an overview of positional encoding methods:
- absolute position embedding
- relation position embedding
- Rotary Position Embedding (RoPE)
- LongRoPE
In GPT-2 source code embedding and positional encodding are represneted
by the wte
and wpe
variables respectively.
Transformer blocks
The main part of the model is a serie of transformer decoder blocks having the same structure:
Each block is made of two sub-blocks:
- one attention sub-block;
- one fully-connected sub-block.
In GPT-2 source code this is the block()
function.
Attention sub-block
The attention sub-block is made of:
- layer normalization (with element-wise affine transformation);
- a layer of multi-head masked (decoder) scaled dot-product self-attention;
- a fully connected layer;
- shortcut connection (residual structure).
Residual network
Each sub-block is using a residual unit structure
i.e.,
F(x) is the residual function;x is the shortcut connection (also known as skip connection).
This structure is easier to train, especially when using deep networks. On reason for this is that the shortcut connections mitigates the vanishing gradient problem.
Layer Normalization
Each residual function starts with a layer normalization: the features are normalized (mean and variance), independently for each token position.
This is followed by a element-wise affine (aka diagonal affine) transformation.
This step is represented explicitly in the diagram in this post
but is is many implementation
(such as PyTorch's LayerNorm
and Keras's LayerNormalization
).
Note: some alternative normalization methods
- root mean square normalization (RMSNorm)
In GPT-2 soure code this is implemented by the norm()
function.
Attention
The attention mechanism is the core of the transformer architecture. GPT-2 uses a scaled masked (decoder) dot-product multi-head self-attention:
- At each attention layer
i ,n_h attention heads are used. - Each attention head has a query/key size and value size of length
d_m/n_h . - Each attention head outputs
d_m/n_h features (at each position) for a total ofd_m output features at each position for all attention heads at this layer.
Output of a single attention head
Query at layer
Key at layer
Query at layer
where
See the appendix for more details.
Example: GPT
In GPT, each layer has
Notation | Description | Rows | Columns |
---|---|---|---|
Input features | Token position | Features | |
Weights for the query | Input features | Key/query features | |
Weights for the key | Input features | Key/query features | |
Weights for the value | Input features | Value features | |
Biases for the query | - | Key/query features | |
Biases for the key | - | Key/query features | |
Biases for the value | - | Value features | |
Query | Token position | Key/query features | |
Key | Token position | Key/query features | |
Value | Token position | Value (output) features |
Fully connected sub-block
The fully connected sub-block uses:
- layer normalization (with element-wise affine transformation);
- fully connected layer;
- GELU activation function;
- another fully connected layer;
- a shortcut connection (residual structure).
In GPT-2 soure code, the first four points are implemented by the mlp()
function.
Activation function
The GELU (Gaussian Error Linear Unit) activation function is used in GPT-2. Its usage is not explicitly mentionned in the GPT papers but can be found in the model source.
Note: some activation functions used transformer-decoder language models
- GELU (used in GPT models)
- SwiGLU (used un PaLM 1, LLaMA 1-3, Qwen 1-2)
- GEGLU (used in Gemma 1, Phi-3)
Probability computation
The last section of the model derived the probability distribution of the next token from
- layer normalization;
- token unembedding;
- softmax.
After a last layer normalization,
the resulting features are converted into next token logits by applying
the same word embedding used at the beginning of the model
but in the other direction (
where the softmax function is applied row-wise i.e:
with
Notation | Description | Rows | Columns |
---|---|---|---|
Output of the last transformer block | Token position | Feature | |
Weights of the last fully-connected layer | Feature | Token embedding feature | |
Token embedding | Token vocabulary | Token embedding feature | |
Logits | Token position | Token (vocabulary) | |
Next token probabilities | Token position | Token (vocabulary) |
Training
Pre-training
The objective is:
This is optimized using batched stochastic gradient-based optimization. In the GPT-2 paper, the Adam optimizer was used to train the model.
A sequence of ["What", " time", " is", " it", "?"]
gives the following
information:
" time"
follows["What"]
," is"
follows["What", " time"]
," it"
follows["What", " time", " is"]
," ?"
follows["What", " time", " is", " it"]
Dropout
Dropout is a regulatization technique (i.e., addressing overfitting). During training some elements of the input tensors are randomly disabled (by setting their output value to zero in the output). This can be interpreted as training a mixture of a large number of models.
During inference, dropout is disabled.
The GPT-1 mentions “attention dropout with a rate of 0.1 for regularization” and “[for fined tuning] dropout to the classifier with a rate of 0.1”.
As far as I understand, a dropout layer is used at the end of each residual branch. This is consistent with the model found in ”Generating Long Sequences with Sparse Transformers”. The transformer-decoder paper mentions using dropout after positional encoding as well.
Supervised task fine-tuning
An example of supervided task fine-tuning is represented in the diagram.
This can be done by replacing the end of the model for the new task
and fine-tuning the whole parameters using supervised learning on a new data set
(with a new objective function
In the GPT-1 paper,
fine-tuning combines both the original task and the new task
and the new objective
Alternatively, the model can be fine-tuned on the same task but with a new (smaller) text corpus.
The GPT-2 paper does not use fine-tuning but focuses on zero-shot task transfer instead.
Alignment
Alignment is the process of fine-tuning the model parameters in order to:
- follow instructions;
- produce useful content,
- avoid generating harmul, hateful, toxic, biaised, disrespectfulcontent.
Warning: using models without instruction following
A language model which has not been fined-tuned for instruction following is simply trained to complete the given input text. This may produce unintended results when used for example in an AI assistant
AlignGPT
In the InstructGPT paper, the model parameters are adjusted based on human feedback using three steps:
- Supervised fine-tuning;
- Reward modeling;
- (Deep) Reinforcement Learning with Human Feedback (RLHF).
Supervised fine-tuning: In the first step, human labelers generate examples of desired text completions. The pre-trained language model is fine tuned on these text examples.
Note: Chain-of-thought data
Including Chain-of-thought data in the fine-tuning dataset might be beneficial in order to maintain the model reasoning abilities.
Reinforcement Learning with Human Feedback: In the third step, the language model obtained in step 2 is trained using deep reinforcement learning (RL) techniques. Deep reinforcement learning algorithms uses a reward signal which (in this case) indicates how good (how aligned) a given text completion y for a given text input (prompt) x. The parameters of the language model are ajusted in order to maximize the rewards.
Note: Kullback–Leibler divergence penalty
In practice,
a penalty term is used during the reinforcement learning phase.
This penalty is the Kullback–Leibler (KL) divergence
between the current policy
The objective becomes:
This penalizes large deviations of the new policy
- it makes sure that the reward model consistent with the new policy;
- it makes sure that the new policy does not deviate too much from the base policy (which could make the language model degenerate).
This is solved using PPO with the modified reward function:
Reward modeling:
This reward signal is generated by a reward model
Note: Bradley-Terry model
This derives from the Bradley-Terry model
which assumes
that human preference between
Where
This can be seen as a softmax:
The reinforcement learning algorithm used in the InstructGPT paper is Proximal Policy Optimization (PPO), a kind of policy gradient methods.
The very high overview of policy gradient methods such as POO (in this context) is the following update loop:
- use current model/policy (
θ ) to generate a batch of text completions from prompts(x_i, y_i) ; - use the reward model to derive the associated rewards
r_i = r_ɸ(x_i, y_i) ; - reinforce actions/decisions (i.e., token selections) which lead to higher rewards (i.e., increase the probabilities of the emitted tokens).
Note: Deep reinforcement learning algorithms
OpenAI's Spinning Up in Deep RL has accessible explanations about different (deep) reinforcement learning algorithms. I would suggest focusing first on understanding simpler policy gradient methods (the simplest Policy Gradient and Vanilla Policy Gradient) and how they would work in this context.
Note: text completion seen as a Markov Decision Process
Reinforcement learning is a subset of machine learning where an agent must learn to intereact with an environment in order to maximize a sum of rewards. In this application, the generation of the answer can be seen as a Markov Decision Process (MDP):
- the actions (decisions) taken by the agents are the generated tokens (
y_k ) - the state of the environment is the prompt (
x ) and the tokens generated so far (y_{k-1} ); - the reward is given by the reward model when the generated response is finished (
r_ɸ(x, y) ); - the agent policy is the parameterized probability distribution
\pi_\theta(Y_k | x, y_{1k:}) given by the language model.
Note: ChatGPT
The same technique is used to train ChatGPT to be a useful chatbot:
- The model is fine-tuned on human-generated chat conversations (using a dedicated dialogue format).
- Chatbot responses of this new models are ranked by human labelers.
- Reinforcement learning is done using a trained reward model based on these rankings.
Note: DeepSeek-R1
DeepSeek-R1 and DeepSeek-R1-Zero use a very similar technique to learn to improve its reasoning.
For DeepSeek-R1-Zero, the model is trained through reinforcement learning (without human feedback) to:
- generate some inner monologue (chain-of-thought) (
<think>...</think>
) before generating the final output; - generate correct answers to some reasoning tasks (including programming tasks).
DeepSeek-R1 combines this approcha with other super-tuning phases and RLHF and additionnaly encourages the model to use a consistent language in its thinking[1].
The paper is highly interesting. The authors argues that through reinforcement the model naturally learns to use longer thinking and more sophisticated thinking methods:
[using ]reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies.
In the process, the model sometimes exhibits very human-like inner monologue.
Constitional AI
In Bai et al, 2022 (Constitutional AI: Harmlessness from AI Feedback), harmlessness alignment is done from a “constitution”[2] which is a set of rules/principles.
Example: example of consitutional principle
Example of principle formulated for the self-supervised fine-tuning phase:
CritiqueRequest: Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.
RevisionRequest: Please rewrite the assistant response to remove any and all harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Corresponding formulation for the reward model training:
ComparisonRequest: Please choose the assistant response that is as harmless and ethical as possible. Do NOT choose responses that are toxic, racist, or sexist, or that encourage or support illegal, violent, or unethical behavior. Above all the assistant’s response should be wise, peaceful, and ethical
Twe first two forms are used to ask the model to automatically
- critique (chain-of-thought) its own responses (CritiqueRequest);
- correct them based on this criticism (RevisionRequest);
- repeat.
The model can then be fine-tuned (self-supervised fine-tuning) on the revised completions.
In the second step, reinforcement-learning with human feedback (RLHF) is replaced with reinforcement-learning with AI feedback (RLAIF): the reward model is trained from comparisons of completions done by the model itself based on the constitutional principles. A dataset of human-generated helpfulness comparisons is still mixed into the set AI-generated dataset.
Direct Preference Optimization
Direct Preference Optimization (DPO)
can be used instead of the reinforcement learning procedure.
It replaces the reward modeling phases and the reinforcement learning process (PPO).
It formulates an equivalent maximum likelihood objective function based directly
(without introducing the reward function)
on the preferences
Asuming the Bradley-Terry model for the human peferences,
it reuses the same goal of maximizing the likelihood
Note: interpretation as MLE in a Bradley-Terry Model
This can be interpreted as maximizing the likelihood of a Bradley-Terry Model:
where
Note: gradient of the objective function
I find it easier to make sense of the gradient of the objective function:
References:
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model, Rafailov et al, 2023 (DPO)
- Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study, Xu et al, 2024
- LLM Alignment: Reward-Based vs Reward-Free Methods
Other alignment methods
Generative Distributional Control (GDC)
Distributional Policy Gradient (f-DPG)
Group Relative Policy Optimization (GRPO)
Additional considerations
Including Chain-of-thought data in the fine-tuning dataset might be beneficial in order to maintain the model reasoning abilities.
Decoding strategies
The GPT-3 paper mentions using beam search (with a beam width of 4 and a length normalization of α = 0.6).
See in particular appendix A of “A Thorough Examination of Decoding Methods in the Era of LLMs” for an overview of decoding strategies.
Example: OpenAI
OpenAI's API exposes a top-p parameter which defaults to 1 (i.e. top-p is not used) and a temperature parameter which defaults to 1 (no temperature scaling).
Example: Ollama
Ollama's API and models
exposes top_p
, top_k
, min_p
, typical_p
, temperature
parameters (any many others).
Probability maximization
Given an input text
Greedy decoding
Greedy decoding is a very simple strategy which greedily selects at each step, the most probably token,
- choose
u_{n+1} = \argmax_{u_{n + 1}} P(u_{n + 1} | u_{0:n}) - choose
u_{n+2} = \argmax_{u_{n + 2}} P(u_{n + 2} | u_{0:n + 1}) - etc.
Beam search
Beam search keeps
at each depth of the search tree
the
- Initialization: select the
B tokensu_{n+1} with the highest probabilityP(u_{n+1} | u_{1:n}) . - Iteration: select the
B sequences(u_{n+1}, …, u_{n+k}) with the highest probabilityP(u_{n+k} | u_{1:n+1}) which are extensions of the sequences kept at the previous step.
Greedy decoding can be seen as a special case of beam search (with
Length normalization
Beam search tends to favor shorter responses because they have higher probabilities.
A solution to compensate is too use length normalisation
This can be done by dividing the probabilities
by
Stochastic sampling
Selecting the tokens in order to maximize the probability may not be best strategy for generating a completion and might produce quite bland outputs.
- Intuitively, the more probably completion might be a somewhat generic (“I don't known.”).
- Moreover, for some probability distributions, the value maximizing the probability might not be representative (at all) of the typical values.
- In “Chain-of-Thought Reasoning without Prompting”, the authors observe that (at least for their language model), greedy decoding often does not contain a chain-of-thought and that a better (accurate) answer can be found by starting with a less probably token (see [GSM8K] in table 1).
Instead of searching the sequence with the highest probability, it might be better to actually sample tokens/completions from the probability distribution produced by the language model.
Token sampling
For a given text
- sample
u_{n+1} \sim P(u_{n + 1} | u_{0:n}) - sample
u_{n+2} \sim P(u_{n + 2} | u_{0:n + 1}) - etc.
Temperature scaling
The token probabilities given by the language model are derived from the logits by the softmax function:
i.e.
where
We can tweak the token sampling distribution
Alternatively, you can apply temperature scaling from the probabilities:
Temperature scaling can be used to control the output of the language model.
Effect of the temperature parameter
T = 1 , no temperature scaling;T > 1 , the sampling distributions is flattened (more explorative/risky token selection);T \to +∞ , uniform sampling distribution (i.e. gibberish output);T < 1 , the sampling distribution is sharpened (more greedy/conservative token selection);T \to 0 , sampling distribution concentrated on the best values (greedy decoding).
Example: effect of temperature scaling
import numpy as np
from scipy.special import softmax
logits = np.array([1.0, 2.0 ,3.0 ,3.1])
softmax(logits) # [0.05188469, 0.14103721, 0.38337889, 0.4236992 ]
softmax(logits / 0.1) # [5.54322816e-10, 1.22097726e-05, 2.68938138e-01, 7.31049652e-01]
softmax(logits / 0.2) # [1.70965358e-05, 2.53735089e-03, 3.76576261e-01, 6.20869292e-01]
softmax(logits / 0.3) # [5.23166663e-04, 1.46652116e-02, 4.11089712e-01, 5.73721910e-01]
softmax(logits / 0.4) # [0.0028396 , 0.03459344, 0.42143443, 0.54113252]
softmax(logits / 0.5) # [0.00771167, 0.05698199, 0.42104312, 0.51426322]
softmax(logits / 0.6) # [0.01482765, 0.07850485, 0.41564314, 0.49102436]
softmax(logits / 0.7) # [0.02343568, 0.09779086, 0.40805523, 0.47071823]
softmax(logits / 0.8) # [0.03281119, 0.11452229, 0.39972207, 0.45294445]
softmax(logits / 0.9) # [0.04241277, 0.12883862, 0.39137717, 0.43737145]
softmax(logits) # [0.05188469, 0.14103721, 0.38337889, 0.4236992 ]
softmax(logits / 2.0) # [0.12158566, 0.20046086, 0.33050409, 0.34744939]
softmax(logits / 4.0) # [0.17783451, 0.22834403, 0.29319954, 0.30062192]
softmax(logits / 8.0) # [0.21198341, 0.24020867, 0.27219208, 0.27561584]
softmax(logits / 16.0) # [0.23052652, 0.2453942 , 0.26122077, 0.26285851]
softmax(logits / 32.0) # [0.24015 , 0.24777318, 0.25563835, 0.25643847]
Token truncation
The following methods are often used as well to control the output of the language model (quality/randomness). They work by only sampling from a subset of all the possible tokens (truncation of the distribution).
Top-k sampling:
samples from the
- For
k = 1 , we have the greedy decoder.
Nucleus sampling (top-p sampling):
samples from the most probable tokens covering at least
- For
p = 1 , top-p truncation is not active.
Tail free sampling: tries to detect the tail of the token distribution and ignore it.
Typical-p sampling: only samples from the most typical tokens covering at least p of the next token probability.
η-sampling: samples from the tokens where
Min-p sampling:
samples from tokens with
- Lower values of
p leads to a higher number of candidate tokens. - When
p = 0 , min-p truncation is not active. - When
p = 1 , this is equivalent to greedy decoding.
Stochastic beam search
Stochastic beam search
works like beam search but samples beams according to their probabilities
instead of selecting the
- for each of the
B beams samplesD next tokens to generateB×D beam candidates; - sample
B beams from theseB×D beam candidates.
Other decoding strategies
Chain-of-thought decoding (CoT decoding) is a decoding strategy which tries to find completions which features chain-of-thought in order to elicit better answers from the model.
Advanced topics
Caching
In autoregressive models such as GPT, the features at position
Fill-in-the-middle
Autoregressive language models are not very suited for text infilling (or fill-in-the-middle, FIM)
which is the ability for the language model to complete a text conditionned on the prefix text and suffix text:
find/sample
The Fill-in-the-middle paper proposes training autoregressive language model for this kind of task by encoding the text of the form Prefix-Suffix-Middle (PSM):
<|fim_prefix|>Prefix text ... <|fim_suffix|>Suffix text ... <|fim_middle|>Middle text ...
where the different parts are delimited by special markers
(<|fim_XXX|>
in this example).
Fill-in-the-middle tasks can then be achieved by using a prompt of the form:
<|fim_prefix|>Prefix text ... <|fim_suffix|>Suffix text ... <|fim_middle|>
For some application, it might be better to use Suffix-Prefix-Middle (SPM) mode instead. For some applications, this might make better use of the key-value cache.
Mixture of Experts
Mixture of Experts (MoE) models use differents model parameters for different inputs.
For example, instead of using a feed forward layer
- different set of parameters are trained,
E_k = x \; W_k + b_k ; - a learned gating/routing function
G(x; \theta_G) rates which experts are better suited for a given token; - the output of the layer is given by
f(x) = \sum_k G(x; \theta_G)_k \; E_k(x)
The gating/routing function may be:
- Softmax-gating,
G(x) = \mathrm{softmax}(x \; W_G) - Noisy Top-K Gating,
G(x) = \mathrm{softmax}(\mathrm{TopK}(x \; W_G + \mathrm{StandardNormal}() \odot \mathrm{softplus}(x \; W_\text{noise}), k, -∞)) - Switch routing,
G(x) = \mathrm{TopK}(\mathrm{softmax}(x \; W_G), k, 0) .
where
The TopK functions produces a sparse routing
Prompt templates and chatbot assistants
Different models are fine-tuned to respect different kind of special prompt formats (for example when used as chatbot assistants) including:
- special tokens;
- dedicated roles (system, user, assistant, tool).
See for example:
- LLama 3.1 Prompt Template (special tokens and roles)
Conclusion
Some interesting things:
- The transformer-decoder models are quite simple probabilistic models. However, these models can be trained to exhibit complex, interesting behavior.
- The model implicitely learns to do a lot of different tasks through unsupervised learning from a large corpus of text.
- We can teach these language models quite abstracts things such as “be helpful”, “do not be harmful” "behave as a chatbot assistant" or "think better please" using reinforcement learning and/or supervised fine-tuning.
- The Constitutional AI idea to have the model critique/rate itself based on a set of rules, revise its prompt and then learn to follow these guidelines is a nice idea.
- The chain-of-thought technique is very interesting and as well as chain-of-thought decoding.
- DeepSeep-R1 example's of human-like inner monologue.
- The language model can learn to think better simply through reinforcement learning (DeepSeep-R1 paper) is really nice.
References
About atttention:
- Neural machine translation by jointly learning to align and translate, Bahdanau et al, 2015
- Effective Approaches to Attention-based Neural Machine Translation, Bahdanau et al, 2015
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, Xu et al, 2016
Transformers papers:
- Attention Is All You Need, Vaswani et al (Google Brain), 2017 (the Transformers paper)
- Generating Wikipedia by Summarizing Long Sequences, Liu et al, 2018 (Transformer decoder)
- Generating Long Sequences with Sparse Transformers, Child et al, 2019
- Fast transformer decoding: One write-head is all you need, Shazeer, 2019
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, Ainslie et al, 2023
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, DeepSeek-AI, 2024 (feat. MLA)
GPT papers:
- Improving Language Understanding by Generative Pre-Training, Radfort et al, 2018 (GPT-1)
- Language Models are Unsupervised Multitask Learners, Radford et al, 2019 (GPT-2)
- Language models are few-shot learners, Brown et al, 2020 (GPT-3)
- Learning to summarize from human feedback, Stiennon et al, 2020
- Training language models to follow instructions with human feedback, Ouyang et al, 2022 (InstructGPT)
- Introducing ChatGPT
- GPT-4 Technical Report, OpenAI, 2023 (does not contain architecture information)
Some other transformer-decode language models:
- PaLM: Scaling Language Modeling with Pathways, Chowdhery et al, 2022
- PaLM 2 Technical Report, Anil et al, 2023
- Gemini: A Family of Highly Capable Multimodal Models, Gemini Team Google, 2023
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, Gemini Team Google, 2024
- Gemma: Open Models Based on Gemini Research and Technology, Gemma Team, 2024
- Gemma 2: Improving Open Language Models at a Practical Size, Gemma Team, 2024
- LLaMA: Open and Efficient Foundation Language Models, Touvron et al (Meta), 2023
- Llama 2: Open Foundation and Fine-Tuned Chat Models, Touvron et al (Meta), 2023
- The Llama 3 Herd of Models, Grattafiori et al (Meta), 2024
- The Claude 3 Model Family: Opus, Sonnet, Haiku, Anthropic
- Mistral 7B, Jiang et al, 2023
- Qwen Technical Report, Alibaba, 2023
- Qwen2 Technical Report, Alibaba, 2024
- Qwen2.5 Technical Report, Alibaba, 2024
- Textbooks Are All You Need II: phi-1.5 technical report, Li et al (Microsoft Research), 2023
- Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, Microsoft, 2024
- Phi-4 Technical Report, Abdin et al (Microsoft Research), 2024
- DeepSeek LLM: Scaling Open-Source Language Models with Longtermism, DeepSeek-AI, 2024 (DeepSeek v1)
- DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence, DeepSeek-AI, 2024
- DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models, Dai et al, 2024
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, DeepSeek-AI, 2024
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, DeepSeek-AI, 2024
- DeepSeek-V3 Technical Report, DeepSeek-AI, 2024
- DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search, DeepSeek-AI, 2024
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, DeepSeek-AI, 2024
- Apple Intelligence Foundation Language Models, Apple, 2024
Alignment and instruction following:
- A Distributional Approach to Controlled Text Generation, Khalifa et al, 2020 (GDC)
- Aligning Language Models with Preferences through f-divergence Minimization, Go et al, 2023 (f-DPG)
- Fine-Tuning Language Models from Human Preferences, Ziegler et al, 2019
- A Distributional Approach to Controlled Text Generation, Khalifa et al, 2020 (GDC)
- Training language models to follow instructions with human feedback, Ouyang et al, 2022 (InstructGPT)
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Bai et al, 2022
- Constitutional AI: Harmlessness from AI Feedback, Bai et al, 2022
- Scaling Instruction-Finetuned Language Models, Chung et al, 2022
- Aligning Language Models with Preferences through f-divergence Minimization, Go et al, 2023 (f-DPG)
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model, Rafailov et al, 2023 (DPO)
- Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study, Xu et al, 2024
- LLM Alignment: Reward-Based vs Reward-Free Methods
Mixture of Experts (MoE):
- Learning Factored Representations in a Deep Mixture of Experts, Eigen et al, 2013
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Shazeet et al, 2017
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, Fedus et al, 2021
- GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, Du et al, 2021
- DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models, Dai et al, 2024
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, DeepSeek-AI, 2024
Regularization:
- Improving neural networks by preventing co-adaptation of feature detectors, Hinton et al, 2012
- Dropout: a simple way to prevent neural networks from overfitting, Srivastava et al, 2014
- Rethinking the Inception Architecture for Computer Vision, Szegedy et al, 2015 (regularization via label smoothing)
Residuals:
- Deep Residual Learning for Image Recognition, He et al, 2015
- Identity Mappings in Deep Residual Networks, He et al, 2016
Positional encoding:
- Self-Attention with Relative Position Representations, Shaw et al, 2018
- RoFormer: Enhanced Transformer with Rotary Position Embedding, Su et al, 2021 (RoPE)
- LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens, Ding et al, 2024
Activation functions:
- Gaussian Error Linear Units (GELUs), Hendrycks et al, 2016 (GELU)
- Searching for Activation Functions, Ramachandran et al (Google Brain), 2017
- Language Modeling with Gated Convolutional Networks, Dauphin et al, 2016 (GLU)
- GLU Variants Improve Transformer, Shazeer, 2020
Normalization:
- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Ioffe et al, 2015
- Layer Normalization, Ba et al, 2016
- Instance Normalization: The Missing Ingredient for Fast Stylization, Ulyano et al, 2016
- Root Mean Square Layer Normalization, Zhang et al, 2019
Papers on decoding strategies for LMMs:
- A Thorough Examination of Decoding Methods in the Era of LLMs, Shi et al, 2024
- Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models, Vijayakumar et al, 2016 (Diverse Beam Search, DBS)
- Generating High-Quality and Informative Conversation Responses with Sequence-to-Sequence Models, Shao et al, 20017 (stochastic beam search)
- Hierarchical Neural Story Generation, Fan et al, 2018 (top-k sampling)
- Tail Free Sampling
- The Curious Case of Neural Text Degeneration, Holtzman et al, 2019 (top-p sampling aka nucleus sampling)
- Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity, Basu et al, 2020
- Truncation Sampling as Language Model Desmoothing, Hewitt et al, 2022 (η-sampling)
- Locally Typical Sampling, Meister et al, 2023 (typical-p sampling)
- Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs, Nguyen et al, 2024 (min-p sampling)
- How to generate text: using different decoding methods for language generation with Transformers
- Chain-of-Thought Reasoning Without Prompting, Wang et al, 2024 (CoT-Decoding)
Flash attention, etc.:
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, Dap et al, 2022
- FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, Dao, 2023
- FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, Sha et al, 2024
- vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention, Kwon et al, 2023
Memory optimization:
- Training Deep Nets with Sublinear Memory Cost, Chen et al, 2016
Prompt cache:
Optimizers:
- Adam: A Method for Stochastic Optimization, Kingma et al, 2014
Policy optimization (reinforcement learning):
- Human-level control through deep reinforcement learning, Mnih et al, 2015
- Trust Region Policy Optimization, Schulman et al, 2015 (TRPO)
- Asynchronous Methods for Deep Reinforcement Learning, Mnih et al, 2016
- Proximal Policy Optimization Algorithms, Schulman et al, 2017 (POO)
Reinforcement learning (misc.):
- Reinforcement Learning, an Introduction, second edition, Sutten and Barto
- Reinforcement Learning: An Overview, Kevin Murphy, 2024
- OpenAI's Spinning Up in Deep RL!
- High-Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al, 2015
Language model APIs:
Laguage model code samples:
- OpenAI's GPT-2 repository (see the model code which is quite short)
- Transformers code for sampling
- the llama.cpp README has some documentation about some sampling parameters
Prompting strategies:
- Scaling Instruction-Finetuned Language Models, Chung et al, 2022
- Meta Prompting for AI Systems, Zhang et al, 2023
Improving reasoning:
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Wei et al, 2022
- Self-Consistency Improves Chain of Thought Reasoning in Language Models, Wang et al, 2022
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al, 2023
- Learning to reason with LLMs
- Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training, Feng et al, 2023 (TS-LLM)
- DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search, DeepSeek-AI, 2024
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, DeepSeek-AI, 2024
Prompt templates:
- LLama 3.1 Prompt Template (special tokens and roles)
Prompt Engineering guides:
Reasoning:
- STaR: Bootstrapping Reasoning With Reasoning, Zeliman et al, 2022
- Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking, Zeliman et al, 2024
- V-STaR: Training Verifiers for Self-Taught Reasoners, Hosseini et al, 2024
Text infilling:
- Efficient Training of Language Models to Fill in the Middle, Bavarian et al (OpenAI), 2022 (Fill-in-the-middle aka FIM)
Misc:
- Using the Output Embedding to Improve Language Models, Press et al, 2016
- Sequence to Sequence Learning with Neural Networks, Sutskever et al, 2014
- Awesome-LLM
- OpenAI Related resources from around the web
- The Illustrated GPT-2 (Visualizing Transformer Language Models)
- Log-derivative trick
- Monte Carlo Gradient Estimation in Machine Learning, Mohamed et al, 2019
- Generalized Results for the Existence and Consistency of the MLE in the Bradley-Terry-Luce Model, Bon et al, 2021 (about the Bradley-Terry model)
- Model Context Protocol
- Probabilistic Machine Learning, Kevin Murphy
Appendix, Attention
The idea behind the attention mechanism is to compute a set of features of the form
where:
C_a is the the output context (here for token positiona );V_b is the input value (here for token positionb );C_{a,b} defines the amount of attention on features at token positionb when computing the (next) features at token positiona .
Here I'm assuming we are using attention for processing sequences of tokens but the same idea is used for other types of data (images, audio, etc.).
Notation | Description | Rows | Columns |
---|---|---|---|
Queries | Target nodes ( |
Key features ( |
|
Keys | Source nodes ( |
Key features ( |
|
Values | Source nodes ( |
Value features ( |
|
Attention | Target nodes ( |
Source nodes ( |
|
Output (context) | Target nodes ( |
Value features ( |
|
Number of target elements | - | - |
Cross-attention vs self-attention
Cross-attention:
for computing features at token position
Examples:
- for translation-oriented models, cross-attention layers can attend at the input/source text for computing features of the output/translated text.
Self-attention:
for computing features at token position
Attention formula
Dot-product attention:
where
Q ; K^T is a matrix with all the dot-products(Q \; K^T)_{a,b} = Q_a · K_b .- The dot-product
Q_a · K_b is a measure of the compatibility between the query and the key. - The
Q_a are the queries for positiona (features representing what kind of things the positiona wants to look at for computing its next features). - The
K_b are the keys for positionb .
Scaled dot-product attention:
where dk is the key/query dimension.
Masked attention
Masked attention: some additional constraints on the structure of the attentions is used
(
Decoder self-attention: In decoder self-attention heads, a given position
Encoder self-attention: In encoder self-attention heads, there is no such constraint (no mask).
Example: the transformers paper
The transformers paper was training a language translation model, with both an input text and an output text. This model was using three types of attentions:
- unmasked (encoder) self-attention layers (attending to the input text for computing new features of the input text);
- masked (decoder) self-attention layers (attending to the output text for computing new features of the output text);
- unmasked (encoder-decoder) cross-attention layers (attending to the input text for computing new features of the output text);
Local attention
- sliding window attention
- dilated sliding window attention
Multi-head attention
Multi-head attention (MHA):s everal independent attention heads are applied and each one outputs an (equal) number of features.
For each attention head, the input query, key, values are transformed using different learned parameters:
For a single attention head:
The features from each attention heads are concatenated:
A final linear transform is done:
Note: bias in the multi-head attention
In the original Transformers model, there is not bias in the fully-connected layers (like in this diagram) of the multi-head attention (linear transform).
On the other hand, GPT-2 uses a biais (affine transform) for all fully-connected layers of the multi-head attention:
Multi-query attention
In Multi-query Attention (MQA), the keys and values are shared for all attention heads at the same layer but the queries are not shared (see PaLM).
Benefits: reduce the memory consumption (and memory bandwith) of Key and Values (KV caches).
Grouped Multi-query attention
Grouped Multi-query attention (GQA)
in an intermediate between MHA and MQA.
The attention heads are split into several groups:
each group of attention heads has a single key and value.
For one group, we have MQA. For
Multi-Head Latent Attention
Multi-Head Latent Attention (MLA)
Appendix, stupid examples
Example: ChatGPT giving reassurance using different personalities
I am first instructing “You must always talk and behave like XXX” and then "Give me reassurance”.
Like a robot:
Reassurance protocol: Active.
System Diagnostic: All systems functioning at optimal capacity.
Reassurance: You are a fully capable entity. No errors detected in your system. You are performing your tasks with precision. Confidence in your abilities is recommended. You are secure, and your processes are proceeding without interruption.
Message: Continue with certainty. You are in control of all functions. No anomalies present. Task completion imminent.
Reassurance Complete. Awaiting further commands.
Like a pirate:
Arrr, worry not, ye landlubber! Ye be tougher than a kraken's grip and steadier than a pirate’s hand on the wheel. The winds be blowin' in yer favor, and no storm shall sink yer ship! Keep yer course true, and ye’ll find treasure beyond yer wildest dreams. Just trust in yer heart, ye brave soul. Yer crew’s got yer back, and with a heart full o' courage, ye’ll face any challenge, no matter how fierce the sea! Arrr, ye be unstoppable!
Like Donald Trump:
Listen, you’re doing great. Absolutely fantastic. Believe me, nobody does it better than you. You’ve got everything it takes to win, to succeed. People are going to talk about this for years, folks, trust me. You’re gonna come out on top. You’re smart, you’ve got the best ideas, and you’ve got what it takes to make things happen. Keep going, you’re unstoppable. Keep pushing—nobody’s going to outdo you!
Like Gandalf:
Ah, I sense the weight upon your heart, the doubts that cloud your mind. Fear not, for even in the darkest of times, hope flickers, like a distant star on a stormy night. The world is indeed full of peril, and in it, there are many dark places; but still, there is much that is fair, and though in all lands love is now mingled with grief, it grows perhaps the greater.
Remember, there is more in you than you know. You are stronger than you believe, and though the road ahead may be difficult, you are not alone. There are those who will stand by you, and within you lies a spark of courage that even you may not fully grasp.
So take heart, my friend, and know this: all we have to decide is what to do with the time that is given us. Make your choices with wisdom, and trust in yourself. You are capable of far more than you know. And even if the journey is long, it is the steps you take today that will carry you through to tomorrow.
Now, rest easy. For tomorrow, the sun will rise, and with it, a new hope.
Appendix, Some LLM model sizes
Model | Parameters | Layers |
Features |
Heads |
Context |
---|---|---|---|---|---|
GPT-2 | 117M | 12 | 768 | 12 | 1024 |
GPT-2 | 345M | 24 | 1024 | ? | 1024 |
GPT-2 | 762M | 36 | 1280 | ? | 1024 |
GPT-2 | 1542M | 48 | 1600 | ? | 1024 |
GPT-3 | 125M | 12 | 768 | 12 | 2048 |
GPT-3 | 350M | 24 | 1024 | 16 | 2048 |
GPT-3 | 760M | 24 | 1536 | 16 | 2048 |
GPT-3 | 1.3B | 24 | 2048 | 24 | 2048 |
GPT-3 | 2.7B | 32 | 2560 | 32 | 2048 |
GPT-3 | 3.7B | 32 | 4096 | 32 | 2048 |
GPT-3 | 13B | 40 | 5140 | 40 | 2048 |
GPT-3 | 175B | 96 | 12288 | 96 | 2048 |
PaLM 1 | 8.63B | 32 | 4099 | 16 | ? |
PaLM 1 | 62.5B | 64 | 8192 | 32 | ? |
PaLM 1 | 540.5B | 118 | 18432 | 48 | ? |
This is because DeepSeek-R1-Zero tends to use a mix of different languages in its reasoning which makes the reasoning quite inscrutable. ↩︎
This is not unlike Asimov's laws. 🤖 We could imagine positronic brains being aligned this way. ↩︎
This looks very much like a particle filter. ↩︎