Neural Network Distillation
Published:
Updated:
Overview of neural network distillation as done in “Distilling the Knowledge in a Neural Network” (Hinton et al, 2014).
Table of content
Overview
What?
Transferring knowledge from one classifier neural network
Learner model
- already trained;
- usually larger than the student;
- may be an ensemble of models.
We don't need to have access to the architecture or parameters of the teacher model. We only be able to evaluate from it to get logits or probabilities (eg. through an API[1]).
Student model
- usually a smaller model.
Both models generate discrete probability distributions (classification tasks, stochastic agent actions):
- Logits vector,
z = f(x) - Probability vector,
p = \mathrm{softmax}(z) = \mathrm{softmax}(f(x)) .
Primary objective
We want to train the student model to match the probability distributions of the teacher model:
Distillation is obtained by fitting the student model
to minimize the cross-entropy loss
between the soft targets and the student predictions
with temperature scaling (at high temperature,
where
The cross-entropy
Secondary objective
We can at the same time train the model to predict the ground-truth labels (
Which yields the objective function:
Explanations
Soft-target
The student model is trained to match the
predictions of the teacher model (soft target):
In particular,
low probability values of the soft targets contain important valuable information
about the learned teacher model
Temperature scaling
Why using temperature scaling?
The low probability values of the soft targets contain important information.
However, they tend to be be disregarded by the cross entropy loss
(because of
Which temperature are we talking about?
The different experiments in the paper mentions using
Objective mixing
The paper mentions using
Other distillation methods
See the references for other distillation approaches.
The DeepSeek-R1 paper fine-tunes existing transformer-decoder language models (Qwen, Llama) on outputs of the DeepSeek-R1 model.
References
General:
- Distilling the Knowledge in a Neural Network, Hinton et al (Google), 2014
- FitNets: Hints for Thin Deep Nets, Adriana et al, 2014
For text:
- Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding, Liu et al, 2019
- Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System, Yang et al, 2020 (TMKD)
- Learning by Distilling Context, Snell et al, 2022
- DeepSeek-V3 Technical Report, DeepSeek-AI, 2024
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, DeepSeek-AI, 2024
Image classification:
- Self-training with Noisy Student improves ImageNet classification, Xie et al, 2020
In Diffusion models:
- Progressive Distillation for Fast Sampling of Diffusion Models, Salimans et al (Google), 2022
- On Distillation of Guided Diffusion Models, Meng et al, 2022
- BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion, Kim et al, 2023
- Open-sourcing Knowledge Distillation Code and Weights of SD-Small and SD-Tiny, Yatharth Gupta, 2023
You wouldn't steal a car, a handbag, a television, a baby or a helmet. You wouldn't steal logits, right? ↩︎