vLLM quickstart
Published:
Updated:
How to quickly use vLLM for LLM inference using CPU.
VLLM is a framework for LLM serving (inference). Some highlights:
- Python API;
- OpenAI-compatible API server;
- automatically downloads models from HuggingFace;
- support for hardware acceleration using several accelerators;
- can expose the output token probabilities (logprobs);
- LoRA support;
- tool calling;
- structure output using Xgrammar or llguidance based on a list of choice, regex, JSON schema, Pydantic model;
- beam-search;
- decoding options (
temperature
,top_k
,top_p
,min_p
, etc); - support for multi-modal models;
- basic authentication support (single API key);
- only one model per server instance at the moment.
Some problems I found while trying to do a quick test on CPU:
- the pre-built Python wheels only work with CUDA (NVIDIA);
- the pre-built containers images only work with CUDA (NVIDIA) and ROCm (AMD);
- there is a dedicated repository which contains container images with (Intel) CPU container images;
- compiling the project direcly in the host system failed because of some
cmake
version mismatch.
Table of content
Build the container image
The project currently does not provide a pre-built container images for CPU, you have to build it yourself:
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout v0.8.5
podman build -f docker/Dockerfile.cpu --tag vllm-cpu --target vllm-openai .
It provides images for CUDA (NVIDIA) and ROCm (AMD).
Using a pre-trained model
vLLM can automatically download models from Hugging Face. Many models require you to accept some license, in this case you must:
- create/have an Hugging Face account;
- accept the license of the model on Hugging Face;
- create an API token on the Hugging Face administration console;
- provide this API token (eg. the
HF_TOKEN
environment variable).
HF_TOKEN=...
export HF_TOKEN
Note: provision the Huggin Face API token through file
If you do not want to pass the Huggin Face token through environment variables, you can store it in ~/.cache/huggingface/token
. When using containers, you could use something like:
-v $(pwd)/token:/root/.cache/huggingface/token:ro
Note: download the model from the host system
If you do not want to expose your API tokens to the server, you can download the models from the host system:
huggingface-cli download mistralai/Mistral-7B-v0.1
You can now start a OpenAPI-compatible server serving a given model:
# Hugging Face models are downloaded there:
mkdir ~/.cache/huggingface
podman run --rm --shm-size=4g -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface -e HF_TOKEN \
localhost/vllm-cpu \
--model=mistralai/Mistral-7B-v0.1 --dtype=bfloat16
Once the server has downloaded and loaded the model, we can use the completion API:
curl http://127.0.0.1:8000/v1/completions -XPOST -H"Content-Type: application/json" -d '{
"prompt": "Hello, how",
"max_tokens": 1,
"logprobs": 20
}' | jq
{
"id": "cmpl-e3d3413caa534d6f8b4a751a79a180c6",
"object": "text_completion",
"created": 1747088026,
"model": "mistralai/Mistral-7B-v0.1",
"choices": [
{
"index": 0,
"text": " are",
"logprobs": {
"text_offset": [
0
],
"token_logprobs": [
-0.3279009759426117
],
"tokens": [
" are"
],
"top_logprobs": [
{
" are": -0.3279009759426117,
"’": -2.7654008865356445,
" is": -3.0154008865356445,
" can": -3.3279008865356445,
" was": -3.7029008865356445,
" have": -3.9529008865356445,
" do": -4.3904008865356445,
" has": -4.8279008865356445,
"'": -4.8279008865356445,
" to": -5.2029008865356445,
" you": -5.2654008865356445,
" did": -5.3904008865356445,
" many": -5.5779008865356445,
" about": -5.7029008865356445,
" long": -5.9529008865356445,
" nice": -6.0154008865356445,
" goes": -6.0154008865356445,
" much": -6.1404008865356445,
" the": -6.1404008865356445,
"dy": -6.1404008865356445
}
]
},
"finish_reason": "length",
"stop_reason": null,
"prompt_logprobs": null
}
],
"usage": {
"prompt_tokens": 4,
"total_tokens": 5,
"completion_tokens": 1,
"prompt_tokens_details": null
}
}
Warning: pre-trained model
This model is a pre-trained model. It is not trained to behave as a nice and helpful chatbot 😊 and might behave unexpectedly if you try to use it as a chatbot:
curl http://127.0.0.1:8000/v1/completions -XPOST -H"Content-Type: application/json" -d '
{
"prompt": "What is the color of the sky?\n\nAgent:",
"max_tokens": 25,
"temperature": 1.2
}' | jq .choices[0].text
" Fuck! The wife's screwing Ryan Gosling! I hate good-neighbor-looking mothers!)* &"
🫢
curl http://127.0.0.1:8000/v1/completions -XPOST -H"Content-Type: application/json" -d '{
"prompt": "User: What is the color of the sky?\n\nAssistant:",
"max_tokens": 50
}' | jq .choices[0].text
" Related, but unconditional, to the black nothingness that is hue.\n\n
User: Which drug do you like best?\n
User: LSD remains my favorite as it allows one to directly experience the altered state. Side effects"
🤯
Exploiting the output token probabilities
The logprobs
parameters can be used to estimate the confidence of the model:
curl http://127.0.0.1:8000/v1/completions -XPOST -H"Content-Type: application/json" -d '{
"prompt": "Question: Which of these is NOT a Pixar film?\n
A: Wall-E\n
B: Cloudy With a Chance of Meatballs\n
C: Coco\nD: a Bug s Life\n
\nAnswer:",
"max_tokens": 1,
"logprobs": 10
}' | jq .choices[0].logprobs.top_logprobs[0]
{
" B": -1.1635770797729492,
" Cloud": -1.5385770797729492,
" C": -1.9760770797729492,
"\n": -2.101077079772949,
" Wall": -3.413577079772949,
" A": -3.663577079772949,
" D": -3.851077079772949,
" The": -4.101077079772949,
" a": -4.976077079772949,
" (": -5.038577079772949
}
We can compute the probabilities (exp(logprob)
) with:
jq -r 'to_entries|sort_by(.value)|reverse|map((.value|exp|tostring) + " " + (.key|tojson))|.[]'
0.3123668190227707 " B"
0.2146863657643902 " Cloud"
0.138611935699938 " C"
0.12232460391645038 "\n"
0.032923220503856244 " Wall"
0.025640629909635775 " A"
0.021256828803575347 " D"
0.016554834917839274 " The"
0.006901081919294773 " a"
0.0064829665025314025 " ("
If we only consider the acceptable answers, this gives after rescaling:
62.73% " B"
27.84% " C"
05.15% " A"
04.26% " D"
The confidence is much higher when using the instruction following variant.
Using an instruction following model
For chat completion, we can serve an instruction-following model instead:
podman run --rm --shm-size=4g -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HF_TOKEN \
localhost/vllm-cpu \
--model=mistralai/Mistral-7B-Instruct-v0.3 --dtype=bfloat16
We can now use the Chat completion API:
curl http://127.0.0.1:8000/v1/chat/completions -XPOST -H"Content-Type: application/json" -d '{
"messages":[
{"role":"system","content":"You are a helpful assistant"},
{"role":"user","content":"What is the color of the sky?"}
]
}' | jq -r .choices[0].message.content
The color of the sky can vary depending on the time of day, weather, and location, but under a clear blue sky, it typically appears blue due to a process called Rayleigh scattering. During sunrise and sunset, the sky may take on shades of red, pink, and orange. At night, the sky is generally a dark expanse of black, dotted with stars.