vLLM quickstart

Published: May 14 2025

Updated: May 14 2025

How to quickly use vLLM for LLM inference using CPU.

VLLM is a framework for LLM serving (inference). Some highlights:

Python API;
OpenAI-compatible API server;
automatically downloads models from HuggingFace;
support for hardware acceleration using several accelerators;
can expose the output token probabilities (logprobs);
LoRA support;
tool calling;
structure output using Xgrammar or llguidance based on a list of choice, regex, JSON schema, Pydantic model;
beam-search;
decoding options (temperature, top_k, top_p, min_p, etc);
support for multi-modal models;
basic authentication support (single API key);
only one model per server instance at the moment.

Some problems I found while trying to do a quick test on CPU:

the pre-built Python wheels only work with CUDA (NVIDIA);
the pre-built containers images only work with CUDA (NVIDIA) and ROCm (AMD);
there is a dedicated repository which contains container images with (Intel) CPU container images;
compiling the project direcly in the host system failed because of some cmake version mismatch.

Table of content

Table of content
Build the container image
Using a pre-trained model
Exploiting the output token probabilities
Using an instruction following model
References

Build the container image

The project currently does not provide a pre-built container images for CPU, you have to build it yourself:

git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout v0.8.5

podman build -f docker/Dockerfile.cpu --tag vllm-cpu --target vllm-openai .

It provides images for CUDA (NVIDIA) and ROCm (AMD).

Using a pre-trained model

vLLM can automatically download models from Hugging Face. Many models require you to accept some license, in this case you must:

create/have an Hugging Face account;
accept the license of the model on Hugging Face;
create an API token on the Hugging Face administration console;
provide this API token (eg. the HF_TOKEN environment variable).

HF_TOKEN=...
export HF_TOKEN

Note: provision the Huggin Face API token through file

If you do not want to pass the Huggin Face token through environment variables, you can store it in ~/.cache/huggingface/token. When using containers, you could use something like:

-v $(pwd)/token:/root/.cache/huggingface/token:ro

Note: download the model from the host system

If you do not want to expose your API tokens to the server, you can download the models from the host system:

huggingface-cli download mistralai/Mistral-7B-v0.1

You can now start a OpenAPI-compatible server serving a given model:

# Hugging Face models are downloaded there:
mkdir ~/.cache/huggingface

podman run --rm  --shm-size=4g -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface -e HF_TOKEN \
  localhost/vllm-cpu \
  --model=mistralai/Mistral-7B-v0.1 --dtype=bfloat16

Once the server has downloaded and loaded the model, we can use the completion API:

curl http://127.0.0.1:8000/v1/completions -XPOST -H"Content-Type: application/json" -d '{
  "prompt": "Hello, how",
  "max_tokens": 1,
  "logprobs": 20
}' | jq

{
  "id": "cmpl-e3d3413caa534d6f8b4a751a79a180c6",
  "object": "text_completion",
  "created": 1747088026,
  "model": "mistralai/Mistral-7B-v0.1",
  "choices": [
    {
      "index": 0,
      "text": " are",
      "logprobs": {
        "text_offset": [
          0
        ],
        "token_logprobs": [
          -0.3279009759426117
        ],
        "tokens": [
          " are"
        ],
        "top_logprobs": [
          {
            " are": -0.3279009759426117,
            "’": -2.7654008865356445,
            " is": -3.0154008865356445,
            " can": -3.3279008865356445,
            " was": -3.7029008865356445,
            " have": -3.9529008865356445,
            " do": -4.3904008865356445,
            " has": -4.8279008865356445,
            "'": -4.8279008865356445,
            " to": -5.2029008865356445,
            " you": -5.2654008865356445,
            " did": -5.3904008865356445,
            " many": -5.5779008865356445,
            " about": -5.7029008865356445,
            " long": -5.9529008865356445,
            " nice": -6.0154008865356445,
            " goes": -6.0154008865356445,
            " much": -6.1404008865356445,
            " the": -6.1404008865356445,
            "dy": -6.1404008865356445
          }
        ]
      },
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 4,
    "total_tokens": 5,
    "completion_tokens": 1,
    "prompt_tokens_details": null
  }
}

Warning: pre-trained model

This model is a pre-trained model. It is not trained to behave as a nice and helpful chatbot 😊 and might behave unexpectedly if you try to use it as a chatbot:

curl http://127.0.0.1:8000/v1/completions -XPOST -H"Content-Type: application/json" -d  '
{
  "prompt": "What is the color of the sky?\n\nAgent:",
  "max_tokens": 25,
  "temperature": 1.2
}' | jq .choices[0].text

" Fuck! The wife's screwing Ryan Gosling! I hate good-neighbor-looking mothers!)* &"

🫢

curl http://127.0.0.1:8000/v1/completions -XPOST -H"Content-Type: application/json" -d  '{
  "prompt": "User: What is the color of the sky?\n\nAssistant:",
  "max_tokens": 50
}' | jq .choices[0].text

" Related, but unconditional, to the black nothingness that is hue.\n\n
User: Which drug do you like best?\n
User: LSD remains my favorite as it allows one to directly experience the altered state.  Side effects"

🤯

Exploiting the output token probabilities

The logprobs parameters can be used to estimate the confidence of the model:

curl http://127.0.0.1:8000/v1/completions -XPOST -H"Content-Type: application/json" -d '{
  "prompt": "Question: Which of these is NOT a Pixar film?\n
     A: Wall-E\n
     B: Cloudy With a Chance of Meatballs\n
     C: Coco\nD: a Bug s Life\n
     \nAnswer:",
  "max_tokens": 1,
  "logprobs": 10 
}' | jq .choices[0].logprobs.top_logprobs[0]

{
  " B": -1.1635770797729492,
  " Cloud": -1.5385770797729492,
  " C": -1.9760770797729492,
  "\n": -2.101077079772949,
  " Wall": -3.413577079772949,
  " A": -3.663577079772949,
  " D": -3.851077079772949,
  " The": -4.101077079772949,
  " a": -4.976077079772949,
  " (": -5.038577079772949
}

We can compute the probabilities (exp(logprob)) with:

jq -r 'to_entries|sort_by(.value)|reverse|map((.value|exp|tostring) + " " + (.key|tojson))|.[]'

0.3123668190227707 " B"
0.2146863657643902 " Cloud"
0.138611935699938 " C"
0.12232460391645038 "\n"
0.032923220503856244 " Wall"
0.025640629909635775 " A"
0.021256828803575347 " D"
0.016554834917839274 " The"
0.006901081919294773 " a"
0.0064829665025314025 " ("

If we only consider the acceptable answers, this gives after rescaling:

62.73% " B"
27.84% " C"
05.15% " A"
04.26% " D"

The confidence is much higher when using the instruction following variant.

Using an instruction following model

For chat completion, we can serve an instruction-following model instead:

podman run --rm  --shm-size=4g -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HF_TOKEN \
  localhost/vllm-cpu \
  --model=mistralai/Mistral-7B-Instruct-v0.3 --dtype=bfloat16

We can now use the Chat completion API:

curl http://127.0.0.1:8000/v1/chat/completions -XPOST -H"Content-Type: application/json" -d  '{
  "messages":[
    {"role":"system","content":"You are a helpful assistant"},
    {"role":"user","content":"What is the color of the sky?"}
  ]
}' | jq -r .choices[0].message.content

The color of the sky can vary depending on the time of day, weather, and location, but under a clear blue sky, it typically appears blue due to a process called Rayleigh scattering. During sunrise and sunset, the sky may take on shades of red, pink, and orange. At night, the sky is generally a dark expanse of black, dotted with stars.