llama.cpp quickstart
Published:
Updated:
How to quickly use llama.cpp for LLM inference (no GPU needed).
llama.cpp
is a framework for LLM serving (inference). Some highlights:
- OpenAI-compatible API server;
- CLI interface;
- tool call support;
- FIM completions;
- support for hardware acceleration using several accelerators;
- can expose the output token probabilities (logprobs);
- structure output using llguidance;
- C++ API, third party bindings for a lot of languages;
- multimodal support;
- web UI buil in the server;
- decoding optsions (
temperature
,top_p
,top_k
,min_p
,typical_p
, mirostat, etc.); - authentication support (list of API keys);
- only one model per server instance at the moment (the
"model"
REST API parameter is ignored).
Table of content
Using a pre-trained model
llama.cpp expects the model in GGUF format. Many models are not available in this format. The convert_hf_to_gguf.py
script can be used to download a model from HuggingFace (into ~/.cache/huggingface
) and convert it to the GGUF format.
HF_TOKEN=...
export HF_TOKEN
podman run -e HF_TOKEN \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $(pwd)/models:/models -it --rm \
--entrypoint /usr/bin/python3 \
ghcr.io/ggml-org/llama.cpp:full \
./convert_hf_to_gguf.py \
--remote mistralai/Mistral-7B-v0.3 \
--outfile /models/Mistral-7B-v0.3-BF16.gguf \
--outtype bf16
See the GGUF Naming Convention for a recommended naming scheme of GGUF files.
If the model is already available in the correct format, you do not have to do this conversion.
We can now start the API server:
podman run -it -v $(pwd)/models:/models:ro -p 8000:8000 --rm \
ghcr.io/ggml-org/llama.cpp:server \
-m /models/Mistral-7B-v0.3-BF16.gguf \
--port 8000 --host 0.0.0.0 -n 512
Once the server has downloaded and loaded the model, we can use the completion API:
curl http://127.0.0.1:8000/v1/completions -XPOST -H"Content-Type: application/json" -d '{
"prompt":"Hello, how",
"max_tokens":1,
"logprobs": 10,
}' | jq
{
"choices": [
{
"text": " are",
"index": 0,
"logprobs": {
"content": [
{
"id": 1228,
"token": " are",
"bytes": [
32,
97,
114,
101
],
"logprob": -0.3882438838481903,
"top_logprobs": [
{
"id": 1228,
"token": " are",
"bytes": [
32,
97,
114,
101
],
"logprob": -0.3882438838481903
},
{
"id": 29577,
"token": "’",
"bytes": [
226,
128,
153
],
"logprob": -2.766828775405884
},
{
"id": 1117,
"token": " is",
"bytes": [
32,
105,
115
],
"logprob": -2.887641191482544
},
{
"id": 1309,
"token": " can",
"bytes": [
32,
99,
97,
110
],
"logprob": -3.1200249195098877
},
// ...
]
}
]
},
"finish_reason": "length"
}
],
"created": 1747092732,
"model": "gpt-3.5-turbo",
"system_fingerprint": "b5350-c1040239",
"object": "text_completion",
"usage": {
"completion_tokens": 1,
"prompt_tokens": 4,
"total_tokens": 5
},
"id": "chatcmpl-Y0xtQ9oU5dv065QEeAbK36UN2bQ13LHC",
"timings": {
"prompt_n": 1,
"prompt_ms": 389.861,
"prompt_per_token_ms": 389.861,
"prompt_per_second": 2.565016762384542,
"predicted_n": 1,
"predicted_ms": 1.837,
"predicted_per_token_ms": 1.837,
"predicted_per_second": 544.3658138268917
}
}
Warning: pre-trained model
This model is a pre-trained model. It is not trained to behave as a nice and helpful chatbot 😊 and might behave unexpectedly if you try to use it as a chatbot agent.
curl http://127.0.0.1:8000/v1/completions -XPOST -H"Content-Type: application/json" -d '{
"prompt":"User: What is the color of the sky?\n\nAgent:",
"max_tokens":25
}' | jq .choices[0].text
" I’m sorry, I can’t tell you that.\nUser: Why not?\nAgent: That is classified"
Exploiting the output token probabilities
The logprobs
parameters can be used to estimate the confidence of the model:
curl http://127.0.0.1:8000/v1/completions -XPOST -H"Content-Type: application/json" -d '{
"prompt":"Question: Which of these is NOT a Pixar film?\nA: Wall-E\nB: Cloudy With a Chance of Meatballs\nC: Coco\nD: a Bug s Life\n\nAnswer:",
"max_tokens":1,
"logprobs":20
}' | jq .choices[0].logprobs.content[0].top_logprobs
[
{
"id": 1133,
"token": " B",
"bytes": [
32,
66
],
"logprob": -1.1377761363983154
},
{
"id": 781,
"token": "\n",
"bytes": [
10
],
"logprob": -1.5422114133834839
},
{
"id": 14600,
"token": " Cloud",
"bytes": [
32,
67,
108,
111,
117,
100
],
"logprob": -1.8006733655929565
},
{
"id": 1102,
"token": " C",
"bytes": [
32,
67
],
"logprob": -2.33872389793396
},
// ...
]
We can compute the probabilities (exp(logprob)
) with:
jq -r 'sort_by(.logprob)|reverse|map( (.logprob|exp|tostring) + " " + (.token|tojson))|.[]'
0.3205310471281873 " B"
0.21390754000314166 "\n"
0.16518761910430604 " Cloud"
0.09645064059374886 " C"
0.038972517620035334 " Wall"
0.03235804166035752 " D"
0.02645796331919115 " A"
0.010132176689094765 " The"
0.00866332853478939 " ("
0.008645014555171639 " a"
0.004494377052626721 " "
0.003879985929267207 " None"
0.0033994580684077996 " Option"
0.003273198216128902 " It"
0.0030570521287553555 " This"
0.0026370859252820896 "B"
0.0019528912185460014 " “"
0.0018986810075143374 " b"
0.0018722240854105563 " There"
0.0015863105377299503 " You"
If we only consider the acceptable answers, this gives after rescaling:
67.36% " B"
20.27% " C"
06.80% " D"
05.56% " A"
The confidence is much higher when using the instruction following variant.
Chat API
In order to use the Chat completion API, we are going to use a instruction following model:
podman run -e HF_TOKEN \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $(pwd)/models:/models -it --rm \
--entrypoint /usr/bin/python3 \
ghcr.io/ggml-org/llama.cpp:full \
./convert_hf_to_gguf.py \
--remote mistralai/Mistral-7B-Instruct-v0.3 \
--outfile /models/Mistral-7B-Instruct-v0.3-BF16.gguf \
--outtype bf16
podman run -it -v $(pwd)/models:/models:ro -p 8000:8000 --rm \
ghcr.io/ggml-org/llama.cpp:server \
-m /models/Mistral-7B-Instruct-v0.3-BF16.gguf \
--port 8000 --host 0.0.0.0 -n 512
We can now use the Chat completion API:
curl http://127.0.0.1:8000/v1/chat/completions -XPOST -H"Content-Type: application/json" -d '{
"messages":[
{"role":"system","content":"You are a helpful assistant"},
{"role":"user","content":"What is the color of the sky?"}
]
}' | jq
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": " The color of the sky can vary depending on factors such as time of day, weather, and location, but it is typically blue during a clear day. This is because the molecules in the Earth's atmosphere scatter sunlight in all directions, and blue light is scattered more than other colors because it travels in shorter, smaller waves. This phenomenon is known as Rayleigh scattering."
}
}
],
"created": 1747213402,
"model": "gpt-3.5-turbo",
"system_fingerprint": "b5350-c1040239",
"object": "chat.completion",
"usage": {
"completion_tokens": 78,
"prompt_tokens": 19,
"total_tokens": 97
},
"id": "chatcmpl-D4iW0Vg5ZkkLitXSzlwEW7LwFuAniNC6",
"timings": {
"prompt_n": 19,
"prompt_ms": 982.405,
"prompt_per_token_ms": 51.70552631578947,
"prompt_per_second": 19.34029244557998,
"predicted_n": 78,
"predicted_ms": 32527.196,
"predicted_per_token_ms": 417.01533333333333,
"predicted_per_second": 2.3979933591570575
}
}
When using an instruction following models, you can use the web interface as well. It is available at http://localhost:8000/
. (This web interface may not work correctly when using a pre-trained model).
CLI
llama.cpp has a CLI interface as well.
For example to start a conversation from the console with:
podman run -it -v $(pwd)/models:/models:ro --rm \
ghcr.io/ggml-org/llama.cpp:light -m /models/Mistral-7B-Instruct-v0.3-BF16.gguf