/dev/posts/

llama.cpp quickstart

Published:

Updated:

How to quickly use llama.cpp for LLM inference (no GPU needed).

llama.cpp is a framework for LLM serving (inference). Some highlights:

Table of content

Using a pre-trained model

llama.cpp expects the model in GGUF format. Many models are not available in this format. The convert_hf_to_gguf.py script can be used to download a model from HuggingFace (into ~/.cache/huggingface) and convert it to the GGUF format.

HF_TOKEN=...
export HF_TOKEN

podman run -e HF_TOKEN \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $(pwd)/models:/models -it --rm \
  --entrypoint /usr/bin/python3 \
  ghcr.io/ggml-org/llama.cpp:full \
  ./convert_hf_to_gguf.py \
  --remote mistralai/Mistral-7B-v0.3 \
  --outfile /models/Mistral-7B-v0.3-BF16.gguf \
  --outtype bf16

See the GGUF Naming Convention for a recommended naming scheme of GGUF files.

If the model is already available in the correct format, you do not have to do this conversion.

We can now start the API server:

podman run -it -v $(pwd)/models:/models:ro -p 8000:8000 --rm \
  ghcr.io/ggml-org/llama.cpp:server \
  -m /models/Mistral-7B-v0.3-BF16.gguf \
  --port 8000 --host 0.0.0.0 -n 512

Once the server has downloaded and loaded the model, we can use the completion API:

curl http://127.0.0.1:8000/v1/completions -XPOST -H"Content-Type: application/json" -d '{
  "prompt":"Hello, how",
  "max_tokens":1,
  "logprobs": 10,
}' | jq
{
  "choices": [
    {
      "text": " are",
      "index": 0,
      "logprobs": {
        "content": [
          {
            "id": 1228,
            "token": " are",
            "bytes": [
              32,
              97,
              114,
              101
            ],
            "logprob": -0.3882438838481903,
            "top_logprobs": [
              {
                "id": 1228,
                "token": " are",
                "bytes": [
                  32,
                  97,
                  114,
                  101
                ],
                "logprob": -0.3882438838481903
              },
              {
                "id": 29577,
                "token": "’",
                "bytes": [
                  226,
                  128,
                  153
                ],
                "logprob": -2.766828775405884
              },
              {
                "id": 1117,
                "token": " is",
                "bytes": [
                  32,
                  105,
                  115
                ],
                "logprob": -2.887641191482544
              },
              {
                "id": 1309,
                "token": " can",
                "bytes": [
                  32,
                  99,
                  97,
                  110
                ],
                "logprob": -3.1200249195098877
              },
              // ...
            ]
          }
        ]
      },
      "finish_reason": "length"
    }
  ],
  "created": 1747092732,
  "model": "gpt-3.5-turbo",
  "system_fingerprint": "b5350-c1040239",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 1,
    "prompt_tokens": 4,
    "total_tokens": 5
  },
  "id": "chatcmpl-Y0xtQ9oU5dv065QEeAbK36UN2bQ13LHC",
  "timings": {
    "prompt_n": 1,
    "prompt_ms": 389.861,
    "prompt_per_token_ms": 389.861,
    "prompt_per_second": 2.565016762384542,
    "predicted_n": 1,
    "predicted_ms": 1.837,
    "predicted_per_token_ms": 1.837,
    "predicted_per_second": 544.3658138268917
  }
}

Warning: pre-trained model

This model is a pre-trained model. It is not trained to behave as a nice and helpful chatbot 😊 and might behave unexpectedly if you try to use it as a chatbot agent.

curl http://127.0.0.1:8000/v1/completions -XPOST -H"Content-Type: application/json" -d  '{
  "prompt":"User: What is the color of the sky?\n\nAgent:",
  "max_tokens":25
}' | jq .choices[0].text
" I’m sorry, I can’t tell you that.\nUser: Why not?\nAgent: That is classified"

Exploiting the output token probabilities

The logprobs parameters can be used to estimate the confidence of the model:

curl http://127.0.0.1:8000/v1/completions -XPOST -H"Content-Type: application/json" -d '{
  "prompt":"Question: Which of these is NOT a Pixar film?\nA: Wall-E\nB: Cloudy With a Chance of Meatballs\nC: Coco\nD: a Bug s Life\n\nAnswer:",
  "max_tokens":1,
  "logprobs":20
}' | jq .choices[0].logprobs.content[0].top_logprobs
[
  {
    "id": 1133,
    "token": " B",
    "bytes": [
      32,
      66
    ],
    "logprob": -1.1377761363983154
  },
  {
    "id": 781,
    "token": "\n",
    "bytes": [
      10
    ],
    "logprob": -1.5422114133834839
  },
  {
    "id": 14600,
    "token": " Cloud",
    "bytes": [
      32,
      67,
      108,
      111,
      117,
      100
    ],
    "logprob": -1.8006733655929565
  },
  {
    "id": 1102,
    "token": " C",
    "bytes": [
      32,
      67
    ],
    "logprob": -2.33872389793396
  },
  // ...
]

We can compute the probabilities (exp(logprob)) with:

jq -r 'sort_by(.logprob)|reverse|map( (.logprob|exp|tostring) + " " + (.token|tojson))|.[]'
0.3205310471281873 " B"
0.21390754000314166 "\n"
0.16518761910430604 " Cloud"
0.09645064059374886 " C"
0.038972517620035334 " Wall"
0.03235804166035752 " D"
0.02645796331919115 " A"
0.010132176689094765 " The"
0.00866332853478939 " ("
0.008645014555171639 " a"
0.004494377052626721 " "
0.003879985929267207 " None"
0.0033994580684077996 " Option"
0.003273198216128902 " It"
0.0030570521287553555 " This"
0.0026370859252820896 "B"
0.0019528912185460014 " “"
0.0018986810075143374 " b"
0.0018722240854105563 " There"
0.0015863105377299503 " You"

If we only consider the acceptable answers, this gives after rescaling:

67.36% " B"
20.27% " C"
06.80% " D"
05.56% " A"

The confidence is much higher when using the instruction following variant.

Chat API

In order to use the Chat completion API, we are going to use a instruction following model:

podman run -e HF_TOKEN \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $(pwd)/models:/models -it --rm \
  --entrypoint /usr/bin/python3 \
  ghcr.io/ggml-org/llama.cpp:full \
  ./convert_hf_to_gguf.py \
  --remote mistralai/Mistral-7B-Instruct-v0.3 \
  --outfile /models/Mistral-7B-Instruct-v0.3-BF16.gguf \
  --outtype bf16
podman run -it -v $(pwd)/models:/models:ro -p 8000:8000 --rm \
  ghcr.io/ggml-org/llama.cpp:server \
  -m /models/Mistral-7B-Instruct-v0.3-BF16.gguf \
  --port 8000 --host 0.0.0.0 -n 512

We can now use the Chat completion API:

curl http://127.0.0.1:8000/v1/chat/completions -XPOST -H"Content-Type: application/json" -d  '{
  "messages":[
    {"role":"system","content":"You are a helpful assistant"},
    {"role":"user","content":"What is the color of the sky?"}
  ]
}' | jq
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " The color of the sky can vary depending on factors such as time of day, weather, and location, but it is typically blue during a clear day. This is because the molecules in the Earth's atmosphere scatter sunlight in all directions, and blue light is scattered more than other colors because it travels in shorter, smaller waves. This phenomenon is known as Rayleigh scattering."
      }
    }
  ],
  "created": 1747213402,
  "model": "gpt-3.5-turbo",
  "system_fingerprint": "b5350-c1040239",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 78,
    "prompt_tokens": 19,
    "total_tokens": 97
  },
  "id": "chatcmpl-D4iW0Vg5ZkkLitXSzlwEW7LwFuAniNC6",
  "timings": {
    "prompt_n": 19,
    "prompt_ms": 982.405,
    "prompt_per_token_ms": 51.70552631578947,
    "prompt_per_second": 19.34029244557998,
    "predicted_n": 78,
    "predicted_ms": 32527.196,
    "predicted_per_token_ms": 417.01533333333333,
    "predicted_per_second": 2.3979933591570575
  }
}

When using an instruction following models, you can use the web interface as well. It is available at http://localhost:8000/. (This web interface may not work correctly when using a pre-trained model).

CLI

llama.cpp has a CLI interface as well.

For example to start a conversation from the console with:

podman run -it -v $(pwd)/models:/models:ro --rm \
  ghcr.io/ggml-org/llama.cpp:light -m /models/Mistral-7B-Instruct-v0.3-BF16.gguf

References