Infinite chatbot with llama.cpp

Date: 2023-04-04 (original trials)

Update: 2024-07-22 (keeping the setup section updated)

Code: rendezllama - A project I created to reliably get around some issues discussed in this article.


We walk through some attempts to use llama.cpp as a chatbot.

Jump to the Setup section if you don’t have llama.cpp set up already or if you’re trying to copy/paste commands.

Guiding the Autocomplete

The first thing you’ll notice when running llama.cpp’s main example is that the text after your prompt just keeps going as if completing a story. To use it as a chatbot, we have to guide the glorified autocomplete tool to expect & fill in the correct structure by providing it a good prompt that consists of 2 parts:

  1. Priming prompt.
    • Describes the rest of the document.
      • “Transcript of a dialogue between User and a fancy AI named Bot.”
      • “Every line of dialogue begins with a name followed by a colon.”
    • Describes the characters chatting.
      • Bad: “Bot definitely doesn’t want to destroy humanity.”
      • Good: “Bot is friendly and it tries to help User with their concerns.”
  2. Few-shot example prompt.
    • A handful of example dialogue lines.

For a chat, llama.cpp’s main example is run with flags that indicate that you’re requesting an interactive session (-i) and a reverse prompt (-r "User:") that lets you fill in the user dialogue.

For an infinite chat, llama.cpp’s main example can reinitialize its prompt with your original primer prompt followed by only the newest half(ish) of chat dialogue. You need to use flags that tell it to run forever (-n -1), have no exit token (--ignore-eos), keep as many tokens as exist in the primer prompt (--keep 123), and you probably want max context size (-c 2048).

Constraining Structure

Infinite chat has some issues:

A hacky but effective workaround for the last 3 issues is to let the user decide whether they want to speak whenever the Bot writes a punctuation or newline. And, in the case of a newline, force it to start with User: or Bot: depending on either user input or alternating pattern (because sometimes it’s fun to let the LLM write the user’s dialogue).

A better general solution is to regenerate the last line if it looks bad (user-driven) or goes on for too long (automatic). This idea is tracked by Issue #604 for llama.cpp and is already implemented in a fork named koboldcpp. My rendezllama project now does the same via llama.cpp’s API.

Allen Roush et al. explore some other more direct ways of enforcing structure. See their Constrained-Text-Generation-Studio project on GitHub for an implementation and the list of constraints. I like this idea, and it could definitely complement the ability to regenerate lines, but it’s impossible to guard against all bad text patterns.

Quality vs Structure

When the structure LLM-generated text is unconstrained, it’s difficult to use some common techniques that improve chatbot quality.

I observed the trade-offs for the 7B and 13B LLaMA checkpoints. The larger one usually behaves better, so I suspect that the 30B and 65B checkpoints continue that trend. It would explain why some people are able to use these techniques.

Inner Monologue

Some people give the chatbot a /think command to reason about the world without affecting conversation. The Google robotics team shows that inner monologue can improve LLM quality by providing space to narrate and re-contextualize information before committing to an idea (e.g., by “speaking” it).

Structural problems:

The last problem can be mitigated with a different encoding that makes the chatbot’s speech lines look less similar to its inner monologue lines. A narrator (below) works well for this, but introduces different structural problems.


If you’re using the chatbot for a “choose your own adventure” experience where actions can take place, a narrator can help. The narrator can also reiterate important medium-term contextual information that would otherwise be lost in the normal flow of conversation. Just add lines starting with Narrator: or > .

Structural problems:

A narrator gives great results in terms of writing quality but needs to be actively corrected when it makes the wrong choice for the user.

Actions in square brackets

Characters can perform actions on their own dialogue lines fairly naturally without hinting much at an unwanted document structure. However, there’s still a trade-off because denoting actions with brackets like [points here] seems to produce better quality than asterisks like *points here* do.

Structural problems:

I got the idea to use brackets after reading Simon Wilson’s rant about how brackets are more proper. This syntax seems to behave really well as long as it only appears in the speech lines.

Starting Indicator

Some people add a line like ### Conversation Start to separate the priming prompt from the chat dialogue.

Structural problems:

A blank line seems to work well enough as a way to separate the priming prompt from the start of the dialogue.



The instructions assume you’re running as the correct user (e.g., via doas) and have the following environment variables set.

doas -u gendeux bash -l

Get llama.cpp

mkdir -p $(dirname "${llama_cpp_dir}")
cd $(dirname "${llama_cpp_dir}")
git clone -b master $(basename "${llama_cpp_dir}")
cd llama.cpp
# Prepare pipenv for later.
pipenv install -r requirements/requirements-convert_hf_to_gguf.txt

Download LLaMA checkpoints

Request the checkpoints from Facebook directly here or see this comment. The command to get everything but the 65B checkpoint might look like:

cd $(dirname "${models_dir})
# Omit 65B model, only port 49184, and name the result directory how our commands expect.
aria2c --select-file 1-4,21-23,25,26  --listen-port=49184 --dht-listen-port=49184 "magnet:?xt=urn:btih:...&dn=$(basename ${models_dir})"

Updating and Quantizing

The llama.cpp project moves fast and sometimes breaks stuff. Check its commits to see if there’s something new.

cd "${llama_cpp_dir}"
# Remember where we were in case results are bad.
# If it is, use `git checkout THE_HASH`.
git rev-parse HEAD | tee "/tmp/${delegate_user}_llama_version.txt"
# No-op if you were at the latest commit before.
git checkout master
# Update and build.
git pull origin master
pipenv install -r requirements/requirements-convert_hf_to_gguf.txt

# Convert from checkpoints and quantize. You don't always have to do this.
pipenv run python "${models_dir}/${model_subdir}/"
./llama-quantize "${models_dir}/${model_subdir}"/.*F16.gguf "${models_dir}/${model_subdir}/ggml-model.Q5_K_M.gguf" q5_k_m