Everyone runs local LLMs differently. Llama.cpp, Ollama, and LM Studio are just a few of the more popular options that can run on simple hardware—pretty much anything. However, Llama.cpp has just released a new web UI that might change the equation for some of you.
In this article, I’ll show you how to build it from the source and demonstrate a behavior that might make you rethink your entire stack.
There’s a discussion worth noting, initiated by Georgie Ganov himself, the creator of Llama.cpp. While it’s likely to be merged into the main documentation eventually, it’s still in its early stages. He outlines the benefits and provides setup instructions. For context, I’m working on an M4 Mac Mini with 16GB of RAM, so we’ll be using smaller models. The process, however, is nearly identical for running larger models on more powerful hardware.
The Easy Way vs. The Developer’s Way
One way you can install it is the simple way, just using Homebrew.
brew install llama-cpp
And you’re done. Pretty easy. But, since this is a developer-focused publication, we’re going to do it the hard way. This method is also crucial if you’re building on systems that don’t have a pre-compiled binary available, which includes quite a few platforms.
Building Llama.cpp From Source
First, we need to clone the repository. You can find the URL on the main repo page under the “Code” button, but the documentation provides a clear set of steps.
I’ve already created a code directory, so let’s get started.
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
Now, we just need to build it. The documentation mentions a CPU build, but we don’t want our models to run on the CPU, right? If you’re on an NVIDIA or AMD system, you’ll follow different steps. But on a Mac, the Metal build is enabled by default. You don’t have to do anything extra.
First, run CMake to set up the build directory.
cmake -B build
Boom. No extra flags are needed because we’re on Apple Silicon, which is supported right out of the box.
Next, we’ll use this command to create the release build. I’m adding one more flag to it: the -j flag. This allows you to run multiple jobs in parallel for faster compilation. Trust me, you’ll notice the difference.
cmake --build build --config Release -j 8
There we go. It’s building. Done. That was remarkably fast.
Now that it’s built, your binaries are located in a different directory.
cd build/bin
If you look inside, you’ll find all the tools you can run, like llama-cli and llama-bench.
Starting the Llama Server
Now for step two: starting the llama-server tool. The command points to a Hugging Face repository, including the organization name and the model name, sometimes with the quantization level appended. The default example, GGUF/meta-llama/Meta-Llama-3-8B-Instruct-GGUF, might be too large for a 16GB machine, so I’m going to head over to Hugging Face to pick a different model.
I’ll search for the ggml-ai organization. Here on their organization page, you can click on “Models” and filter by name to find something small enough to fit on this machine.
Let’s search for a Qwen2 model. We have several options, including Q8 (quantized to 8 bits) and others. The GGUF in the name signifies the format that Llama.cpp supports.
A Note on Model Formats:
Typically, models don’t come in GGUF format. They are often released in safetensors format. To run them with Llama.cpp, they must undergo a conversion process. Fortunately, this is usually done for you. The ggml-ai organization, being directly related to Llama.cpp, provides most of its models in the GGUF format already.
However, you can’t always trust the name. It’s up to the creator to name it properly using standards that are still evolving.
Looking at the “Files and versions” tab, we see a few options:
- F16: The original 16-bit floating-point model, converted to GGUF. (8GB)
- Q8_0: Quantized to 8-bit, half the size.
- Q4_K_M: Quantized to 4-bit, even smaller. (2.5GB)
I’ll use the smallest one, the Q4_K_M version. It’s a quarter of the size of the full F16 model.
Hugging Face provides a command to run the model using Llama.cpp. I’ll select the Q4_K_M version and copy the command.
Back in the terminal, I’ll paste it in.
./llama-server -m hf-repo-id/ggml-ai/Qwen2-7B-Instruct-Q4_K_M.gguf -c 4000
A few things to note here:
- I’m using
./llama-serverto be explicit about running the binary from our local build folder. If you installed Llama.cpp globally, you could just runllama-server. I prefer this isolated approach to manage different versions. - The
-c 4000flag sets the context window to 4,000 tokens to make it quicker and use less memory. The default (-c 0) uses the maximum context available.
Boom. The server is listening on http://localhost:8080. Let’s open that up.
Exploring the New Llama.cpp Web UI
There it is. The new web UI for Llama.cpp. It tells you the model and the context size. This UI is really nice. It’s simple, but it’s good because it offers a lot of flexibility.
Let’s try a common prompt: “Write a story.”
As it works, you can see the context filling up, which is incredibly useful information. It also shows the number of tokens being output. The process is divided into a “thinking” stage and a “generation” stage, but both count against your total context.
The settings panel offers a bunch of controls:
- Temperature
- Option to show the “thought” process automatically
- Import/export for all conversations
- A developer section for sending custom JSON to the API
At the end, it shows you the statistics: 29 tokens per second, total tokens created, and how long it took.
How Does It Compare to Ollama?
Ollama is a tool that is incredibly easy to install. However, it seems to be changing direction. Many of its models are now available in the cloud, suggesting a potential shift towards cloud-only offerings or enhanced cloud options for hybrid models. This is just speculation, of course.
If we compare the UIs, Ollama’s is quite basic. The settings are limited to model location, network exposure, and context length. You don’t get real-time statistics like tokens per second in the UI.
To get those stats, you have to run Ollama through the terminal with the --verbose flag.
ollama run qwen2:7b --verbose
After running a prompt, you’ll get an evaluation rate, for example, 36 tokens per second.
The Real Difference: Parallel Processing
Besides the functional UI, Llama.cpp has another major advantage: the ability to run things in parallel. This is incredibly useful for agents or any programmatic use of the API. For simple chat, it doesn’t matter much unless you want to have two conversations at once.
Ollama’s Sequential Processing When you have Ollama running, it only handles one message at a time, no matter where it’s coming from. If you start a request in the UI and another in the terminal, one will block the other. The server is tied up, answering only one instance at a time. Imagine a process, or multiple users, having to wait around like that.
Even if you run two ollama run commands in separate terminals, the result is the same. One process is busy while the other sits and waits. It’s simply not what it’s designed for.
Llama.cpp’s Parallel Power Now, let’s go back to Llama.cpp. I’ll start up two chats in the browser.
- Chat 1: “Write a story about Alex.”
- Chat 2: “Write a story about Bob.”
Look at that. They’re both reasoning simultaneously. You can see the context counter going up on both of them. The server is responding to multiple requests.
In fact, let’s open two more.
- Chat 3: “Write a story about Tracy.”
- Chat 4: “Write a story about Alice.”
All four are now generating at the same time. A quick look at the activity monitor shows the GPU is fully engaged, confirming that Llama.cpp is taking advantage of Apple Silicon’s GPU, not just the CPU.
Now, notice the tokens per second. We’re seeing rates like 16, 17, and 13 tokens/second. The individual speed has taken a dive because we’re splitting the GPU’s processing power. But here’s the key: we can now work with multiple processes concurrently.
Let’s try just two parallel requests.
- Request 1: 24.3 tokens/second
- Request 2: 25.92 tokens/second
That’s almost 50 tokens per second in aggregate. When you’re using this with agents or programmatically, you’re going to get much better throughput.
For even more flexibility, you can run a second, separate instance of the server on a different port.
./llama-server -m [model_name] -c 4000 --port 8081
Now you have two completely independent server processes running.
Conclusion
While Ollama offers simplicity, the new Llama.cpp web UI combined with its inherent parallel processing capabilities makes it a formidable tool. For developers building applications that require concurrent requests, or for anyone who wants more control and higher aggregate throughput, Llama.cpp presents a compelling case for rethinking your local LLM stack.
Llama.cpp works across multiple systems, but for those on high-end NVIDIA or AMD clusters, another popular tool to consider is vLLM, which can yield some pretty impressive numbers.