Robert Važan

Local LLMs on Linux with Ollama

I finally got around to setting up local LLM, almost a year after I declared that AGI is here. I have low-cost hardware and I didn't want to tinker too much, so after messing around for a while, I settled on CPU-only Ollama and Open WebUI, both of which can be installed easily and securely in a container. Ollama has a big model library while Open WebUI is rich in convenient features. Ollama is built on top of the highly optimized llama.cpp.

Setup

Setup is super simple. No GPU is needed. Both projects have instructions for running in docker containers. See the relevant Ollama blog post, Open WebUI README, and podman section in Open WebUI setup guide. I have tweaked the instructions a bit to use Podman instead of Docker (I am using Fedora) and to restart automatically after reboot:

podman run -d \
    --name ollama \
    --replace \
    --pull=always \
    --restart=always \
    -p 127.0.0.1:11434:11434 \
    -v ollama:/root/.ollama \
    --stop-signal=SIGKILL \
    docker.io/ollama/ollama
podman run -d \
    --name open-webui \
    --replace \
    --pull=always \
    --restart=always \
    -p 127.0.0.1:3000:8080 \
    --network=pasta:-T,11434 \
    -v open-webui:/app/backend/data \
    ghcr.io/open-webui/open-webui:main
systemctl --user enable podman-restart

You can now access Open WebUI at http://localhost:3000. To update your installation, just run the above commands again.

I have also created some aliases/scripts to make it very convenient to invoke Ollama from the command line, because without aliases, containerized CLI interface gets a bit verbose:

podman exec -it ollama ollama run llama3.2:1b

Or alternatively run the CLI interface in a separate container:

podman run -it --rm \
    --network=pasta:-T,11434 \
    docker.io/ollama/ollama run llama3.2:1b

Why run LLMs locally?

I used to have GPT-4 subscription, but it was barely paying for itself. It saved less than 10% of my time and I wasted a lot of time tinkering with it. Local LLMs are free and increasingly good. Then there are all the issues with the cloud. Cloud LLM can change, disappear, or get more expensive at any moment. It keeps asking for my feedback and other data, which only serves the operator while I get my data locked up. I am quite sensitive about privacy and freedom and although I don't run into guardrails often, it's annoying when I do. Even though ChatGPT is smart, it's often unnecessarily creative when I just want it to follow instructions. Local LLMs offer more control over output. API gives more control too, but it can get crazy expensive if some script gets stuck in a loop.

Choosing models

My current favorite models are llama3.1 8B for general topics, qwen2.5-coder 7B for programming, and dolphin-llama3 8B for overcoming refusals. If you don't have enough memory for those, try the smaller llama3.2 3B, llama3.2 1B, or qwen2.5 0.5B.

The default 4-bit quantization makes models smaller and faster with negligible loss of accuracy. 3-bit quantization is cutting into accuracy perceptibly, but it's still better than resorting to a smaller model. There's no point in running models with more than 4 bits per parameter. If you have powerful hardware, just run larger model instead.

Model settings

Open WebUI provides convenient UI for tweaking Ollama's parameters. There's also older UI for creating custom modelfiles, but that's rarely useful now that there are several ways to adjust parameters. Parameters can be set globally (Settings / General) and tweaked separately for every model (Workspace / Models). Temporary parameter changes can be applied to the current conversation after opening Chat Controls. Default values for parameters temperature, top_k, and top_p narrow the output probability distribution, which helps smaller models stay on track, so I leave them alone. I sometimes use greedy sampling (top_k = 1) when I want predictable, robotic output without any creativity. Other than that, I configure only context window for every model. You can also tweak system prompt, but that usually damages model performance unless the model is trained for it.

Models use context window (also called KV cache) to remember what was already said. Context requires a lot of memory, which is why Ollama defaults to just 2048-token context. If you have enough memory, you probably want to adjust the num_ctx parameter, because Ollama does not handle context-exceeding conversations well. Newest models support impressive context lengths: llama3.1 8B up to 128K tokens at 8K tokens per GB, qwen2.5-coder 7B 128K @ 18K/GB, dolphin-llama3 8B 256K @ 8K/GB, llama3.2 3B 128K @ 9.3K/GB, llama3.2 1B 128K @ 32K/GB, and qwen2.5 0.5B 128K @ 85K/GB. Although RULER test shows that effective context size is often much smaller than declared, it also shows that 32K+ effective context is common in newer models and that larger context still helps, just less so.

What performance can you expect?

In general, paid frontier models perform better than free cloud models (not counting rate-limited versions of the paid models), which perform better than local models. You can see it in Chatbot Arena leaderboard. This hierarchy is upset in two ways. Firstly, if you have high-end hardware, you can match performance of the free cloud models. Secondly, there's a growing selection of specialized models that can approach (but not quite match) performance of frontier models in their area of specialization. For coding models, this can be seen in LiveCodeBench, BigCodeBench, and in Aider leaderboard.

In my experience, local models catch about 50% of questions before I resort to cloud models. They can usually handle all easy questions and serve as a tutor for any popular topic. You can ask them to rewrite text to improve style and catch grammatical errors. Local models can be a reliable natural language scripting engine if the task is simple enough and the LLM is properly instructed and provided with examples. I believe more application opportunities will be unlocked with better hardware and software.

Speeding things up

Hardware is a big problem, BTW. Like most people shopping before advent of local LLMs, I bought hardware that is woefully inadequate for running LLMs. Models become barely usable at speeds around 10 tokens/second, which is approximately what you can expect from 7-9B models running on CPU with 2-channel DDR4-3200. Long-context applications like coding require fast prompt processing, something no CPU can do (mine maxes at 20 t/s). I am not a LLM nerd like the guys hanging out at /r/LocalLLaMA who build multi-GPU rigs just to run the largest LLMs, but I am certainly going for at least 16GB GPU in my next computer and so should you.

Aside from hardware upgrades, you can speed things up in a number of ways:

I wouldn't waste time tinkering with thread count (parameter num_thread). Ollama automatically allocates one thread per physical core, which is optimal, probably because instruction-level parallelism already fully utilizes all cores and additional threads just introduce coordination issues.

What to expect in the future

Hardware will certainly get better. Local AI, including LLMs, changed workload composition on personal computers and hardware is just beginning to adapt. The fastest change will come from consumers just buying suitable hardware, specifically GPUs with plenty of fast VRAM. I am not particularly knowledgeable about hardware market, but my guess is that vendors will first scale up existing functionality that favors LLMs and other local AI, then introduce new primitives designed for quantized local models, and eventually get around to architectural changes like on-chip memory.

There are also plenty of opportunities for software and model optimizations, which is where I hope to get significant performance boost in the next year or two. Code and text completion is an obvious application for local LLMs, but editor support is still scarce and often cumbersome. Domain models could crush much larger generalists, but there are hardly any specialist models at the moment. Numerous architectural improvements are in the pipeline: ternary networks, Diff Transformer, YOCO, multi-token prediction, Mamba, RWKV. Letting LLMs access resources (documents and source code, Internet search, APIs, code execution) can help overcome size limitations of local LLMs, but the current implementation in Open WebUI and Ollama is limited and unwieldy. Speculative decoding can help with speed, but no popular inference engine uses it yet. iGPUs and AMD/Intel dGPUs could help with multimodal models, long prompts, and energy efficiency, but most of them sit still for lack of software support.

I am confident there will be steady and fairly fast progress in local LLMs, but cloud LLMs will not go away. With high sparsity and other optimizations, cloud LLMs will eventually grow to be as big as search engines. Instead of replacing cloud LLMs, local LLMs will evolve to support different use cases, especially fine-tuning and continuous training on local data.