Robert Važan Jan 13, 2024 – Jun 1, 2025

Local LLMs on Linux with Ollama

I finally got around to setting up local LLM, almost a year after I declared that AGI is here. I have low-cost hardware and I didn't want to tinker too much, so after messing around for a while, I settled on CPU-only Ollama and Open WebUI, both of which can be installed easily and securely in a container. Ollama has a big model library while Open WebUI is a decent proprietary frontend that I can use until I find a good opensource alternative. Ollama is built on top of the highly optimized llama.cpp.

Setup

Setup is super simple. No GPU is needed. Both projects have instructions for running in docker containers. See the relevant Ollama blog post, Open WebUI README, and podman section in Open WebUI setup guide. I have tweaked the instructions a bit to use Podman instead of Docker (I am using Fedora):

podman run -d --rm \
    --name ollama --replace \
    --stop-signal=SIGKILL \
    -p 127.0.0.1:11434:11434 \
    -v ollama:/root/.ollama \
    -e OLLAMA_MAX_LOADED_MODELS=1 \
    -e OLLAMA_NUM_PARALLEL=1 \
    docker.io/ollama/ollama
podman run -d --rm \
    --name open-webui --replace \
    -p 127.0.0.1:3000:8080 \
    --network=pasta:-T,11434 \
    -v open-webui:/app/backend/data \
    ghcr.io/open-webui/open-webui:0.6.5

You can now access Open WebUI at http://localhost:3000. Setting variables OLLAMA_MAX_LOADED_MODELS and OLLAMA_NUM_PARALLEL is not essential, but I recommend it on desktops to prevent Ollama from consuming all available RAM. I am pinning Open WebUI to version 0.6.5, because later versions are no longer opensource and I am now looking for alternatives.

I have also created some aliases/scripts to make it very convenient to invoke Ollama from the command line, because without aliases, containerized CLI interface gets a bit verbose:

podman exec -it ollama ollama run llama3.2:1b

Alternatively, you can run the CLI interface in a separate container:

podman run -it --rm \
    --network=pasta:-T,11434 \
    docker.io/ollama/ollama run llama3.2:1b

Why run LLMs locally?

I used to have GPT-4 subscription, but it was barely paying for itself. It saved less than 10% of my time and I wasted a lot of time tinkering with it. Local LLMs are free and increasingly good. Then there are all the issues with the cloud. Cloud LLM can change, disappear, or get more expensive at any moment. It keeps asking for my feedback and other data, which only serves the operator while I get my data locked up. I am quite sensitive about privacy and freedom and although I don't run into guardrails often, it's annoying when I do. Even though ChatGPT is smart, it's often unnecessarily creative when I just want it to follow instructions. Local LLMs offer more control over output. API gives more control too, but it can get crazy expensive if some script gets stuck in a loop.

Choosing models

My current favorite models are llama3.1 8B for general topics, qwen2.5-coder 7B for programming, and dolphin-llama3 8B for overcoming refusals. If you don't have enough memory for those, try the smaller llama3.2 3B, llama3.2 1B, or qwen2.5 0.5B.

The default 4-bit quantization makes models smaller and faster with negligible loss of accuracy. 3-bit quantization is cutting into accuracy perceptibly, but it's still better than resorting to a smaller model. There's no point in running models with more than 4 bits per parameter. If you have powerful hardware, just run larger model instead.

Model settings

Open WebUI provides convenient UI for tweaking Ollama's parameters. There's also older UI for creating custom modelfiles, but that's rarely useful now that there are several ways to adjust parameters. Parameters can be set globally (Settings / General) and tweaked separately for every model (Admin Panel / Settings / Models). Temporary parameter changes can be applied to the current conversation after opening Chat Controls. Default values for parameters temperature, top_k, and top_p narrow the output probability distribution, which helps smaller models stay on track, so I leave them alone. I sometimes use greedy sampling (top_k = 1) when I want predictable, robotic output without any creativity. Other than that, I configure only context window for every model. You can also tweak system prompt, but that usually damages model performance unless the model is trained for it.

Models use context window (also called KV cache) to remember what was already said. Context requires a lot of memory, which is why Ollama defaults to just 2048-token context. If you have enough memory, you probably want to adjust the num_ctx parameter, because Ollama does not handle context-exceeding conversations well. Newest models support impressive context lengths: llama3.1 8B up to 128K tokens at 8K tokens per GB, qwen2.5-coder 7B 128K @ 18K/GB, dolphin-llama3 8B 256K @ 8K/GB, llama3.2 3B 128K @ 9.3K/GB, llama3.2 1B 128K @ 32K/GB, and qwen2.5 0.5B 128K @ 85K/GB. Although RULER test shows that effective context size is often much smaller than declared, it also shows that 32K+ effective context is common in newer models and that larger context still helps, just less so.

What performance can you expect?

In general, paid frontier models perform better than free cloud models (not counting rate-limited versions of the paid models), which perform better than local models. You can see it in Chatbot Arena leaderboard. This hierarchy is upset in two ways. Firstly, if you have high-end hardware, you can match performance of the free cloud models. Secondly, there's a growing selection of specialized models that can approach (but not quite match) performance of frontier models in their area of specialization. For coding models, this can be seen in LiveCodeBench, BigCodeBench, and in Aider leaderboard.

In my experience, local models catch about 50% of questions before I resort to cloud models. They can usually handle all easy questions and serve as a tutor for any popular topic. You can ask them to rewrite text to improve style and catch grammatical errors. Local models can be a reliable natural language scripting engine if the task is simple enough and the LLM is properly instructed and provided with examples. I believe more application opportunities will be unlocked with better hardware and software.

Speeding things up

Hardware is a big problem, BTW. Like most people shopping before advent of local LLMs, I bought hardware that is woefully inadequate for running LLMs. Models become barely usable at speeds around 10 tokens/second, which is approximately what you can expect from 7-9B models running on CPU with 2-channel DDR4-3200. Long-context applications like coding require fast prompt processing, something no CPU can do (mine maxes at 30 t/s). I am not a LLM nerd like the guys hanging out at /r/LocalLLaMA who build multi-GPU rigs just to run the largest LLMs, but I am certainly going for at least 16GB GPU in my next computer and so should you.

Aside from hardware upgrades, you can speed things up in a number of ways:

Ollama will keep multiple recently used models loaded in RAM and ready for chat. This is a great default for servers, but desktops usually run other software that also needs memory. To limit Ollama to single model at a time, pass OLLAMA_MAX_LOADED_MODELS=1 to Ollama container.
Ollama will cache context (KV cache) for several recent conversations. This is again great for servers but suboptimal for desktops, especially if you follow my advice and increase context size. To make Ollama allocate memory only for the last conversation, pass OLLAMA_NUM_PARALLEL=1 to Ollama container.
By default, current LLM is used to generate titles and tags shown in conversation history, but it's such a performance killer that it's better to disable both title and tag generation. Or you can do it like a pro: configure WebUI to generate titles and tags using some small model (e.g. qwen2.5 0.5B) running in a separate Ollama container.
When Ollama exhausts its context window, it discards the earliest turns of the conversation and reprocesses the rest, which is slow on CPU. I always set context window as big as I can afford to avoid the reprocessing lag. This also makes models smarter.
Ollama unloads the model and clears the context cache when it is unused for 5 minutes. If you return to the chat later, it takes 10 seconds to reload the model and then more time to reprocess the context. If you want to keep the last model loaded, pass OLLAMA_KEEP_ALIVE=-1 to Ollama container.
If you have background processes competing for the processor, it might be worth running Ollama with real-time priority.
You can run Ollama on iGPU or dGPU for faster prompt processing, lower energy use, and lower load on CPU cores. I have made it work on AMD iGPU and Intel dGPU.

I wouldn't waste time tinkering with thread count (parameter num_thread). Ollama automatically allocates one thread per physical core, which is optimal, probably because instruction-level parallelism already fully utilizes all cores and additional threads just introduce coordination issues.

What to expect in the future

Hardware will certainly get better. Local AI, including LLMs, changed workload composition on personal computers and hardware is just beginning to adapt. The fastest change will come from consumers just buying suitable hardware, specifically GPUs with plenty of fast VRAM. I am not particularly knowledgeable about hardware market, but my guess is that vendors will first scale up existing functionality that favors LLMs and other local AI, then introduce new primitives designed for quantized local models, and eventually get around to architectural changes like on-chip memory.

There are also plenty of opportunities for software and model optimizations, which is where I hope to get significant performance boost in the next year or two. Code and text completion is an obvious application for local LLMs, but editor support is still scarce and often cumbersome. Domain models could crush much larger generalists, but there are hardly any specialist models at the moment. Numerous architectural improvements are in the pipeline: ternary networks, Diff Transformer, YOCO, multi-token prediction, Mamba, RWKV. Letting LLMs access resources (documents and source code, Internet search, APIs, code execution) can help overcome size limitations of local LLMs, but the current implementation in Open WebUI and Ollama is limited and unwieldy. Speculative decoding can help with speed, but no popular inference engine uses it yet. iGPUs and AMD/Intel dGPUs could help with multimodal models, long prompts, and energy efficiency, but most of them sit still for lack of software support.

I am confident there will be steady and fairly fast progress in local LLMs, but cloud LLMs will not go away. With high sparsity and other optimizations, cloud LLMs will eventually grow to be as big as search engines. Instead of replacing cloud LLMs, local LLMs will evolve to support different use cases, especially fine-tuning and continuous training on local data.