Local LLMs on Linux with Ollama
I finally got around to setting up local LLM, almost a year after I declared that AGI is here. I have low-cost hardware and I didn't want to tinker too much, so after messing around for a while, I settled on CPU-only Ollama and Open WebUI, both of which can be installed easily and securely in a container. Ollama has a big model library while Open WebUI is rich in convenient features. Ollama is built on top of the highly optimized llama.cpp.
Setup
Setup is super simple. No GPU is needed. Both projects have instructions for running in docker containers. See the relevant Ollama blog post, Open WebUI README, and podman section in Open WebUI setup guide. I have tweaked the instructions a bit to use Podman instead of Docker (I am using Fedora) and to restart automatically after reboot:
podman run -d --name ollama --replace --pull=always --restart=always \ -p 127.0.0.1:11434:11434 -v ollama:/root/.ollama --stop-signal=SIGKILL \ docker.io/ollama/ollama podman run -d --name open-webui --replace --pull=always --restart=always \ -p 127.0.0.1:3000:8080 --network=pasta:-T,11434 \ --add-host=ollama.local:127.0.0.1 \ -e OLLAMA_BASE_URL=http://ollama.local:11434 \ -v open-webui:/app/backend/data \ ghcr.io/open-webui/open-webui:main systemctl --user enable podman-restart
You can now access Open WebUI at http://localhost:3000
.
To update your installation, just run the above commands again.
I have also created some aliases/scripts to make it very convenient to invoke Ollama from the command line, because without aliases, containerized CLI interface gets a bit verbose:
podman exec -it ollama ollama run tinyllama
Or alternatively run the CLI interface in a separate container:
podman run -it --rm \ --network=pasta:-T,11434 --add-host=ollama.local:127.0.0.1 \ -e OLLAMA_HOST=http://ollama.local:11434 \ docker.io/ollama/ollama run tinyllama
Why run LLMs locally?
I used to have GPT-4 subscription, but it was barely paying for itself. It saved less than 10% of my time and I wasted a lot of time tinkering with it. Local LLMs are free and increasingly good. Then there are all the issues with the cloud. Cloud LLM can change, disappear, or get more expensive at any moment. It keeps asking for my feedback and other data, which only serves the operator while I get my data locked up. I am quite sensitive about privacy and freedom and although I don't run into guardrails often, it's annoying when I do. Even though ChatGPT is smart, it's often unnecessarily creative when I just want it to follow instructions. Local LLMs offer more control over output. API gives more control too, but it can get crazy expensive if some script gets stuck in a loop.
Choosing models
My current favorite models are llama3 8B for general topics, codeqwen 7B for programming, and dolphin-mistral 7B for overcoming refusals. If you don't have enough memory for those, try phi3 4B. Uncensored models are less capable, but they are useful when other models refuse to answer.
The default 4-bit quantization makes models smaller and faster with negligible loss of accuracy. 3-bit quantization is cutting into accuracy perceptibly, but it's still better than resorting to a smaller model. There's no point in running models with more than 4 bits per parameter. If you have powerful hardware, just run larger model instead.
Model settings
Open WebUI provides convenient UI for tweaking Ollama's
parameters
as well as for creating custom modelfiles.
Default values for parameters temperature
, top_k
, and top_p
narrow the output probability distribution,
which helps smaller models stay on track, so I leave them alone.
I just have one modelfile with greedy sampling (top_k
= 1)
for when I want predictable, robotic output without any creativity.
You can also tweak system prompt, but that usually damages model performance unless the model is trained for it.
Models use context window (also called KV cache) to remember what was already said. Context requires a lot of memory,
which is why Ollama defaults to just 2048-token context.
If you have enough memory, you probably want to adjust num_ctx
parameter,
because Ollama does not handle context-exceeding conversations well.
Llama3 supports up to 8K context tokens (1GB RAM), Codeqwen up to 64K (4GB total, 16K per 1GB),
Mistral up to 32K (4GB total, 8K per 1GB), and Phi3 up to 4K (1.5GB total, 2.7K per 1GB).
What performance can you expect?
In general, paid cloud models perform better than free ones, which perform better than local models. You can see it in Chatbot Arena leaderboard. This hierarchy is upset in two ways. Firstly, if you have powerful hardware, you can match performance of free cloud models. Secondly, there's a growing selection of specialized models that can match the largest generalist cloud models in their area of specialization. For programming tasks, this can be seen in EvalPlus leaderboard.
In my experience, local models catch about 50% of questions before I resort to cloud models. They can usually handle all easy questions and serve as a tutor for any popular topic. You can ask them to rewrite text to improve style and catch grammatical errors. Local models can be a reliable natural language scripting engine if the task is simple enough and the LLM is properly instructed and provided with examples. I believe more application opportunities will be unlocked with better hardware and software.
Speeding things up
Hardware is a big problem, BTW. I have a few months old computer but a low-cost one. I am not a LLM nerd like the guys hanging out at /r/LocalLLaMA who build multi-GPU rigs just to run the largest LLMs. GPUs are bloody expensive these days and they do not have anywhere near enough RAM. I therefore opted for a cheap box with iGPU and lots of system RAM. The downside is that inference is slow. Token rate is about 20-35% lower than what you would guess from model size and memory bandwidth, probably because of context access, but also because some parts of inference are not bandwidth-limited. Models become barely comfortable to use at speeds above 10 tokens/second, which is approximately what you can expect from a 7B model like Mistral on 2-channel DDR4-3200.
You can speed things up in a number of ways:
- Open WebUI can use configured LLM to generate titles in your chat history, but it's such a performance killer that I have disabled the feature.
- When Ollama exhausts its context window, it discards the earliest turns of the conversation and reprocesses the rest, which is slow on CPU. I always set context window as big as I can afford to avoid the reprocessing lag. This also makes models smarter.
- Ollama unloads the model and clears the context cache when it is unused for 5 minutes. If you return to the chat later, it takes 10 seconds to reload the model and then more time to reprocess the context. You can set "Keep Alive" setting in Open WebUI to 24 hours to effectively disable this behavior.
- If you have background processes competing for the processor, it might be worth running Ollama with real-time priority.
- You can run Ollama on AMD iGPU for faster prompt processing, lower energy use, and lower load on CPU cores. Intel iGPUs might work too, but I haven't tested that.
- If the model supports custom system prompt, keep it short. Ollama can cache the system prompt, but keeping it short still helps a bit, especially with the first query.
- I wouldn't waste time tinkering with thread count (parameter
num_thread
). Ollama automatically allocates one thread per physical core, which is optimal, probably because instruction-level parallelism already fully utilizes all cores and additional threads just introduce thread coordination issues.
What to expect in the future
This sets priorities for future hardware purchases. Nothing else on my computer suffers from hardware constraints as much as local LLMs. If you are willing to pay hundreds of euros per year per subscription for access to cloud models, you might as well spend a thousand euros or more on new hardware to run models locally and get local model benefits like privacy, control, and choice. Local compute also eliminates usage caps and network latency of cloud models.
High-end DDR5 doubles memory bandwidth, which makes Mixtral and 13B dense models sufficiently fast, but larger dense models will not be practical without more memory channels, which are currently rare and expensive. GPUs have wide memory bus, but they instead constrain model size via limited VRAM. You need 2x16GB for 30B+ models and 3x16GB for 70B models. 24GB GPUs are unreasonably expensive. Smaller 8-16GB GPU setup is still useful for multimodal models like llava and for long prompts, but even some iGPUs are going to be fast enough for that. The newly announced CPUs with in-package high-speed RAM will enable iGPUs to run 30B+ models.
There are also plenty of opportunities for software and model optimizations, which is where I hope to get significant performance boost in the next year or two. Code and text completion is an obvious application for local LLMs, but editor support is still scarce and often cumbersome. Domain models could crush much larger generalists, but there are hardly any specialist models at the moment. Context lengths have increased, mostly thanks to GQA, but we need more and there are promising techniques on the horizon (YOCO and various token-merging algorithms). Letting LLMs access resources (documents and source code, Internet search, APIs, code execution) can help overcome size limitations of local LLMs, but the current implementation in Open WebUI and Ollama is limited and unwieldy. RWKV, Mamba, and ternary networks promise faster inference and other benefits. Speculative execution of LLMs can help a lot, but no open weights model uses it. Beam search would be essentially free for local inference. iGPUs and AMD/Intel dGPUs could help with multimodal models, long prompts, and energy efficiency, but most of them sit still for lack of software support. MoE and sparsity are underutilized.
I am very optimistic about software improvements. The area is exciting and attracts lots of talented people. I am not going to contribute anything beyond bug reports though, because I need to tend to my own business and LLMs are mere productivity boost for me. Money for training hardware will keep coming from governments and enterprises worried about data security. Hardware will also improve, although not as quickly, because hardware is expensive to change and also because vendors hesitate to commit to computational primitives that might be rendered obsolete by next year's software optimizations.
I am confident there will be steady and fairly fast progress in local LLMs, but cloud LLMs will not go away. With high sparsity and other optimizations, cloud LLMs will eventually grow to be as big as search engines. Instead of replacing cloud LLMs, local LLMs will evolve to support different use cases.