I finally got around to setting up local LLM, almost a year after I declared that AGI is here. I have low-cost hardware and I didn't want to tinker too much, so after messing around for a while, I settled on CPU-only Ollama and Ollama WebUI, both of which can be installed easily and securely in a container. Ollama has a big model library while Ollama WebUI is rich in convenient features. Ollama is built on top of the highly optimized llama.cpp.
Setup is super simple. No GPU is needed. Both projects have instructions for running in docker containers. See the relevant ollama blog post and WebUI README. I have tweaked the instructions a bit to use podman instead of docker (I am using Fedora) and to restart automatically after reboot:
podman run -d --name ollama --replace --restart=always \ -p 11434:11434 -v ollama:/root/.ollama --stop-signal=SIGKILL \ docker.io/ollama/ollama podman run -d --name ollama-webui --replace --restart=always \ -p 3000:8080 -v ollama-webui:/app/backend/data \ --add-host=host.docker.internal:host-gateway \ ghcr.io/ollama-webui/ollama-webui:main systemctl --user enable podman-restart
I have also created some aliases/scripts to make it very convenient to invoke ollama from the command line, because without aliases, containerized CLI interface gets a bit verbose:
podman exec -it ollama ollama run tinyllama
Why run LLMs locally?
I have GPT-4 subscription, but it's barely paying for itself. It saves less than 10% of my time and I waste a lot of time tinkering with it. Local LLMs are free and increasingly good. Then there are all the issues with the cloud. Cloud LLM can change, disappear, or get more expensive at any moment. It keeps asking for my feedback and other data, which only serves the operator while I get my data locked up. I am quite sensitive about privacy and freedom and although I don't run into guardrails often, it's annoying when I do. I am also hoping that local LLM will offer more control, because even though GPT-4 is smart, it's often unnecessarily creative when I just want it to follow instructions. API gives more control, but it can get crazy expensive if some script gets stuck in a loop.
My current favorite model is dolphin-mistral 7B followed by vanilla mixtral 8x7B, which is slower but smarter. 3B and smaller models are really fast even on CPU, but they are a confused, hallucinating mess. If you must, orca-mini 3B is the least bad one. Dolphin variants are mostly intended as uncensored LLMs, but uncensoring also has the surprising effect of making models more amenable to tweaking with crafted prompts. 4-bit quantization makes models smaller and faster with negligible loss of accuracy. 3-bit quantization is cutting into accuracy perceptibly, but it's still better than resorting to smaller model. There's no point in running models with more bits per parameter. If you have powerful hardware, just run larger model instead.
Ollama WebUI provides convenient UI for Ollama's custom modelfiles, which can be used to set system prompt and to tweak parameters. This is important for smaller models, which are often unsure. This uncertainty manifests in overly wide output probability distributions. To compensate, I tighten the available parameters (temperature, top_k, top_p) to narrow the distribution. I have even created custom modelfiles with greedy sampling (top_k = 1) for when I absolutely don't want any creativity. Beware that narrowing output distribution is a hack that makes smaller models vulnerable to repeat loops, so use with care. Sufficiently large models are confident in their output and they should be tweaked only via system prompt. Ditto for smaller models used for creative output. As for other parameters, I remove output length limit (num_predict = -1), relax repeat_penalty to allow the model to output repetitive code when I need it, and expand context size (num_ctx) above the default 2048 (Mistral/Mixtral support up to 32K), because ollama does not handle context-exceeding conversations well.
But no matter how much you tweak them, small local models aren't of much practical use. Compared to cloud LLMs, local 7B model is an alpha-stage technology demo. Nobody would pay for GPT4 if the free 175B ChatGPT was good enough, so what do you expect from a 7B model? People mostly use local LLMs for entertainment, especially role-play. The more serious business use-cases rely on fine-tuning, which is currently impractical for individual users and next to impossible without high-end GPU. Smaller models are however okay for simple Q&A, overviews, and recommendations. Summarization and document indexing are feasible, but you need GPU to process the long prompts quickly. Code and text completion might work well if the editor supports it and you have a GPU or the editor allows you to limit context length.
Speeding things up
Hardware is a big problem, BTW. I have a few months old computer but a low-cost one. I am not a LLM nerd like the guys hanging out at /r/LocalLLaMA who build multi-GPU rigs just to run the largest LLMs. GPUs are bloody expensive these days and they do not have anywhere near enough RAM. I therefore opted for a cheap box with iGPU and lots of system RAM. The downside is that inference is slow. Token rate is about 20-35% lower than what you would guess from model size and memory bandwidth, probably because of various inefficiencies in llama.cpp, but also because some parts of inference are not memory-bound. Models become barely comfortable to use at speeds above 10 tokens/second, which is approximately what you can expect from a 7B model like Mistral on 2-channel DDR4-3200.
To speed things up, my system prompt usually consists of only two sentences: one giving the AI a role (assistant) and one requesting brevity. Ollama can cache the system prompt, but keeping it short still helps a bit, especially with first query. I stick to one model to avoid the cost of ollama switching models and clearing attention cache. Attention cache (also called KV cache) is essential for performance in multi-turn conversations. If it is cleared, ollama will reconstruct it by reevaluating the whole conversation from the beginning, which is slow on CPU. Ollama WebUI can use configured LLM to generate titles in your chat history, but it's such a performance killer that I have disabled the feature. By default, Ollama unloads the model and discards attention cache after 5 minutes of inactivity. This can be configured as "Keep Alive" in WebUI settings and I have set it to higher value to ensure quick response even if I come back to the conversation a bit later.
Llama.cpp is very sensitive to competition from background processes, running as much as 2x slower even if the background process is in a cgroup with low CPU share. The most likely cause is that the background process interferes with scheduling of llama.cpp's threads, which causes some thread to fall behind and then the other threads are left idling while they wait for the affected thread to catch up. This is hard to fix purely within llama.cpp code, at least for transformer architecture, which requires the implementation to repeatedly parallelize small chunks of work, syncing threads after every chunk. To fix this on system level, we can tinker with scheduler configuration, specifically with real-time priorities. I run CPU-hogging background processes all the time, so I invested the necessary effort into granting ollama real-time priority:
sudo podman run -d --name ollama --replace --restart=always \ -p 11434:11434 -v ollama:/root/.ollama --stop-signal=SIGKILL \ --cap-add=SYS_NICE --entrypoint=/bin/sh \ docker.io/ollama/ollama \ -c chrt 1 /bin/ollama serve sudo systemctl enable podman-restart
Rootless podman ignores SYS_NICE, so run with
I tried both round-robin scheduler (
chrt default) and FIFO scheduler, but I don't see any difference.
Interestingly, real-time schedulers are 10-20% slower than default scheduler on an unloaded system,
probably because default scheduler is a bit smarter about evenly spreading load over all cores.
But the massive boost under load is worth it.
With real-time priority, ollama performs almost as well as it does on an unloaded system.
System remains stable, because I have CPU with hyperthreading,
which ollama does not use, so apparent CPU usage is only 50% and system can schedule other prosesses freely.
I nevertheless noticed significant interference with other real-time processes, notably audio playback.
Be warned that without hyperthreading, ollama with real-time priority will probably crash the system.
Ollama allocates one thread per physical core, but this can be configured in custom modelfile. My experiments show that inference can work with fewer threads, because it is bottlenecked on RAM bandwidth. It even runs slightly faster with one less thread on an unloaded system. But prompt processing can definitely use all available cores. Increasing thread count beyound one thread per core actually worsens performance, probably because instruction-level parallelism already fully utilizes all cores and additional threads just introduce thread coordination issues.
What to expect in the future
This sets priorities for future hardware purchases. Nothing else on my computer suffers from hardware constraints as much as local LLMs. DDR5 will double memory bandwidth, which will make Mistral and perhaps even Mixtral sufficiently fast, but larger models will not be competitive without more memory channels, which are currently rare and expensive. GPUs have wide memory bus, but they instead constrain model size via limited VRAM unless you buy a really expensive GPU (I wouldn't). GPU might be nevertheless useful to run multimodal models, which are currently unusable on CPU (2 minutes to analyze single image with llava). Prompt processing is also limited by compute rather than by memory bandwidth. GPUs and new processors with full AVX-512 support should help with that. On-chip memory bus, whether in the form of HBM or in-memory computation, is the future, but it will take a decade to trickle down to low-cost computers.
There are however plenty of opportunities for software and model optimizations, which is where I hope to get significant performance boost in the next year or two. Mistral shows that well-trained 7B model can deliver impressive results. Properly trained 3B model could approximate it while delivering lightning speed. Code and text completion is an obvious application for local LLMs, but editor support is still scarce and often cumbersome. Domain models could crush much larger generalists, but there are hardly any specialist models at the moment. Lightweight local fine-tuning could fix style and conventions without excessive prompting, but it's not exactly a pushbutton experience yet. Letting LLMs access tools, Internet, and supporting databases can help overcome their size limitations. RWKV and Mamba promise faster inference and other benefits. Speculative execution of LLMs can help a lot, but no open weights model uses it. Beam search would be essentially free for local inference. iGPUs and AMD/Intel dGPUs could help with multimodal models, long prompts, and energy efficiency, but they sit still for lack of software support. MoE and sparsity are underutilized. Ollama will sometimes swamp the limited zram swap space on linux for some reason instead of allocating regular heap memory. Word on the street is that there are significant multi-threading inefficiencies in llama.cpp, which are probably behind performance problems on busy systems.
I am very optimistic about software improvements. The area is exciting and attracts lots of talented people. I am not going to contribute anything beyond bug reports though, because I need to tend to my own business and LLMs are mere productivity boost for me. Money for training hardware will keep coming from universities and from enterprises worried about data security. Hardware will also improve, although not as quickly, because hardware is expensive to change and also because vendors hesitate to commit to computational primitives that might be rendered obsolete by next year's software optimizations.
I am confident there will be steady and fairly fast progress in local LLMs, but cloud LLMs will not go away. With high sparsity and other optimizations, cloud LLMs will eventually grow to be as big as search engines. Instead of replacing cloud LLMs, local LLMs will evolve to support different use cases.