Robert Važan

Real-time priority for Ollama

Setting up Ollama on Linux is straightforward if you don't need GPU acceleration, but the underlying llama.cpp engine is very sensitive to competition from background processes, running as much as 2x slower. I run CPU-hogging background processes all the time, so I invested the necessary effort into testing Ollama with real-time priority. This will use Podman here instead of Docker, because I am on Fedora.

Obviously, the first thing I tried is to deprioritize the background process by placing it in a cgroup with low CPU share. This however does not work. The above mentioned 2x slowdown of Ollama was measured when the background process was already limited to 10% share of CPU.

The most likely cause of the slowdown is that the background process interferes with scheduling of llama.cpp's threads, which causes some thread to fall behind and then the other threads are left idling while they wait for the affected thread to catch up. This is hard to fix purely within llama.cpp code, at least for transformer architecture, which requires the implementation to repeatedly parallelize small chunks of work, syncing threads after every chunk.

To fix this on system level, we can tinker with scheduler configuration, specifically with real-time priorities. Let's modify the default setup of Ollama to run with real-time priority:

sudo podman run -d \
    --name ollama \
    --replace \
    --pull=always \
    --restart=always \
    --stop-signal=SIGKILL \
    -p 127.0.0.1:11434:11434 \
    -v ollama:/root/.ollama \
    -e OLLAMA_MAX_LOADED_MODELS=1 \
    -e OLLAMA_NUM_PARALLEL=1 \
    --cap-add=SYS_NICE \
    --entrypoint=/bin/sh \
    docker.io/ollama/ollama \
    -c chrt 1 /bin/ollama serve
sudo systemctl enable podman-restart

Rootless Podman ignores SYS_NICE, so run with sudo. I tried both round-robin scheduler (chrt default) and FIFO scheduler, but I don't see any difference. Interestingly, real-time schedulers are 10-20% slower than default scheduler on an unloaded system, probably because default scheduler is a bit smarter about evenly spreading load over all cores. But the massive boost under load is worth it. With real-time priority, Ollama performs almost as well as it does on an unloaded system. System remains stable, because I have CPU with hyperthreading, which Ollama does not use, so apparent CPU usage is only 50% and the system can schedule other prosesses freely. I nevertheless noticed significant interference with other real-time processes, notably audio playback. Be warned that without hyperthreading, Ollama with real-time priority will probably crash the system.