Robert Važan

Running Ollama on AMD iGPU

Running Ollama on CPU cores is the trouble-free solution, but all CPU-only computers also have an iGPU, which happens to be faster than all CPU cores combined despite its tiny size and low power consumption. With some tinkering and a bit of luck, you can employ the iGPU to improve performance. Here's my experience getting Ollama to run on iGPU in AMD Ryzen 5600G under Linux and inside a Podman container.

Even though iGPUs can dynamically allocate host RAM via UMA/GTT/GART and llama.cpp supports it via compile-time switch, there's currently no UMA support in Ollama. The only option is to reserve some RAM as dedicated VRAM in BIOS if your system supports it (some notebooks don't). In my case, it defaults to puny 512MB, but it can be configured to any power of two up to 16GB. I opted for 8GB VRAM, which is sufficient for quantized 7-8B model (4GB), KV cache and buffers (1GB), desktop and applications (1-2GB), and some headroom (1GB). Multimodal llava 7B is a bit larger (5GB), but it still fits. If you want to run 13-20B models, you will need to reserve 16GB of RAM as VRAM.

How you run Ollama with GPU support depends on GPU vendor. I have AMD processor, so these instructions are AMD-only. To make Ollama use iGPU on AMD processors, you will need docker image variant than bundles ROCm, AMD's GPU compute stack. It's a separate image, because ROCm adds 4GB to image size (no kidding). You will also have to give it a few more parameters compared to CPU-only setup:

podman run -d --name ollama --replace --pull=always --restart=always \
    -p 127.0.0.1:11434:11434 -v ollama:/root/.ollama --stop-signal=SIGKILL \
    --device /dev/dri --device /dev/kfd \
    -e HSA_OVERRIDE_GFX_VERSION=9.0.0 -e HSA_ENABLE_SDMA=0 \
    docker.io/ollama/ollama:rocm

ROCm has a very short list of supported GPUs. The environment variables trick ROCm to use the unsupported iGPU in my Ryzen 5600G. You might have to adjust the variables and Ollama/ROCm versions for other unsupported GPUs. According to what I have read on the topic of GPU access in containers, the container remains properly sandboxed despite all the sharing.

If you do all this and Ollama does not error out, crash, or hang, you should get a nice performance boost. In my case, I am observing over 2x faster prompt processing, approaching 40 tokens/second for 7B model on an unloaded system. Time to process an image with moondream dropped to 6 seconds. Generation speed is still limited by memory bandwidth at 10 tokens/second, but it is no longer impacted by background workload on the CPU, which is a killer feature of iGPU inference for me. Ollama still uses some CPU time even if the whole model runs on iGPU, but CPU load is negligible now.

The whole thing is a bit shaky though. Unsupported GPU means this can break with any future update. Inference on iGPU sometimes goes off the rails and produces garbage until Ollama is restarted. Even when it works, output is a tiny bit different from what CPU produced (with top_k = 1) and the first run on iGPU produces slightly different output from second and following runs. Ollama sometimes fails to offload all layers to the iGPU when switching models, reporting low VRAM as if parts of the previous model are still in VRAM. This is damaging to performance and it gets worse over time, but restarting Ollama fixes the problem for a while. Offloading of Mixtral layers to iGPU is broken. The model just hangs.

I am nevertheless happy with the solution. The increased speed and independence from system load make Ollama more practical. I am going to keep this setup until I have dedicated GPU.