Robert Važan

Running Ollama on AMD iGPU

Running Ollama on CPU cores is the trouble-free solution, but all CPU-only computers also have an iGPU, which happens to be faster than all CPU cores combined despite its tiny size and low power consumption. With some tinkering and a bit of luck, you can employ the iGPU to improve performance. Here's my experience getting Ollama to run on AMD Ryzen 5600G (RX Vega 7, GCN5.0) under Linux and inside a Podman container.

Discrete GPU setup

Although this article is about integrated GPUs, I will first desribe the simpler setup process for discrete GPU, partly to serve as a basis for iGPU setup and partly to demonstrate what iGPU setup should look like in the future once iGPU support in ROCm and Ollama improves.

We will make two changes to the CPU-only setup:

This is what the complete command looks like:

podman run -d \
    --name ollama \
    --replace \
    --pull=always \
    --restart=always \
    --stop-signal=SIGKILL \
    -p 127.0.0.1:11434:11434 \
    -v ollama:/root/.ollama \
    -e OLLAMA_MAX_LOADED_MODELS=1 \
    -e OLLAMA_NUM_PARALLEL=1 \
    --device /dev/dri \
    --device /dev/kfd \
    docker.io/ollama/ollama:rocm

This should work flawlessly with any recent AMD dGPU. If you have older hardware, you might have to set HSA_OVERRIDE_GFX_VERSION variable to fool ROCm into using a GPU that is not on the extremely short list of GPUs supported by ROCm.

Integrated GPU setup

Integrated GPUs use the same RAM as the CPU. System RAM is assigned to iGPU in two ways:

Why are we discussing these technical details? Because this is where iGPU support is badly broken. Before kernel 6.10, ROCm allocated only in reserved VRAM. After kernel 6.10, it allocates only in GTT. So far so good. GTT is better than reserved VRAM, because we no longer have to fiddle with BIOS settings. The catch is that Ollama determines whether suitable GPU is present by looking at the size of reserved VRAM. If you have the default 512MB VRAM configured in BIOS, Ollama will refuse to use the iGPU. If you increase VRAM carve-out in BIOS, Ollama will use the iGPU, but all memory allocations will go to GTT and the reserved VRAM will sit idle, which is of course extremely wasteful.

To fix this mess, you have to use the unmerged pull request for AMD iGPUs. Let's start by building the pull request from scratch. To prevent issues with stale build cache, the script below aggressively prunes all caches.

cd ~
mkdir ollama-gtt
cd ollama-gtt
git clone \
    -b AMD_APU_GTT_memory \
    --recurse-submodules \
    https://github.com/Maciej-Mogilany/ollama.git \
    .
podman image prune -f
rm -rf /var/tmp/buildah-cache-1000
podman build \
    -f Dockerfile \
    --no-cache \
    --platform=linux/amd64 \
    --target runtime-rocm \
    --build-arg=OLLAMA_SKIP_CUDA_GENERATE=1 \
    -t ollama-gtt

This will create local ollama-gtt container image, which we can now use to launch iGPU-compatible version of Ollama. Podman command is almost identical to the one above for dGPUs. We need to make only two changes:

podman run -d \
    --name ollama \
    --replace \
    --pull=always \
    --restart=always \
    --stop-signal=SIGKILL \
    -p 127.0.0.1:11434:11434 \
    -v ollama:/root/.ollama \
    -e OLLAMA_MAX_LOADED_MODELS=1 \
    -e OLLAMA_NUM_PARALLEL=1 \
    --device /dev/dri \
    --device /dev/kfd \
    -e HSA_OVERRIDE_GFX_VERSION=9.0.0 \
    ollama-gtt

If you do all this and Ollama does not error out, crash, or hang, you should see models running on the iGPU. You can use radeontop tool to see GPU memory and compute usage.

Performance

Unfortunately, Ryzen 5600G's iGPU performs worse than CPU. It wasn't always so. Before kernel 6.10, with reserved VRAM in BIOS, iGPU was significantly faster than the CPU. Others with newer hardware seem to have more luck and get better performance than with CPU alone. Note that the performance boost is concentrated in context processing. Generation speed is still limited by memory bandwidth just like on the CPU. Ollama still uses some CPU time even if the whole model runs on iGPU, but CPU load should be negligible.

Beware that people reported desktop environment crashes when running models on AMD iGPUs. You will have to test it on your hardware with your preferred models to be sure. Good luck.