Running Ollama on AMD iGPU
Running Ollama on CPU cores is the trouble-free solution, but all CPU-only computers also have an iGPU, which happens to be faster than all CPU cores combined despite its tiny size and low power consumption. With some tinkering and a bit of luck, you can employ the iGPU to improve performance. Here's my experience getting Ollama to run on AMD Ryzen 5600G (RX Vega 7, GCN5.0) under Linux and inside a Podman container.
Discrete GPU setup
Although this article is about integrated GPUs, I will first desribe the simpler setup process for discrete GPU, partly to serve as a basis for iGPU setup and partly to demonstrate what iGPU setup should look like in the future once iGPU support in ROCm and Ollama improves.
We will make two changes to the CPU-only setup:
- Use
:rocm
container image tag instead of:latest
. This is an official Ollama image that bundles ROCm, AMD's GPU compute stack. It's a separate image, because ROCm adds 4GB to image size (no kidding). - Share
/dev/dri
and/dev/kfd
devices to the container, so that it can access the GPU. According to what I have read on the topic of GPU access in containers, the container remains properly sandboxed despite all the sharing.
This is what the complete command looks like:
podman run -d \ --name ollama \ --replace \ --pull=always \ --restart=always \ --stop-signal=SIGKILL \ -p 127.0.0.1:11434:11434 \ -v ollama:/root/.ollama \ -e OLLAMA_MAX_LOADED_MODELS=1 \ -e OLLAMA_NUM_PARALLEL=1 \ --device /dev/dri \ --device /dev/kfd \ docker.io/ollama/ollama:rocm
This should work flawlessly with any recent AMD dGPU.
If you have older hardware, you might have to set HSA_OVERRIDE_GFX_VERSION
variable to fool ROCm
into using a GPU that is not on the extremely short list of GPUs supported by
ROCm.
Integrated GPU setup
Integrated GPUs use the same RAM as the CPU. System RAM is assigned to iGPU in two ways:
- Reserved iGPU "VRAM": This is a portion of system RAM carved out for iGPU at boot time. It is only used by iGPU and cannot be used by applications. Size of iGPU carve-out is configured in BIOS. Default is usually tiny 512MB, but many BIOSes allow increasing it up to 16GB.
- GTT: Graphics Translation Table, also called Unified Memory Architecture (UMA) or GART, allows dynamic allocation of system RAM to iGPU while the system is running. Linux by default allows up to half of system RAM to be used as GTT memory. If the iGPU does not use it, this memory can be used by applications instead.
Why are we discussing these technical details? Because this is where iGPU support is badly broken. Before kernel 6.10, ROCm allocated only in reserved VRAM. After kernel 6.10, it allocates only in GTT. So far so good. GTT is better than reserved VRAM, because we no longer have to fiddle with BIOS settings. The catch is that Ollama determines whether suitable GPU is present by looking at the size of reserved VRAM. If you have the default 512MB VRAM configured in BIOS, Ollama will refuse to use the iGPU. If you increase VRAM carve-out in BIOS, Ollama will use the iGPU, but all memory allocations will go to GTT and the reserved VRAM will sit idle, which is of course extremely wasteful.
To fix this mess, you have to use the unmerged pull request for AMD iGPUs. Let's start by building the pull request from scratch. To prevent issues with stale build cache, the script below aggressively prunes all caches.
cd ~ mkdir ollama-gtt cd ollama-gtt git clone \ -b AMD_APU_GTT_memory \ --recurse-submodules \ https://github.com/Maciej-Mogilany/ollama.git \ . podman image prune -f rm -rf /var/tmp/buildah-cache-1000 podman build \ -f Dockerfile \ --no-cache \ --platform=linux/amd64 \ --target runtime-rocm \ --build-arg=OLLAMA_SKIP_CUDA_GENERATE=1 \ -t ollama-gtt
This will create local ollama-gtt
container image, which we can now use to launch iGPU-compatible version of Ollama.
Podman command is almost identical to the one above for dGPUs. We need to make only two changes:
- Replace public
ollama:rocm
image with our newly builtollama-gtt
image. - Set
HSA_OVERRIDE_GFX_VERSION
, because ROCm has absolutely zero support for iGPUs. Value9.0.0
works well for Ryzen 5600G. You will likely have to adjust it for your iGPU.
podman run -d \ --name ollama \ --replace \ --pull=always \ --restart=always \ --stop-signal=SIGKILL \ -p 127.0.0.1:11434:11434 \ -v ollama:/root/.ollama \ -e OLLAMA_MAX_LOADED_MODELS=1 \ -e OLLAMA_NUM_PARALLEL=1 \ --device /dev/dri \ --device /dev/kfd \ -e HSA_OVERRIDE_GFX_VERSION=9.0.0 \ ollama-gtt
If you do all this and Ollama does not error out, crash, or hang, you should see models running on the iGPU.
You can use radeontop
tool to see GPU memory and compute usage.
Performance
Unfortunately, Ryzen 5600G's iGPU performs worse than CPU. It wasn't always so. Before kernel 6.10, with reserved VRAM in BIOS, iGPU was significantly faster than the CPU. Others with newer hardware seem to have more luck and get better performance than with CPU alone. Note that the performance boost is concentrated in context processing. Generation speed is still limited by memory bandwidth just like on the CPU. Ollama still uses some CPU time even if the whole model runs on iGPU, but CPU load should be negligible.
Beware that people reported desktop environment crashes when running models on AMD iGPUs. You will have to test it on your hardware with your preferred models to be sure. Good luck.