Robert Važan May 27, 2024 – May 28, 2025

Running Ollama on AMD iGPU

Running Ollama on CPU cores is the trouble-free solution, but all CPU-only computers also have an iGPU, which happens to be faster than all CPU cores combined despite its tiny size and low power consumption. With some tinkering and a bit of luck, you can employ the iGPU to improve performance. Here's my experience getting Ollama to run on AMD Ryzen 5600G (RX Vega 7, GCN5.0) under Linux and inside a Podman container.

Discrete GPU setup

Although this article is about integrated GPUs, I will first desribe the simpler setup process for discrete GPU, partly to serve as a basis for iGPU setup and partly to demonstrate what iGPU setup should look like in the future once iGPU support in ROCm and Ollama improves.

We will make two changes to the CPU-only setup:

Use :rocm container image tag instead of :latest. This is an official Ollama image that bundles ROCm, AMD's GPU compute stack. It's a separate image, because ROCm is large.
Share /dev/dri and /dev/kfd devices to the container, so that it can access the GPU. According to what I have read on the topic of GPU access in containers, the container remains properly sandboxed despite all the sharing.

This is what the complete command looks like:

podman run -d --rm \
    --name ollama --replace \
    --stop-signal=SIGKILL \
    -p 127.0.0.1:11434:11434 \
    -v ollama:/root/.ollama \
    -e OLLAMA_MAX_LOADED_MODELS=1 \
    -e OLLAMA_NUM_PARALLEL=1 \
    --device /dev/dri \
    --device /dev/kfd \
    docker.io/ollama/ollama:rocm

This should work flawlessly with any recent AMD dGPU. If you have older hardware, you might have to set HSA_OVERRIDE_GFX_VERSION variable to fool ROCm into using a GPU that is not on the extremely short list of GPUs supported by ROCm.

Integrated GPU setup

Integrated GPUs use the same RAM as the CPU. System RAM is assigned to iGPU in two ways:

Reserved iGPU "VRAM": This is a portion of system RAM carved out for iGPU at boot time. It is only used by iGPU and cannot be used by applications. Size of iGPU carve-out is configured in BIOS. Default is usually tiny 512MB, but many BIOSes allow increasing it up to 16GB.
GTT: Graphics Translation Table, also called Unified Memory Architecture (UMA) or GART, allows dynamic allocation of system RAM to iGPU while the system is running. Linux by default allows up to half of system RAM to be used as GTT memory. If the iGPU does not use it, this memory can be used by applications instead.

Why are we discussing these technical details? Because this is where iGPU support is badly broken. Before kernel 6.10, ROCm allocated only in reserved VRAM. After kernel 6.10, it allocates only in GTT. So far so good. GTT is better than reserved VRAM, because we no longer have to fiddle with BIOS settings. The catch is that Ollama determines whether suitable GPU is present by looking at the size of reserved VRAM. If you have the default 512MB VRAM configured in BIOS, Ollama will refuse to use the iGPU. If you increase VRAM carve-out in BIOS, Ollama will use the iGPU, but all memory allocations will go to GTT and the reserved VRAM will sit idle, which is of course extremely wasteful.

To fix this mess, you have to use the unmerged pull request for AMD iGPUs. Let's start by building the pull request from scratch. To prevent issues with stale build cache, the script below aggressively prunes all caches.

cd ~
mkdir ollama-gtt
cd ollama-gtt
git clone \
    -b AMD_APU_GTT_memory \
    --recurse-submodules \
    https://github.com/Maciej-Mogilany/ollama.git \
    .
podman image prune -f
rm -rf /var/tmp/buildah-cache-1000
podman build \
    -f Dockerfile \
    --no-cache \
    --platform=linux/amd64 \
    --target runtime-rocm \
    --build-arg=OLLAMA_SKIP_CUDA_GENERATE=1 \
    -t ollama-gtt

This will create local ollama-gtt container image, which we can now use to launch iGPU-compatible version of Ollama. Podman command is almost identical to the one above for dGPUs. We need to make only two changes:

Replace public ollama:rocm image with our newly built ollama-gtt image.
Set HSA_OVERRIDE_GFX_VERSION, because ROCm has absolutely zero support for iGPUs. Value 9.0.0 works well for Ryzen 5600G. You will likely have to adjust it for your iGPU.

podman run -d --rm \
    --name ollama --replace \
    --stop-signal=SIGKILL \
    -p 127.0.0.1:11434:11434 \
    -v ollama:/root/.ollama \
    -e OLLAMA_MAX_LOADED_MODELS=1 \
    -e OLLAMA_NUM_PARALLEL=1 \
    --device /dev/dri \
    --device /dev/kfd \
    -e HSA_OVERRIDE_GFX_VERSION=9.0.0 \
    ollama-gtt

If you do all this and Ollama does not error out, crash, or hang, you should see models running on the iGPU. You can use radeontop tool to see GPU memory and compute usage.

Performance

I benchmarked Ryzen 5600G with 8B llama3.1 on 2K-token prompt and 1.3K response. I am getting speeds around 70/6 PP/TG (prompt processing and text generation in tokens per second) on iGPU compared to 30/8 PP/TG on CPU only. Whether that's an improvement depends on whether you work with long prompts or not. Performance is however very unreliable as smaller models are often paradoxically much slower. You will have to test it. Performance will also certainly vary a lot with hardware. Note that the performance boost is inevitably going to be concentrated in prompt processing. Generation speed is limited by memory bandwidth just like on the CPU. Ollama still uses some CPU time even if the whole model runs on iGPU, but CPU load should be negligible.