Ollama on Intel Arc A380 using IPEX-LLM
I bought Intel Arc A380 GPU last year to feed my AI pet. I am testing LLM inference on it for a few months now. In this post, I will detail my experience getting small LLMs running on the A380 using Intel's IPEX-LLM fork of Ollama/llama.cpp, all inside a Podman container on Linux.
TLDR (spoiler): It sort of works, but if you put it under serious load, you will run into IPEX-LLM's slowdown bug that renders the whole setup impractical for serious applications. I am now looking for alternatives.
Why Intel Arc A380?
Intel Arc A380 (spec, more data) sells for under 150€. It's an upgrade from my previous CPU-only inference setup. With only 75W TDP, it fits within the power and cooling limits of my current computer, which was originally intended to be a quiet CPU-only configuration. A larger card would perform better, but it would require building an entirely new system. 6GB VRAM is enough memory to run several popular small models with reasonable context windows, especially when the integrated GPU handles desktop duties and the discrete GPU can be dedicated to LLM inference.
IPEX-LLM
IPEX-LLM is Intel's project that ports popular LLMs and inference engines to Intel hardware. It includes a modified llama.cpp and offers a container image with modified llama.cpp and Ollama. I chose this solution because it provides containerization, direct Intel maintenance, and Ollama API compatibility with my existing setup.
Setting up IPEX-LLM required working around several issues:
- Enable ReBAR (or "above 4G decoding") first in your BIOS.
- Kernel compatibility initially prevented the container from seeing the GPU. After I reported the bug, Intel employees initially claimed support was limited to Ubuntu 22.04 with kernel 6.2 or 6.5. This seemed ridiculous for a containerized solution, but they fortunately fixed the issue after some nagging, so it now works on Fedora with the default kernel.
- No versioning of IPEX-LLM container images makes the setup fragile. Intel just updates the "latest" tag daily. I've reported this issue, but I haven't seen improvements since. My workaround is pulling the image and tagging it locally with the current date.
- Usability is pretty poor for the default container image, which requires you to start the container, log in, and run commands manually. I've customized the container for seamless operation like standard Ollama containers, inspired by Matt Curfman's Dockerfile.
Here's my Dockerfile that makes the container behave like regular Ollama container:
FROM docker.io/intelanalytics/ipex-llm-inference-cpp-xpu:latest ENV ZES_ENABLE_SYSMAN=1 ENV USE_XETLA=OFF ENV OLLAMA_HOST=0.0.0.0:11434 RUN mkdir -p /llm/ollama && \ cd /llm/ollama && \ init-ollama WORKDIR /llm/ollama ENTRYPOINT ["./ollama", "serve"]
Build the container and tag it ollama-ipex
.
You can run it similarly to the CPU-only container,
but share /dev/dri
so the container can access the GPU while remaining properly sandboxed:
podman run -d --rm \ --name ollama --replace \ --stop-signal=SIGKILL \ -p 127.0.0.1:11434:11434 \ -v ollama:/root/.ollama \ -e OLLAMA_MAX_LOADED_MODELS=1 \ -e OLLAMA_NUM_PARALLEL=1 \ --device /dev/dri \ localhost/ollama-ipex
You should now be able to run LLMs via Ollama API fully accelerated on Intel GPU.
Supported models
6GB of VRAM doesn't sound like much, but it's enough for decent small models with a reasonable context window, especially if your iGPU handles the desktop and the dGPU is dedicated to LLMs. Here are the most useful models I've tested:
- llama3.1:8b with 10K tokens of context — good for article summarization
- qwen2.5-coder:7b with 24K context — surprisingly effective for repetitive coding tasks
Some models are too large. Gemma2 fits only with tiny 0.5K context. Vision models either run slowly or fail entirely.
Performance
After using this setup for a while, I noticed prompt processing slows down over time, especially with long contexts. I reported the bug with a script to reproduce it, but as of May 2025, it's still unresolved. My workaround is to have my scripts restart the inference engine after a certain number of tokens. For fair benchmarks, the data below was collected with a fresh instance.
As a simple performance test, I measured speed in tokens/second for both prompt processing (PP) and text generation (TG) given about 1.5K-token prompt that results in about 1K-token output. All models were configured with maximum context the GPU can handle.
Model | Context | PP | TG |
---|---|---|---|
llama3.1 | 10 K | 329 t/s | 18.7 t/s |
qwen2.5 | 24 K | 327 t/s | 14.4 t/s |
qwen2.5-coder | 24 K | 364 t/s | 14.4 t/s |
llama3.2:3b | 32 K | 314 t/s | 25 t/s |
llama3.2:1b | 128 K | 625 t/s | 44.8 t/s |
Observations:
- The A380 significantly outperforms both my CPU-only and previous AMD iGPU setups for models that fit in VRAM.
- Even 6GB VRAM supports long contexts — qwen2.5-coder handles 24K tokens while smaller models support up to 128K context.
- Prompt processing speed is generally over 300 t/s (over 18K tokens per minute). Even A380 supports long contexts without unreasonable waiting.
- Text generation speed is over 14 t/s, which is faster than most people can read.
- Prompt processing and text generation speeds don't scale linearly with model size. For example, qwen is smaller but slower.
- Small models (if you don't max out the context), can be small enough to run alongside the desktop on the same GPU.
I have then tested how the performance scales with progressively larger context:
Context | qwen2.5-coder | llama3.1 |
---|---|---|
1 K | 247 / 14.5 | 223 / 19.2 |
2 K | 328 / 14.4 | 541 / 18.7 |
4 K | 497 / 13.9 | 283 / 17.6 |
8 K | 295 / 13 | 258 / 15.5 |
16 K | 254 / 11.7 | - |
Prompt processing peaks over 500 tokens/second for optimal context length but drops to ~250 t/s for long contexts. Text generation remains practical even at maximum context, though it falls far below the theoretical 30 tokens/second that the A380's 186GB/s memory bandwidth should support.
Alternatives
The slowdown bug, underutilization of memory bandwidth, and not-so-fast prompt processing discourage me from buying a bigger Intel card for my next build. Maybe Intel cards work better with other inference engines? Here are some options I'm considering:
- Ollama pull requests introduce Vulkan support and SYCL support. They are both work-in-progress but usable. Author of the Vulkan PR maintains a fork of Ollama with Vulkan support.
- I could just use llama.cpp directly. It supports both Vulkan and SYCL. While its OpenAI-compatible server can't switch models like Ollama, wrappers like llama-swap can handle this.
- There's also vLLM. It has limited built-in Intel GPU support that is however unsuitable for consumer use as it lacks 4-bit quantization. IPEX-LLM offers patched vLLM in a container with 4-bit quantization support, though it's a single-model server without multi-model wrappers.
Or maybe I'll just buy an AMD card next time and hope for better performance with less hassle.