Robert Važan

Ollama on Intel Arc A380 using IPEX-LLM

I bought Intel Arc A380 GPU last year to feed my AI pet. I am testing LLM inference on it for a few months now. In this post, I will detail my experience getting small LLMs running on the A380 using Intel's IPEX-LLM fork of Ollama/llama.cpp, all inside a Podman container on Linux.

TLDR (spoiler): It sort of works, but if you put it under serious load, you will run into IPEX-LLM's slowdown bug that renders the whole setup impractical for serious applications. I am now looking for alternatives.

Why Intel Arc A380?

Intel Arc A380 (spec, more data) sells for under 150€. It's an upgrade from my previous CPU-only inference setup. With only 75W TDP, it fits within the power and cooling limits of my current computer, which was originally intended to be a quiet CPU-only configuration. A larger card would perform better, but it would require building an entirely new system. 6GB VRAM is enough memory to run several popular small models with reasonable context windows, especially when the integrated GPU handles desktop duties and the discrete GPU can be dedicated to LLM inference.

IPEX-LLM

IPEX-LLM is Intel's project that ports popular LLMs and inference engines to Intel hardware. It includes a modified llama.cpp and offers a container image with modified llama.cpp and Ollama. I chose this solution because it provides containerization, direct Intel maintenance, and Ollama API compatibility with my existing setup.

Setting up IPEX-LLM required working around several issues:

Here's my Dockerfile that makes the container behave like regular Ollama container:

FROM docker.io/intelanalytics/ipex-llm-inference-cpp-xpu:latest
ENV ZES_ENABLE_SYSMAN=1
ENV USE_XETLA=OFF
ENV OLLAMA_HOST=0.0.0.0:11434
RUN mkdir -p /llm/ollama && \
    cd /llm/ollama && \
    init-ollama
WORKDIR /llm/ollama
ENTRYPOINT ["./ollama", "serve"]

Build the container and tag it ollama-ipex. You can run it similarly to the CPU-only container, but share /dev/dri so the container can access the GPU while remaining properly sandboxed:

podman run -d --rm \
    --name ollama --replace \
    --stop-signal=SIGKILL \
    -p 127.0.0.1:11434:11434 \
    -v ollama:/root/.ollama \
    -e OLLAMA_MAX_LOADED_MODELS=1 \
    -e OLLAMA_NUM_PARALLEL=1 \
    --device /dev/dri \
    localhost/ollama-ipex

You should now be able to run LLMs via Ollama API fully accelerated on Intel GPU.

Supported models

6GB of VRAM doesn't sound like much, but it's enough for decent small models with a reasonable context window, especially if your iGPU handles the desktop and the dGPU is dedicated to LLMs. Here are the most useful models I've tested:

Some models are too large. Gemma2 fits only with tiny 0.5K context. Vision models either run slowly or fail entirely.

Performance

After using this setup for a while, I noticed prompt processing slows down over time, especially with long contexts. I reported the bug with a script to reproduce it, but as of May 2025, it's still unresolved. My workaround is to have my scripts restart the inference engine after a certain number of tokens. For fair benchmarks, the data below was collected with a fresh instance.

As a simple performance test, I measured speed in tokens/second for both prompt processing (PP) and text generation (TG) given about 1.5K-token prompt that results in about 1K-token output. All models were configured with maximum context the GPU can handle.

ModelContextPPTG
llama3.110 K329 t/s18.7 t/s
qwen2.524 K327 t/s14.4 t/s
qwen2.5-coder24 K364 t/s14.4 t/s
llama3.2:3b32 K314 t/s25 t/s
llama3.2:1b128 K625 t/s44.8 t/s
Model performance

Observations:

I have then tested how the performance scales with progressively larger context:

Contextqwen2.5-coderllama3.1
1 K247 / 14.5223 / 19.2
2 K328 / 14.4541 / 18.7
4 K497 / 13.9283 / 17.6
8 K295 / 13258 / 15.5
16 K254 / 11.7-
Large context performance (PP / TG)

Prompt processing peaks over 500 tokens/second for optimal context length but drops to ~250 t/s for long contexts. Text generation remains practical even at maximum context, though it falls far below the theoretical 30 tokens/second that the A380's 186GB/s memory bandwidth should support.

Alternatives

The slowdown bug, underutilization of memory bandwidth, and not-so-fast prompt processing discourage me from buying a bigger Intel card for my next build. Maybe Intel cards work better with other inference engines? Here are some options I'm considering:

Or maybe I'll just buy an AMD card next time and hope for better performance with less hassle.