Why are input tokens expensive?
I have recently come across an article that claims input tokens are extremely overpriced. The argument is that input processing in LLMs is compute-bound and compute is dirt cheap, so input tokens should cost a cent or so per 1M tokens even when using frontier models. I am no expert on hosting LLMs, but even I understand it's more complicated than that.
Quick recap of how LLM inference works
You can picture the work that GPU has to do to process a LLM request as a rectangle. Horizontal axis represents token position in the sequence. Vertical axis represents layers in the model ordered bottom-up. There are many more tokens (~100K) than layers (~100), so this rectangle is wide and low. It is split into two parts: input and output.
GPU processes the input part of this rectangle one row (model layer) at a time, starting at the bottom and quickly reaching the top. This process is highly parallel and compute-bound, so GPUs perform well in it.
But output is processed differently. The GPU processes one column (token) at a time. Within each token, it goes cell by cell (layer by layer in the model) from the bottom to the top. Only after the current token completes processing can the GPU start processing the next token.
Cost of processing model parameters
Every layer in the model consists of several large matrices that must be loaded from memory. For input tokens, this is done once for the whole row (layer). For output tokens, this is done for every cell (combination of token and layer).
So if we have to process I input tokens and O output tokens and the model has N total parameters and A active parameters, then input processing requires O(N) memory bandwidth to fetch model parameters while output processing requires O(OA) of memory bandwidth. Compute requirements are O(IA) for input and O(OA) for output. The situation so far looks like this:
Input | Output | |
---|---|---|
Bandwidth | O(N) | O(OA) |
Compute | O(IA) | O(OA) |
Since GPUs have far more compute than memory bandwidth, output tokens are bandwidth-limited while input tokens are compute-limited. Overall request cost is determined by bandwidth requirements of output tokens. Batching helps a bit, but its effectiveness is limited in modern MoE models.
Cost of processing the KV cache
The above naive calculation, apparently assumed by author of the linked article, neglects KV cache. That's where the model keeps information about already processed tokens. For short Q&A chats, it does not matter much, but for long context workloads like agentic coding tools, KV cache size can be easily larger than model's active parameters.
From GPU's point of view, KV cache appears to be a set of dynamically generated matrices that grow by one row/column with every token. KV cache thus incurs compute and memory bandwidth overhead somewhat similar to model parameters except it scales with sequence length.
Note that attention processing also involves using model parameters to compute queries, keys, and values, and to project attention result back into the embedding, but cost of these operations is already included in processing of model parameters above. What's left unaccounted is the innermost attention operation that matches query vectors to key vectors and combines value vectors.
If I and O are defined like above and C is the number of scalars added to the KV cache by every token, then KV cache size in memory is O((I+O)C). If input is much longer than output, we can simplify this to O(IC). Processing of KV cache costs O(IC) memory bandwidth for input and O(OIC) memory bandwidth for output. Remember this is because input is processed one layer at a time while output is processed one token at a time, so every output token has to reread the entire KV cache from memory. Compute cost is O(IIC) for input (the famous quadratic attention) and O(OIC) for output. So for KV cache alone, we have:
Input | Output | |
---|---|---|
Bandwidth | O(IC) | O(OIC) |
Compute | O(IIC) | O(OIC) |
Output token bandwidth remains the bottleneck for a typical request. Let's combine this with cost of parameter processing:
Input | Output | |
---|---|---|
Bandwidth | O(N+IC) | O(OA+OIC) |
Compute | O(IA+IIC) | O(OA+OIC) |
There are two important differences compared to the naive analysis that considered only model parameters. First, KV cache is separate for every sequence, so batching cannot help with it at all. Second, all four complexity bounds now scale with sequence length.
Accounting the cost of KV cache
The most expensive part of handling LLM request is the O(OA+OIC) memory bandwidth for output. The O(OA) part describes fixed cost per output token, which is included in output token price.
The interesting part is O(OIC). Notice how it scales with both input length and output length. Should vendors account it under input price, output price, or both? And what if customers start submitting requests with very long input and very long output? That can get expensive very fast.
In practice, input is much longer than output and input is also more variable than output, so it makes more sense to include this cost in price of input tokens. It's logical if you think about it. Every input token contributes some data to the KV cache, which will have to be read again and again for every output token or O(O) times. That adds up to a lot of memory bandwidth. So the seemingly compute-bound input tokens actually cost lots of memory bandwidth.
Why is there a limit on output length?
If you have been browsing LLM vendor offerings, you might have wondered why are they limiting output length. There is no technical reason to do so. But there's an economic reason. While output length varies less than input length, output can still get very long, which would get very expensive for LLM vendors due to the quadratic O(OIC) memory bandwidth. To prevent customers from driving inference costs above API price by submitting long inputs and asking for long outputs, vendors set an arbitrary limit on output length.
Memory usage
Memory footprint is O(IC) for single request and O(BIC) for a batch of B concurrently processed requests. This memory must remain allocated until the last output token is generated. How much is that O(IC) in practice? Frontier models are large and large models tend to have larger KV cache. On the other hand, vendors have been optimizing KV cache aggressively. I would guess that KV cache footprint with 100K input tokens falls somewhere between 10 GB and 100 GB. That's definitely going to limit batch sizes even on spacious AMD GPUs.
Batching is less effective in MoE models, but it still helps. Attention-related parameters and expert router parameters are shared. Some models also have shared experts that always run. Every request chooses its own experts, but if you have a larger batch, chances are there will be some overlap.
But if a single request eats all memory on the GPU, you cannot take advantage of batching. Or you fit only 2-3 requests per GPU, which greatly diminishes benefits of batching. Notice that the input length in O(BIC) is user-controlled. That means users submitting longer inputs reduce effectiveness of batching and increase costs. Input tokens thus not only strain memory bandwidth, they strain memory capacity as well. This is something you also have to account for when pricing input tokens.
Regarding optimized attention
Througout this article, I have assumed a simple attention mechanism that has fixed per-token cost. Vendors sometimes optimize this. They only read the last 1,000 or so tokens from the KV cache in some layers. Sometimes they even deallocate the memory or move it to off-GPU prompt cache. They might also employ some kind of sub-quadratic attention. These optimizations reduce impact of long contexts, but they also tend to compromise model performance, so there's a limit to how extensively you can optimize. Even with these optimizations, input tokens still impact cost and they wouldn't ever be free.
Real cost structure of LLM inference
As I have demonstated, the super cheap input tokens from the linked article are an illusion that you get when you forget to account for the cost of KV cache. Real cost of LLM inference is O(IO). It's a rectangle whose sides are the input length and the output length. Vendors charge for input tokens and set limit on output length to prevent the O(IO) cost from getting out of control.