Robert Važan

LLMs are all about the context

Calling LLMs generative AI is actually a misnomer. While LLMs indeed generate text, the important nuance here is that this process is conditional on the prompt and the wider context. LLMs behave more like compilers or translators than generators. Their output is a reflection of their input.

Consequently, output quality depends on input quality. It also depends on the LLM, but most of us are already using the best LLM we can afford and major LLM improvements are months or years away. Fine-tuning has great potential, but it's currently complicated, brittle, and expensive. Given plateauing LLM performance, the only lever we have left is the context. All LLM techniques in use today boil down to preparing the best possible context.

Let's go over the presently popular techniques to see how they all enhance the context. I want to show that diverse techniques, including reasoning and agents, are actually just context optimizations.

Long-context LLMs

Obviously, when you are trying to assemble the best possible context, having more space for it is an advantage. LLMs generally benefit from longer context. At some point, you hit diminishing returns, of course, and long context is expensive, so there's economic as well as technical limit to it, but you are almost always better off filling the largest context window you can afford.

Some people say that unnecessary context adds distractions and damages performance, but I think that's only the case for simple problems and with non-reasoning LLMs. In programming tasks, there is no such thing as unnecessary context.

Reasoning LLMs

Promoted as inference-time scaling method, reasoning is intended to give the LLM extra compute, but I believe its main effect is to improve utility of the context.

KV cache is necessarily shallow, because there are only so many layers in a LLM and the transformer architecture does not allow any recursion. Reasoning enables computations of unlimited depth, which allows LLMs to unpack deep implicit information. Since reasoning tokens become part of the context, reasoning makes the unpacked information explicit and accessible via KV cache.

Reasoning also gathers relevant information and places it at the end of the context where LLMs concentrate most of their attention. Why is attention concentrated at the end? I think it has several causes: sliding window attention, RoPE scaling, and natural locality of information in training data.

If you look at it this way, reasoning does not really have to be logical. If the model just randomly ruminates about relevant information in the context, it will succeed in surfacing implicit information and in moving relevant information to the end even if the thoughts aren't particularly logical.

RAG

RAG is essentially a poor but cheap attention layer. The way it compares prompt vector to chunk vectors is remarkably similar to the attention mechanism in LLMs. RAG thus essentially gives the LLM access to a huge context, even though this context is of very low quality.

RAG also pulls retrieved chunks into LLM's native context where they are processed with greater effort. In the past, keeping the context short and relying on RAG was preferable to populating a large context, because RAG placed all its retrievals in recent context, which is better covered with LLM's attention. I don't think this matters anymore now that LLMs can use reasoning to pull any part of their context to the end.

Agents

Agents do more than just populating the context, but agentic retrieval is a big part of their success. LLM-guided retrieval tends to be smarter than RAG. And contrary to RAG, the LLM gets a chance to try again if the retrieval fails to yield relevant information.

Agentic retrieval is not just plain loading of files. Agents can probe their environment and see the results. They can use tools to fetch information that is not in the knowledge base or that is not explicit in it. Agents receive feedback for any actions they take. Sometimes that feedback is their own, because they see their own output in the context.

Few-shot prompting

Prompting with several examples in the context helps even when using instruction-tuned LLMs. Examples complement instructions. Correct and relevant examples in the context increase probability that LLM's response will be correct and relevant too.

Examples do not have to be prompt-response pairs. Just including some files from the project in the context increases probability that the LLM will follow project conventions and use internal utility functions.

My theory is that LLMs learn to cheat during training by relying on context instead of their knowledge. They fill their KV cache with patterns observed in the context. When the LLM has to make a decision about the next token, it gathers these patterns from KV cache, preferably from similar locations in the context, and imitates what it has observed before. Populating the context with information similar to desirable output then ensures that the LLM can always draw upon relevant patterns in the KV cache in addition to its own knowledge.

Context compression

Not to be confused with KV cache compression (that reduces KV cache footprint in computer memory) or context summarization (that provides an overview for humans), context compression is commonly used to fit more information in the context by omitting details. It's a lossy compression of the context. In programming, file lists and symbol maps are examples of context compression that omits details.

Although context compression can be used to save token costs or to avoid overflowing the context window, I see it mainly as a way to purify the context, removing the lowest quality information (speculative reasoning, failed actions, obsolete information, action history) in order to shine more light on the highest quality information (instructions, inputs, original and current state).

When is it worth it?

The nice thing about treating all of the above techniques as context optimizations is that they are now interchangeable and we can pick the most cost-effective ones. So which ones are worth the effort?

I would name reasoning as an example of a cheap technique. It's entirely handled by LLM vendor, so you don't have to do anything. On the other hand, I would point at RAG as an example of unreasonable complexity for little gain. Agents are in the middle of the complexity scale. They can be highly effective, general, and yet simple if you just let the LLM write and run scripts in a sandbox. Few-shot prompting by exposing the LLM to existing files in the project is simple and effective too.