Robert Važan Aug 4, 2025

How to manage an idiot LLM

Some people say that LLMs are like an eager intern, smart but in need of supervision. I think that grossly overstates intelligence of contemporary LLMs. In my experience, current frontier LLMs are idiots with IQ somewhere around 75. Kind of like Frankenstein's assistant Igor. They are useful, but they need far more oversight than an intern.

How come LLMs do so well in competitions then?

LLMs indeed do well in benchmarks. They pass university-level exams and they rank high in competitions, among the most skilled humans. But when I give LLM a real programming task in a real software project, it struggles to complete it correctly. It makes incredibly dumb mistakes. If prompted naively, reviews quickly become frustrating. The LLM is like an idiot. Maybe idiot savant but still an idiot. So what explains the apparent gap between IQ 150 competition performance and IQ 75 performance on real projects?

The first issue is money. LLMs score high in competitions only by spending thousands of dollars worth of tokens. That could cover one year of spending on a fairly expensive coding assistant. Spending so much money so quickly is not going to be economical on a real project. LLMs don't just have to be competent. They must be affordable. Only affordable LLMs are relevant for this discussion and those aren't particularly smart. Yes, LLMs let me do more work, so they are worth some money, but there's a limit to that, because the goal was to make myself richer, not to make Anthropic richer. Why would I use a LLM that just earns money for Anthropic?

Then there's cheating. The gap between benchmark results and real-world performance can be partly explained by LLM vendors specifically designing training runs to maximize benchmark performance. Benchmarks are LLM marketing. They pique interest of users and attract investor money. Who wouldn't cheat a little? I don't say they just blatantly include test answers in the training data. It's more like training data and fine-tuning are optimized for best benchmark performance rather than for best real-world performance.

And finally, real-world projects are very different from the sort of tasks that LLMs encounter in benchmarks and competitions. Benchmark tasks are tiny whereas real projects are so huge that only a small part of the project fits in the context. Benchmarks use well-known terminology and concepts and rely on publicly available knowledge whereas real projects define their own concepts and vocabulary and rely on an internal knowledge base. Benchmark tasks are well defined whereas real programming tasks contain ambiguities, gaps, and errors despite best efforts of the developer who assigned the task to the LLM.

LLM architecture is being optimized for benchmark-like tasks. This is a side effect of optimizing the architecture for chats and other simple, short-context tasks. These tasks are mostly knowledge-limited, so huge MoE models and long training runs are effective for them. In contrast, programming requires the model to gather information from a large context full of unfamiliar concepts that must be carefully pieced together. Programming thus strains model's attention mechanism, context size, and thinking while model's knowledge is largely unused. Vendors are actually scaling down attention mechanism and optimizing it out, because it allocates per-user memory, which is expensive in the cloud. Context sizes have grown a lot, but short-lived prompt cache in the cloud makes long context unaffordable. Thinking indeed helps with coding tasks, but current thinking implementations, which reportedly rely on a few thousand fine-tuning samples, aren't strong enough to let the model find its way around large code bases.

Which LLMs are least stupid?

As of mid-2025, I am most happy with Claude Sonnet 4. I assign most of the tasks to it.

Gemini Pro 2.5 is overrated. In my experience, it is very erratic. I cannot trust it to do anything right. It often fails to follow output format and other basic instructions. I even saw it descend into a repetition loop. And it is more expensive than Claude Sonnet in practice, because it thinks for thousands of tokens on every task whereas Claude is fine with a few hundred thinking tokens. It feels like a model several times smaller than Claude Sonnet. I think it got Sonnet-like pricing only because Google is really proud of its thinking abilities. Gemini Pro is nevertheless useful when I need to cram a lot of files in the context or when the problem needs a lot of thinking.

If you sort models by competence, there seems to be a threshold that separates productive models from frustrating ones. Claude Sonnet is a bit over the threshold on the productive side. Gemini Pro is a bit under the threshold on the frustrating side. Everything smaller is hopelessly frustrating. I don't even try to assign anything to Gemini Flash anymore, because its responses are full of bugs and omissions. I have yet to try OpenAI's o3, but it's reportedly not very good at programming.

So how do you manage an idiot?

Since LLM acts like an idiot with IQ 75 when given real task on a real project, we need to come up with a way to make use of it despite its limited intelligence.

Some people like to give LLMs lots of tiny tasks, usually via IDE extension to reduce per-task overhead. I am not a fan of this approach. The numerous requests are expensive. You are then under pressure to keep context to minimum, which limits model's understanding of the project. Since the model is not aware of future steps, it tends to write code that will soon need changes, which increases review overhead. And since the model is not given enough autonomy, there is high communication overhead and you end up waiting for the model several times per hour.

I instead prefer to give the LLM one bigger task at a time. It's usually one feature or one refactoring task spanning up to a dozen files. I sandwich the model between detailed specification on one side and line-by-line code review on the other side. This gives the model enough autonomy to provide meaningful productivity boost. And the approach scales with model intelligence, so higher gains are possible in the future, because smarter models can handle less detailed specification and they don't need the review to be so careful. The model is forced to implement all enclosed requirements consistently with each other. Splitting the work into a few large tasks lets me use relatively large model with relatively large context while keeping API costs below 1€ per day on average.

With contemporary LLMs, it's a bad idea to neglect specification, because it adds more work during code review. My specifications range from 5 to 20 bullet points, each containing a clear command or constraint. You cannot neglect code review either, because it would result in expensive bugfixing down the road.

The best assignments to LLMs are broad and shallow. This lets the LLM write a lot of code without getting stuck on something. Over time, you will notice that the bugs are always where the code is most complicated. You will eventually learn to predict what the LLM can handle and where the issues will be. To deal with uneven task complexity, I make the specification more detailed where the task has more depth and conversely give the LLM more freedom where the task is shallow. Similarly, during code review, I pay more attention to code that is difficult and only skim over trivial changes.

This works, but only with Claude Sonnet. Other LLMs are way too dumb to handle assigned tasks even when I write detailed specifications. A single task takes anywhere between 15 minutes and several hours depending on complexity. Productivity gain is clearly there, but it's not even 2x, definitely not 10x. I can produce over a thousand new or modified lines per day, although that's only possible because I now ask LLMs to write internal documentation and unit tests, which balloon the changesets. Productivity is now increasingly limited by complex design work, which current LLMs definitely cannot do, although there's still plenty of room for improvement in the specification-inference-review cycle.

What about agents?

You might have noticed that I talk about interactions with LLMs as if I am just making single request and using whatever the LLM returns on the first try. You are right. I don't currently use agentic loop. I find coding agents expensive and I am skeptical they are worth the cost.

In my experience, neither type checks nor unit tests nor automated reviews are likely to improve output quality much. LLMs are pretty good at creating valid code, so type checks almost always pass. If the LLM leaves a bug in the code, it will most likely leave the same bug in the tests, which means the tests pass even though the code is wrong. I am currently issuing followup requests with manually written review and the LLM is usually unable to correct the code. Automated review will likely perform worse. All of this makes me skeptical about benefits of agentic looping.

My focus is currently on context quality. For historical reasons, I am using my own llobot library, which is actually intended for non-coding tasks (translation, summarization, etc.), but I find it preferable even for coding tasks. I know there's Aider and OpenHands that are both intended for coding tasks, but I am not happy about the way they populate context. What I particularly like about llobot is that it lets me keep recently completed tasks with the final accepted solution in the context. I believe (and I have some anecdotal evidence) that it helps the model perform the next task correctly. I am also a fan of context stuffing. Standard tools will however eventually get better at creating valuable context and I will then switch.

Anthropic offers 90% discount on cached input tokens at the cost of 25% more expensive initial prompt. That means 1-3 agentic loop iterations would increase total cost only by about 50%. Even though agentic loop is not that effective, with this pricing it's likely more economical than using larger model. I am therefore looking forward to using LLM agents in the future.