Robert Važan

Gemini Pro for coding

Claude is recently down more often than not. I had to fall back to Gemini Pro 2.5 so often that by now I have essentially switched to it. It's actually pretty good at coding once you get used to it.

Agent

I am using Gemini Pro via API instead of using the Google-provided CLI for coding, because I prefer my own llobot-based agent. I am on paid tier of Gemini that includes a vague promise of confidentiality. I even figured out how to check token usage and cost in Google Cloud Console.

Real context size

Gemini Pro has cheaper input tokens than Claude and it has higher maximum context length, which initially encouraged me to fill the context up in order to specialize the model for the currently edited project, but the model isn't really strong enough to use the 1M token context window. Performance visibly degrades with longer context and additional chat turns. The model becomes severely brain-damaged beyond 100K tokens. It starts repeating previous edits, misinterprets simple instructions, and makes other dumb mistakes. I am therefore keeping context short, far below model's declared limit. This results in an odd cost structure dominated by output tokens.

Multi-turn chats

I don't think it's just the context length though, although that certainly contributes. I believe Gemini Pro rapidly degrades in a multi-turn conversation. It is having trouble implementing followup edits on top of the edits it has already made. Maybe its training somehow favors single-step workflows. Or maybe Gemini Pro's attention mechanism cannot deal with multiple versions of the same document. Or maybe its long-range attention mechanism does not encode token position at all. That could explain why it gets confused about what has been done already and what's yet to be done.

It's best to use Gemini Pro for easier tasks that it can complete in one turn. If followup edits are necessary, it's often best to start a new conversation for those.

Cost

In llobot, I am still using full file rewrite as the only supported edit format, which balloons output tokens, but I believe most output tokens I see in reports are actually thinking tokens. I cannot know for sure, because Google Cloud Console does not break down output tokens by type. I however know Gemini Pro tends to think for thousands of tokens at every conversation turn. It is therefore cheaper to ask it to perform a batch of related changes in every request, at least as long as it can handle the task on the first try. I estimate my current costs to be about 0.20€ per request, which is a lot compared to Copilot's $0.04 per whole chat, but Gemini Pro is doing a lot more work per request, so it works for me for now.

Thinking

All the thinking is actually useful. I am no longer getting the dumb half-assed edits that Claude produced. Claude was under-thinking with only a few hundred tokens even when given complex task. I suspect Claude thinking is just a tweak in the system prompt rather than a result of proper reinforcement learning. Even though Gemini Pro does not make half-assed edits like Claude, it still fails to fully adhere to instructions and it butchers up every nontrivial algorithm.

Verdict

Gemini Pro still feels like a smaller model than Claude Sonnet, but I learned to appreciate its reliable availability, thorough thinking, and high token limits. I especially like its ability to nail a comprehensive changeset on the first try. Even though I find it useful in everyday work, I keep looking for better alternatives that would fail less often and handle more difficult tasks.