What LLMs can and cannot do today
I have made considerable progress in employing LLMs in software development. As I get close to exhausting automation opportunities that current LLMs can handle, it is becoming more clear what LLMs currently cannot do. These hard problems now dominate in my schedule. Limits of LLM capabilities thus determine limits of my productivity. It's therefore a good idea to review where the capability limits lie to make sure I am not missing any substantial opportunities for automation.
What LLMs do well in my experience
- Unit tests: I have almost completely stopped reviewing generated unit tests, because they are already good enough. Tests merely add redundancy. If there are flaws, they will make the project less thoroughly tested, but they cannot break anything.
- Internal documentation: Since humans and LLMs reading the internal documentation are error-tolerant, flaws in internal documentation aren't a serious problem. I mostly skip reviewing generated docs too.
- Updating call sites: Today's LLMs are big and smart enough to reliably update call sites to match changes in signature or behavior of the called code. I still give these changes a brief review, but it's very rare to see a bug in these edits.
- Boring application code: Given clear requirements and guiding instructions, LLMs will produce application code fairly reliably. I do encounter lots of bugs here, so dutiful review is necessary. Also, see notes about algorithms and API design below.
- Repetitive code cleanups: This covers bulk cleanup operations that go beyond refactoring features in IDEs, but that are still simple enough to be done reliably by frontier models. If the task is well defined, bugs are very rare. An ambiguous task may however result in enough stylistic flaws to warrant review comments.
- Sweeping but simple changes: This is more dangerous than cleanup, because functionality is being altered, but if the requested transformation is not too complicated, LLMs will handle it without a hitch.
- Translation between languages: This covers both human and programming languages. LLMs can translate well, although they usually leave behind stylistic and semantic flaws that require quite a lot of manual intervention. It's still far easier than doing everything manually.
- Micro-optimizations: LLMs can take high-level code and transform it into micro-optimized low-level code. They are pretty good at it, but this is testing their limits, because it's borderline algorithmic programming. Resulting code requires careful review, but it's usually correct.
- Summaries: Even the smaller models are pretty good at writing summaries. In software development, this includes file and directory overviews as well as git commits.
What LLMs struggle with at the moment
- Algorithms: I have never ever seen any LLM+agent+instructions combination that would be able to reliably write algorithms. It's not just binary search and hashing. LLMs cannot write queries over application data. They will fail to implement a for loop with one state variable and two conditions. It's that bad. If you ask them to write an algorithm that you can fit in 5-10 lines of elegant code, LLMs will produce 30 lines of spaghetti code that does not work. If you point out the flaws, the LLM will expand the code to a 50-line monstrosity that is still broken.
- Consistently applying quality standards: As a rule of thumb, expect LLMs to randomly ignore about half of the system prompt. Some instructions will be ignored consistently every time. It's not just ambiguous high-level instructions either. LLMs often ignore straightforward output format instructions. Without thorough instruction-following and without on-the-job fine-tuning, LLMs cannot be trusted to uphold project's quality standards in any area.
- Code review: If LLMs cannot apply quality standards, they cannot do code review either. Sure, self-review can help a bit, but you cannot avoid eventual manual review, which is what matters for productivity.
- API design: APIs produced by LLMs tend to be ugly and inconsistent even if you provide the LLM with extensive style guide. APIs are important. They are the user interface for your code. All external and major internal APIs require human oversight.
- UI design: I don't have experience with LLM-generated UIs yet, but I expect the same problems I am seeing with API design. Others with more experience in this area have reported issues with generated UIs already.
- Architecture and project structure: This is even harder than API design, because it requires wider context and awareness of project history. Only humans can do this at the moment.
- Developing new abstractions, concepts, and terminology: Abstractions are mostly developed by performing thought experiments. I think LLMs struggle with this for two reasons. Firstly, while LLMs can in principle perform thought experiments in the reasoning phase, it's a laborious and unintuitive process for them. Secondly, they cannot internalize lessons from these thought experiments properly, because their weights are frozen. LLMs are currently good only at discovery and application of standard design patterns.
- Gathering and evaluating usability data: This is completely out of reach for LLMs. I don't expect progress in this area until LLMs have context sufficiently performant to maintain awareness of current project state as well as project history on a level comparable to humans.
- Formulating new requirements: This is a much harder version of UI/API design and concept development. LLMs aren't going to be formulating useful requirements for themselves anytime soon even if given results of extensive usability testing.
- Moving application metrics: This would be the ultimate form of automation in software development. Just state the goal and metrics of success and let the LLM iterate. We are undoubtedly very far from this utopia.
Impact on productivity
Although the specific gain factor varies from project to project, frontier LLMs are now good enough to bring substantial productivity gains almost everywhere. My recommendation is to deploy them extensively everywhere like there's no tomorrow.
On the other hand, there's no technological singularity. Productivity gains do not compound and you cannot reach full automation using contemporary LLMs. At some point, additional tool and workflow improvements start yielding diminishing returns. Bigger part of the productivity gains goes into improved quality (tests, docs) rather than higher velocity. Long hours of human labor are still necessary. Only the nature of the work changes. Software development is now closer to project management and more distant from technical details of coding.
Future
Personally, I expect fastest progress in instruction following and algorithmic programming. Code review should work well once LLMs can dutifully follow even ambiguous high-level instructions. There is some low-hanging fruit in API and UI design, although LLMs are likely to struggle with this for some time. Remaining tasks IMO require far more powerful LLMs than what we have today.