Believe it or not, I was on the right track 15 years ago. By that time, I already gained several fundamental insights that explain performance of modern language models. I also got a lot of things wrong and I was working within too narrow hardware constraints, so I inevitably failed. Given that even today many people, including people working on language models, fail to understand what the models are about and why they work, I think it would be a good idea to reevaluate my experience with the benefit of hindsight.
So here's what I got right at the time:
- Imitation: Humans are imitation machines. They gain nearly all of their knowledge and skills by imitating other people. If you could just build a machine that can imitate people, you would be halfway towards building an AI with human abilities. The entire concept of modern language models is a statistical formalization of what we intuitively understand as imitation.
- Optimization: Clever reasoning can be much more efficient than blind local optimization, but all reasoning strategies have applicability and complexity limits. Any sufficiently complicated problem-solving process will inevitably degenerate into optimization (or evolution). Optimization is the only form of intelligence that is truly universal and yet still practical. Reasoning strategies merely arm the higher-level optimization process with ability to make longer local steps. It's like with cities: individual buildings are systematically designed while cities as a whole are chaotically evolving.
- Memetic evolution: Ideas can hop from person to person and persist in writing or embodied in objects. They have a life of their own, independent from any particular carrier. Aside from persisting, ideas replicate and mutate, collectively undergoing memetic evolution, essentially a large-scale optimization of ideas. While hardwired functionality in the brain can help with common tasks, it's the ability to carry memetic evolution that distinguishes humans from animals. As language models produce marginally novel content (mutation) that is then filtered by users (selection) and published on the Internet where future models pick it up (reproduction), they become a suitable substrate for memetic evolution, which is what matters regardless of any deficiencies contemporary language models might otherwise have. While user-mediated selection makes the process dependent on humans, it is conceivable that more automated selection processes will be developed in the future.
- Emergent abilities: Memetic evolution does not just produce ideas for direct application. It produces reasoning strategies that can be learned and therefore do not have to be hardwired in the brain or the AI. Only performance-sensitive computations (vision, hearing, muscle control) need some degree of direct support. The rest can be learned as long as there is sufficient support for memetic evolution. Language models are surprisingly smart despite their simple architecture precisely because they pick up reasoning strategies from the training data.
- Self-prediction: Since brains already create predictive model of the world, the easiest way to add decision-making on top is to predict one's own actions. Although it may seem strange at first, people take action by predicting themselves performing the action. This is the familiar "auto-pilot" that takes care of unlocking the door while your mind is occupied by something else. Decision by self-prediction makes imitation straightforward once you understand you are a human like everyone else (perhaps via mirror neurons). Language models make decisions by predicting themselves one token at a time. Self-prediction is what the word "autoregressive" means in autoregressive language models.
- Biased predictions: While predictions have to reflect reality to be useful, they can be biased towards favorable outcomes to incorporate motivation. In language models (and neural networks in general), gradient descent is the biasing mechanism. In foundation models, it will gradually bias the model towards observations until it aligns with observed reality. In fine-tuning, it will bias the model towards desirable output. Biased self-prediction is sufficient to implement goal-driven behavior.
- Layering: Layering is an inevitable consequence of evolution, which is more likely to experiment with new layers on top than to attempt surgical changes in already working parts of the brain. So the brain ends up looking like a barrel with many layers on top of each other. Inputs travel up the barrel from senses to high-level abstractions and outputs then travel down from high-level abstractions to muscle control. Shortcuts are everywhere, so simpler behaviors do not need to roundtrip all the way through highest layers. It's essentially a neural version of subsumption architecture. Transformers unfold the layers into one-way stack with input processing on one side, abstractions in the middle, and output control on the other side. Residual connections shorten the stack where necessary.
- World model: Human mind holds a four-dimensional model of the world that extends in all directions, includes currently invisible objects, and extends into the past and the future. Whatever information is missing, in the future or in the past, is filled in with predictions. In other words, internal states correspond to the physical reality. Humans have place neurons and grid neurons for this purpose while transformers explicitly model the sequential world, in which language models exist. Although world model is not strictly necessary, because large neural networks can simulate it, hardwiring correct world model into network architecture helps with model efficiency.
- Imagination: Some people make a big deal out of consciousness, but I think it's just a form of imagination. Ditto for thinking. And I think imagination is just a backflow in the sensory processing pipeline. This view is supported by known limitations of imagination. It's only visual and acoustic, because the backflow needs dedicated neural connections, which are only in some parts of the brain. It interferes with sensory processing, because imagination reuses sensory pipeline. And some people do not have imagination, because they were born without the backflow links. Transformer-based language models are feed-forward networks, which would seem to preclude this backflow-based imagination, but they can work around it to some extent by having many layers and by mixing layer output with skip link signal. I am nevertheless unsure as to what extent does imagination exist in language models.
What I got wrong 15 years ago:
- Statistics: Back then, I was thinking in algorithms. While interpreting AI systems as statistical models is not essential to their development, it does provide clarity, systematic approach, and objective evaluation.
- Neural networks: Even though I knew that brain is a connectionist machine and many of my ideas relied on connectionist architecture, I still dismissed neural networks as impractical, unwieldy, and too low-level. I was looking for symbolic approach that would be a better fit for computers. This was a few years before neural networks started climbing to the top of virtually every machine learning benchmark and almost completely killed off all alternatives. Embeddings provided a bridge between symbolic and vector worlds. Residual connections allowed scaling up network depth. GPU acceleration made larger networks practical.
- Scale: Neural networks learn to internally simulate whatever computation is needed to perform the task. Somewhat surprisingly, neural networks that are sufficiently large and sufficiently interconnected can simulate other neural networks in the same way computers can simulate each other using virtual machines. This virtualization phenomenon, which arises in large networks, reduces importance of network architecture while increasing importance of network size. As a neural network designer, you are first and foremost expected to scale up network capacity and training dataset. Tinkering with network architecture makes sense only once you have exhausted scaling opportunities. In my time, I did the exact opposite: adding more and more complexity into ridiculously tiny models. As the bitter lesson of AI tells us, scale always wins over clever hardwired logic.
- Feed-forward architecture: I always assumed that human-like AI must be recurrent, because human brain is recurrent. It turns out that transformer models do just fine with feed-forward architecture. Where recurrence is needed to compute correct results, transformers use two workarounds. First they unroll some of the recurrent computations into feed-forward layers, of which they have several dozens. If that is not enough, they unpack the recurrent computation into the token stream by thinking aloud. This is why language models work better when told to reason step by step.
- Modeling the future: Although transformers model the sequential world they live in, they do not model the future. They also never reinterpret the past in light of new data. I think this is mostly an efficiency measure. Freezing layer outputs saves compute time. While modeling of the future as well as reevaluation of the past can be implemented within the current transformer architecture at the cost of increased compute requirements, the point here is that it is not essential. Language models work just fine without explicitly modeling the future. I overengineered this.
- Hardware: Computers are still orders of magnitude less efficient than brains. Since scaling up the model is so important, efficient use of the scarce hardware resources is paramount to success. While I programmed my AI using data structures and algorithms, others have been already experimenting with SIMD and densely packed vectors and matrices and later with GPU-accelerated computations. GPUs became so important in AI that we now have GPUs specifically designed for machine learning. GPUs are growing larger and support increasingly bigger GPU clusters. At the same time, models are aggressively scaled to FP16 and now FP8 weights, sparse vectors and matrices, and even to layout of specific GPUs.
It is now clear why my efforts couldn't possibly succeed 15 years ago. But the world got there eventually even without my help. I think that in the end it does not matter who was first. What matters is that the technology is eventually opensourced, optimized, democratized, and integrated into everything. And I am optimistic we are heading in this direction.