LLMs are pattern matchers
Large Language Models (LLMs) are excellent pattern matchers. I don't mean this as a dismissive reductionist "LLMs are just X" argument. I believe that seeing LLMs as pattern matchers helps us understand how they work and how model size and prompt content influence performance.
How neural networks use patterns
Reality is complicated, often involving an infinite tail of exceptions. Neural networks deal with the complexity by capturing simple patterns that approximate reality without trying to model it exactly. These patterns work across multiple levels of abstraction — from basic language syntax through text structure all the way to reasoning strategies. They are stored both in the neural network's weights and in the key-value (KV) cache built during prompt processing.
While patterns are computationally efficient and discoverable through gradient descent and attention mechanisms, they are inherently unreliable on their own. Neural networks compensate for this by collecting numerous patterns to build a more complex world model. LLM's output is formed by combining signals from many patterns at each token generation step, similar to how music emerges from overlapping waves that change over time.
Importantly, having more patterns doesn't just add breadth of knowledge — it adds depth by describing the same phenomena from multiple angles. The layered structure of LLMs enables formation of increasingly abstract patterns built on top of simpler ones.
Statistical perspective
From a mathematical viewpoint, each pattern can be seen as the correct rule plus random noise. When multiple patterns describe the same phenomenon, they cluster around the correct rule with noise pointing in different directions. As per the central limit theorem, averaging related patterns tends to cancel out the noise and reveal the underlying rule. While neural networks can perform more sophisticated aggregation than simple averaging, this basic intuition about noise cancellation still holds.
Performance implications
Understanding LLMs as pattern matchers helps predict what affects their performance. Every LLM mistake and hallucination can be traced to the model following some simplistic pattern that leads it astray. LLM application developers have a number of knobs they can turn to minimize such errors.
Model size increases pattern diversity. Small models write with "shaky hand", meandering around the correct output as if their hand slips all the time, because they have fewer patterns that produce noisy averages. Larger models write with "steady hand", generating precise and focused output, because they work with more patterns that yield stable averages.
Context stuffing feeds the model task-specific patterns via in-context learning. All context stuffing techniques ultimately aim to fill the KV cache with relevant patterns. Few-shot prompting is particularly effective, especially in smaller models, because patterns can be more easily harvested from concrete examples. Conversely, irrelevant information in the context hurts performance by adding noise.
Specialization improves relevance of patterns encoded in model weights. Fine-tuned models and pretrained specialist models benefit not only from having more of the relevant patterns in their weights but also from having fewer irrelevant ones.
Context compression techniques seem attractive until you realize they limit how many patterns can be gathered per token of context. KV cache is inevitably large, because pattern diversity is essential for model performance. Excessive KV cache compression reduces this diversity, increases noise, and ultimately undermines model performance.
When developing an application, you would first look for ways to improve pattern relevance by adding few-shot examples and task-specific information to the context, filtering out irrelevant information, and by choosing or fine-tuning specialized model. When that is not enough, the next step is to try to increase pattern diversity by using larger model with heavier KV cache.