LLM-ology and the "moving target" problem
How can a science of LLMs keep up with technological development?
A few months ago, I gave a talk about research on large language models (LLMs) and some of the challenges I foresee. After the talk, a researcher asked me about what I’ll call the “moving target” problem: if the thing you’re studying is constantly changing, how do you make sure your findings aren’t immediately obsolete?
To illustrate the problem, this researcher pointed out that early work on social media and psychology was all done on MySpace; early work on video games and cognition was done on games like PacMan. Some of what we learned from those early studies may still be relevant, but a lot, unfortunately, probably isn’t.1 I’m certainly not an expert on this area, but my intuition is that the more a given study was tailored towards the specific features of MySpace—bulletins, top friend lists, etc.—the less likely it is to contain generalizable nuggets of insight that apply to social media technologies that are widely used today. The research, in other words, was overfit to the technology of the day.
I’ve been thinking about the question ever since. I don’t think I gave a very good answer at the time, so I wanted to distill some of my thoughts here.
Broadly, I think there are three potential angles, which I describe below in order of increasing optimism about our ability to avoid theoretical obsolescence.
Option 1: Just be practical
The first approach is to eschew the notion of generalizability altogether and focus purely on very practical questions about current systems. Model developers and evaluators could develop benchmarks that measure systems in simulated environments that mimic the contexts in which they will be deployed (e.g., software engineering, legal interpretation, etc.). As others have argued, these benchmarks should be dynamic, ecologically valid, and predict actual performance “on the job”.
This is hard to do! But this view might hold that measuring abstract capabilities like Theory of Mind is even harder—and less generalizable. Therefore, we should focus on the immediate capabilities and impacts of new LLMs as they pop up.
How close is this to what companies like OpenAI and Anthropic are already doing? A natural intuition might be that, as commercial actors, they’re necessarily focused on practical considerations like building a usable product. Yet model releases are still accompanied by a suite of benchmark metrics that may or may not reflect “on the job” performance—there’s a lack of focus on the immediate construct validity of those benchmarks. Different models might be priced differently, but it’s not entirely clear to me how those prices reflect the actual differences of those models in on-the-job performance; if I were a consumer interested in deploying those models, that’s a factor I’d want to know.
Option 2: Focus on the methods
Another approach is to focus on building generalizable methods for assessing the behaviors and mechanisms of LLMs. Like the first approach, this view accepts as given that there won’t be much in the way of generalizable principles connecting Claude, Llama 3, and other new LLMs on the horizon. We might be stuck studying each LLM as it comes out.
But unlike the approach above, this view holds that there might still be general-purpose methods we can apply to studying those LLMs. And here, I’m using “methods” pretty broadly. It could be tools for mechanistic interpretability, like sparse auto-encoders or activation patching. It could also be evaluations for rapidly and automatically assessing model capabilities (i.e., a benchmark). And it could even be conceptual frameworks that guide research: in psychology, that might include concepts like construct validity and external validity; in medicine, that might include methods like randomized controlled trials (RCTs); in economics, it might include the suite of econometric tools researchers have developed to try to tease apart empirical relationships.
Of course, it’s entirely possible that none of these methods or conceptual frameworks are generalizable. But maybe some other set of methods is. And once you have an automated method for finding specific circuits or evaluating certain behaviors, you can simply apply that method to each new LLM as it’s released.
My sense is that this view is implicit to much of the current research on LLMs. And if this view is correct, it actually makes me relatively optimistic about the future of LLM-ology. It might be frustrating that none of the theoretical claims we obtain on one model can be generalized to other models—but at least we can deploy the same methods to obtain those claims.
Option 3: Find generalizable principles
The most optimistic view—at least from one perspective—is that there should be generalizable theoretical principles that apply to basically any LLM. Alternatively, even if a given claim doesn’t generalize across all LLMs, maybe it generalizes within some “family” of LLMs. The key thing here is to identify the reference class to which any given model instance belongs.
I’ve written about this challenge pretty extensively before in a post about whether and how we can generalize across model instances:2
LLM-ology and the external validity problem
Generalizing from samples to the broader population of interest is at the heart of scientific research. It’s impossible to exhaustively survey whatever population we’re interested in, so we have to rely on small slices of that population. If those samples are randomly and representatively drawn, we can be more confident that whatever we discover in a sa…
If generalizable (or even “universal”) principles exist, that goes some way towards addressing the moving target problem. Future LLMs might vary in their design or benchmark performance, but maybe some of those principles will still apply to them.
Here, we need to think deeply about which findings or theoretical claims are likely candidates for generalizable principles. In my view, a good starting point would be to hew as close as possible to what LLMs actually do: operate over token sequences and make predictions about future tokens. It’s tempting to draw general conclusions about more abstract constructs like “reasoning” or “intelligence”, but these constructs are very hard to define and operationalize—and my suspicion is that the fuzzier something is, the harder it will be to generalize any conclusions about that thing across LLMs.
In an upcoming paper, I suggest that induction heads might be a good candidate for a generalizable model component. Induction heads are defined as follows:
Perhaps the most interesting finding was the induction head, a circuit whose function is to look back over the sequence for previous instances of the current token (call it
A
), find the token that came after it last time (call itB
), and then predict that the same completion will occur again (e.g. forming the sequence[A][B] … [A] → [B]
). In other words, induction heads “complete the pattern” by copying and completing sequences that have occurred before.
As this Anthropic report describes, induction heads form part of a “circuit” in transformer models. They recur across many different architectures of various sizes, they seem to emerge at roughly similar timepoints during training, and they also appear to be critical for in-context learning: the mechanism by which LLMs “learn” to complete prompts based on some analogy structure in the prompt. The idea here is that these induction circuits can perform not only “exact matches” to previous tokens but also match to a type of token:
Specifically, the thesis is that there are circuits which have the same or similar mechanism to the 2-layer induction heads and which perform a “fuzzy” or “nearest neighbor” version of pattern completion, completing
[A*][B*] … [A] → [B]
, whereA* ≈ A
andB* ≈ B
are similar in some space; and furthermore, that these circuits implement most in-context learning in large models.
Induction heads are a useful case study because they reveal third useful epistemological principles, which might help guide our search for new generalizable components. First, as noted above, they are extremely concrete: they can be defined in terms of tokens and token sequences—not in terms of abstract human constructs like “nouns” or “mental states”. Second, despite being so concrete, they’re also very compositional, which makes them amenable to abstraction. And third, it makes sense that something like an induction head would emerge in transformer language models given what those models are trained to do: predict upcoming tokens based on the tokens that came before. In some sense, you might say induction heads “increase the fitness” of these models—or more precisely, they decrease the loss.
Will there be other generalizable mechanisms like induction heads? It’s hard to say. But as far as I understand, we didn’t go looking for induction heads—we discovered them. And my hope is that if we follow something like the principles I’ve just outlined above, we’ll discover other generalizable principles about these models.
Which path do we take?
I’m a methodological pluralist. Perhaps it’s just my equivocal nature, but I usually think it’s premature to commit to a particular research path at the expense of alternative paths—especially when the field itself is arguably pre-paradigmatic. Individual researchers should probably specialize, but as a research community, there’s probably room for all of the above.
That said, it might be worth briefly considering some of the arguments for and against these approaches. To lay my cards on the table: the approach that excites me the most is option 3—discovering fundamental principles about how LLMs work that generalize across broad reference classes. But I fully acknowledge that this may simply not be possible.
Further, even if it is possible, it may not be very useful. Here, the argument would go something like this: sure, basic properties like induction heads are useful for explaining how LLMs work “in the abstract”, but they don’t deliver when it comes to predicting on-the-job performance. If what we want is to build safe, useful technologies, then we’re better off with option 1: focus on the technology of the day and how it’s used in practice. We might think of this as a kind of pragmatic view.
Yet there’s an equally pragmatic counterargument: namely, you can’t escape construct validity. According to this view, we might actually mischaracterize a system if we study it using methods that aren’t theoretically motivated. In fact, there’s a kind of hand-waving going on when we refer to option 1 as the “practical” view: even if you’re only interested in predicting on-the-job performance, you still have to operationalize what that means and design methods that measure it. For that, maybe you need at least generalizable methods (option 2)—and maybe even generalizable principles (option 3).
At the start of this piece, I mentioned that early research on social media systems and video games might have been “overfit” to the technology of the day. Even if our only goal is a pragmatic one, I’m not sure the choice between options 1, 2, and 3 is entirely clear. In my view, it hinges on one’s intuitions about the dangers of overfitting. From one perspective, overfitting actually isn’t so bad if you’re not interested in generalizing beyond the thing you’re studying: if you’re only interested in Claude Sonnet, then maybe it doesn’t matter that your conclusions about Claude Sonnet are based solely on Claude Sonnet. But from another perspective, this presents another risk: besides the fact that you can’t generalize beyond Claude Sonnet—which is fully acknowledged by the Sonnet afficionado—it’s possible that only studying Sonnet will limit your understanding of Sonnet itself. That is, even if you are not interested in the general, your study of the particular will be harmed by your failure to consider the general.
We can pose this dilemma in the form of two questions. First, are you interested in studying a particular tree or trees in general? And second, if you’re only interested in a particular tree: is the best way to understand that tree to study only that tree or to study that tree in the light of other trees as well?
Related posts:
In contrast, pioneering work on the literary properties of early video games—like Espen Aarseth’s Cybertext—may well hold up, perhaps with even wider applicability than before.
I’ve also just written a more academic paper that I hope to put up on arXiv soon, which I’ll make available here.