Generalizability in LLM-ology, revisited
What does it mean to say two circuits in two different models "do the same thing"?
Much of my research focuses on what I call LLM-ology: the scientific study of the behavioral capacities of large language models (LLMs), as well as the mechanisms underpinning those behaviors. That is, I’m interested in what LLMs can and can’t do, how they behave under different conditions, and which internal mechanisms give rise to the behaviors we observe.
But as I’ve written before, one of the central challenges in this relatively novel field is the problem of generalizability. Any given empirical study of LLMs must necessarily focus on a sample of model instances, or even a single model instance (e.g., GPT-2). It’s unclear whether and to what extent the findings obtained on that single instance generalize to other model instances we haven’t studied (say, GPT-3).1 We don’t even have a set of coherent principles we can use to guide decisions about which findings might generalize and which might not.
When it comes to mechanistic interpretability research, this challenge is compounded by another: not only is it unclear which “reference class” a given model instance belongs to, it’s also unclear what it means to assert that a given mechanism (such as a circuit) can be found in multiple models. Concretely, when can we say that two models have the “same circuit”? What theoretical parameters can we use to define circuit identity or similarity?
This is the problem I set out to characterize (and partially address) in a recent paper (now available on arXiv here) that’s just been accepted to the Mechanistic Interpretability Workshop at NeurIPS 2025. Below, I sketch out some of the main arguments in that paper.
The problem of circuit identity, briefly explained
To understand the problem, we have to understand what it means to define and identify a circuit or other putative “mechanisms” in an LLM.
Suppose you’re interested in which components of a model implement a particular function: say, tracking the modifiers (adjectives, prepositions, etc.) attached to a particular noun. Take the following sentence:
The quick brown fox jumped over the lazy dog.
We might expect that certain components of a language model have learned to specialize in representing noun phrases, such that “quick” and “brown” are attached to “fox”, while “lazy” is attached to “dog”. This problem could, in principle, be solved by any combination of model components: static word embeddings, attention heads, feed-forward layers, residual stream, and more.
For the sake of simplicity, let’s assume that you (an interpretability researcher) decide to focus on attention heads. Using a range of interpretability techniques (probing, ablation, etc.), you identify two candidate attention heads in a specific model (say, Pythia-14m) that seem to be responsible for this function. The model has six layers with four heads in each layer, and the heads you’ve identified are both in the same layer (say, layer 4). You’re pretty confident that these are indeed “noun phrase heads”2, so you write up a paper concluding the following:
We identified two “noun phrase heads” in Pythia-14m, both in layer 4: [L4, H1] and [L4, H2].
The paper is sent out for review, and it comes across my desk. Here is my central question: what is the nature of the scientific claim (if any) that can be generalized to other model instances from this empirical result?
To make this concrete: if we retrained Pythia-14m with a different random seed (or on a different dataset, etc.), would you expect the same heads to implement the same function? What if we tested an entirely different model (say, Pythia-70m) or a model trained on a language other than English?
There is, in fact, basically no reason to expect that the exact same head indices should implement the same function in a different model instance. The index of a given head within a layer is arbitrary: head 2 is not “closer” to head 1 than head 4 in any meaningful sense (except in some modified architectures). Thus, the narrow claim that heads [L4, H1] and [L4, H2] are “noun phrase heads” cannot reliably be generalized to other model instances.
But is there any broader claim that can be generalized? That is, in what sense might we say that two different circuits in two different models do the “same thing”?
Axes of potential correspondence
This problem is not unique to mechanistic interpretability. Neuroscientists must also contend with the fact that every brain is unique (even within the narrow set of species their research might focus on). Nonetheless, they work within a set of assumptions (informed by empirical observations) about potential axes of correspondence between the brains they study, which allow them to hopefully draw broader conclusions about “rodent brains” or “human brains” as a general category. For example, researchers might taxonomize brain cells according to their gene expression, their anatomical connectivity patterns, their developmental profile, their morphology, their firing rate, and more. Similarly, mesoscopic and macroscopic structures (such as structures or gross brain regions) might be defined in terms of their putative function, their relative position within the brain, and their relationship to other neural components.
These axes of correspondence are not perfect, and they’re also under constant revision. But they serve as useful organizing principles for making sense of findings and guiding future research.
What might “axes of correspondence” look like for the study of circuits in LLMs? In the paper, I proposed five:
Functional. This is the simplest, most important, and also perhaps the most intuitive axis. Two circuits in two different models can be defined as “doing the same thing” to the extent that they meet the same functional criteria: for instance, knocking out each circuit results in similar behavioral changes in each model instance.
Developmental. A growing body of research is focused on “training dynamics”, i.e., the developmental trajectory of specific model behaviors and mechanisms throughout pretraining. Just as biological organisms reach certain “developmental milestones” at systematic points during development, we might expect certain behaviors or circuits to “come online” at similar points in training. Intuitively, two circuits in two different models seem more similar if they not only meet the same functional criteria, but also arise at relatively similar points during training.
Positional. Earlier, I mentioned that the index of particular attention heads within a layer is arbitrary—but the layer itself is not. We might expect some systematicity across models in terms of where certain mechanisms develop: for instance, perhaps earlier layers track more superficial relationships between tokens, while later layers represent more “abstract” relationships. In models of different sizes, we can further ask whether the absolute or relative position is what matters most.
Relational. In biological neural networks, circuits are defined not only in terms of their function, but in terms of their compositional structure and their relationship to other circuits. Similarly, we might expect the “same” circuit across model instances to exhibit similar internal structure (e.g., roughly analogous networks of attention heads occupying similar relative positions), as well as similar relationships to other model components (e.g., perhaps those “noun phrase heads” always connect to simpler “previous token heads”).
Configurational. Finally, we might expect two circuits doing the “same thing” to share more fundamental properties as well, such as their relative position in weight-space. There’s still much we don’t know about why certain attention heads end up specializing in certain ways, or why exactly a certain configuration of weights corresponds to a given function, but my intuition is that there’s some comprehensible systematicity here.
Importantly, this list is not meant to be exhaustive, nor is each axis meant to be equally important. My goal in the paper (and here) was to propose a set of plausible organizing principles that interpretability researchers can use—I expect (and hope) that researchers will build on these principles and test them more rigorously.
One utility of these principles is that they allow us to assess various circuits that have already been identified in terms of their axes of correspondence across model instances. For example, induction circuits—which track previous instances of a token, then copy information about the token that occurred subsequently to that previous instance—appear reliably across a range of model instances, meeting the same functional criteria and also, intriguingly, emerging at roughly similar points during training at roughly similar relative positions in each model. Further, induction circuits are partially defined in terms of their own internal relational structure.3
These axes of correspondence can also be used to motivate future research. For example, a researcher interested in a particular mechanism (say, noun phrase heads) might explicitly set out to investigate that mechanism across a range of model instances with respect to these axes of correspondence. That’s what I did in the paper, focusing a very simple kind of attention head: 1-back heads (or “previous token heads”).
Shared developmental trajectories of 1-back heads
1-back attention heads are among the most simple mechanism one can imagine in a transformer language model: their main “job” is simply to direct attention from some target token to the token immediately preceding that token.
Here, their simplicity was actually a benefit, since my goal was to stress-test the conceptual framework I’d proposed for identifying commonalities across model instances. Because 1-back heads are so simple, I was pretty confident I could find them (or something like them) across a range of model instances—even small very small models. That meant that I could compare the properties of 1-back heads across all models in which they showed up, such as when and where they appeared in each model.
There are a number of ways one could define and measure “1-back heads”. I chose a relatively straightforward approach: I presented a series of English sentences to a model instance; then, for each attention head in that model, I measured the average attention directed from each token to the token immediately preceding that token. I implemented this procedure for four models in the Pythia suite (14m, 70m, 160m, and 410m), testing all nine random seeds of each model; the Pythia suite also makes pretraining checkpoints publicly available, so I measured these heads’ behavior at select checkpoints throughout pretraining to characterize their developmental trajectory.
I should emphasize, here, that this is a purely behavioral assessment of “1-backness”. Technically, this procedure does not identify the function of these heads (for which we’d need a causal intervention). That said, my view is that if a head always “looks” from the current token to a preceding token, it’s a reasonably high candidate for being a 1-back head.
Once I took these measurements, I could address all sorts of interesting questions. What was the overall distribution of “1-backness” across heads in each model? When do 1-back heads start to emerge in each model? Where do they emerge? Are there any interesting points of convergence or divergence across model instances?
The paper has a number of empirical results, and I won’t go into all of them here. I want to focus on a few specific findings relating to the developmental trajectory of these 1-back heads:
First, I found that different random seeds of the same model (e.g., all nine random seeds of 14m) were highly aligned in when putative 1-back heads started to show up. That is, there was extremely high inter-seed consistency.
There was also surprisingly high inter-model consistency (e.g., 14m vs. 70m vs. 160m vs. 410m). Seeds of different models were less aligned than seeds of the same model, but the overall correlation was still quite high.
That said, there were interesting (and systematic) points of divergence. Specifically, larger models tended to show an earlier onset of 1-back attention than smaller models. The slope of 1-back attention was also steeper, i.e., 1-back attention developed not only earlier but at a faster rate in larger models. Finally, larger models had a higher peak: they had individual heads that showed a higher degree of “1-backness” than smaller models.

I found these results really exciting. The high degree of inter-seed consistency was a reassuring proof-of-concept that at least when it came to a simple mechanism, models of the same size trained on the same data—but with different initial random weights—developed that mechanism at very similar timepoints. I was also struck by the relatively high degree of inter-model consistency: a priori, I didn’t know whether models of different sizes would show much temporal alignment. And finally, the fact that the points of divergence (onset, slope, and peak) were systematic suggest that understanding why models differ might actually be somewhat tractable.
What’s next?
One theme tying together many of my disparate research projects—which I’m planning to focus on more in the coming months and years, especially as I start my own research lab at Rutgers University—is that of trying to make headway on the epistemological challenges I’ve outlined in previous posts. When it comes to generalizability in particular, there are tons of possible extensions to the work I’ve described here, including investigating models trained on other languages and trying to determine what causes different heads to specialize.
More broadly, I remain interested in issues like construct validity and how different metaphorical construals of LLMs shape our understanding of what “kind of thing” an LLM is. These are all questions about how to figure out what we want to know, and they are by no means unique to LLM-ology. They might seem abstract or even navel-gazing, but in my view, they’re fundamental: it’s very hard to make progress—or to know whether we’ve made progress—without some kind of epistemological framework to guide us.
Related posts:
This is a problem for the study of human behavior as well, which is why it’s so important that Cognitive Science researchers account for linguistic and cultural diversity when investigating research questions.
Determining the functional scope of a model component is itself a very hard (and related) problem.
As far as I know, the only axis that has not been investigated with respect to induction heads is the configurational axis, but I might simply be unaware of that research.
