The Counterfactual

How a model got its weights

Sean Trott — Thu, 11 Jun 2026 15:08:03 GMT

While writing this piece, an excellent position paper came out on arXiv (Biderman et al., 2026), articulating some of these same arguments about the importance of analyzing training dynamics. I’ve tried to weave some of those arguments into the current piece, but I also highly recommend that interested readers check out the paper!

A large language model (LLM) does not appear out of nowhere. It is the result of subjecting a neural network—typically with a random initial configuration of weights—to an extensive training process. During this training process, an LLM’s weights are updated in ways that, hopefully, improve its ability to predict tokens in the kinds of contexts it’s been exposed to.

Much of the enthusiasm around LLMs comes from the observation that this simple training procedure results in systems that produce, for lack of a better term, interesting behaviors. There’s considerable debate about how to measure these behaviors and whether they actually index some “emergent capability”. What we can say, however, is that the predictions of LLMs display contextual sensitivity in ways that, in some cases, co-vary with theoretical or practical constructs of interest. For example, an LLM’s predictions might be sensitive to the grammaticality of an utterance, the discourse context in which a sentence is embedded, and even the implied mental states of characters in a story.

What we conclude about these sensitivities is, as I’ve written before, a matter of both philosophical and empirical debate. But one key limitation of this research—including much of the work I’ve done—is that it often focuses on the “final checkpoint” of a model. That is, researchers take a model that’s already been trained on (say) 100 billion words and characterize what it can and can’t do. This is informative about the behaviors produced by that particular configuration of weights, but the problem is it tells us little to nothing about how a model arrived at that configuration.

My contention—and the central thesis of a recent preprint (Biderman et al., 2026)—is that a science of LLMs needs to take these training dynamics into account. In Biology and Psychology, studying the developmental process of an organism is generally understood as producing insights that are difficult, or even impossible, to obtain from studying a “mature” organism. For instance, work in developmental psychology has tried to characterize the key “milestones” associated with language learning in early infancy, and researchers use these milestones to draw inferences about how linguistic knowledge is acquired and reorganized throughout the maturation process. More broadly, studying how any process unfolds over time (e.g., historical changes) allows researchers to identify which factors reliably covary, or whether certain types of events or changes exhibit systematic temporal ordering (e.g., “where A and B are both observed, A always precedes B”).

This “developmental” approach is by no means a panacea to the epistemological challenges I mentioned above. But I do think that the story of how a model gets its weights can play a useful role in constructing a coherent picture of how and why LLMs do what they do.

Loss curves: the simplest view

One of the simplest and most common ways of visualizing training dynamics for machine learning models is what’s called a loss curve. “Loss” refers essentially to error, so the hope is that loss decreases over the course of training.

Large language models are trained to predict tokens. Given a particular linguistic context (e.g., “I like my coffee with cream and ___”), they output a probability distribution over all possible subsequent tokens from their vocabulary. Intuitively, a better LLM should assign higher probability to the token that actually appeared next in that context (e.g., “sugar”); a bad LLM might be one that assigns very low probability to this token, relative to other possible tokens.

Thus, we can transform this output probability to a loss metric by calculating the negative log probability, or surprisal, of this token. A lower probability corresponds to a higher surprisal: that is, this token was more surprising given the language model’s probability distribution. Over the course of training, LLMs generally improve at predicting tokens in context as they observe more examples, which means their average loss tends to decrease, as depicted in the schematic below. Note that this figure was made by hand and involves no actual data:

Schematic of a loss curve. Over the course of training, “loss” (or error, etc.) generally decreases, though rarely in so smooth a fashion as depicted here.

Each “point” in a loss curve typically reflects an average across many contexts. That is, an LLM at a given training step might be presented with thousands of examples, and the loss (surprisal) is calculated for each token in each example. That means changes in performance reflect changes in an LLM’s ability to predict tokens on average.

That’s obviously useful for benchmarking overall training progress. But an average, of course, bundles together many potentially disparate phenomena, which makes it difficult to determine what an LLM is learning, when. To oversimplify a bit: suppose that an LLM needs to learn things like part-of-speech, basic grammatical constructions, and which words have similar meanings. One possibility is that each of these things are learned simultaneously, but another possibility is that they are learned at different stages of training. If the latter scenario is true, we’d need to disaggregate the loss measure to tease these stages apart.

Ultimately, on some level, we’ll still be measuring changes in the probability that a model assigns to various strings. The key difference is which strings we’re measuring the probability of, or what we’re comparing those probabilities to.

LLMs and n-grams: training “phases”?

One of my favorite articles taking a developmental approach comes from my former labmate, Tyler Chang. In this 2024 TACL article, Tyler (and co-authors) asked whether the developmental trajectories of relatively small models (~124M parameters) exhibited legible patterns. Specifically, they measured the surprisal each model assigned to input strings at each checkpoint, then asked which factors drove changes in the surprisal assigned to various strings.

Perhaps the most striking result comes from comparing the behavior of each model throughout training to the behavior of simple n-gram models (with varying n). An n-gram model is a type of language model that predates the transformer (or other “neural” architectures), and is much simpler conceptually: in an n-gram model, the probability of a given word directly follows from the number of times that word has occurred in some exact context of length n - 1.

To make this concrete, let’s consider a bigram model, i.e., in which n = 2. Here, the probability of word n following word n - 1 is determined by calculating the number of times n has immediately followed n - 1 in a corpus, then dividing by the number of times word n - 1 (the context) occurred overall. For example, if the string is “that iguana”, a bigram model would determine p(“iguana” | “that”) by first counting the number of times “that iguana” occurred in a corpus, then dividing this by the number of times “that” occurred in the corpus. This tells us: of all the times we’ve observed some context, how frequently did we see this particular continuation?

A bigram model is not a very good model of language: in most cases, the probability of a word cannot be deduced solely from the word immediately preceding it. Researchers can systematically vary n (the size of the window) to account for more or less context: when n = 1, the model reduces to a unigram frequency model (i.e., the frequency of each word in isolation); as n increases, the model accounts for more and more context, which can improve prediction accuracy—though it also runs into overfitting issues, which are typically addressed using various smoothing techniques.

In general, even n-gram models with smoothing are not particularly good models of language, in part because of their lack of representational flexibility: a key benefit of transformers and other “neural” approaches is that they represent strings in a vector-space, reflecting generalizations across specific strings (e.g., part-of-speech, semantic class, etc.), which allows models to make approximate predictions for strings they haven’t observed before. But this is also why n-gram models offer a useful “anchor” for studying the training dynamics of a transformer. By comparing a transformer language model’s predictions to different n-gram models, we can understand the patterns of apparent generalization that transformer is undergoing.

Specifically, for each training checkpoint, Tyler calculated the correlation between a language model’s predictions and the predictions made by n-gram models of varying n. The pattern they found was pretty clear:

Consistent with previous work (Chang and Bergen, 2022b; Karpathy et al., 2016 for LSTMs), the models overfit to unigram (token frequency) predictions then bigram predictions early in pre-training. Extending this up to 5-grams, the models reach maximal similarity to a unigram model around step 1K, before peaking in similarity to 2, 3, 4, and 5-grams, in that order.

That is, the transformer models appeared to follow different “stages” of training, in which predictions gradually reflected sensitivity to longer and longer contexts.

One way to think about this is that early on in training, the simplest and most effective way to minimize loss is to predict tokens on the basis of their individual frequency. If you don’t really know anything about language, you’re best off assigning high probability to very frequent words like “the”, and low probability to infrequent words like “zoologist”. Throughout training, however, the model learns longer and longer chunks, corresponding roughly to n-gram models of various sizes—until, eventually, it forms generalizations that allow it to make even more flexible, context-sensitive predictions that can’t easily be captured by an n-gram model.

This progressive pattern has been attested in many different models of different sizes, trained on different corpora (see, e.g., a paper by another former labmate, Michaelov et al., 2026). It seems, then, to be a relatively robust principle underlying language model training dynamics. This is crucial: identifying generalizable principles is central to the scientific endeavor, and the notion that language models undergo various “phases” during training is an important start towards this identification process.1

Beyond n-grams: phases and more

Of course, there are many ways one could measure potential “phases”. Beyond comparing a language model’s predictions to the predictions of an n-gram model, researchers can also compare predictions between various language models, or they can ask about changes in response to some particular stimulus contrast.

The first approach allows researchers to determine whether and when the behavior of different models converges or diverges. By also manipulating things like the initial parameters of a model, the data it’s trained on, or the model’s architecture, researchers can try to identify the model-level properties that drive patterns of convergence or divergence.

One recent paper (Fehlauer et al., 2025) adopted this approach using random “seeds” of the PolyPythia suite: these are open-source models trained on exactly the same data in the same order, but initialized with different parameters. Here, at each checkpoint of a given architecture, the authors compared the output probability distribution between random seeds, using a measure that captured the degree to which two probability distributions converge or diverge. Then they visualized these patterns of convergence across pretraining. As depicted below, they found evidence for roughly four distinct “phases”: first, a uniform phase (i.e., before models have learned much about language); second, a sharp convergence pattern (i.e., when different models start resembling each other in their predictions); third, a divergence pattern (i.e., models start looking rather different); and fourth, a slow reconvergence pattern (i.e., models of the same architecture again start to resemble each other).

Patterns of convergence and divergence between random seeds of the same model across training. Figure taken from Fehlauer et al. (2025).

One of the most interesting things about this result, in my view, is the differences across model sizes. All models undergo the first three phases (i.e., uniform, sharp convergence, and divergence), but the smaller models exhibit less evidence of reconvergence, or seem to do so more slowly. What causes this gap? A speculative explanation lies in the factors causing convergence and divergence in the first place: given what we already know about correlations with various n-gram models, a reasonable hypothesis is that early convergence between transformer language models is driven by relatively simple n-gram statistics (e.g., unigram or bigram frequency), which even small models can learn. Divergence is then driven (perhaps) by subtly different patterns of generalization across models. Finally, the slow pattern of reconvergence—most pronounced in larger models—is, possibly, driven by those models all converging on the “same” generative account of language.

For what it’s worth, I’ve observed very similar patterns in my own work, examining a range of different model behaviors. That is, seeds of larger models exhibit more pairwise similarity than do seeds of smaller models, and they also converge earlier, and faster, than do seeds of smaller models. For example, here’s an analysis looking at inter-seed correlations in when and to what extent different Pythia random seeds learn particular dative constructions (e.g., “I gave him the ball” vs. “I gave the ball to him”). I won’t go into too much of the linguistic details here, but the quick version is that different ways of expressing a transfer event are preferred in different situations, according to what’s being described (e.g., “the ball” vs. “the curious object I found in the grass”) and the surrounding discourse context (e.g., what’s already been mentioned, etc.). In a recent project, I calculated the “preference” language models have for these different constructions at different checkpoints, then compared these preferences both across language models and to humans. The pattern of inter-seed converge and divergence is depicted below:

Inter-seed convergence and divergence in preferences for various dative constructions. Figure from work in progress.

Early on, seeds exhibit little to no correlation with each other: they don’t know anything about the dative construction. After observing about 0.5B-1B tokens, seeds exhibit a sharp convergence pattern (though larger models do so earlier than smaller models). This is followed by a temporary “dip”, which is in turn followed (in larger models) by a reconvergence.

I’ve now analyzed the training dynamics of various behaviors and internal mechanisms, and I’ve found very similar patterns in each case. First, there are relatively clear “phases” of convergence and divergence between random seeds of the same model across training. And second, models show systematic variance in these patterns as a function of their size. Larger models tend to converge earlier, faster, and to a greater extent than smaller models.

Another way to visualize this latter claim is to embed each seed of each model in a two-dimensional space using something called multi-dimensional scaling (MDS). To do this, I first calculated (at each checkpoint) the correlation in output predictions between every pair of models: this tells us the extent to which one model instance is “aligned” with another in terms of its predictions. This produces a correlation matrix of size MxM (where M is the total number of model instances). Then, for each checkpoint, I ran MDS on this correlation matrix, which projects it into a two-dimensional space preserving some of the original structure. These new dimensions aren’t intrinsically meaningful or interpretable, but they reflect the original similarity structure of the correlation matrix: that is, model instances with more similar patterns of correlations will be closer together in the 2D space. By running this process at various checkpoints, we can observe how different model instances are “distributed” throughout the 2D space, and how that process evolves throughout training.

The figure below depicts the MDS projection at different stages of training. Each point represents a particular seed of a particular model architecture, and the points are colored according to their size. As the figure depicts, points are roughly randomly distributed throughout the space early on in training, reflecting the fact that they haven’t really learned anything about language yet. As training progresses, however, model instances begin to converge. This is particularly true for seeds of larger models: e.g., in panels 32-64, we see that seeds of larger models cluster tightly together, while seeds of pythia-14m (the smallest model) are more sparsely distributed. Finally, in the latest stages of training, each model architecture exhibits relatively tight clustering, though notably, the cluster of pythia-14m seeds is somewhat distinct from the clusters of seeds for larger models.

Results of running multi-dimensional scaling on the correlation matrix of dative alternation preferences across pairs of language model seeds at each checkpoint. Figure taken from work in progress.

Behavior and mechanism, together

So far, I’ve discussed how particular behaviors change across training. But part of the promise of a developmental approach is that researchers can investigate changes not only in behavior, but in the internal mechanisms and representations of language models as well. Moreover, by analyzing how these changes coincide, researchers can get better conceptual traction on how exactly internal mechanisms give rise to observable behaviors. This kind of approach has been applied to a number of domains, including how and when models learn syntactic constructions, which mechanisms underlie in-context learning, and more.

This is also the approach Pam and I adopted in a recent paper. We focused on the ability of language models to disambiguate the meaning of a word in context. Specifically, we asked whether and when language model representations of ambiguous words (e.g., “lamb”) reflected human similarity judgments. The sense of lamb evoked in “marinated lamb” is distinct from the sense evoked in “friendly lamb”, but is more similar to “roasted lamb”. As a proxy for disambiguation behavior across training, we compared the similarity of language model representations for words in different contexts to human relatedness judgments about these words. We found, first, a relatively sharp inflection point around step 1K (about ~2B tokens observed), at which point each model we tested started showing evidence of disambiguation. Again, larger models continued improving well past this point, while smaller models seemed to plateau:

Performance of various language models in the Pythia suite throughout pretraining. R2 reflects proportion of variance explained in human judgments using cosine distance from that language model’s best-performing layer. Figure taken from Rivière & Trott (2026).

We then turned to the mechanisms subserving these changes. One hypothesis we had was that specific attention heads might learn to attend to specific cues that helped disambiguate a target ambiguous words. Fortunately, our stimuli were designed such that there was always a single disambiguating cue (e.g., “She liked the marinated lamb”). For each attention head in each model, we measured the attention strength from the target ambiguous word (e.g., “lamb”) to the disambiguating cue (e.g., “marinated”), and visualized these changes over training. Our goal was to find heads that attended strongly to the disambiguating cue—and, crucially, which exhibited a similar developmental trajectory as the overall disambiguation trajectory.

The figure below depicts the pattern of attention to disambiguating cues for the four attention heads in layer 3 of pythia-14m, superimposed over the model-level changes in disambiguation performance (measured as R^2). Here, I’ll make two observations. First, the heads in this layer are clearly not all doing the same thing. Two heads (3 and 4) don’t really “look” at the disambiguating cue at all, whereas the other two heads (1 and 2) do show a stronger pattern of attention to the disambiguating cue (especially head 2). And second, these changes in attention are timelocked to changes in disambiguation behavior!

Attention from a target ambiguous word (e.g., “lamb”) to disambiguating cue (e.g., “marinated”) from each head in layer 3 of pythia-14m, superimposed over model-level changes in disambiguation performance. Figure adapted from Rivière & Trott (2026).

Of course, this doesn’t entail that these attention heads are therefore functionally involved in disambiguation. But it does suggest that there’s a temporal synchrony in when specific mechanisms begin reorganizing their behavior and when the overall model begins to improve in disambiguation. To test the functional role of these attention heads, we’d need to intervene on them (e.g., “knock them out”) and ask how this affects changes in behavior. Thus, we carried out a series of ablation studies in which we systematically altered the behavior of those attention heads and asked how much this “hurt” disambiguation performance, as compared to ablating random heads. We did this at each checkpoint of the model. As depicted below, we found that ablating the target heads did impair performance (compared to the baseline heads), and did so throughout training. Moreover, these ablations didn’t affect performance early on in training, which is exactly what you’d expect given that the model hasn’t learned the disambiguation behavior at this point yet.

Effects of ablating target heads at various checkpoints, as compared to ablating random baseline heads. Figure adapted from Rivière & Trott (2026).

A few caveats are in order. First, these ablations reduced performance, but they didn’t entirely knock it out. This was particularly true for larger models, in which there was considerable redundancy across attention heads: thus, it stands to reason that knocking out any individual head wouldn’t have a huge effect on performance.

Second, this doesn’t entail that these heads are therefore “disambiguation heads”. Indeed, determining the functional scope of a model component is in part a philosophical problem, which depends on carefully defining the conceptual boundaries of a behavior or mechanism. We carried out a range of “stress tests” to determine the robustness of each head’s behavior to various stimulus perturbations. These are described in more detail in the paper, but briefly: heads of smaller models were not particularly robust to these perturbations, suggesting that they perform relatively simple operations (e.g., “1-back heads”); in contrast, heads of larger models were less brittle, suggesting they might perform a more generalizable operation (e.g., “noun modification”) robust to the part-of-speech or relative position of the disambiguating cue.

Third, and perhaps most importantly, this is a relatively simple behavior, characterized in only a handful of models. As I’ve written before—and as others have argued as well—a fully developed science of LLMs will depend on identifying robust, generalizable behaviors and mechanisms across LLMs that enable us to make accurate predictions about those models.

Causal history as an epistemic criterion

I’ve spent much of the last year or so thinking about the key epistemological challenges facing LLM-ology. Those challenges include questions of generalizability (which findings generalize to which models?), construct validity (how do we know if we’re measuring what we think we are?), and ontology (what kind of thing is an LLM anyway, and how ought we study it?). During this period, I’ve been reading more history and philosophy of science, trying to contextualize these questions in the challenges other disciplines have faced. The position I’ve gradually arrived at is something like a coherentist approach to the study of LLMs: for the most part, I don’t think we’ll ever identify procedures for producing determinative answers or conclusions about LLMs—instead, I think researchers need to carefully articulate their explanatory goals and theoretical commitments, then contextualize their claims in a “justificatory web” of mutually supporting pieces of evidence, which will offer a provisional understanding of how these systems work in the service of particular theoretical or practical goals.

The reason I bring this up is that I think such a perspective necessitates consideration of what might constitute this “justificatory web”. As I’ve written before, I think the scientific study of LLMs should be rooted, in part, in an attempt to understand them “on their own terms”. I continue to believe the lens of Cognitive Science can provide a useful perspective—at minimum, deriving inspiration from careful experimental design and control—but I’m also quite open to the possibility that a theory of LLMs might well be grounded in something that looks very different from a science of human cognition.

What might that be? Well, one very important property of language models is that they are trained. This training process can be viewed as an effective procedure for producing a configuration of weights that produces the kinds of behaviors we find so interesting. If we’re to understand LLMs on their own terms, I think we need to think of their training dynamics as central to the kind of thing they are. This is akin to an argument made in Biderman et al. (2026), which describes models as time-evolving processes:

Answering the question “why did a model do X on Y input” certainly has some utility (e.g. for corporations interested in product assurance, user-satisfaction, and compliance), but on a scientific level it’s fundamentally limited. A more scientific mindset would be to ask “why did the model develop this behavior?” This involves shifting from viewing models as static objects to viewing them as snapshots of time-evolving processes and studying the entire dynamical system (Saphra, 2023; Biderman et al., 2023b; Hoogland et al., 2023). When the object of study is the training process rather than the finished model, an account of that process can be applied [to] any model it produces, not to one specific set of weights (Sellam et al., 2022). (pg. 3)

There are still, of course, considerable degrees of freedom in how behaviors are characterized, and in the kinds of explanations for this behavior we might investigate or find satisfying. I’d advocate for pluralism in this respect: I think the field is too nascent to warrant extreme confidence about the right level of analysis here. But I do think that a scientific understanding of LLMs should be informed by an understanding of their causal histories, i.e., the story of how they got their weights.

Note that when researchers in this space use words like “phase” or “phase transition”, they’re usually referring to something like a sharp discontinuity in some measurable behavior over the course of training, as opposed to incremental changes. The question of whether these behaviors should be called “emergent” has received some attention, given that “emergent” has specific meanings in the study of complex systems. For the purposes of this post, I’ll be using words like “phase shifts” in the sense described above—i.e., as sudden changes in some target behavior over the course of training.

Bad habits

Mon, 01 Jun 2026 15:02:26 GMT

In July of 2022—now almost four years ago—I wrote about the possibility that large language models (LLMs) could change the way we use language. This was pre-ChatGPT, but after the release of GPT-3. It was a strange liminal period for researchers studying LLMs: it seemed like something significant was happening, but it was difficult, at the time, to predict the shape of how these new systems would impact society.1 Much has changed since then, to put it mildly.

In general, I am not one for forecasting, mostly because I am not very good at it, though I have great respect for others who are. Of more personal interest to me is whether someone produces a conceptual framework that makes future events more legible. From this perspective, that initial post succeeded in identifying (as did others at the time, and since) that LLMs and other machine learning technologies could reshape cultural practices in significant ways. I also continue to think the two competing hypotheses for how language, specifically, might change—what I called the nova hypothesis and the homogenization hypothesis—are interesting and helpful (and reality seems to be leaning towards homogenization, at least for now), though here, too, I owe a conceptual debt to the writings of Jenny Odell and others on topics like “algorithmic entombment”.

At the same time, my initial essays on this topic missed the mark in two pretty significant ways. First, I drastically underestimated the extent to which people would use LLMs to generate large swaths of text for them wholesale. I was imagining effects “around the margins”, so to speak, driven by a kind of sophisticated autocomplete; I did not imagine that people would use an LLM to write entire Substack posts or even academic papers. In retrospect, I think this is partly because of the state of the technology (LLMs have improved since 2022), and partly because of the interface in which the technology was embedded: a chat interface using instruction-tuned models makes it much easier to use these tools for freeform text generation.

Second, and more significantly, I underestimated how polarizing LLM-generated text would be. I assumed, of course, that people might not react positively to the idea that someone has sent them an email written entirely by an LLM. But I did not anticipate the level of antipathy or even revulsion that would develop towards synthetic text—indeed, I failed to predict even my own sense of despair upon encountering an online landscape increasingly filled with synthetic writing.2

Much has been written on the topic of LLM-generated text at this point, including by writers much more eloquent than me. I don’t presume to add much to this discourse at this point. To be honest, much of it makes me sad, even though I’ve written about it before and even investigated statistical signatures of LLM-generated text. I feel sad when I encounter text that seems to be LLM-generated, especially in the context of academic writing or peer review—which, unfortunately, has gotten much more frequent in the last year or so. But I also feel sad when I consider how frustrating it must be for someone to be incorrectly accused of using LLM-generated text in their prose. I feel sad that the well of online discourse has been further poisoned in this particular way.

But I do want to discuss two topics that are, I think, related. Why does LLM-generated text inspire such strong negative reactions in many people? And what attitude should we have towards using LLMs, especially in our own lives? In both cases, I can, of course, speak only for myself, though perhaps some generalizable lesson can be drawn.

Disgust, horror, mechanism

There’s been considerable discussion in recent weeks about a prize-winning short story that bears many signs of having been at least partially LLM-generated. I don’t much care for the story, and I suspect I’d feel that way regardless of whether I was primed to think it might’ve been LLM-generated (e.g., if I’d read it in 2016 rather than 2026).

One answer to why people don’t like LLM-generated text, then, is that it is bad, and distinctively so. “Bad”, here, might mean many different things to different people, but in my own experience, the aesthetic weaknesses of LLM-generated text often feel like exaggerations of stylistic motifs one might encounter in human writing—including human writing we might even consider good. The infamous em-dash is, of course, an example, and one I refuse to relinquish; but so is the “it’s not just X, it’s Y” construction, along with many of the other constructions we’ve come to associate with LLM-generated text.

In moderation, some of these constructions might be effective rhetorical devices, or might at least appear in the prose of effective rhetoricians. But LLMs make use of them, and in some cases misuse them, to a degree that lays bare their status as devices, resulting in the feeling that one is reading an essay or story by someone trying very hard to sound profound, but without the patience to construct an argument or narrative to earn that profundity.

Those devices really are recognizably human, though, even if now we take them to be undeniable signatures of LLM-generated text. Perhaps the grating irritation we feel upon encountering these ersatz constructions is driven in part by this recognition and the accompanying inference (regardless of whether it is true) that much of language is, in fact, a mechanical thing; akin to the horror some people might feel upon observing the ceaseless working of organs in the human body and feeling that one is, in the end, a kind of meat machine.

Simulacra

The problem with this aesthetic account is that it relies on the assumption that LLM-generated text is necessarily distinctive. But this is by no means guaranteed. LLMs can already pass as human in the adversarial context of an online Turing Test; it stands to reason that an unsuspecting reader or interlocutor could easily mistake synthetic text for human writing “in the wild”.3 Moreover, synthetic text is even now not a singular thing: people with particular interests or expertise can elicit strange and novel outputs from these systems that don’t exhibit the traditional hallmarks of LLM writing, some of which might even be quite interesting to read (provided it’s clearly marked, as I suggest below).

Thus, as an explanation for why people dislike LLM-generated text, I think the claim that the dislike is borne specifically from a distinctive aesthetic quality falls short. It’s also insufficient for mounting a principled opposition to synthetic text making its way into certain domains of human writing. Put another way: I think many people don’t like the idea of reading LLM-generated text in certain contexts regardless of how that text is written—and basing an argument on certain contingent empirical facts about LLM-generated text makes that stance difficult to maintain if (say) models are changed in ways that eliminate current aesthetic signatures.

In many situations, we’re interested in reading something because we think a human wrote it. If we later find out that it was produced by an LLM—even if it is aesthetically indistinguishable from what some human might have written—we might feel irritated or even betrayed. The explanation for this might involve embarrassment (we were fooled), offense (we assume communication involves honesty and effort from both parties), or even loneliness (we thought we were communing with another person with thoughts and experiences of their own4). This is by no means an original thought, but it’s worth disentangling from the argument that people dislike LLM-generated text purely on the basis of its aesthetic properties. Indeed, in some cases we might generate a distaste for the aesthetics because of the source, not the other way around.5

For my part: if I discover that text was LLM-generated in a context when I was expecting and hoping to read something written by a human, I am upset for the reasons I enumerated above. But I am not intrinsically opposed to reading text produced by an LLM; I use LLMs frequently, in fact! In some cases, the fact that something was generated by an LLM is actually the source of my interest in it. I can even imagine creative uses of an LLM, for instance, that play with our understanding of how language works, or that reveal the intriguing or surprising consequences of representing language as a statistical process. In other cases—the cases when I want to read something by a human—the knowledge that something is LLM-generated might well make me lose interest in reading it, which I think is (again) a strong argument that LLM-generated text should come with a tag.

One interesting question is how the discovery that something was LLM-generated compares to the discovery that something was written by someone other than the person you thought wrote it. Two examples come to mind. First, in Her, Joaquin Phoenix’s character (Theodore) makes a living writing cards (condolences, anniversaries, etc.) for other people; it’s not clear to me whether the cards are passed off as written by the sender, but let’s suppose, for the moment, that they are. Second, the central conceit of Edmond Rostand’s Cyrano de Bergerac is, of course, that a handsome but inarticulate man (Christian) collaborates with a less handsome but very eloquent man (Cyrano) to woo the woman they both love.

Opinions differ, I’m sure, on the ethical dimensions of both Theodore’s and Cyrano’s actions here. But I suspect that even those critical of either protagonist would acknowledge that these situations seem distinct, somehow, from passing off LLM-generated text as your own. An obvious difference is that the “generative process” for producing the substitute language in both Her and Cyrano de Bergerac relies on another human with the capacity for phenomenological experience.

And in Cyrano de Bergerac specifically, that other human is in love with the recipient, just like the putative sender (Christian); the feelings the putative sender wishes to convey, then, are shared by the actual writer. All of which is to say: even if one is uncomfortable with Christian’s and Cyrano’s deceit, here, it feels simply like a categorically different kind of act than using an LLM to write (say) your love letters—though I acknowledge that one’s intuition on the distinction here will depend in part on one’s beliefs about the capacity for LLMs to have phenomenological experience.

My broader point, however, is that distaste for LLM-generated text is not solely an empirical property of the words or constructions contained in that text. It is also, and perhaps more fundamentally, a property of the process we believe was responsible for the text.

Modes of use

I’ve focused, so far, on the question of why many people have such strong distaste for LLM-generated text. But I think this issue is to some extent inextricable with the debates about using an LLM in one’s own creative or cognitive process. These debates often center around writing, for reasons I’ll discuss momentarily, but in principle the contours of the debates also fit with other areas of life.

I’ll be direct: I don’t like the idea of using an LLM in my writing process, and thus far have not found LLMs to be particularly useful either for ideation or for crafting prose. For me, this is most clearly true of creative writing (for which the idea of using an LLM feels like a category error6), but it also applies to writing these Substack posts and to academic writing. I have used LLMs for quickly finding typos in a large body of text, and I think they are fairly useful in this role.7

I mention this to point out that it is difficult, in fact, for me to imagine why I would want to use an LLM to write the things I write. It’s become something of a cliche to point out, but it really is true for me that the writing process is deeply intertwined with the thinking process. Writing clarifies thought, and it also identifies where thought is unclear; how often have I started an essay, confident that my point was clear, and realized in the act of writing that there was some deep confusion—or, more interestingly, that my point was actually something else entirely? Even when I do have a good idea of what I want to say ahead of time, writing adds meat to an outline’s bones. This is to say nothing of more meandering essays (like the ones I write at the Leaky Margin) or creative fiction.8

I should acknowledge, however, that I enjoy writing. Many people don’t enjoy writing, or they have to write prose in contexts where they don’t believe (rightly or wrongly) the act of writing is central to their thinking process. Here, an instructive contrast might be to programming: lest you think me some kind of purist, this is one area where I do regularly rely on LLMs. LLMs really have gotten quite good at writing code, at least for applications that are prevalent in their training data. This makes them useful for the kinds of targeted modeling or analysis scripts I’d usually write by hand. They’re also useful for walking through code I don’t understand, line by line.

Why do I feel comfortable using LLMs to write code but not prose? The answer is not as simple as suggesting that I enjoy the latter but not the former—in part because I do, in fact, enjoy writing code in some cases. I think the more accurate, and more interesting explanation, is that there are some tasks that seem purely instrumental (a means to an outcome), and other tasks where the doing of the task seems somehow constitutive of the outcome. For me, the point of writing is not (only) to produce a chunk of text that efficiently conveys a thesis to a reader; the point is also to craft that text myself. I’m reminded, here, of something I read in an essay by Derek Thompson a few months back:

As AI gets better at automating more tasks, I suspect that students and workers will have to cultivate and sustain a new kind of wisdom. They’ll have to answer the question: What are the parts of life where I could use AI, but I shouldn’t, because I want to protect this skill or habit from atrophy?
I loved this bit of wisdom from the author Agustin Lebron. A simple way to figure out whether to use AI at work, or in life, is to think about the difference between a gym and a job. At a gym, the point isn’t for the weight to be lifted, but for you to lift the weight. At a mere job, however, “the point is for the weight to be lifted.”
Use AI for the jobs in your life. Don’t use AI for the gyms in your life.

The personal challenge that I think many people (including myself) will face is delineating “jobs” and “gyms”. I worry that there are currently too many pressures pushing people to automate their cognitive processes with an LLM: a fear of “falling behind”, for instance, which is exacerbated by the rhetoric one encounters from official and unofficial marketing for LLMs. Even in the absence of this fear, though, I think there would still be a deep temptation to reach for an LLM when one encounters a moment of struggle. It takes a great deal of intentionality to determine, first, that one wants to do something oneself; and a great deal of inhibitory control to resist, in the moment of struggle, that temptation to reach for a potentially easy solution.

It is my belief that this will largely be a question of habit.

The grooves of thought

One of my favorite essays by William James is his chapter on Habit. In it, he emphasizes the importance of routine action in inculcating the principles and virtues one wishes to uphold. Each time we take an action, we might think of it as deepening a particular set of grooves in our mind that make that action easier to take in the future—and, perhaps, an alternative action less easy to take or even contemplate. He writes (bolding mine):

Seize the Very first possible opportunity to act on every resolution you make, and on every emotional prompting you may experience in the direction of the habits you aspire to gain. It is not in the moment of their forming, but in the moment of their producing motor effects, that resolves and aspirations communicate the new 'set' to the brain…A tendency to act only becomes effectively ingrained in us in proportion to the uninterrupted frequency with which the actions actually occur, and the brain 'grows' to their use.

Moreover, a failure to act on our convictions will over time weaken those convictions and resolve (bolding again mine):

These latter cases make us aware that it is not simply particular lines of discharge, but also general forms of discharge, that seem to be grooved out by habit in the brain. Just as, if we let our emotions evaporate, they get into a way of evaporating; so there is reason to suppose that if we often flinch from making an effort, before we know it the effort-making capacity will be gone; and that, if we suffer the wandering of our attention, presently it will wander all the time.

I think this is all quite relevant to our use of LLMs and external tools or devices more generally. Anyone who’s tried to reduce their screen time likely knows that it is difficult to do so without the use of some additional commitment device, like leaving your phone at home or disabling access to (say) social media. Speaking as someone who’s taken, at various points, both actions: these commitment devices actually alert you to the strange sensations that occasionally bubble up, typically in moments of anxiety or boredom, which manifest as an impulse to “check your phone”. When your phone is in your pocket, the impulse is essentially continuous with the “motor effect” (to quote James). But when your phone is back at home, you cannot, of course, do so; the sensation thus becomes more perceptible and discontinuous from the motor effect.

What I am describing here is what, I think, others might refer to as mindfulness: a kind of attention to our interior thoughts and impulses and an ability to recognize them as, in principle, separable from our motor actions. My suggestion is that on an individual level, we would likely benefit from this kind of mindfulness regarding our use of LLMs. As I’ve tried to argue throughout this essay, I am not opposed in principle to the use of an LLM in any context (though I understand that some are); I also recognize that people will likely come to different conclusions about which processes they view as instrumental and which they view as constitutive. But I do think people should be clear-eyed about the costs involved in each case, and I also think that, having identified the things they wish to do and understand themselves, it is useful to view the use of LLMs as a kind of “habit”, which might well be a bad habit in many circumstances.

For me, that means using LLMs with a very clear goal in mind, for tasks that I don’t particularly need or want to do myself. Perhaps it is their interactive structure, or perhaps I am particularly susceptible, but I find that if I consult an LLM with a more open-ended question, it is easy to be “tugged” along in the course of an interaction in ways that feel disconcertingly out of my own control. That doesn’t mean these goals are restricted to identifying typos or writing boilerplate code: I recently used Claude to help walk through the math in this 2021 paper. But I like to know what I want from an interaction before I start it.

It is strange to have to make these decisions. Something I have not mentioned here is, of course, the more fundamental question of whether an LLM can competently perform a task in the first place; but I take that to be separable from the question of whether, if it can perform the task, it should. I have also neglected the topic of education, which is full of tasks that appear instrumental but are at least intended to be constitutive. My point here is more personal: I do not presume to prescribe or proscribe anything for anyone else, but I hope that in enumerating my own thoughts, others might be given some kind of conceptual framework for navigating these issues themselves.

Though, of course, many people tried (including me, in that original article).

Actually, upon reading this introduction, my wife kindly reminded me that I did, in fact, anticipate some of this despair, bemoaning at some point in mid-2022 the question of my own value in a universe of simulacra.

Here, a reader might object that people with more experience reading LLM-generated text are better at identifying it. This is true (for now), and as a practical matter it is, of course, relevant to the question of detection. But I think it’s separable from the question of why people don’t like LLM-generated text.

But what if, a reader might object, LLMs are conscious entities themselves? I’m not going to address this debate here, but I think the sense of betrayal still holds: even if you think Claude is conscious, you might still be frustrated by someone passing off Claude’s text as their own.

It’s worth noting that the social norms here are still developing, and it might be some time before different people align on their communicative preferences. Just as politeness norms vary substantially across cultures, norms about whether and when it is appropriate to use LLM-generated text will likely vary across people and contexts. Some people might object to the use of an LLM for any purpose. Others might distinguish between “cosmetic” uses (e.g., copyediting an essay) and more “generative” uses (e.g., prompting an LLM to write the entire essay). Still others might be entirely fine with reading any LLM-generated text as long as it is marked as such.

Again, this is not to say the outputs of an LLM can’t make for interesting art, but I suspect that this art would look quite different from (say) prompting ChatGPT to write a short story, and might (for example) center instead around pushing the model into different regions of state-space, so to speak, revealing properties of language and how the model represents linguistic structure. (I’m thinking, here, of a rough analogy to the visual art Eryk Salvaggio has made.)

Though they are non-deterministic in a way that makes them different from, say, a standard grammar-checker. Pasting the same essay into ChatGPT or Claude multiple times might reveal different grammatical or spelling errors.

This is also true, I think, of reading. I, like Daniel Muñoz, do not much care for the idea of replacing reading books with “vibe-reading”. I do not think reading a summary of a text gives you the “same information” as engaging with the text itself. Ezra Klein has described the belief that these things are equivalent as something like the “Matrix view of the mind”, in which one can simply “download” the raw, distilled “information” from a text into your brain. This is not to say reading summaries is bad! It just seems obviously different to me than what happens in your mind when you read the source.

The problem of induction (heads), pt. II

Sean Trott — Wed, 20 May 2026 14:51:35 GMT

In my last post, I discussed research on induction heads: parts of large language models (LLMs) that learn to attend to repeated sequences in the context and enable LLMs to copy strings of text.

I suggested that as scientific constructs, induction heads have several explanatory virtues. First, their definition is closely tied to the units over which LLMs operate (token sequences) and thus characterize LLMs “on their own terms” rather than imposing human psychological constructs onto the system, something I’ve previously argued should be an explanatory goal of LLM-ology. Second, they appear to be relatively generalizable: induction heads are observed across LLMs of various sizes and tend to emerge at roughly similar points during training. Third, early work suggested they may account for in-context learning (ICL) in LLMs. That is, the same mechanisms enabling exact copying appeared to also enable a kind of “fuzzy copying”, which researchers argued is what allows LLMs to quickly extrapolate the goal of a task from a few examples (i.e., “few-shot learning”). In terms of broader scientific interest, this is arguably the most important explanatory virtue: it demonstrates the viability of mechanistic interpretability in accounting for high-level, “macroscopic” behaviors of a model, providing evidence that such explanatory bridging is, in fact, possible.

But as I described in that post—and as early work acknowledged—insights about the role of induction heads in ICL were mostly derived from small models, and relied on specific ways of operationalizing ICL. It’s unclear, then, whether those insights generalize to larger models. Put another way: even if larger models have heads that copy exact sequences (which they do appear to), we can’t necessarily assume those heads play the same functional role in the emergent behaviors we care about (like ICL).

Indeed, in the last few years, evidence has accumulated suggesting that the picture is considerably more complicated than initially thought. Independent of the research on induction heads, researchers have identified a different “kind” of attention head that might also play a role in ICL; recent work suggests that these function vector (FV) heads might actually be more important for ICL than induction heads. Moreover, there may even be multiple kinds of induction heads: some heads appear to copy exact sequences, while others copy something more analogous to synonyms or paraphrases.

Finally, all of this is further complicated by the fact that delineating model components into discrete “types” is not a straightforward matter. How cleanly can FV heads and induction heads be separated, really? In fact, as I’ll discuss below, some of these functions appear to be deeply related: some heads look like induction heads early on in pretraining, then turn into FV heads—suggesting that something like exact copying might be a kind of precursor to these other, perhaps more sophisticated functions.

As a case study, then, I think induction heads illustrate some of key philosophical challenges interpretability researchers face; they also point us towards some of the methodological and conceptual perspectives that might lead to scientific insight about LLMs. With that in mind, let’s turn to the evidence.

What are FV heads and how do we measure them?

Function vector heads (or FV heads) were first characterized in a 2024 paper led by Eric Todd.

The paper is motivated by the question of how, exactly, LLMs perform “in-context learning” (ICL). Concretely, successful ICL requires LLMs to identify and extract structured patterns in an input string and apply that same pattern to future inputs. We might think of these structured patterns as “functions”. For example, an input string might consist of translation pairs:

gato —> cat
perro —> dog
pájaro —> ___

In this case, the function is something like translate from Spanish to English. A closely related function might be translate from Russian to French. In principle, however, many different functions or tasks could be expressed this way, such as match capital to country:

Berlin —> Germany
Paris —> France
Beijing —> ___

Or match musician to canonical instrument:1

Miles Davis —> trumpet
Duke Ellington —> piano
Charles Mingus —> ___

With a bit of creativity, one could even express more complicated question/answer pairs this way, then ask whether an LLM can induce the right pattern from a few examples and complete additional cases. Traditionally, machine learning models had to be explicitly trained to learn specific functions using annotated data; the ostensible benefit of ICL is that models could perform a variety of tasks without any weight updates at all, as long as the function required to perform the task could be induced from the examples or the instructions.

How, then, do LLMs extract and represent these functions? As I wrote last time, induction heads were one promising mechanistic explanation, based on the idea that they enabled a kind of “fuzzy copying” from past input.

A parallel thread of research, however, identified FV heads as another plausible candidate. The underlying experimental logic of this work is relatively straightforward: which attention heads are most important for a model’s ICL performance, and how robust is the role of those heads across different kinds of ICL tasks (e.g., translation, matching capitals to countries, etc.)? There are a number of ways that one could determine the “importance” of a model component for a particular task, but the authors (correctly, in my view) focused on identifying heads with a putatively causal role. Specifically, they used a technique called activation patching, which I’ve written in detail about before, but which I’ll briefly describe the steps of here.

First, the authors presented the prompts for a range of ICL tasks to a given model. Each prompt can be described in terms of a series of input/output pairs, e.g., (x_i, y_i), demonstrating the task in question (e.g., identifying antonyms, translating from Spanish to English, conjugating verbs, and so on). For each attention head in the model, the authors recorded the mean activation2 of that head for all prompts of a given task type. One way to think of what this means is (roughly): what information does this head produce, on average, for a given ICL task?
Second, the authors create a shuffled or “corrupted” dataset, in which the input/output pairs are all jumbled up, such that x_i is no longer paired with y_i. For example, “gato” might be paired with “bird”, while “perro” might be paired with “table”. This removes most reliable information about the task from the prompt (and in fact, makes the prompt misleading); unsurprisingly, this impairs performance on the task.
Crucially, however, the authors then perform a selective substitution procedure, in which the activations from each head from step (1) above are “patched” onto the corresponding head from step (2). In activation patching, this is sometimes called the corrupted-with-restoration run. The authors measure the performance of the LLM for each patched head. The authors also performed this activation patching procedure in a “zero-shot” context, i.e., to models that have received no prompt whatsoever.

The critical question is as follows: which heads, when patched from the “clean” model runs in (1) to the “corrupted” model runs in (2) (or to models with no prompt at all), are most effective at restoring performance? We know the model didn’t observe the correct input/output pairs in step (2)—making it difficult or impossible to induce the structure of the task—so the difference in performance can be causally attributed to the information contained in a given head from step (1).

The reason I’m belaboring this description is to make it clear that these heads are defined in terms of their causal contribution to ICL tasks. Each head can be given a score representing its contribution to each of the tasks, i.e., the extent to which patching that head onto the “corrupted” runs made a positive difference. It’s worth noting here that it’s not trivially obvious that any individual heads should be clearly identifiable using this procedure—especially across distinct ICL tasks. But as it turns out, each of the models the authors investigated (GPT-J, GPT-NeoX, and Llama 2) had a handful of heads that received positive scores. That is, the activation from these heads reliably contained information that allowed models to complete a range of ICL tasks.

Figure 3a from Todd et al. (2024). Each cell represents a particular attention head from GPT-J. The color of the cell indicates that head’s average indirect effect (AIE) using the activation patching procedure. A higher score corresponds to heads that made a positive contribution, i.e., patching them from the clean to the corrupted run helped the model solve the task.

The authors suggest that these heads can be conceptualized as “transporting information identifying the demonstrated ICL task” (pg. 5). The average activation of these heads over examples from a given ICL task can thus be viewed as a “function vector”. We might think of this function vector as containing executable instructions for how the LLM should respond to future inputs.

The authors perform a range of experiments, including asking whether these function vectors can be injected into arbitrary layers of the model. The authors find that they can to some extent: specifically, injecting function vectors into early or middle layers restores performance, but injecting them into later layers does not (see below). As the authors point out, this could suggest that function vectors are responsible for “triggering” particular downstream operations in the model (i.e., in later layers); the “instructions” contained in a function vector must therefore be available sufficiently early on such that these operations can be triggered in time.

Excerpt of Figure 4 from Todd et al. (2024). Here, the authors ask about the effects of injecting function vectors into various layers of the model, e.g., for the antonym task. Here, chance performance (dotted line) is effectively 0%. Injecting function vectors into early and middle layers of the model nontrivially restores performance (e.g., to ~50% or more), but injecting them into later layers has little to no effect.

Which attention heads matter for ICL?

Function vector heads, then, do seem to make a causal, positive contribution to ICL. One plausible interpretations of these heads is that they extract and encode executable “instructions”, i.e., the “function” implied by a particular pattern of relationships in the input. Presumably, such “instructions” can only be extracted because the pattern is sufficiently clear in the input and accessible to an LLM. (That is, it’s conceivable that a given LLM may not be able to recognize certain kinds of patterns or relationships.) Provided these conditions are met, FV heads seem to enable in-context learning.

Yet as I wrote in my last post, induction heads are another plausible mechanistic candidate for ICL. Do both kinds of heads matter? Are FV heads more important than induction heads, or vice versa? Or are they, in fact, the same “kind” of head—just identified using different procedures?

Yin & Steinhardt (2025) ask this question systematically, examining a range of open-weight models of various sizes (including Llama-2-7B, as well as those from the Pythia and GPT-2 model families). In each model, they identify candidate induction heads (using the prefix-matching score approach I’ve written about before) and candidate FV heads (using the approach I described above). That means each attention head in each model is now associated with two “scores”: 1) an induction score; and 2) an FV score. In turn, heads can be characterized in terms of their percentile along each distribution of scores (e.g., the heads in the top 1% of induction scores).

The authors first compare the attention heads in the top 2% of induction scores to those in the top 2% of FV scores. In principle, these could have been the same heads—i.e., if the induction heads are identical to the FV heads. In practice, however, there is very little to no overlap: the heads with the very highest induction scores are not those with the very highest FV scores. On the other hand, the scores are correlated, i.e., heads in the top 2% of induction scores are frequently in the top ~5-10% of FV scores. The authors write:

In most models, FV heads are at around the 90-95th percentile of induction scores, and vice versa. Therefore, although there is little overlap between the sets of induction and FV heads, induction and FV scores are correlated: FV heads have high induction scores relative to other attention heads, and induction heads have relatively high FV scores. (pg. 4.)

Already, this points to a somewhat nuanced picture in terms of how we interpret different “kinds” of heads. These metrics are not identical and do not pick out exactly the same model components; at the same time, they aren’t entirely unrelated.

The crucial question, then, is which of these heads actually matter for ICL. To test this, the authors systematically ablate (or “knock out”) three groups of heads: 1) those with very high FV scores; 2) those with very high induction scores; and 3) random heads, as a baseline for what to expect when ablating any given head from a model.3 They then quantify the impact of ablating these different groups of heads on model performance for a range of ICL tasks. For me, the key takeaway is as follows (bolding authors):

These results revealed that ablating FV heads caused greater degradation in few-shot ICL performance compared to ablating induction heads, with this disparity becoming more pronounced in larger models….When ablating induction heads while preserving the top 2% FV heads, we observe minimal impact on few-shot ICL performance – comparable to random ablations in models exceeding 1B parameters. Conversely, ablating FV heads while preserving induction heads continues to significantly impair ICL performance.
These ablations suggest that the contributions of induction heads to ICL in the top row of Figure 4 mostly come from heads that are both induction and FV heads, and that FV heads matter the most for few-shot ICL: as long as the model preserves its top 2% FV heads, it can perform ICL with reasonable accuracy even if we ablate induction heads. (pg. 6)

Based on my read of the results, the authors’ interpretation seems fair: it does seem that FV heads play a stronger functional role in ICL than FV heads—at least using these procedures for identifying each head type.

One question, though, is whether the types are entirely distinct, or whether they’re related in certain ways. The authors explore this by examining the developmental trajectory of head types during training. That is, rather than looking only at the behavior of heads at the final checkpoint of a model, they characterize their behavior throughout the whole training process, allowing them to ask when certain behaviors develop and whether induction-like and FV-like behaviors are related processes. (See my previous post on generalizability in LLMs for more details on characterizing head behaviors throughout pretraining.)

The most interesting finding is depicted in Figure 6 of the paper: a number of heads that look like FV heads at the final checkpoint also look like induction heads at earlier checkpoints. Yet the inverse is not found: heads that look like induction heads at the final checkpoint do not look like FV heads earlier in training. Moreover, induction-like behavior tends to emerge earlier than FV-like behavior in all models tested. Together, these results suggest that at least certain attention heads exhibit multiple “stages” of behavior throughout training. Perhaps, for these heads, induction (literal string copying) acts as a kind of developmental precursor to their role in ICL (requiring some form of rudimentary compositional abstraction).

This is roughly what the authors suggest as well:

Our first conjecture (C1) posits that induction heads are an early version of FV heads. Under this interpretation, induction heads serve as a stepping stone for models to develop the more sophisticated FV mechanism. As FV heads emerge and prove more effective at ICL tasks, they gradually supersede the simpler induction mechanism. (pg. 8).

Figure 6 from Yin & Steinhardt (2025). Heads were identified as induction heads or FV heads based on their final checkpoint behavior. Some proportion of FV-like heads (based on their final checkpoint behavior) also exhibit induction-like behavior throughout pretraining (top), though the reverse is not found for heads identified as induction heads at the final checkpoint.

Of course, as the authors also point out, the picture is likely not so simple as “heads first become induction heads, then FV heads, and the latter are what matter for ICL”. First, not all induction heads end up looking like FV heads, so clearly the developmental process is not deterministic. This raises the question of why certain induction-like heads remain induction heads, while others appear to change their behavior.

Second, improvements in ICL first appear following the development of induction heads, and subsequent improvements are observed following the development of FV heads. This suggests that the functional split between these heads is not cut and dry, particularly from a developmental perspective: induction-like heads do seem to facilitate something like ICL, albeit to a substantially lesser degree than FV-like heads.

The puzzle of induction heads

I began this pair of posts with the observation that induction heads represent an interesting case study in the promises and challenges of mechanistic interpretability. In particular, induction heads are associated with a set of explanatory virtues, including their apparent ability to account for high-level behaviors like in-context learning. As I discussed here, however, recent work casts doubt on the claim that induction heads specifically play a causal role in ICL—in contrast, function vector (FV) heads appear to play a stronger role.

Where does this leave us, philosophically and practically?

Distinct types?

The first philosophical issue is whether it is fair to say these are distinct “types” of heads, particularly given the developmental overlap in their behavior. Often, apparent distinctions in kind can also be described as distinctions in degree. Analogous questions pervade the natural sciences, especially biology: when should we label two organisms as distinct species, or two cells as distinct types? I’m generally inclined to say that there is no “objectively correct” answer to this question: it’s a matter of practical utility. Thus, the question in such cases is whether the patterns of variance observed across the continuum exhibit differences that might be more usefully and succinctly described as differences in kind.

Moreover, as others have pointed out, distinctions are often made on the basis of multiple criteria—for example, cells in the brain might be classified on the basis of their gene expression profiles, their morphological structure, their anatomical location and patterns of connectivity with other cells, their firing rate, their apparent function in cognitive phenomena, and more. As I’ve argued before, I think taxonomies of LLM components should adopt a similar strategy and identify multiple axes of correspondence or divergence.

I’m not sure there’s enough data yet to say whether these should be considered meaningfully distinct types in the LLMs investigated by the authors. On the one hand, there’s very little overlap among the top induction heads and top FV heads, and the latter seem to be more implicated in ICL. On the other hand, the two phenomena do seem to be somewhat related developmentally—some fraction of FV heads look like induction heads earlier in training. Personally, I think that a taxonomy of attention heads—to the extent that such a goal is desirable and feasible—should take into account not only the final behavior of a model component, but the developmental process by which that component arrived at that behavior. Such a perspective would, in my view, enrich our understanding of LLMs and how they work.

A complicated account

One of the chief explanatory virtues of induction heads was that they ostensibly accounted for macroscopic behaviors like ICL. Is that no longer true? And what have we learned, at the end of the day?

Certainly, the picture we have now is much more complicated than the account initially given by Olsson et al. (2022), which suggested that induction heads are functionally involved in ICL. While this was true of the smaller, attention-only models tested in that 2022 paper, it does not appear to be straightforwardly true for larger models. To the extent that induction heads are clearly separable from FV heads, the latter are more important for ICL than the former.

Yet as I noted above, the question of “distinct types” is not itself straightforward. If induction heads are, in some cases, a kind of “precursor” to FV heads—which, to be clear, is a speculative claim—we might postulate that certain induction heads are an important developmental stage, and that they enable ICL through their eventual transformation into FV heads.

The last thing I’ll mention is that we’ve really only scratched the surface in terms of the space of possible models we’ve investigated here. I think it’s still really unclear which findings about LLMs are generalizable to a larger class of models, and even what it would mean to extrapolate a claim in the first place. It’s possible that extrapolations will be straightforward: maybe, as others have suggested, some model mechanisms will be “universal”. My baseline assumption, however, is that the space of possible solutions is vast, and we should expect some degree of heterogeneity as a default. In all likelihood, our explanations about LLMs and how they work will get increasingly complicated—and conceptually richer—as we try to account for the behavior of a larger sample of models.

I’m aware that many musicians (including those listed) play multiple instruments!

Specifically, “activation” here refers to the output vector from the attention head, i.e., what the head is writing into the residual stream.

The authors selectively ablate FV heads not identified as induction heads, and vice versa.

The problem of induction (heads), pt. I

Sean Trott — Fri, 24 Apr 2026 18:13:47 GMT

A key aim of mechanistic interpretability research is to identify which internal components of a neural network—like a large language model (LLM)—produce observable behaviors.

Explanations, here, typically reference specific model components like token representations (the vectors representing particular inputs, which are successively contextualized throughout the layers of a model) and attention heads (the parts of an LLM that guide how information from the context is integrated with these token representations). For instance: if an LLM reliably solves simple addition problems (“1 + 2 = ___”), researchers might try to explain this behavior in terms of which token representations in which layers of the model carry information about the quantities to be summed, and which attention heads are responsible for contextualizing these representations.1

The problem is that this is all very difficult. There are at least two specific challenges, some of which I’ve written about before. The first, and most obvious, is that interesting behaviors (like “reasoning”) are often very complex, which makes them difficult to operationalize behaviorally—and it’s even more difficult to account for them in terms of low-level mechanisms that are also human-interpretable.2 Second, even if one does manage to identify a candidate mechanism, it’s very unclear whether it generalizes across different LLMs or even across tasks for the same LLM.

Because this is so difficult, I’m especially excited when I encounter research that appears to address these challenges. This is partly why I’ve long been interested in “induction heads”: attention heads that help LLMs predict upcoming tokens by attending to repeated sequences in the preceding context. These heads perform a relatively simple (and legible!) function and—according to early research—could be involved in more interesting behaviors, such as in-context learning (ICL). As candidate mechanisms, they enjoy a number of “explanatory virtues”. Yet recent research has also complicated the initial conception of induction heads.

In this two-part essay, I treat induction heads as a case study for mechanistic interpretability more broadly. In part I, I describe what induction heads are and how they’re often identified, and then discuss why they’ve received so much attention—they have a number of explanatory virtues that make them useful scientific constructs. In part II, I’ll discuss some challenges of this research program: specifically, I’ll describe the results of recent work suggesting that there may actually be multiple “kinds” of induction heads and what this means for the explanatory role of these model components.

Induction heads: the basics

The basic idea of an induction head is, as I wrote above, relatively straightforward.

Recall that the job of an LLM is to predict upcoming tokens (roughly, though not necessarily, like words) based on some context, which itself constitutes a sequence of tokens. LLMs learn to make these predictions by observing many examples of sentences in training, which allow them to learn certain kinds of regular, statistical relationships in language.

One of the simplest kinds of relationships in a piece of text is exact repetition. If you observe a short sequence of tokens earlier in the context (“a, b, c”) and then the first part of that sequence later on (“a, b, ___”), a reasonable strategy for predicting the next token might be to “look back” at which token completed the initial sequence.

This is exactly what induction heads do: given a partial sequence, they look for ways that sequence has been completed in the preceding context. This information can then be used to make predictions about upcoming tokens.

Indeed, one of the approaches to identifying induction heads is to present LLMs with random sequences of tokens:

t1, t2, t3, t4 ||| t1, t2, t3, ___

Researchers can then calculate a few metrics (as in this 2025 paper by Aoyama et al.).

First, they can ask whether an LLM assigns high probability to “t4” as the most likely next token; this is sometimes called “associative recall” and refers to the LLM’s overall ability to perform this string repetition task.
Second, they can ask which attention heads direct attention from the second occurrence of “t3” (bolded in the example above) to the first occurrence of “t4” (also bolded). That is, which heads look from the current token to the token that previously completed an equivalent sequence? This latter metric is called a “prefix-matching score” and is commonly used as an index of how likely each individual head is to be an induction head.
And third, they can ask whether a given head plays a functional role in helping the model complete the sequence (i.e., increasing the probability of “t4”); this is often called the “logit attribution score”. This is, again, a measure of individual attention heads in a model.

Combined, these metrics do a pretty good job of triangulating which individual attention heads in a model might be induction heads.

To illustrate what this might look like, I calculated prefix-matching scores for each of the 24 heads (6 layers x 4 heads in each layer) in the final checkpoint of Pythia-14m 3, part of the Pythia suite trained by EleutherAI. The figure below depicts these (standardized4 scores) in a heatmap, with each layer on the x-axis and each head on the y-axis. As you can see, a few heads stand out as especially strong candidates: heads (5, 1) and (5, 2). Heads (4, 2) and (5, 3) are also plausible candidates, though their prefix-matching scores are weaker. Every other head, notably, is pretty weak across the board.

Mean prefix-matching scores (standardized) for each of the 24 heads in the final checkpoint of seed 1 for Pythia-14m. Calculated using the method in Aoyama et al. (2025).

Note that this is not a causal analysis—it just shows the “attentional behavior” of each head. Nonetheless, attentional behavior can be a promising start. An in-depth analysis would, presumably, further investigate those heads by ablating them and examining the impact on token prediction. Ideally, we’d also trace these scores throughout pretraining and correlate them with model-level metrics like “associative recall”: in principle, developmental changes in a model’s ability to copy exact sequences should correspond to the onsets of these candidate induction heads.

At this point, however, some readers might be wondering: what’s the point? Exact copying has its uses, but it’s hardly the primary reason people are interested in LLMs. LLMs can complete all sorts of different sequences in flexible ways, recognizing conceptual patterns in the prompt and generalizing these to a new example. Crucially, early work on induction heads suggested they might play a role in this kind of powerful “in-context learning” (or “ICL”) behavior as well—which is where the explanatory virtues come in.

The explanatory virtues of induction heads

Induction heads are a kind of scientific construct specific to the field of mechanistic interpretability. And scientific constructs can be evaluated in terms of their utility: what do they do for us in terms of explaining the things we’d like to explain? From this perspective, there are a few key “virtues” of induction heads, which I’ll list in order of increasing importance to the field.

The first is one I’m partial to, which is that induction heads are clearly tied to the units over which LLMs operate (tokens) and the thing that LLMs are primarily designed to do (predict tokens on the basis of preceding tokens). As I’ve written before, I’m drawn to explanations that try to understand LLMs “on their own terms”, rather than projecting constructs from human psychology onto the system. In some cases, of course, it’s very difficult to avoid those kinds of high-level psychological abstractions, especially when discussing LLM behavior. But when we’re trying to understand internal causal mechanisms, I think it’s advisable to hew as closely to the computational substrate as possible—in much the same way that I think we should seek to understand the brain as a biological system, not merely as a device for “implementing” psychological functions.

The second virtue is that induction heads appear to be relatively generalizable mechanisms. A few months ago, I presented a paper at the NeurIPS Mechanistic Interpretability workshop discussing what it might mean to generalize findings about model mechanisms across model instances. I suggested that researchers investigate generalizability across several potential axes of correspondence: functional (do they do the same thing?), developmental (do they emerge at similar points in training?), positional (do they occupy similar absolute or relative positions in a network’s structure?), relational (do they form similar relational “circuits”?), and configurational (do they occupy similar regions of weight-space?).

Proposed axes of correspondence from my paper on generalizability in mechanistic interpretability.

Crucially, induction heads appear to satisfy (roughly) at least four of these criteria. By definition, they perform the same function. Developmentally, they also follow relatively similar trajectories across models. Indeed, in a longer version of that 2025 paper, I investigated the training dynamics of induction heads across different seeds and architectures in the Pythia suite, and found remarkably similar trajectories: induction heads reliably emerged when models had encountered between 1B and 2B tokens. For seeds of the same size, there was very little variance in the onset; across models of different sizes, induction heads tended to emerge slightly earlier in larger models than smaller models.

Training dynamics of induction heads across models in the Pythia suite. Red line represents the grand mean across all seeds; solid black line represents the mean of seeds of a given size (e.g., 14m parameters). From an extension to my 2025 paper.

Positionally, there’s slightly more variance, but most papers on induction heads suggest they tend to pop up around the middle layers of LLMs. And perhaps most crucially, induction heads have consistent relational structures with other internal mechanisms in a model. Specifically, induction heads are part of a “circuit” consisting of a previous-token head (which helps represent, for each token, the token that occurred just before) and an induction head (the part of the circuit that “looks back” to see how a sequence was completed previously). Finally, there’s also some evidence that induction heads are implemented by similar configurations of weights across models, though (at least to my knowledge) the configurational axis has been less studied in general, and the nature of what constitutes a “configurational explanation” is still unclear.

Induction heads, then, are tied to what LLMs fundamentally do (the first virtue) and generalizable (the second virtue). Their third virtue—and the reason they’ve received so much attention—is their (purported) ability to explain powerful downstream behaviors of interest, such as in-context learning (ICL). We might think of this virtue as something like “functional bridging”: induction heads, in this capacity, link simple, legible mechanisms to higher-level behaviors. I describe this work in more detail below.

Induction heads and in-context learning

The initial work on induction heads and ICL comes from this 2022 white paper by researchers at Anthropic. The paper brings a range of evidence to bear on the argument that induction heads are responsible not only for copying exact sequences, but for completing types of sequences. The idea here is captured by this description from the paper:

Specifically, the thesis is that there are circuits which have the same or similar mechanism to the 2-layer induction heads and which perform a “fuzzy” or “nearest neighbor” version of pattern completion, completing [A*][B*] … [A] → [B] , where A* ≈ A and B* ≈ Bare similar in some space; and furthermore, that these circuits implement most in-context learning in large models.

That is, induction heads allow for a kind of compositional abstraction, where the sequence being matched isn’t a literal exact match but a kind of pattern. Let’s consider the case of translation as an example:

gato —> cat
perro —> dog
pájaro —> ___

If you’re fluent in both Spanish and English, you’ll recognize that pájaro means “bird” and therefore that the most likely next token is, in fact, “bird”. Presumably, you recognize this because you identify a pattern in the context: the word on the left side of each “—>” symbol is a Spanish word, and the word on the right side is its English translation. Asked to complete this sequence, then, you understand that your job is to translate pájaro (or whatever Spanish word is presented next).

LLMs trained on sufficient English and Spanish data can do this kind of task pretty well. This is called in-context learning (ICL), and it’s one of the reasons people started getting very excited about LLMs back in ~2020. In the past, language models had to be fine-tuned to perform specific tasks like machine translation or summarization. This fine-tuning process required curated datasets and further training (i.e., updating the weights of the model), both of which were (somewhat) expensive and time-consuming. It also, crucially, limited the generality of these systems.

But models like GPT-3 (and beyond) appear to be able to “induce” certain kinds of behaviors from a small set of examples in the prompt without updating the weights at all. Thus, while it’s called “in-context learning”, no weight changes occur: it’s simply that the context contains enough information for the model to be able to extrapolate to new examples. Eventually, researchers explored a range of strategies for ICL, including giving models instructions about how to complete a task. Nowadays, most models are “instruction-tuned”, meaning they’ve been explicitly trained to follow a user’s instructions rather than merely complete sequences of text sampled from the Internet.5

The reason, then, that ICL was so exciting is that it provided a glimpse of what “generality” might look like. Instead of building bespoke models for each purpose you could imagine, maybe people could rely on a single “foundation model” that could perform a range of tasks from a simple set of instructions or examples.

But how and why did ICL work? These are the questions occupying that 2022 Anthropic paper; their provisional answer is that something like induction heads enable this ability to generalize from examples. The first piece of evidence is what the researchers call “macroscopic co-occurrence”: that is, language models undergo a relatively marked phase change in their ICL performance that coincides with the onset of induction heads.

Illustration of “phase changes” in ICL performance over the course of model training (Olsson et al., 2022). ICL performance only showed marked improvements in models with at least two layers, consistent with the idea that induction heads form part of a two-layer circuit. Moreover, these developmental trajectories were closely aligned with the onset of induction heads.

This evidence is not, of course, causal: the fact that ICL and induction heads both show changes at similar points in training does not entail that induction heads are functionally involved in ICL. It’s possible, for example, that there’s some common cause of both factors, or even that the factors are simply causally unrelated.6 But it is suggestive, just as the alignment of behavioral and biological milestones in humans or other biological organisms is suggestive. Importantly, the authors also demonstrated this alignment across a range of model sizes, illustrating that convergence between the onset of induction heads and ICL is not limited to very small transformer models.

The authors then perform a series of additional analyses to establish the functional role of induction heads in ICL. (Not all of these analyses are performed on both small and large models, which is why the authors are careful to hedge some of their claims about the generality of the results.)

For example, in small models, the authors show that ablating (or “knocking out”) induction heads is associated with a decrease in ICL performance, whereas knocking out other (e.g., random) heads is not. This is strong evidence that at least in these models, induction heads are functionally involved in ICL: if a component is causally involved in some downstream process, then intervening on that component (e.g., disabling it) should result in measurable differences in the downstream process.

The authors also show that some heads identified as induction heads (i.e., according to their prefix-matching scores) also show some interesting other behaviors, which are plausibly linked to ICL. This includes translation (as I illustrated above) and more general “pattern-matching”, such as recognizing arbitrary templates from the context:

(month) (animal): 0
(month) (fruit): 1
(color) (animal): 2
(color) (fruit): 3

Importantly, these heads weren’t initially identified on the basis of their behavior during these other, more general tasks. As the authors stress, the heads were defined in terms of their behavior on the literal prefix-matching task—and then it simply turned out that some of them might also be involved in these other, more conceptually interesting tasks. The authors suggest that these tasks are all “spiritually similar” to the task of exact string copying:

But this still leaves the question: why do the same heads that inductively copy random text also exhibit these other behaviors? One hint is that these behaviors can be seen as “spiritually similar” to copying. Recall that where an induction head is defined as implementing a rule like [A][B] … [A] → [B], our empirically observed heads also do something like [A*][B*] … [A] → [B] where A* and B* are similar to A and B in some higher-level representation. There are several ways these similar behaviors could be connected. For example, note that the first behavior is a special case of the second, so perhaps induction heads are implementing a more general algorithm that reverts to the special case of copying when given a repeated sequence.

The authors also consider several other pieces of evidence, such as how these operations are plausibly implemented in the actual attention heads’ weight matrices. Those details aren’t especially important for my point here, but the brief summary is that the specific functions of induction heads—again, at least in small, attention-only models—can be directly related to their actual parameters.

Combined, then, the authors have shown that induction heads and ICL pop up at similar points in training, that induction heads are causally involved in ICL in smaller models, and that there are plausible accounts of how induction heads might be amenable to the kind of compositional abstractions necessary for ICL. None of this entails that induction heads are the mechanistic basis for ICL in large models—which are, to be clear, the class of powerful models most people are interested in. This is relevant because it calls into question the third (and most important) explanatory virtue of induction heads: their purported ability to provide a bridge between low-level mechanisms and powerful behaviors of interest.

As the authors point out, it’s a question of extrapolation: is it reasonable and appropriate to generalize findings from small models to large models?

Across the boundaries

This question has been top of mind for me in the last year or so—not just in the context of induction heads, but in the study of LLMs more broadly and the use of LLMs as “model organisms” for the study of human cognition. When are researchers justified in extrapolating results from one sample to another, especially given the inevitability of heterogeneity within and across populations?

This is not, of course, a question restricted to the study of LLMs. Extrapolation is central to many fields of science, including biomedical research (will a drug that works in rats work in humans?) and economics (will a welfare program that worked in one state work in another?). These are hard problems, and different fields have constructed different solutions—some of which might be more successful than others. In Across the Boundaries, philosopher of science Daniel Steel considers some of the challenges inherent to extrapolation and the weaknesses of many proposed philosophical solutions. His own proposal involves something called comparative process tracing, in which components of the causal chain identified in a model (e.g., a rat) is compared to select, known components of a putatively equivalent causal chain identified in the target of inquiry (e.g., humans). It’s by no means a panacea, but the point is that it offers some kind of grounded basis on which to evaluate the degree to which a model and target align.

I mention all this because it’s central to the questions raised at the end of the Anthropic paper. Small, attention-only language models are clearly different in many ways from larger models, both in terms of their basic architecture (e.g., attention-only models lack a feedforward layer) and, presumably, in terms of their emergent behaviors. The key question, however, is whether they are different in ways that matter to the specific causal mechanisms identified in small models, i.e., that induction heads are causally implicated in in-context learning.

Short of knowing more about large models, it’s really hard to answer this kind of question. What might such differences be? You can, however, speculate—as the authors do in the end of their paper (bolding mine):

On the flip side, there are many cases where large models behave very differently than small models (see discussion of phase changes with respect to model size in Related Work). Extrapolating from small models to models many orders of magnitude larger is something one should do with caution.
The most compelling alternative possibility we see is that other composition mechanisms may also form during the phase change. Larger models have more heads, which gives them more capacity for other interesting Q-composition and K-composition mechanisms that small models can’t afford to express. If all “composition heads” form simultaneously during the phase change, then it’s possible that above some size, non-induction composition heads could together account for more of the phase change and in-context learning improvement than induction heads do.

This possibility—that different mechanisms might underlie the same behavior, and that this might pose challenges for extrapolating mechanistic results from small models to large ones—is what I discuss in part II of the essay.

Or, more relevantly for an AI safety context: if an LLM reliably says untrue things when prompted in certain ways—a behavior some might call “deception”—researchers might ask whether they can predict if an LLM will produce a deceptive statement by analyzing the representations from each layer of the model. They might even ask if they can steer those representations to make the LLM more or less truthful.

This problem is not unique, of course, to mechanistic interpretability: it’s a problem faced by any reductionist account of high-level behavior.

Technically these are from seed 1 of Pythia-14m. EleutherAI released multiple seeds per model.

The figure looks qualitatively the same whether you use the raw scores or the standardized scores.

On some level, this is, of course, still a question of predicting upcoming tokens. But the training paradigm is specific and selective enough that I think it’s reasonable to consider it a different kind of thing the models are being trained to do—at minimum, they’re being trained to predict tokens based on a very specific kind of format, and this training helps them “infer” what kinds of response is expected from a particular kind of question or instruction.

The authors acknowledge both possibilities, to their credit.

Why benchmarks don't resolve disagreements

Sean Trott — Fri, 03 Apr 2026 16:53:28 GMT

There are a number of epistemological challenges involved in the study of Large Language Models (LLMs). Foremost among them is evaluation: determining what LLMs can and can’t do is hard, and discussion of their capabilities is thus rife with disagreement.

The claim that this is hard might strike some as strange. It might also seem strange that this topic can be so contentious. After all, there are numerous benchmarks—collections of tasks designed to test capabilities like “reasoning” or “general intelligence”—and more are introduced via arXiv practically every day. If we take benchmarks at face value, they should act as a reference point (literally a “benchmark”!) to ground conversation and, presumably, resolve disagreements.

But we don’t take benchmarks at face value. We bring to bear on our interpretation a host of background assumptions, some of them implicit. By and large, I think this is a good thing: as I’ve argued before, I think it’s reasonable to treat benchmarks with caution and think carefully about which inferences are licensed from a particular empirical result.

More problematic, however, is the gap between (some of) our ideal conception of benchmarks, our actual conception, and the discursive function they serve. Put another way: we might believe we take benchmarks at face value, but we don’t, and so in many cases, we may not even accurately predict how we will respond to the results of an LLM on a given benchmark.1

I want to be clear that I’m using the term “we” quite intentionally: I am speaking, in fact, primarily of myself (though I suspect my experience is not singular in this matter). Thus, you can interpret what follows as my reflections on why I struggle to interpret benchmark results and navigate the often contentious debates around their use.

A question of generalization

The root of the problem, in my view, is generalization. Given a model’s performance on a task—a set of items designed to operationalize some construct like “reading comprehension” or “mathematical reasoning”—what can we say about how that model might perform on some other set of items? The world, after all, is vast, and the set of possible contexts is effectively infinite. We cannot hope to measure them all, so what principles do we have to bound our inferences about whether and when a system’s behavior will generalize?

Defining even the notion of a “capability” is, as I say, hard. But if it’s to mean anything at all, I think its definition ought to incorporate something like a behavioral prediction: when we say a model (or a person, for that matter) has a “capability”, I think we’re making some kind of commitment to a prediction about the robustness of their behavior in a range of circumstances that likely involve that capability.

We can make this more concrete. Suppose we’re interested in assessing an LLM’s ability to answer questions about the mental states of characters in a story; this is sometimes called mental state reasoning or mentalizing. We administer a range of tasks designed to test this ability to the LLM, and it performs well on all of them. Does that mean it “has” the ability? At the very least, can we assert that the LLM is mentalizing on these tasks?

There are, as I’ve written before, a few perspectives one can take on this question: one can accept the test at face value and draw whichever inferences we’d draw about humans (the “duck test” position); one can reject the test based on some a priori principle about what LLMs can and cannot do (the “axiomatic rejection” position); or one can assert that the test means different things for humans and LLMs (the “differential construct validity” position).

What counts as generalization?

My suspicion is that much of the skepticism underlying this differential construct validity position is borne from skepticism about whether a system’s behavior constitutes (and will reflect, in the future) generalization of the kind we think the test is measuring. Specifically, we lack a set of principles (formal or informal) for extrapolating from a system’s locally observable behavior on some task to the possible universe of contexts and behaviors in which the corresponding construct might be relevant.

Again, to make this concrete: an LLM trained on text from the Internet has almost certainly encountered various versions of the false belief task (FBT), a task designed to measure mental state reasoning or “Theory of Mind” in humans. In the most extreme case, if we test an LLM on the exact FBT stimuli it’s encountered during training, we should expect it to do well—not because it’s mentalizing, but because it’s memorized (effectively) its training data. It’s akin to allowing a student to read the answer key before taking a test. We call this verbatim data contamination, and it’s a major problem for evaluating LLMs, particularly closed-source models for which we don’t know all the details of their training data and procedure.

But what if we modify the stimuli only slightly (e.g., changing the names and location names), while leaving the basic structure intact? This is no longer a case of verbatim data contamination, but the structure of the stimuli is virtually identical, and it’s plausible that a sufficiently advanced LLM could in principle answer the questions using a kind of “learned template” approach, i.e., without needing to explicitly reason about mental states at all. How different does our test environment need to be such that successful behavior “counts” as generalization?

“Verbatim data contamination” is one tests a model on exactly the same items it’s encountered during training. How different must the train and test environments be such that we intuit we’re measuring something like “generalization”?

I think many people (including myself) have the intuition that it’d be more impressive if an LLM trained only on transcripts of child-directed speech somehow passed the false belief task. The reason for this intuition is that a corpus of child-directed speech, presumably, does not contain explicit examples of the structure of the false belief task, nor does it (probably) involve explicit discussion of topics like Theory of Mind and the tasks used to measure it. I’m not sure if everyone would agree on the meaning of such a finding, but I suspect it might change more minds than, say, finding that GPT-5 passes the false belief task.

The problem, as I see it, is in articulating this intuition more rigorously. In math and machine learning, the concept of “out-of-distribution” (or OOD) examples is well-known and can even be formally defined in some cases. The topic of generalization has also been well-studied in domains like grammar, where researchers can ask how much evidence is required (direct or indirect) for a language model to learn a grammatical construction; that’s in part because grammar can be defined using relatively formal structures. But it’s less clear to me how one would formalize the notion of “in-distribution” and “out-of-distribution” for something like a false belief task. At best, we might be able to constructed an ordered ranking of similarity between types of training data and types of assessments, though this doesn’t convey quite the same precision.

I’m far from the first person to broach the topic of generalization; it’s central to most debates about measuring LLM capabilities, particularly when it comes to constructs like “general intelligence”. Indeed, François Chollet’s now-famous 2019 paper on the topic makes a very compelling case that a measure of whether a machine is “intelligent” needs to somehow assess that system’s ability to build robust, generalizable abstractions that can be deployed flexibly across contexts. The challenge, of course, is still how you formalize a measure like that.2

Alternatively, one could decide that formalizing the notion of generalization is intractable and try a different approach I’ve taken to calling triangulation. I plan to write more about this in the future, but briefly: here, the construct or capability of interest is approached from multiple “angles”, including but not limited to: a priori philosophical arguments; various tasks with broad coverage of the phenomenon in question; fine-grained analysis of how the behavior develops throughout model training; and an analysis of the internal mechanisms giving rise to the behavior. If different strands of research point in similar directions, we might conclude that we’ve constructed a web of mutually supporting facts and beliefs. Less satisfying than a proof, perhaps; but maybe it’s the best we can do.

My preferred approach is broadly aligned with how I think about categories generally (as prototypes with fuzzy boundaries) and with some approaches taken in fields like comparative psychology. In a 2017 paper, the philosopher Cameron Buckner describes this strategy as trying to identify “property clusters”, rather than trying to identify necessary and sufficient conditions for the ascription of particular capacities:

Though it is good to insist that our cognitive and associative hypotheses generate clear predictions, we must give up on the idea of critical tests that can cleanly confirm or falsify such hypotheses in isolation. This simplistic philosophy of science should have died under the lash of the Quine-Duhem thesis, but it has persisted in corners of comparative psychology to this day. Some of the savviest comparative psychologists are now beginning to look instead for correlations amongst clusters of independent behavioral properties (Cheke and Clayton 2015), which provides a better methodology for assessing the kinds of psychological categories I have been discussing here.

When the test is the thing itself

It’s worth noting that many disagreements about benchmarking and evaluation involve cases where the benchmark stands in for some other construct of interest. The test, in these situations, acts as a proxy for something we’re actually interested in.

A different approach—which I’m quite sympathetic to—would be to decide which applications we want to use a system for and directly measure how the system performs on those applications. That is, we can bypass the representation of the construct and measure the thing itself (assuming this is possible).

This is, supposedly, roughly what Anthropic is doing with Claude Code: by directly using the system to generate code, employees can (hopefully) determine whether that code is helpful and reliable.

Of course, this strategy still presents some thorny philosophical challenges. For instance, it’s possible that a system is generating bad code but employees simply haven’t noticed it yet, and perhaps won’t notice it until some serious damage has been done. (This is, presumably, the situation that proxy benchmarks are designed to prevent!) Alternatively, maybe the system is generating good code now, but that codebase will, over time, drift into something “out-of-distribution” and the quality will subsequently and catastrophically decline. Still, even given this caveats, the measure of success is more clearly tied to the target of inquiry.

Perhaps an even clearer example is how we assess the safety of self-driving cars. Because self-driving cars are actually deployed in select cities around the country, we can analyze their performance on the job, i.e., calculating the number of accidents per miles driven and comparing this to measures of driving safety in humans. Dedicated journalists like Tim Lee can even read through the accident reports to determine how often self-driving cars are at fault in the accidents that do happen.

Here, again, we don’t have the same level of concern about whether the assessment “represents” what we’re interested in (safety) because there’s a tighter conceptual link between the two. That’s not to say there are no issues with generalization: it’s quite plausible, for example, that a car’s reliable performance in San Francisco does not guarantee its reliability in Mumbai or in a snowy mountain pass. But unlike the false belief task and mentalizing, there is at least some context under which we can say something concrete and direct about the construct of interest.

This insight may seem obvious: measuring a system’s performance “on the job” raises fewer questions about construct validity and generalization than measuring its performance on some proxy test.

But the solution, unfortunately, is not necessarily obvious. After all, in most situations, we’d like to validate a system’s performance on some application before deploying it directly. An analogy might be drawn to hiring here: the best way to evaluate whether someone will be good at their job is to hire them and observe their performance—but most managers want at least something to go off when they’re making that initial hiring decision. (Job offers, for better or for worse, are not made by lottery.)

Epistemic iteration

This challenge draws one back, inexorably, into the logic of benchmarks and proxies—even when the target of inquiry is a system’s performance on some applied task.

Proxy tasks are, of course, unavoidable when one’s target of inquiry is explicitly theoretical nature, e.g., constructs from Cognitive Science like mentalizing or grammatical knowledge. But even when one’s ultimate interest is practical application, developing theoretical abstractions may prove necessary. That is, if we assume that some proxy is necessary some of the time, then it will likely be helpful to design these proxies with the aid of some theory connecting various abstractions and aspects of benchmark performance to the desired application.

Importantly, I don’t think this process can start from a purely theoretical perspective, nor can it proceed in wholly empirical fashion. A useful framework here might be Hasok Chang’s notion of epistemic iteration. Early work in thermometry faced a problem of circularity: it was difficult to “ground” any particular measuring device in some gold standard in the absence of another standardized system for measuring temperature; this is especially difficult when one lacks a coherent theory of what “temperature” even is. Chang’s suggestion—informed by the history of thermometry over centuries—is that you don’t need to justify everything up front. Rather, justification occurs incrementally through a mutually corrective process of various measurements and theories, which, much like gradient descent, gradually establish a more and more coherent picture of the target of inquiry. Perhaps our evaluation of LLM capabilities must follow in a similar direction.

For instance, someone might assert that a benchmark is a good measure of “general intelligence”, but when an LLM performs really well on the benchmark, they realize that it doesn’t affect their mental model of the LLM’s abilities at all. This is sometimes (derisively) called “moving the goalposts”, and in principle it can go in both directions: a different person might think a test is a good measure, but when an LLM performs poorly, they dismiss the test as somehow biased. Goalposts can be moved forward or backward.

As Melanie Mitchell mentioned in this recent interview with Ben Riley, some LLMs now pass ARC, the measure proposed by Chollet:

As Chollet noted in his paper, ARC is mean to test human-like reasoning using human-like priors, where the priors are things like “core knowledge” concepts.¹ But it turns out AI models can solve these tasks by a kind of brute force.
For that reason, ARC in a way has lost its usefulness, unless we go back and embrace what it was actually meant to test. We should disallow what’s been allowed, which is training AI systems on these test tasks, or doing a huge amount of test-time compute that is essentially training on lots of these tasks. That’s been a little disappointing to me.

In a way, this reflects the point I’m trying to make in the post. First, a test is developed according to certain theoretical criteria relating to some construct of interest; at the start, LLMs don’t “pass” this test; but later, LLMs provided with sufficient “power” (be it data, context windows, “reasoning” tokens, etc.) do pass the test; and this, in turn, triggers debate about whether the test still indexes the capability it was originally designed to index.

Language models and the intentional stance

Sean Trott — Wed, 18 Mar 2026 13:44:15 GMT

Humans often grapple with new concepts by way of metaphor or analogy. These days, many people (including myself) are struggling with the question of how to properly conceive of systems like large language models (LLMs). As I’ve written before (and as others have pointed out as well), there is no shortage of metaphors for these systems: they’ve been likened to “stochastic parrots”, “blurry JPEGs of the web”, “role-players”, an “alien species”, or even “troublesome genies”1.

LLMs, of course, are not literally any of these things. The point of a metaphor is to grant some kind of conceptual purchase on a slippery topic—typically by foregrounding certain features of the thing you’re talking about and backgrounding others. In a sense, each metaphor is a kind of rudimentary model of the phenomenon. And as the oft-quoted saying goes: “All models are wrong, but some are useful”. The question, then, is not necessarily whether any given mental model of LLMs is “correct”, but what that model does for us conceptually and communicatively: which thoughts does it make more or less expressible—or even more or less thinkable in the first place?

My focus in this post is the mental model of an LLM as the kind of thing that has beliefs, desires, or even feelings. We might speak of an LLM “believing X”, “wanting Y”, or “feeling Z”. This broad mental model includes the common tendency to attribute human qualities to non-human entities (also called anthropomorphism), which many readers are likely already familiar with.2 This inclination feels particularly natural when it comes to LLMs, given our association between human cognition and the use of language, and also given the fact that user-facing systems (like ChatGPT) are often engineered3 to be conversational or even “supportive”.4

When, if ever, is this mental model useful? And when might it be misleading?

I think these are crucial questions to contend with. First, they matter for discussions of AI safety. I, like many, think it is important to build policies and systems that avoid harms relating to AI. But one’s preferred solution-space (and also conception of the harms themselves) plausibly depends on whether AI systems are construed in terms of having desires or goals. Second, they matter for the scientific study of LLMs. As I’ve argued before, the way we go about studying LLMs depends on what “kind of thing” we think an LLM is—or at least, what kind of properties we find instrumentally useful to ascribe to it in the service of understanding and predicting its behavior. And third, they relate to other ongoing debates and discussions, such as increased emotional dependence on social AI systems 5 and even the notion of “AI welfare”.

This post has the following structure. In Part I, I briefly introduce and defend the notion of the “intentional stance” as an epistemological (though not necessarily metaphysical) approach, making reference to both Daniel Dennett’s “Real Patterns” and a recent 2025 paper by Simon Goldstein and Harvey Lederman. In Part II, I discuss potential epistemological dangers associated with anthropomorphism, particularly when it comes to accurately diagnosing and addressing AI-related risks. And in Part III, I make the case for prioritizing causal-mechanistic accounts of LLM behavior, while nonetheless allowing for the provisional adoption of the intentional stance should those accounts be deemed insufficiently tractable.

Part I: In defense of the intentional stance

One can explain the same thing in multiple ways. Often these explanations vary in terms of their level of abstraction. For instance, one could describe the behavior of a gas in terms of each individual molecule or in terms of coarser, macroscopic properties (like temperature or pressure). Similarly, one can describe the motion of an object in quantum terms or using the language of classical mechanics. These levels of abstraction have trade-offs: more granular, microscopic explanations might be more accurate—and perhaps more “real”, depending on one’s metaphysics—but coarser-grained explanations might be more comprehensible, easier to work with, and nearly as accurate.6

This is the heart of the argument for the intentional stance. The philosopher Daniel Dennett suggested that we can explain phenomena from at least three distinct levels or “stances”, which vary in granularity, accuracy, and abstraction. The physical stance focuses on concepts from physics and chemistry; at this level, we are all simply objects in motion. The design (or “teleological”) stance is most akin to engineering or biological explanations, which emphasize the function served by a particular artifact, organ, or action. And the intentional stance (or “folk psychology”), in turn, is the level of “agents”, which have things like “beliefs”, “desires”, or “goals”.

Dennett’s argument is that in many cases—like explaining or predicting the behavior of other humans—something like the intentional stance is actually one of the most useful epistemic stances one can adopt. For instance, if you’re talking with a friend about a personal conflict they’re having, it’s probably most natural and useful to think about their beliefs (what do they think about the situation?), their feelings (are they upset, and why?), and their desires (what do they want to happen?).7

In Real patterns, Dennett suggests that something like the intentional stance is central to the general human project of interaction and coordination; and further, that its success depends on the underlying behavior conforming to a sufficiently regular pattern:

Without its predictive power, we could have no interpersonal projects or relations at all; human activity would be just so much Brownian motion; we would be baffling ciphers to each other and to ourselves—we could not even conceptualize our own flailings. (pg. 29)

At the same time, as Dennett himself argued, the intentional stance is not always appropriate. We could explain the “behavior” of a rock falling from a cliff as having surrendered to the call of the void, but there’s an easier and more accurate alternative available: a careless hiker scuffed the rock with his boot, exerting sufficient force for it to fall. Similarly, we could describe a thermostat as having “beliefs” about the temperature of the room and “desires” to make it colder or hotter, but again, there’s a better and more accurate mechanistic alternative on offer: a thermostat works by measuring temperature through the expansion or contraction of a bimetallic strip, which in turn triggers a cooling or heater system.

When, then, is it appropriate to attribute beliefs and desires on the basis of observed behavior?

As the examples above suggest, much depends on the quality of the alternative theories available. A recent paper by philosophers Simon Goldstein and Harvey Lederman suggests that we should compare the quality of competing theories in terms of three objective criteria. First, their predictive accuracy: to what extent are their predictions true? Second, their predictive power: how many cases are there in which the theory makes concrete predictions, and how precise are those predictions? And third, their tractability: how easy is it to generate predictions from the theory?8

Explaining someone’s behavior in terms of each individual atom in their body—and, presumably, the atoms all around them—might be very accurate. But a folk psychological account of their beliefs and desires is much more tractable, and might be nearly as accurate. In this case, it’s defensible, and perhaps even advisable, to adopt the intentional stance; even if one is not committed to the reality of those beliefs and desires, it is helpful to act as if they are real.

The question is whether this is also true when explaining the behavior of large language models. Should we speak of LLMs “believing” certain things or “wanting” certain outcomes? For me, one of the most helpful parts of Goldstein’s and Lederman’s paper is that it proposes a concrete set of criteria for answering this question: it makes sense to attribute beliefs (and desires, etc.) when this is the best explanation on offer.

It might help to start by considering a scenario where attributing beliefs does not make sense. In a previous paper (which I also recommend!), Harvey Lederman and his co-author Kyle Mahowald used the example of ELIZA, the famous chatbot developed by Joseph Weizenbaum. ELIZA consisted of a relatively simple set of rules and templates for responding to messages—it was, essentially, a lookup table—but many human users walked away convinced they’d been speaking with a real human therapist. Some of those humans might find it natural to describe ELIZA’s behavior in terms of beliefs and desires. Does that mean we should adopt the intentional stance? Lederman and Mahowald argue that the answer is no, precisely because there’s a better theory on offer:9

Similarly, if ELIZA passed the Turing Test in an hour-long conversation with a person, the person might claim that the best explanation of their interlocutor’s behavior was that it has beliefs, desires, and intentions. But they would be wrong. An explanation of ELIZA’s behavior in terms of its being a lookup table is more accurate and powerful (predicting various mistakes and failures) than the hypothesis that ELIZA has beliefs, desires, and intentions. Interpretationists will say that, notwithstanding what the person thought, ELIZA’s behavior is not well-explained by the hypothesis that it has attitudes, and, thus, conclude that ELIZA does not have them. (pg. 1096)

In this way, ELIZA resembles the thermostat example from earlier. One could invoke beliefs and desires, but to what end? (And, one wonders, at what cost?) There’s a more accurate account that’s just as easy to understand and doesn’t break down outside a narrow range of circumstances.

What about the case of modern LLMs? Here, the authors focus primarily on state-of-the-art system architectures (or “LLM-equipped software tools”) like ChatGPT or Claude. These systems are trained not only to predict subsequent tokens from large web corpora, but undergo extensive post-training, including reinforcement learning from human feedback and, in some cases, reinforcement learning with verifiable rewards. They’re also given access to external tools like a “scratchpad”, Google search, or even a Python shell. Finally, systems like ChatGPT or Claude are typically initialized with a lengthy “system prompt” providing detailed instructions about how the system should behave. (In the case of Claude, this prompt was called its “constitution”, and originally instructed Claude to be “helpful, harmless, and honest”.10)

I mention all this to make the point that these LLMs are a very different beast than, say, GPT-2.11 This is reflected in their behavior: under most conditions, systems like ChatGPT or Claude seem to have an orientation towards answering questions and helping their interlocutors. (Indeed, their eagerness to help can sometimes be exhausting!) As others have argued, there’s a case to be made that describing these systems as “fancy auto-complete” can be misleading—despite the fact that on a mechanical level, you could describe these systems as generating text by repeatedly sampling from probability distributions over subsequent tokens.

Thus, if we grant—provisionally, for the sake of discussion—that it might be useful to attribute “beliefs” or “goals” to these systems, which beliefs or goals should we attribute? Goldstein and Lederman consider various hypotheses here, from the word-desire hypothesis (LLMs “want” to predict the next word) to the HHH hypothesis (LLMs “want” to be helpful, harmless, and honest). In each case, we can apply the criteria outlined above—accuracy, predictive power, and tractability—to evaluate the strength of a given hypothesis.

I’ll return to these competing accounts in Part III, but first, I want to discuss what I see as the potential epistemic risks of adopting the intentional stance.

Part II: Misleading maps

Of course, just because you can attribute beliefs or desires to a system doesn’t mean you should. As Dennett argued (and as Goldstein and Lederman point out), there are many cases where the intentional stance is strictly a worse theory of a system’s behavior than, say, the physical stance. This is clearest in the thermostat example. As Goldstein and Lederman write:

Many thermostats, for example, measure temperature using the curve of a “bimetallic strip”, two pieces of connected metal that expand at different rates when heated. There are simple laws connecting the initial temperature of the room to the bend in the metal, and connecting the bend to an air-conditioner that changes the temperature in the room. These simple laws offer an alternative explanation of what the thermostat does that outperforms psychological explanations of its behavior along our three dimensions…For example, imagine that the connection between the strip angle and the air conditioner is noisy, so that the temperature does not always adjust in the direction ’desired’ by the thermostat. In that case, the psychological explanation of the thermostat will make less accurate predictions than the physical explanation about how the system behaves. (pg. 6)

You could even make a similar argument about the design stance. A mercury thermometer, for instance, is designed to measure temperature; we “read off” information about the temperature as a function of the mercury’s height in the bulb. Yet there are circumstances under which a mercury thermometer fails to accurately assess the temperature—specifically, for temperatures below the freezing point of mercury, or for temperatures above the boiling point. That obviously doesn’t mean mercury thermometers are invalid; they’re valid, and very useful, under many conditions in which we need them! But thinking only about what mercury thermometers are designed to do (measure temperature) may mislead us about their behavior in certain situations, whereas thinking about how mercury thermometers work on a physical level (the mercury expands or contracts depending on the temperature) likely helps us make more accurate predictions.

We can draw a general lesson here. More abstract levels of explanation are extremely useful, and sufficiently accurate, when the phenomena they’re characterizing bear some consistent, structural relationship to the underlying physical laws producing those phenomena. However, they break down when that relationship no longer holds. We might think of these as “in-distribution” or “out-of-distribution” with respect to our mental model.

These out-of-distribution cases correspond to what in my view, many (though not all!) cases of “misalignment” actually boil down to. By “misalignment” here, I mean there’s a mismatch between our mental model of a system and the “true” causal processes responsible for its behavior. This mismatch ultimately leads to behavior we don’t expect because our expectations were miscalibrated. For example, we might construct an accurate explanatory model (or “map”) of the system in some evaluation scenario, which suddenly breaks down in other scenarios because we failed to account for some crucial variable. We might also call this a limitation in our mental model’s regime of applicability.

Example of what I mean by aligned vs. misaligned mental models (here, “stances” to avoid confusion with language models”). When the bimetallic strip is functional, the intentional stance and physical stance make the same predictions, and both are correct. But when the strip is broken, the stances make different predictions (they are “misaligned”), and only the physical stance is correct. (You might object: if the bimetallic strip is broken, might we say the thermostat doesn’t want the room to be 72 degrees? Notice, however, that for this move to work, the intentional stance must somehow be “aware” of the physical properties anyway, which raises the question of why one need complete the further step of “translating” those predictions into the intentional stance.)

How does all this relate to the intentional stance?

It boils down to whether the intentional stance provides a sufficiently accurate map to allow us to make predictions about or even interventions in an LLM’s behavior. There is certainly a simplifying appeal to attributing beliefs and desires to systems—it’s annoying, in casual conversation, to feel one must air-quote every use of the word “understand” or “think” in relation to an LLM. And it might even be the best account! But I worry, in a way I admittedly find difficult to articulate, that it might also lead us astray when we think about building safe AI systems.

It might help to focus on a concrete example. Recently, computer scientist (and Nobel Prizer winner) Geoff Hinton suggested that avoiding catastrophic risk from superintelligent AI systems might depend on imbuing them with a “maternal instinct”. Hinton’s logic is that a superintelligent AI (defined as something smarter than the smartest humans) will, presumably, be able to outsmart many of the constraints or safeguards we place on it. Why would such a superintelligent, capable system not manipulate or take advantage of humans to achieve its own goals? Hinton points out that the clearest counterexample is a mother taking care of their child. A child is utterly dependent, but because a mother—or indeed, any good caregiver—has an orientation towards nurturing the child, the dependence actually functions as a kind of constraint. What we need, then, is a kind of superintelligent AI “parent” on which we depend.

Let’s aside, for now, whether this is an appealing scenario we should aim for as a society, and focus instead on the metaphors at play. Hinton’s quoted as follows:

“They have been focusing on making these things more intelligent. But intelligence is just one part of a being. We need to make them have empathy towards us. And we don’t know how to do that yet. But evolution managed and we should be able to do it too.”

In many ways, I agree with what Hinton is saying here, though I might put it another way: instead of just focusing on improving the “raw capabilities” of systems, we need to also do more work to make sure they are safe to deploy and that the contexts and consequences of their deployment are broadly aligned with human values. I think this for a variety of reasons, including the fact that there’s so much epistemological murkiness about how to measure capabilities in the first place. (There are, of course, further questions about whose values we mean by this, but that’s a separate topic.)

Hinton’s metaphor explicitly casts this as a contrast between making systems more intelligent and imbuing them with “maternal empathy”. My claim here is not that this metaphor is definitely wrong or misguided; it might even be the most useful metaphor to use! My concern, rather, is that it might be misleading in ways that are hard to understand or predict ahead of time—and my intuition is that we might get more traction on predicting risks if we think concretely about the causal mechanisms responsible for a system’s behavior, in the same way that we’ll be better at predicting a thermostat’s behavior if we understand how it actually works.

I worry, too, that adopting the intentional stance might have unintended epistemic consequences given the way that conceptual paradigms work. That is, it’s not hard for me to picture a scenario in which people—begrudgingly or not—adopt the intentional stance to explain a system’s behavior in a particular scenario, which subsequently encourages them to adopt the intentional stance in future scenarios; even if, in some counterfactual world in which the researchers hadn’t originally adopted the intentional stance, they would’ve sought (and been satisfied with) a more mechanical explanation first instead. We might think of this as a kind of epistemic path-dependence: adopting the intentional stance might make us more likely to do it again, including in situations where it’s inappropriate.

Another, related risk would be something like overgeneralizing which mental states they attribute to a system. For instance, perhaps we identify a scenario in which it is epistemically justifiable (as per Goldstein & Lederman’s criteria) to attribute certain beliefs or desires, but not necessarily a feeling like “fear”. But having already attributed desires, we are inclined to attribute other mental states as well (like “fear”), and this leads us to incorrect conclusions about the best way to train or deploy models.

Part III: Cause and effect

When I think of misalignment risks from current AI systems, I think primarily of ways in which our instructions, intended to produce a particular outcome, lead to some outcome other than what we intended.12 Importantly, this can happen in the absence of beliefs or competing desires: if we’re in search of a metaphor, then, the mythical golem might feel more appropriate than, say, a “cunning demon” intent on misinterpreting our instructions.

To cite a relatively harmless but revealing example: in his latest post, Steve Newman described the case of a person whose AI “agent” mistakenly deleted many of their emails. They’d instructed the system to check in before taking any actions, but because the inbox was so large, that part of the instructions got lost in the prompt compaction process. Notably, this didn’t occur when they were testing the system on a smaller, “toy” inbox. That is: their expectations were miscalibrated. The specific reason for this miscalibration is that their test environment did not account for a specific variable present in the actual application, i.e., the number and length of emails. Because LLMs have limited context windows, an overly long prompt can be consolidated or “compacted” (i.e., trimmed of unnecessary or redundant tokens); in this case, the compaction process appears to have removed a crucial part of the instructions.13

We could describe this error in terms of Claude’s mistaken beliefs, or even in terms of some misaligned goal, but it’s not clear what we gain from this. The alternative account I described above—the one that feels closest to what actually happened—relies on a more mechanical understanding of how Claude works.

A reader might make two reasonable objections here. First, my “mechanical” account above still involves abstractions; we’re speaking roughly at the design level, not in terms of how Claude’s internal mechanisms actually give rise to subsequent token predictions. Doesn’t that show that explanatory abstractions are useful? And second, as I pointed out, this is a relatively low-stakes example of current AI systems failing in a particularly transparent way. Might not the “cunning demon” be a more appropriate metaphor for future, superintelligent systems—or even the systems we have?

I’ll start with the first objection. Indeed, abstractions are useful: I’m certainly not denying that, and I think Dennett is exactly right that the intentional stance is, in many cases, the most helpful perspective. But the critical question is always what we gain from the abstraction. Using Goldstein and Lederman’s criteria, I think the typical case for the intentional stance is that you gain tractability, possibly at the expense of some accuracy. But it’s not clear to me that the “design-level” account above (Claude failed because of errors in prompt compaction) is any less tractable than some hypothetical “intentional-level” account (Claude failed because it believed the user didn’t need it to check in before deleting emails); moreover, the design-level account feels (to me at least) like a more accurate description of the sequence of events leading to the failure. (I’d be remiss here if I didn’t mention my friend and former lab-mate Sam Taylor’s work on strategic deception here, which I think is a great example of taking a putatively intentional-level construct like “deception” and addressing it from what I see as a causal perspective.)

With that said, I also take seriously the other half of this objection, which is that one could, in principle, take an even lower-level stance. The distributed architecture we call Claude, for example, presumably consists of at least: some system prompt, an extensively pre-trained and post-trained LLM, some set of tools available to the LLM (e.g., a Python shell, a web search function, etc.), and perhaps a front-end interface for filtering or identifying harmful requests. We can assume that inputs to that system (along with the system prompt) are tokenized and presented to the LLM, which undergoes a forward pass to produce a probability distribution over subsequent tokens. Depending on the nature of the computations within that pass, Claude’s next tokens might be presented to the user, or they might be on a “scratchpad”; alternatively, they might even be API calls to an external application. These generated tokens are then incorporated into the context window for subsequent forward passes, and so on—until, at some point, this process predicts that the message should end.

Could we produce an account of Claude’s failures that relies on these explanatory constructs and perhaps even makes reference to specific, identifiable internal mechanisms (e.g., dedicated “circuits” for copying previous tokens)? I’d like to think so. But I’m also aware of the fact that this might simply be intractable: maybe the best we can achieve is a bunch of locally coherent theories, some of which invoke low-level mechanisms and others which don’t. The question—which I don’t know how to answer—is whether some of those theories should involve beliefs and desires.

That brings me to the second objection. Perhaps certain alignment failures can be described in mechanical terms, but others cannot—and perhaps mechanical accounts will only become less tractable as models become increasingly complex and intertwined with various applications. At this point, maybe speaking about LLMs in terms of beliefs and desires will be all but unavoidable if we’re to gain any traction on predicting their behavior.

And my response is: maybe! As I wrote earlier, my case here is not that the intentional stance is definitely wrong in all circumstances. But my position, which readers may or may not find convincing, is a meta-scientific one: it’s that we should try, wherever possible, to prioritize causal explanations before reaching for words like “belief” or “desire” when explaining what it is that LLMs are doing in a given situation. This position is, in some ways, an optimistic one—that we can understand LLM behavior at the level of causal mechanism. This optimism might be misplaced, in which case the argument for the intentional stance might come from a place of epistemic pessimism; but I don’t think it’s time to throw in the towel yet.

Epilogue

In case it wasn’t clear from the rest of this article, I’m deeply ambivalent on this topic. I have a strong intuition about the epistemic dangers associated with adopting certain approaches, but I’m not sure how convincing my articulation of those dangers really is, and I’m aware that other people are capable of marshaling persuasive arguments in opposition to them. Beyond the arguments I’ve cited in this article, I’d like to point readers to a really interesting post (“the void”) by nostalgebraist, which tries to answer this question:

When you talk to ChatGPT, who or what are you talking to?

There’s a lot written out there about LLMs and consciousness, but I found this article one of the most compelling in terms of presenting an account of what it might be like, if it is like anything, to be an LLM. I recommend it even if I don’t share all the intuitions and remain relatively convinced that a causal-mechanistic perspective is the right way to understand LLMs.

The other point I’d like to make in closing is that all of this, to me, highlights just how ontologically out-of-depth we are when we’re talking about LLMs. We’re flailing about in the dark, trying to understand these strange artifacts we’ve created in the image of our own imprints. We might need fundamentally different explanatory frameworks and constructs than the ones we find ourselves reaching for. I’m still bullish on what I call “LLM-ology” as a discipline, in part because I think Cognitive Science is particularly well-suited to identifying these novel frameworks. Some have already been proposed, like the notion of AI as a “cultural technology”—like bureaucracies or markets. I’m not sure whether these are the right metaphors either, but it’s a good time for pluralism.

Thanks to Harvey Lederman, Ben Bergen, Pamela Rivière for extended discussion on this post, and thanks to Harvey Lederman for taking the time to meet and discuss his article in more detail.

From Jack Clark’s appearance on the Ezra Klein Show:

The way that I think of these systems now is that they’re like little troublesome genies that I can give instructions to, and they’ll go and do things for me. But I still need to specify the instruction just right or else they might do something a little wrong.
So it’s very different to typing into a thing, and it figures out a good answer, and that’s the end. Now it’s a case of me summoning these little things to go and do stuff for me, and I have to give them the right instructions because they’ll go away for quite some time and do a whole range of actions.

Though it might be subtly distinct, as Goldstein & Lederman (2025) argue. It’s conceivable that a bat has desires, even even those desires are not particularly humanlike—thus, we can talk about desires without specifically attributing human desires.

E.g., through RLHF or extensive system prompts.

For more, I recommend checking out philosopher Henry Shevlin’s work on social AI.

What, in extreme cases, is sometimes called “AI psychosis”.

Cognition is like this as well. Famously, we can analyze cognitive phenomena in terms of their computational properties (roughly: the problem they’re solving), their algorithmic properties (the putative representations and operations with which they solve the problem), or their implementational properties (the underlying “substrate”, e.g., neural dynamics). While cognitive scientists may disagree about which level of analysis they find most useful or most “real”, many would likely agree with the assertion that in many cases, adopting a fully reductive account is not particularly desirable. More concretely: perhaps (!) we could, with enough time and resources, construct an explanatory model of human belief formation at the level of individual neurons and synapses, but such a model would be very unwieldy and not particularly amenable to interpretation or making predictions. (This is to say nothing of the fact that most reductive explanations could in principle be reduced further: why stop at neurons?)

Importantly, this option is available even to committed eliminativists: that is, you can treat these constructs as instrumentally useful (i.e., “convenient fictions”) even if you think they are not, in some deep sense, ontologically real. You could, of course, also be a realist about the existence of these mental states! In either case, the question is whether the intentional stance is a helpful lens for understanding, explaining, and predicting the behavior of your friend.

It’s worth noting that the authors are arguing in favor of a specific approach called interpretationism, and in doing so, they adopt a realist perspective on beliefs; arguably, however, these criteria are equally useful for adjudicating between accounts purely on their instrumental value. I should also point out that even interpretationism, as presented by the authors, does not require positing a phenomenological dimension to beliefs or desires.

Goldstein & Lederman later make a similar case:

The person conversing with ELIZA might say that its behavior was predicted quite well by the hypothesis that it has beliefs and desires. But they would do so in ignorance of key facts about how ELIZA was built. In light of the full story, there is in fact a better explanation of ELIZA’s behavior: it is a lookup table, and the answers to the questions were pre-programmed by the programmers….in the ELIZA example, the lookup table theory of ELIZA’s behavior is at least as tractable but also more accurate and more powerful than the hypothesis that it has beliefs and desires. (pg. 5)

The newest “constitution” is available here.

I should note here that Goldstein & Lederman argue that the details of how ChatGPT is trained are not actually necessary components of an argument about whether ChatGPT has (say) beliefs or desires—nor is evidence from mechanistic interpretability. The core thing is whether beliefs, desires, or goals provide the best explanatory account of ChatGPT’s behavior.

This is setting aside yet another issue, which is explicit and intentional misuse of AI systems to cause harm. I think this is perhaps the bigger issue to worry about, but it’s not the focus of Hinton’s concern.

Of course, you could also imagine this happening even without compaction. An LLM might fail to “attend” to certain parts of the prompt such that they are effectively irrelevant to its behavior.

In what sense—if any—are LLMs "model organisms" for the study of human cognition?

Sean Trott — Tue, 10 Mar 2026 00:54:42 GMT

My main interest in large language models (LLMs) has always been scientific. I’m interested in whether and how LLMs can inform specific, longstanding debates in Cognitive Science, such as the causal mechanisms underlying the acquisition of grammatical knowledge or Theory of Mind. (I’m also interested in applying Cognitive Science theories and methods to better understand LLMs, but that’s not my focus in this post.)

This sometimes places me in an odd situation discursively relative to the broader context in which LLMs are typically debated. I have opinions, certainly, on these topics. But the bulk of my scientific work is concerned with LLMs of a different nature than the distributed architectures (what I call “LLM-equipped software systems”) that occupy mainstream debate.1 That’s because, in my view, using LLMs to study human cognition demands different epistemic considerations than using, say, Claude Code to build a website or help with data analysis. LLMs, in this context, are serving as models of a particular kind, i.e., to operationalize and test specific theories. Some—including myself—have even analogized this use of LLMs to the use of “model organisms” in fields like Biology, where researchers seek to derive insights about biological mechanisms generally from the study of targeted (and often carefully standardized) organisms other than humans.

But what exactly do we mean when we suggest using LLMs as “model organisms”? This question can be decomposed into two more specific concerns. First, what kinds of research questions are LLMs appropriate for, and when are they inappropriate? And second, what kinds of LLMs are appropriate or inappropriate for addressing these kinds of questions?

These are both really crucial questions for cognitive scientists to consider right now, particularly as researchers are increasingly interested in using LLMs in their own fields. In this post, I focus on the first question: what are LLMs good for when it comes to the study of the human mind? There’s no shortage of possible experiments one could run with LLMs, and while those experiments might inform the study of LLM-ology, not all of them are necessarily informative about human cognition. Some use cases, however, really can advance Cognitive Science. What are those use cases or epistemic modes?

This is the subject of a recent abstract I submitted with my former colleague Ben Bergen, and it’s also the topic we’re currently taking up in a longer paper. Here, I’ll focus on the core arguments from the abstract and paper.

What’s out of scope?

The first order of business is articulating what’s not included when we talk about using LLMs as “model organisms”.

As I mentioned above, there are all sorts of potential uses of LLMs, particularly distributed architectures like Claude Code. In the context of Cognitive Science, a researcher might want to use LLMs to conduct a literature review, design an experiment, analyze data, or even write a paper. I think these are interesting scientific applications, but I’m not concerned here with whether or when these use cases are appropriate because none of them involve the use of LLMs as models or model organisms per se. Rather, they involve using LLMs as tools (or, more generously, as collaborators) to conduct the work of science “on the ground”.

A model organism or model, in contrast, is used to test and inform theories in some target domain. Often, this works by abstracting away certain properties to provide better access into and control over the model. Researchers then seek to generalize (or “extrapolate”) insights gleaned from the model to the target domain; the validity of this generalization depends on whether certain theoretical assumptions are met and community norms are followed. This is the kind of use case we’re concerned with here.

Are LLMs “model organisms” or just models?

One distinction that comes up here is that between a model organism and a computational model.

The term “model organism” typically refers to organisms used in biology, often subjected to various forms of cultivation or standardization, to study specific questions that we can’t test in humans (or in whichever target organism we’re interested in)—either for reasons of ethics or tractability. As scholars Rachel Ankeny and Sabina Leonelli point out in their book on Model Organisms, model organisms in Biology are both “samples of nature and human artifacts”: biologists have sourced the model organism from some “naturally occurring” species, but have also subjected that type of organism to considerable standardization in the lab.

The idea here is that a model organism somehow stands in some representative relation to the target of interest (like human biology). Ankeny and Leonelli (2020) sketch out this relationship as follows: a model (organism) exemplifies certain properties (like genetic pathways, developmental traits, etc.), which—via some set of theoretical commitments—are connected to properties of the target; this connection is used to impute those properties to the target based on findings from the model.

Figure from Ankeny & Leonelli’s Model Organisms (2020).

A computational model follows, in some ways, a similar inferential logic in that it is intended to represent certain key properties of the target domain. The difference, of course, is in how this representative relation is established. In a model organism, the relation might be established through certain theoretical commitments (e.g., common evolutionary descent) or through empirical corroboration (e.g., finding shared genetic pathways in mice and humans).

In computational or mathematical models, these properties are typically instantiated in mathematical rules that allow for the simulation of target processes or properties of interest. For example, the Hodgkin-Huxley model is a model of neuron action potentials, i.e., what makes neurons “fire”. Concretely, the model consists of a series of differential equations intended to capture the flow of electrical current in and out of the cell. Researchers can modify parameters such as the number, distribution, and type of ion channels and analyze the consequences of these modifications on the cell’s behavior. The model can thus be used to generate testable hypotheses about neurons, and it can also be viewed as itself an explanation of how neurons work.

Which of the two better describes LLMs in the study of human cognition?

On the one hand, LLMs are not “samples of nature”, nor do they (obviously) share a common evolutionary ancestor with humans. That suggests that maybe they’re better construed as “models” than as “model organisms”.

On the other hand, we don’t entirely understand how LLMs work, which is why some have argued they are “grown, not made”. This makes them unlike many mathematical models, in which (usually) there’s a clearer relationship between the mathematical rules underlying the model and the target processes or properties the model is intended to represent. Moreover, LLMs are not, for the most part, explicitly designed as models of specific cognitive processes or properties.

At the same time, artificial neural networks were (of course) inspired by certain computational properties of biological neural networks—even if the two are ultimately very different in mechanism and function—and have long played an explanatory role in Cognitive Science. Specifically, these “connectionist” systems have been used to argue that other explanatory constructs—e.g., innate biases or symbolic representations—are not necessary for accounting for some kind of interesting behavior.

If LLMs are models, then, we might say they represent a broad class of theories concerning how linguistic structures and concepts might be learned from certain kinds of input. (I’ll describe these theories in more detail when I describe the first epistemic mode below.) But the allegedly “emergent”2 nature of certain LLM behaviors might also lead us to analogize them more to model organisms raised in laboratory conditions, which researchers might use to ask about the emergence of certain concepts (e.g., numerosity) or behaviors (e.g., Theory of Mind) considered unique to humans.

In either case, I’m not sure how much the distinction matters for now—but I do think enumerating the respective properties of models and model organisms helps clarify the scientific applications of LLMs.

Epistemic modes

I think there are at least three epistemic modes LLMs might serve.

LLMs as existence proofs

The first is what I think of as the explanatory sufficiency mode (or “existence proof mode”). Here, an LLM is used to ask whether certain kinds of input (the type and volume of training data) and architectural features (e.g., the size or initial parameters of a model) are sufficient to produce interesting cognitive behaviors. The most common of these is about the sufficiency of language input in particular3: which kinds of behaviors can emerge purely from learning to predict the statistical patterns of language?

For example, there’s considerable debate about whether human grammatical knowledge can be induced purely from exposure to linguistic input, or whether it requires something else—such as innate constraints or an “inductive bias” towards learning certain kinds of grammars. There’s a similar debate about the origins of Theory of Mind: is our ability to reason about the mental states of others an innate, evolved capacity or is it constructed from experience? And if the latter, how much does linguistic experience in particular (e.g., learning mental state verbs like “think” or “believe”) scaffold the development of this ability? In both cases, language models provide an estimate of what kinds of behavior do or don’t emerge from exposure to language.

This mode is probably the most well-developed of the three, particularly when it comes to questions about learnability from linguistic input. But there’s still room for methodological and theoretical refinement. For instance, as I’ve written before, LLMs are usually trained on orders of magnitude more language data than humans encounter in a lifetime. How should this affect our inferences about the learnability of language (or other concepts) from language alone? Moreover, language is only one kind of input: in principle, one could test analogous questions about the sufficiency of other modalities as well, such as visual input; or about the impact of one modality on another.

Inferentially, this mode doesn’t tell us anything about the necessary causal origins of some mechanism or capacity in humans. What it does provide is evidence for the sufficiency (or lack thereof) of a particular causal origin: it is, as I wrote above, a kind of existence proof.

LLMs as hypothesis generators

The second mode is the hypothesis generation mode. Researchers have a remarkable degree of control over how LLMs are trained and tested. That means that they can explore novel behaviors that may not surface in human studies. In principle, a researcher could expose an LLM to all sorts of experimental manipulations and measure the effect of each manipulation; these effects could then be used to generate testable hypotheses about human behavior—which, of course, the researcher should then go back and corroborate in a human sample.

This mode is somewhat less developed than the explanatory sufficiency mode, but there has been some work in this vein. For example, Misra & Kim (2024) use a “controlled rearing” paradigm to discover novel patterns of syntactic generalization in LLMs. Their study was concerned with a grammatical phenomenon called cross-dative generalization. Concretely, the “dative alternation” refers to two ways of describing a transfer event: the double-object construction (“I gave Cameron the cup”) and the prepositional-dative construction (“I gave the cup to Cameron”). Some verbs (like “give”) can be used with both constructions, while other verbs seem to “prefer” the prepositional-dative construction (e.g., “I explained the story to Cameron” vs. “I explained Cameron the story”), and still other verbs seem to “prefer” the double-object construction, even if weakly (e.g., “I wished Cameron luck” vs. “I wished luck to Cameron”). (Note that verb “preference” is very much a gradient phenomenon: lots of verbs are somewhere in the middle, and constructional preferences are also driven by other factors, such as the length and animacy of the recipient and “theme” arguments.)

Cross-dative generalization is the question of whether and when speakers infer that a verb can be used in both constructions. That is, if I encounter a novel verb in the prepositional-dative construction (“She pilked it to me”), when do I infer that it can also be used in the double-object construction (“She pilked me it”)? There’s some evidence from work on humans (e.g., using artificial grammar learning paradigms) that cross-dative generalization is asymmetric, i.e., it’s easier to generalize from the double-object to the prepositional-dative than the reverse. But testing related hypotheses in humans is necessarily limited in scope: it’s hard to test every combination of factors that might impact cross-dative generalization.

With an LLM, however, researchers (like Misra & Kim) can parametrically manipulate all sorts of factors, including the number of exposures to a verb in a particular construction, as well as constructional features, like the length of the recipient argument (e.g., “John” vs. “the man I met at a conference on language learning just the other day”). In their paper, Misra & Kim (2024) first replicate the asymmetric pattern attested above: LLMs, like humans, are more prone to generalize from the double-object to prepositional-dative construction than the reverse. This is an important step, because it establishes behavioral congruence on a known phenomenon. They then find specific features—primarily relating to the first postverbal argument—that facilitate cross-dative generalization, some of which represent novel hypotheses about the kinds of features that could drive generalization in humans. Importantly, discovering all these features “bottom-up” in a human sample would’ve been pretty intractable; but now that a clear hypothesis has been established, future work could test that specific hypothesis in humans.

Zooming out a bit: like the first mode, the hypothesis generation mode seeks to establish behavioral congruence between LLMs and humans on some target task. But it also goes a step further and uses LLMs as a kind of “platform” on which various empirical relationships can be explored and tested.

The search for mechanistic congruence

The third mode is the most speculative, under-developed, and philosophically fraught; which is what—for me at least—makes it the most theoretically intriguing. This is what I call the mechanistic congruence mode.

Here, researchers might first establish behavioral congruence between humans and LLMs on a given task. This is, effectively, the first mode from above. Then, they might conduct mechanistic interpretability research on LLMs to identify the putative mechanisms underlying LLM behavior on that task. And then—and this is the potentially fraught step—they might try to use those results to draw inferences about the human mechanisms involved in the task.

The benefits, if this last step is at all licensed, should be clear: as I pointed out earlier, researchers have a degree of access into and control over LLMs that we simply don’t have when it comes to humans. That means that our ability to identify genuine mechanistic explanations of human behavior is quite limited. We can devise clever behavioral experiments and corroborate these findings with convergent evidence from brain imaging, but at the end of the day, we simply can’t know the representations or mechanisms subserving human behavior on a given task with the same fidelity that we can know those things for LLMs. Thus, extrapolating from LLMs to humans would represent a major leap forward in our ability to characterize the mechanisms underlying interesting human cognitive behaviors.

The question, though, is whether anything like this extrapolation is licensed.

There are at least two major problems that arise.

The first is mechanistic heterogeneity: put simply, different systems can implement the same behavior in different ways, and behavioral convergence does not entail mechanistic convergence. Mechanistic heterogeneity is well-attested in both biological systems and in language models; indeed, one of my current research directions explicitly concerns questions about whether and when it makes sense to generalize mechanistic findings across LLMs. (If you’re interested, I wrote about that here.) If two language models might implement the same “computation” in different ways, why should we expect mechanistic convergence between humans and LLMs? Unlike with model organisms, we don’t even have the epistemic benefit of pointing to common evolutionary descent and an assumed principle of evolutionary conservation.

The second problem is in some ways even more challenging to resolve. I am not the first person to note that biological networks (like the human brain) work in very different ways than artificial neural networks (like language models). This matters less for questions about behavioral convergence. But when it comes to extrapolating mechanism, it’s not immediately clear what it even means to generalize from an LLM to a human.

Suppose we map out a circuit in an LLM that implements a particular behavior of interest—perhaps, as per the section above, this is the “Dative Alternation Circuit” (DAC). It’s a network of attention heads and feedforward operations that, together, controls a language model’s “preferences” for a double-object construction (“I gave Cameron the book”) or a preposition-dative construction (“I gave the book to Cameron”). That alone is a success for improving our understanding of the LLM. But does it tell us anything about analogous mechanisms in humans?

Here, we encounter a question of explanatory abstraction: at which level of analysis, if any, does mechanistic extrapolation from LLMs to humans make sense? Remember that this hypothetical DAC is effectively a characterization of which units in an LLM appear responsible for a particular kind of observable behavior. The implementational level of abstraction—the level of individual biological neurons—thus seems like an unlikely candidate. It’s not clear how a description of attention heads in a feedforward transformer model can tell us about which parts of the human brain facilitate our understanding of the dative alternation (or how).

But the computational level of analysis, by contrast, might be too abstract. Understanding whether two systems perform the same “function” or computation is an interesting question, but it’s one that seems answerable primarily by comparing their behavior on tasks designed to measure that function. What good, then, is the mechanistic analysis?

The analogy, if there is to be one at all, is probably better situated at what’s called the algorithmic level of analysis: a characterization of the kinds of representations and operations underlying some cognitive capacity or behavior. For instance, suppose the hypothetical DAC in this LLM is found to represent a few key variables, such as the length and animacy of the recipient argument. Suppose, too, we discover that there’s an “order of operations” in how these representations are prioritized and integrated, and this ultimately controls the model’s “preference function” over the dative alternation. This level of analysis—more abstract than the individual attention heads, but less abstract than simply describing the overall behavior—feels to me like the advantage conferred (if any) by a mechanistic analysis of LLM behavior.

Returning to the first problem, however: can such a characterization be generalized from LLMs to humans? After all, different LLMs may implement entirely different DACs! One starting point would be to identify the putative DACs in other LLMs. If a similar mechanism is discovered across a large, architecturally and linguistically diverse sample of LLMs, this may indicate a kind of “stable attractor” in the landscape of possible algorithmic solutions to a task. Accordingly, this could (and should) motivate us to test this hypothesis in humans (much like the second mode above). While we can’t directly observe mechanistic implementation of (say) the dative alternation in humans, we can design clever behavioral tasks that provide corroborative evidence for particular algorithmic solutions. If that evidence aligns with the evidence obtained from an LLM, then maybe we can tentatively conclude we’ve learned something about the representational processes underlying human comprehension of the dative alternation.

Where to from here?

Much has been written in the last year about the potential uses of LLMs in Cognitive Science (e.g., Dillion et al., 2023; Jain et al., 2024; Frank & Goodman, 2026; Lin, 2026) and about the epistemic risks of doing so (e.g., Crockett & Messeri, 2025). My intended contribution here, and in the paper I’m currently co-authoring, is to more precisely articulate some of the specific modes in which LLMs might serve as “model organisms” (or just “models”).

Of the three modes I mentioned, the first is (in my view) the most well-developed and also the least contested (which does not imply that there is no dispute). The third is the most speculative and is thus the most questionable; that’s also why it’s rich for philosophical discussion. And the second mode occupies an interesting epistemic position here: the use of LLMs to generate novel hypotheses is still (as far as I can tell) in its infancy, and I think Misra & Kim (2024) offers a helpful methodological template of how this could be carried out procedurally.

It’s unlikely I’ve exhausted the possible epistemic modes here, or that I’ve successfully identified all of their respective complications. But my hope is that enumerating them leads to two consequences: first, that it leads to healthy debate, critique, and iteration; and second, that it encourages researchers using LLMs as “models” in some capacity to explicitly consider and name the particular mode they’re operating under and articulate which inferences they’re hoping to achieve.

Indeed, some of the language models I study are not very large at all.

In one sense, most behaviors of an LLM are “emergent” in that they aren’t explicitly programmed into the architecture from the beginning. But the question of what “counts” as emergent is a philosophically complex one, as other researchers have argued.

In my research, I call this the “distributional baselines” approach.

Situation models: a cognitively plausible abstraction

Sean Trott — Wed, 18 Feb 2026 19:50:20 GMT

A common point of contention in discussions about Large Language Models (LLMs) like ChatGPT revolves around the notion of a “world model”: specifically, whether LLMs acquire something like a robust world model through their training process, or whether they fail to do so—perhaps because of inherent limitations in their design.

World models are in the news these days, though people don’t always agree in how exactly to define them, or how to determine whether a system like an LLM “has”1 one. But most definitions—like Melanie Mitchell’s in this article—do emphasize the role of causal abstractions. The world around us is vast and complex, and rather than reconstructing it atom-by-atom (see Borges’s On Exactitude in Science), it’s probably more useful to compress (or “forget”: see Borges’s Funes the Memorious2) certain details in the interest of building mental representations that can guide our future behavior.

This is, of course, a familiar concept to Cognitive Science. Cognitive scientists don’t always agree on which abstractions organisms form or how they come about, but the basic idea of a representational abstraction or “internal model” has long played a central role in explanations of the mind and brain. Although some researchers do reject the notion of mental representations altogether,3 I’d argue the dominant paradigm is one that views the mind/brain as constructing internal models of some kind.

All of which is to say: debates about “world models” are very familiar to me, and are occasionally frustrating because of a lack of specificity in what, exactly, is meant by the term. In cases like this, I think theoretical specificity is a virtue. Any specific philosophical commitment is almost certainly wrong—in the sense that it must, by definition, neglect certain details—but making (or rejecting) commitments is important for getting clarity, which in turn is crucial for determining what we think and also how to go about testing it empirically.4 As Max Weber argued, clarity is perhaps the best science has to offer.5

My entry point into this question actually has little to do with Artificial Intelligence. When I started graduate school, I was interested in narrative construction and comprehension: what happens in our minds while we read or listen to a story? I’d been reading narratology (e.g., Vladimir Propp’s Morphology of the Folktale), and while there was something appealing about the idea of a “grammar of narrative”, I wanted something more cognitively inspired, which is what led me to the research literature on situation models. Ultimately, I ended up focusing on different (though related) topics in my PhD, but that literature has stayed with me through the years, and I think it contains lessons for contemporary debates about AI world models.

Memory constraints and the power of inference

Humans don’t have infinite memory. When we read or listen to a story, it’s quite hard to remember a specific word we encountered five sentences ago. Thus, linguistic input faces what some researchers call a “now-or-never bottleneck”. To deal with this bottleneck, the mind must rapidly compress that input into some kind of (possibly hierarchical) representation that allows us to capture the “gist” of what was said while also forgetting some of the finer-grained details that may not be crucial for comprehension.

According to this account, memory constraints induce the need for representational abstraction. But which abstractions? At some level, a useful abstraction probably includes a representation of the event or “scene” described in language. This is sometimes glossed as the following question: who did what to whom? It’s no accident that language contains structural regularities that allows comprehenders to quickly extract this information from a sentence: “the lion ate the man” is, after all, very different from “the man ate the lion”.

Moreover, humans often “go beyond” what’s said. These inferences are fundamental to language comprehension. Consider, for instance, the following text:

I went to my friend’s birthday party last night. I had a hard time waking up this morning!

A comprehender likely infers a causal connection between these events: the speaker had a hard time waking up because they went out the night before. Moreover, you might further infer the specific cause: perhaps the speaker had too much to drink at the party. These inferences could in principle be wrong—they’re not logically entailed by the text—but they’re also pretty reasonable. Moreover, some inferences, like the causal connection between the events, feel so obvious that we’d likely balk if the speaker subsequently insisted the events were unrelated.

Once you’re looking for them, you notice these inferences everywhere. So much of the meaning we extract from language is not directly in the text: it’s something we construct by virtue of connecting the specifics of what we read or hear to our background knowledge and expectations. These expectations in turn shape the inferences we make. Returning to the example above: if you know the speaker doesn’t drink, then you probably won’t infer that they over-imbibed the night before—you might instead assume that they simply stayed up too late.

Linguists and psychologists have identified all sorts of inferences that underlie successful communication. Pragmatic inferences have to do with inferring a speaker’s intended interpretation based on what they did and didn’t say, as with scalar implicature: “Some students passed the test” suggests not all students passed, even if the all interpretation isn’t semantically impossible. Instrumental inferences have to do with inferring the specific instruments involved in an event: if we hear “John shaved this morning”, we probably assume John used a razor—if John had instead used something very unusual (like a steak knife), the speaker likely would have specified this. In each case, we bring to bear certain assumptions about how communication works and the ways in which meanings are typically expressed.

None of this tells us what kinds of abstractions humans form. But it does give us some clues. A situation model should be able to accommodate at least two facts: first, abstractions must be formed relatively quickly, in ways that allow us to capture the overall “scene”; and second, the abstractions should allow comprehenders to draw inferences about what wasn’t said—and these inferences, in turn, should influence the content of the situation model.

So what goes into a “situation model”?

One account comes from cognitive scientist Rolf Zwaan, who proposed the event-indexing model. According to this framework, the role of a situation model is to monitor and represent information about events and the actions of entities involved in those events. In a 1995 article, Zwaan and his co-authors suggested that this model might include at least five types of indices: temporality (an event’s time frame), spatiality (an event’s spatial region or location), protagonist(s) (an event’s key “players” involved), causality (an event’s causal relationship to other events), and intentionality (an event’s relationship with the goals of various protagonists).

Consider, for example, the dimensions involved in the following sentence:

Sam signed a lease for an apartment near the Blue Line so he could easily take public transportation to UC San Diego.

We have at least one (named) protagonist: Sam. We also have a general spatial location: an apartment near the Blue Line, one of the routes on the San Diego light rail (though notably, the signing of the lease need not have occurred in this location!). The verb’s in the past tense (“signed”), so it suggests the event already took place. And intentionality is also explicit: Sam’s motivation is being able to easily take public transit to campus.

Key elements of a situation model.

As language unfolds, comprehenders dynamically update their situation model according to changes along each of these indices. Zwaan et al. (1995) write:

Then the reader monitors whether incoming story events require updating an index on any of these situational dimensions. For example, if a clause indicates a time shift compared with the previous clause, then the temporal index of the model needs to be updated. When an incoming event takes place in a different spatial region, the spatial index needs to be updated. When an incoming event involves a different protagonist, the protagonist index needs to be updated. When an incoming event is causally unrelated to the previous event, the causal index needs to be updated. Finally, when an incoming action introduces a new goal structure, the motivational index requires updating.

There’s a lot this initial framework leaves out: for instance, it doesn’t tell us precisely how these indices are identified or updated.

But it does offer a specific, testable theory of what goes into a situation model. The theory is also intuitively plausible, at face value: it makes sense that comprehenders would track these dimensions, given both our own phenomenology of understanding language and also what we know about language structure itself, which contains a number of explicit markers relevant to these indices.

Perhaps more crucially, the model satisfies the two requirements we enumerated above: first, these five dimensions seem like good criteria for the “gist” of linguistic input; and second, they provide the bedrock for future inferences and can themselves be informed by these inferences (e.g., inferring a causal connection between events).

Situation models: the details

From one perspective, the question of whether people form situation models feels quite obvious. It’s hard to argue with the claim that language comprehension consists in part of tracking information about events—such as who was involved, as well as when, where, and why it took place. Moreover, if someone is asked an explicit question about these features, they can probably produce a reasonable answer, provided they attended to the text in question. One might even argue that forming these situation models is constitutive of successful comprehension.

But as always in Cognitive Science, the details matter. For instance: when do people construct these situation models? Perhaps there are some circumstances where comprehenders build rich representations of a text, and others where they extract a relatively shallow or “good-enough” representation. Moreover, do people construct these models automatically while reading/listening to language, or are they built more strategically in a post-hoc manner—perhaps reflecting the context-dependent needs of the comprehender?

You might also wonder whether certain features are more salient than others. Do people attend more to the agents involved in an event than the spatial location of the event? Is there legible variance across individuals or contexts in this tendency?

Additionally, you might ask about the representational format of these situation models. Are they best described as “symbolic” in nature? Or are they more like grounded representations of an event—a kind of “mental simulation” of the scene and its protagonists, as described by language? And again, does this vary across individuals or the kind of scene being described?

These questions and others have been the focus of research on situation models for multiple decades (see Zwaan’s 2025 review paper for a summary of past and current directions). This work has used a variety of methods common to Psycholinguistics—reading time studies, eye-tracking, EEG, verb clustering, and more—to investigate the dimensions along which comprehenders reliably form abstractions while reading or listening to language. Much is still unknown, of course; but in my view, one lasting and important contribution is actually the framework itself—concretizing a “situation model” in terms of specific event features that can subsequently be operationalized and tested.

LLMs and situation models

I started with the observation that many contemporary debates about LLMs revolve around the question of whether these systems form “world models”. This question is often construed as foundational to the deeper question of whether (and what) LLMs understand or even whether and to what extent LLMs are intelligent.

In some cases, these debates are purely theoretical. Critics might suggest that LLMs are by definition incapable of constructing world models because of how they’re trained (e.g., predicting tokens in a string) or designed (e.g., neural networks without explicit symbolic representations). Here, the suggestion is that something like a world model is either implausible or impossible in such a system. In turn, proponents might respond by arguing that the training signal for many contemporary LLMs is typically much richer than characterized by critics (e.g., using reinforcement learning with verifiable rewards), and that even learning to predict text tokens encourages a system to “reverse-engineer” the generative process giving rise to those tokens, i.e., a kind of world model. In response to the point about symbolic representations, they might suggest that this is not actually a limitation: after all, cognitive scientists don’t universally agree that the human mind is best described in terms of explicit, propositional representations either—there’s a long tradition of scholars describing the mind as a continuous state-space.

Some of these debates recruit empirical evidence as well. For example, skeptics might point to common errors in planning or spatial reasoning. Anecdotally, I’ve noticed that even commercial LLMs like ChatGPT sometimes struggle with route planning, mixing up cardinal directions or suggesting a “short drive” to locations that are multiple hours away.6 That said, even “base” LLMs demonstrate a surprising ability to recall facts, answer comprehension questions about passages of text, display “commonsense” knowledge about the world, and solve Theory of Mind tasks.7

Where does that leave us?

My point in briefly laying out “both sides” here is not to suggest that both are equally plausible. Rather, my goal is to illustrate that the theoretical and empirical cases typically marshaled are often of a somewhat ad hoc nature—unanchored by a precise, testable framework for what constitutes a coherent “world model”. In the absence of such a framework, it is easy for arguments to go in circles: any piece of evidence can be seized upon or dismissed as needed because the inferential stakes have not been determined.

As I argued earlier, this is why I think specificity is a virtue. While the event-indexing model is by no means the only plausible framework of what it means to understand language, it does offer specific, testable criteria for what constitutes a situation model. Its specificity also leaves it open to correction and revision—a crucial part of any theory. In my view, research on LLMs and whether (or what) they “understand” would do well to adopt such a framework as a theoretical anchor. It will, at any rate, likely be a central topic of my coming research.

In quotes, here, because even the notion of “possessing” something like a world model is itself a kind of philosophical commitment.

I think a reader of Borges could arrive at a fairly robust philosophical understanding of many key topics in Cognitive Science.

A topic for another post, perhaps (and relevant to the footnote above).

One of the worst criticisms a theory can get, after all, is that it is “not even wrong”.

I also appreciated this quote from an essay by Philip Agre pointing out that the notion of a “model” is often under-specified and thus, in some sense, indisputable:

It is found, for example, in the notion that knowledge consists in a model of the world, so that the world is effectively mirrored or copied inside each individual’s mind. This concept of a “model”, like that of a “plan”, has no single technical specification. It is, rather, the signifier that indexes a technical schema: it provides a way of talking about a very wide range of phenomena in the world, and it is also associated with a family of technical proposals, each of which realizes the general theme of “modeling the world” through somewhat different formal means. Just as disagreements with the planning theory are unintelligible within AI discourse, it makes virtually no sense to deny or dispute the idea that knowledge consists in a world model. The word “model”, like the word “plan”, is so broad and vague that it can readily be stretched to fit whatever alternative proposal one might offer. AI people do not understand these words as vague when they are applied to empirical phenomena, though, since each of them does have several perfectly precise mathematical specifications when applied to the specification of computer programs.

Of course, these errors could probably be addressed in part by integrating some kind of map API—though I suspect some skeptics would suggest that this actually corroborates their argument that systems need explicit, structured representations to solve complex planning tasks. Part of the problem here is actually disagreeing on what constitutes evidence for one “side” or another in the first place!

The evidence on Theory of Mind is quite complicated, as some LLMs also fail Theory of Mind tasks when stimuli are modified in subtle ways. The question of what an LLM passing a Theory of Mind task “means” is thus very difficult to answer.

The necessity (and complexity) of fixed points in measurement

Sean Trott — Wed, 11 Feb 2026 14:00:50 GMT

An ongoing challenge in developing a reliable science of large language models (LLMs) is understanding and predicting their behavior. This is, in part, a problem of measurement—in both an empirical and a philosophical sense. Long-time readers will know I’ve been preoccupied with this problem for a while now. In an effort to better understand it, I’ve recently begun reading more broadly about measurement and the validation of standards throughout various areas of science. That includes Inventing Temperature by Hasok Chang.

While there are clearly a number of disanalogies between measuring temperature and measuring the behavior of LLMs1, I think the field of thermometry can nonetheless be instructive to modern discussions of LLM-ology or measurement more generally. Indeed, temperature might be a particularly useful point of comparison: despite the fact that thermometric methods have seen such success (reliable thermometers can now be purchased cheaply at CVS!), and despite the apparent “simplicity” of a construct like temperature, thermometry has faced multiple philosophical and empirical challenges. In future posts, I’ll write more about which insights from this history may be applicable to current debates; for now, however, I will focus on the efforts of scientists to resolve these past challenges—which I think should imbue us all with humility.

Hasok Chang’s Inventing Temperature is about the historical and philosophical background of thermometry: the development of practices and instruments to measure temperature. There are, of course, a number of deep metaphysical questions one can ask about temperature (e.g., “what is heat?”), but Chang focuses more narrowly on the question of how scientists went about devising systems for measuring it in the first place, as well as the various epistemological challenges they faced in the process. By doing so, he identifies a kind of pragmatic schema by which scientists make progress in a new, thorny field, which he calls epistemic iteration.

The book covers several distinct challenges in the history of thermometry, but I’m going to focus here on the problem of fixedness.

If you’re devising a completely novel measuring device (say, a thermometer), you need some kind of reference point against which to assess the fidelity of that device. Intuitively, this reference point should be fixed or at least relatively stable: if it’s highly variable—fluctuating wildly between successive measurements—then you have no way of determining the reliability of your device. But this raises a kind of chicken-and-egg problem: how do you know if the fixed point is actually fixed?

Of course, if you already have a reliable instrument, measuring the stability of this reference point is easy. Indeed, the instrument itself may come to be seen as the reference point: nowadays, we use thermometers to establish the temperature of something (e.g., whether one has a fever) and don’t typically think to question whether the thermometer is providing a valid measurement.

At some point, however, such an instrument had to be created (and validated) for the first time. Accordingly, some external fixed point was required to establish the reliability of the instrument. Various reference points were used at different points in history, such as blood temperature or “blood heat” (used by Daniel Gabriel Fahrenheit in establishing the Fahrenheit scale), the boiling and freezing points of water (used by Anders Celsius for the Celsius scale), or even the temperature of a deep cellar (used by Edme Mariotte). The hope in each case was that the reference point was stable enough that it could be relied upon as a consistent index with which to validate a new thermometer. (It’s worth noting that in 2019, the kelvin was defined in terms of the Boltzmann constant: a theoretically defined fixed point from fundamental physics.)

Of course, as we know, these fixed points were not perfectly fixed. Even something like the boiling point of water is subject to change under various conditions, such as altered atmospheric pressure, the addition of dissolved solutes in the water, or the material of the container. Even the question of what constitutes “boiling” itself is surprisingly difficult to answer.

Towards a “theory of boiling”?

Chang spends the first half of this chapter on a period of historical debate (between the 18th-19th century) that occurred over exactly this question. For context: at this point in time, thermometers already existed, many of which relied on the Celsius scale—which in turn used the boiling and freezing points of water as their fixed reference points.

Despite this progress, some scientists—such as Jean-André de Luc—were skeptical about the reliability of these points. In 1772, De Luc wrote:

Today people believe that they are in secure possession of these [fixed] points, and pay little attention to the uncertainties that even the most famous men had regarding this matter, nor to the kind of anarchy that resulted from such uncertainties, from which we still have not emerged at all. (pg. 11 of Inventing Temperature)2

A few years later, the Royal Society created a committee to work on this problem and make concrete recommendations about which fixed points should be used, and how.

Some causes of variation were well-understood, such as atmosphere pressure. But others were more conceptually thorny, such as the fact that the precise temperature at which water boils depends on what one means by “boiling”. Scientists had long acknowledged different “types” of boiling (e.g., “slow” vs. “fast”), as well as various “stages” (e.g., “beginning to boil” vs. “boiling vehemently”). These types and stages corresponded, unsurprisingly, to different measured temperatures. Consistent with these observations, De Luc carried out a series of experiments demonstrating a range of boiling points, from “hissing” to what he called “true ebullition” (corresponding to a range of about 1.5 degrees Celsius).

Perhaps even more problematic and confusing was De Luc’s discovery of “superheating”, a phenomenon whereby water is heated well past its typical boiling point without actually boiling. De Luc was trying to figure out the temperature of the “first layer” of water that began to boil; one of the experiments he carried out involved introducing tiny drops of water into hot oil, which would explode into vapor when the oil was hot enough—which in this case meant 112 degrees Celsius. He then replicated this effect using a different methodology (heating water slowly in a glass flask with a long narrow neck), showing that superheating was a real phenomenon.

Worried that the presence of dissolved air in water was interfering with the boiling point, De Luc sought to create a solution of “pure” water that had been completely deaerated. The process he went through was quite remarkable and involved shaking a flask continually for about four weeks:

This operation lasted four weeks, during which I hardly ever put down my flask, except to sleep, to do business in town, and to do things that required both hands. I ate, I read, I wrote, I saw my friends, I took my walks, all the while shaking my water… (pg. 19)

Under normal atmospheric pressure, this airless water withstood temperatures of up to 112 degrees Celsius without boiling—at which point it boiled off explosively—again confirming De Luc’s observation of superheating.

Theoretically, then, thermometry was in a difficult spot: it needed a fixed reference point with which to validate thermometers—indeed, it was already dependent on the boiling and freezing points of water—but articulating a theoretical definition of “boiling” was surprising challenging, not least because the actual boiling point was not, in fact, fixed; this, in turn, called into question the feasibility of the whole enterprise.

Serendipity and steam

We clearly do have reliable thermometers that we trust will yield stable measurements. How did thermometry escape the problems associated with finding the boiling point?

Chang suggests that it was a combination of serendipity and dogged empirical investigation. First, serendipity: although the boiling point problem seems intractable in its description, the problem is greatly simplified when one considers that under most “normal” conditions, the boiling point is, in fact, relatively stable. In some sense, the problem only arises when trying to draw a universal generalization about boiling, e.g., “Water always boils at 100 degrees Celsius”. As we’ve seen, this is not true. However, the appropriate conditions for getting water to boil at 100 degrees can, in Chang’s words, mostly be achieved by simply “not bothering too much” (pg. 24). He writes:

It was a great blessing for early thermometry that the temperature of boiling was quite fixed under the sort of circumstances in which water tended to be boiled by humans living in ordinary European-civilization conditions near the surface of the earth without overly advanced purification technologies. (pg. 24).

This insight is really remarkable to me. The most stable conditions were actually produced when using the “impure” water (e.g., the water with dissolved air particles), which happened to be impure in roughly the right way.

A further stroke of serendipity came from another realization around this time, which is that the temperature of steam is, in fact, even more reliable than the boiling point. Steam temperature was not affected by the depth of water, the speed of boiling, the distance across the surface of the pot, or many other of the factors that plagued scientists trying to identify a stable boiling point. The Royal Society’s decision, then, was to use steam temperature as a fixed point instead:

The most accurate way of adjusting the boiling is, not to dip the thermometer into the water, but to expose it only to the steam, in a vessel closed up in the manner represented. (pg. 24)

The fixity of this steam point appears to have been largely uncontested, and by the mid-19th century, the sources Chang draws on suggest it was universally accepted.

As Chang notes, the question of why the steam point was so dependable continued to occupy scientists; interestingly, here, too, the answer depends in part on serendipity—namely, there are actually conditions under which steam does not occur, or under which steam can be cooled below the “steam point”; moreover, the conditions under which steam does occur includes certain impurities in the air (e.g., dust particles). Chang writes:

Now we can see that it was only some peculiar accidents of human life that gave the steam point its apparent fixity: air on earth is almost always dusty enough, and no one had thought to filter the air in the boiling-point apparatus. (pg. 36)

The role of epistemic iteration

So far, I’ve focused narrowly on a series of specific challenges faced by researchers of the time. Is there anything more general that we can learn about how scientists go about justifying and validating new standards of measurement?

Here, Chang contrasts two hypothetical modes of justification.

In the first, which he calls “justificatory descent”, researchers might seek to justify a system of measurement by grounding it in some other accepted—and more foundational—way of knowing. For example, the readings on a thermometer might be validated with respect to sensation: we are, after all, equipped with sensory systems that happen to discriminate between hot and cold. Yet this epistemological approach quickly runs into an infinite regress. On what basis is sensation justified as a “ground truth”? Indeed, even in the absence of a thermometer, we might already suspect that sensation can be erroneous: depending on whether a person places their hand in a bowl of hot and cold water, they will subsequently perceive the temperature of a lukewarm bowl of water very differently. Chang argues that this approach leads to a kind of epistemic trap, and further, that it’s simply not an accurate characterization of the kind of activity scientists engage in when justifying their measures.

The second approach, which Chang prefers, is called “epistemic iteration”. Here, scientists don’t seek to ground claims in foundational axioms. Some assumptions or pieces of knowledge might be prior to others—like sensation to readings on a thermometer—but only for the pragmatic reason that they are, definitionally, prior; in the beginning, sensation is all we have. Scientists thus begin with an imperfect system for acquiring knowledge about the world (e.g., our senses), then seek to improve that system via a process of incremental trial-and-error (i.e., epistemic iteration). This framework acknowledges the inherent uncertainty of any given system of knowing, and is thus more accommodating of the observation that we regularly use instruments to “correct” the perceptions of our senses—even if, in some grand historical sense, those instruments were initially validated using sensory experience.

Chang describes this process in stages. Thermometry begins with the ability to discriminate hot and cold, as well as the recognition that this distinction is somehow important. In stage two, we might build elementary systems that capture this variation; we might notice, for example, that liquids expand when heated and contract when cooled, and we might exploit this regularity to construct a simple thermoscope—whose readings reflect ordinal variance in temperature. And in stage three, we might refine these elementary systems by identifying stable fixed points.

This is the stage at which De Luc and others described above were operating: semi-reliable systems of measurement had already been invented, but scientists were seeking ways to make them more reliable still. Thus, scientists might realize that the boiling point is less stable than initially thought, and settle on steam temperature as a reference point instead. Presumably, this drive towards stabilizing and formalizing a fixed point is what eventually leads to the use of theoretical constants, such as the Boltzmann constant.

Galileo’s thermoscope, on display at Musée des Arts et Métiers in Paris. Image from ,Wikipedia.

For those inclined towards theoretical formalization, Chang’s account of epistemic iteration might feel unsatisfying: the process is messy, empirical, ungrounded in any universal axioms. But I think it is an extremely compelling account, especially of how the early stages of a field operate. I also think it is surprisingly inspiring—in a sense, it relieves one of the pressure to prematurely formalize each construct of interest; instead, it suggests there is value in trusting the process of empirical trial-and-error.

For one, the relationship between the measure (e.g., the height of liquid in a thermometer or thermoscope) is causally and transparently tied to the underlying construct of interest (temperature) and also, in many cases, to direct sensation (as we’ll discuss later); in contrast, the relationship between an LLM’s behavior on a relatively artificial benchmark and the purported “capability” that benchmark measures (e.g., “reasoning”) is much less clear.

Page numbering refers to Chang’s book, not the text Chang is quoting (and will for future quotations as well).

Healthy minds and sick souls

Sean Trott — Sat, 31 Jan 2026 19:45:58 GMT

As some readers might know, I occasionally write about books I’m reading over at The Leaky Margin. In some cases, when those essays touch on themes that are also relevant to themes I discuss at The Counterfactual, I post them here too. This post doesn’t directly connect to issues in Cognitive Science or AI, but it does establish some concepts and themes I’ll likely be turning to this year as I write (as I have before) about the role of academic research. Besides, I just think The Varieties of Religious Experience is an amazing book and I encourage anyone who hasn’t already to read it; some other themes from the book are addressed in my post on Language Models and the Ineffable.

A good dichotomy is hard to find. Many attempts to cleave the world in two miss the mark: they feel at times to be overly reductive, or to draw a line that doesn’t quite fit. A good contrast, in my view, often clarifies useful explanatory boundaries while ideally retaining some degree of conceptual flexibility.

William James presents a positive example of such a contrast in Varieties of Religious Experience, where he distinguishes between two religious temperaments: that of healthy-mindedness, and that of the sick soul. I’ll describe these temperaments in more detail later on, but I’d like to linger for a moment on the explanatory context for this contrast, which I think is relevant. James is articulating, in a series of lectures, his definition of religious belief (a question of explanatory scope), the factors that underlie conversion and belief (a question of psychological experience), and the significance of religious belief in guiding one’s orientation and actions (a question of practical impact).

The contrast between healthy minds and sick souls is mentioned in the specific context of circumscribing and defining his explanatory goals. James has already defined, provisionally, religious belief as follows:

…the belief that there is an unseen order, and that our supreme good lies in harmoniously adjusting ourselves thereto. This belief and this adjustment are the religious attitude in the soul. (pg. 53)

James spends some time articulating what he means by a belief in the “reality of the unseen”, drawing extensively on quotes from individuals with religious belief and experiences. Towards the end of this chapter, he elaborates on the particular attitudes that religious belief awakens, which is where the distinction between the two temperaments first emerges (bolding mine):

We have already agreed that they are solemn; and we have seen reason to think that the most distinctive of them is the sort of joy which may result in extreme cases from absolute self-surrender. The sense of the kind of object to which the surrender is made has much to do with determining the precise complexion of the joy; and the whole phenomenon is more complex than any simple formula allows. In the literature of the subject, sadness and gladness have each been emphasized in turn….Sometimes the joy has been primary; sometimes secondary, being the gladness of deliverance from the fear. This latter state of things, being the more complex, is also the more complete; and as we proceed, I think we shall find abundant reason for refusing to leave out either the sadness or the gladness, if we look at religion with the breadth of view which it demands. Stated in the completest possible terms, a man’s religion involves both moods of contraction and moods of expansion of his being. (pg. 75)

We notice, already, the outlines of the contrast, which is not so simple as the distinction between “optimists” and “pessimists”.

Rather, it is a question of whether the quality of religious experience is shaped by knowledge of evil’s absence from the world (the religion of healthy-mindedness) or, conversely, by the knowledge that evil, or at least pain and suffering, is in some sense inextricable from good, and may be kept at bay only by the grace of some power beyond one’s control (the religion of the sick soul). He points out that the “quantitative mixture” of these moods varies across individuals, time periods, and schools of thought; presumably, there is a healthy mind and sick soul in all of us. But much can also be learned by examining the extremes of this contrast1, which are in some sense incommensurate perceptions of the world and might ultimately demand different religious paradigms.

One common answer to the question of life’s purpose is happiness. As James points out, there are entire schools of philosophical ethics that deduce their sense of right and wrong from an analysis of what produces pleasure (and its opposite, displeasure). There’s a through-line from this axiom to the inference that the extent to which a religious belief facilitates happiness is, in some fundamental way, evidence of its truth (bolding mine):

…we must also acknowledge that the more complex ways of experiencing religion are new manners of producing happiness, wonderful inner paths to a supernatural kind of happiness, when the first gift of natural existence is unhappy, as it so often proves itself to be.
With such relations between religion and happiness, it is perhaps not surprising that men come to regard the happiness which a religious belief affords as a proof of its truth. If a creed makes a man feel happy, he almost inevitably adopts it. (pg. 78)

For some people this kind of happiness may seem almost congenital, as if they struggle to conceive of the possibility of feeling otherwise.2 As James points out, these “healthy-minded individuals” are roughly analogous to what Francis W. Newman (younger brother of Cardinal Newman) referred to as the “once-born”, which Newman defined as those who “see God, not as a strict Judge, not as a Glorious Potentate; but as the animating Spirit of a beautiful harmonious world, Beneficent and Kind, Merciful as well as Pure “(pg. 80-81).

Importantly, we observe this temperament (as well as the sick soul temperament) not only in the explicitly, institutionally religious but also in those we might call “spiritual”. Here, James uses Walt Whitman as an example of a healthy-minded individual, quoting from Richard Maurice Bucke’s book Cosmic Consciousness, which contains a lengthy sketch of Whitman’s character (bolding mine):

His favorite occupation seemed to be strolling or sauntering about outdoors by himself, looking at the grass, the trees, the flowers, the vistas of light, the varying aspects of the sky, and listening to the birds, the crickets, the tree frogs, and all the hundreds of natural sounds…He was very fond of flowers, either wild or cultivated; liked all sorts. I think he admired lilacs and sunflowers just as much as roses. Perhaps, indeed, no man who ever lived liked so many things and disliked so few as Walt Whitman. All natural objects seemed to have a charm for him…He never complained or grumbled either at the weather, pain, illness, or anything else. He never swore. He could not very well, since he never spoke in anger and apparently never was angry. He never exhibited fear, and I do not believe he ever felt it. (pg. 84)

Now, to be clear, this quote is from a follower of Whitman’s, in the context of a book expounding upon the notion of transcendental consciousness. It is possible, therefore, that Whitman’s healthy-minded temperament is exaggerated in this quote, either intentionally or unintentionally. At the same time, we can look to this version of Whitman as a kind of template for healthy-mindedness—again, following James’s wisdom about using the most extreme examples as an entry point into studying a phenomenon.

And in this template, we notice several qualities: a kind of relentless and expansive energy3; an inability or refusal to see evil4; and an expulsion of fear. These qualities recur in other reports peppered throughout the chapter. Some of these descriptions are quite beautiful; one of my favorites comes from the transcendentalist and Unitarian minister Theodore Parker (bolding mine):

I have swum in clear sweet waters all my days; and if sometimes they were a little cold, and the stream ran adverse and something rough, it was never too strong to be breasted and swum through. From the days of earliest boyhood, when I went stumbling through the grass, … up to the gray-bearded manhood of this time, there is none but has left me honey in manhood of this time, there is none but has left me honey in the hive of memory that I now feed on for present delight. When I recall the years… I am filled with a sense of sweetness and wonder that such little things can make a mortal so exceedingly rich. But I must confess that the chiefest of all my delights is still the religious. (pg. 83)

The expulsion of fear or, indeed, any negative feelings is a common theme. Here, James connects the religion of healthy-mindedness to a trend of the time called “mind-cure” or “New Thought” (which I’ve written about before). These faiths have little patience for emotions like anxiety or worry (dubbed “fear-thought”), which are seen as basically useless; much, they argue, can be accomplished and overcome if one sets one’s mind to it.

And crucially, this claim does not appear to be all empty talk: it bears both moral and physical fruits. James includes a number of testimonials from individuals who suffered for years from various ailments, who then overcame those ailments in a matter of weeks or even days upon encountering the wisdom of healthy-mindedness:

…my earlier life bears a record of many, many years of bedridden invalidism, with spine and lower limbs paralyzed…but since my resurrection in the flesh, I have worked as a healer unceasingly for fourteen years without a vacation, and can truthfully assert that I have never known a moment of fatigue or pain… (pg. 102)
Life seemed difficult to me at one time. I was always. breaking down, and had several attacks of what is called nervous prostration, with terrible insomnia, being on the verge of insanity; besides having many other troubles, especially of the digestive organs….I never recovered permanently till this New Thought took possession of me. (pg. 102)
I had been a sufferer from my childhood till my fortieth year…The healer said: ‘There is nothing but Mind; we are expressions of the One Mind; body is only a mortal belief; as a man thinketh so is he.’..I felt the next day like an escaped prisoner, and believed I had found the secret that would in time give me perfect health. Within ten days I was able to eat anything provided for others, and after two weeks I began to have my own positive mental suggestions of Truth, which were to me like stepping-stones. (pg. 105)

There are, I think, some fairly recognizable historical and modern-day analogues to the notion of a “mind-cure”, some of which are explicitly religious and some of which are not.

Some interpretations may seem rather fanciful or even dangerous: in its worst incarnation, this worldview might seem to discourage individuals from seeking help from “mainstream” medical institutions, or implicitly blame the sick for their misfortune; these limitations of the “once-born” mentality will be explored below.

But in a more tempered form, there’s clearly something true that the worldview is capturing, which I’ve written about before in the context of back pain. Our minds, after all, are part of our bodies, and therefore our mental states can interact with our physical states in strange and mysterious ways. And it is also true that many people have some modicum of control over how they orient towards the world, and that a healthy-minded orientation may simply be a more congenial way to live—not only for one’s self, but for the others in one’s life.

If the healthy-minded individual refuses to entertain the thought of evil, the sick soul refuses to look away. The sick soul believes that the evil aspects of life are part of its essence, and indeed that “the world’s meaning most comes home to us when we lay them most to heart” (pg. 131). In its most extreme form, this temperament perceives evil not only in certain external conditions but rather as pervading their very interiority, i.e., a “wrongness or vice in his essential nature” (pg. 131).

As with healthy-mindedness, this burden appears to be congenital in many cases, even if it’s not constant within a given individual’s life:

There are men who seem to have started in life with a bottle or two of champagne inscribed to their credit; whilst others seem to have been born close to the pain-threshold, which the slightest irritants fatally send them over.
Does it not appear as if one who lived more habitually on one side of the pain-threshold might need a different sort of religion from one who habitually lived on the other? (pg. 135)

One might wonder—especially if inclined towards healthy-mindedness—why these individuals are burdened with such despair and melancholy preoccupations. James flips the question: how could one not be burdened as such in a world as fraught and dangerous and unstable as this one? Happiness, pleasure, joy: for the sick soul, these cannot be disentangled from suffering and loss; and as anyone who has experienced these melancholy feelings likely knows, they come with an almost noetic (to borrow another word from James) quality of absolute certainty, as if they more accurately ascertain the underlying truth of things. James writes:

Unsuspectedly from the bottom of every fountain of pleasure, as the old poet said, something bitter rises up: a touch of nausea, a falling dead of the delight, a whiff of melancholy, things that sound a knell, for fugitive as they may be, they bring a feeling of coming from a deeper region and often have an appalling convincingness. (pg. 136)

This recognition that there is no happiness without despair seems, at least on my reading, to be central to the worldview of the sick soul. Death comes for us all, so the joys of life are at best temporary—and, knowing this, what kind of satisfaction can we continue to find in them? Perhaps the best-known expression of htis sentiment can be found in Ecclesiastes (bolding mine):

What profit hath a man of all his labour which he taketh under the Sun? I looked on all the works that my hands had wrought, and behold, all was vanity and vexation of spirit…Truly the light is sweet, and a pleasant thing it is for the eyes to behold the Sun: but if a man live many years and rejoice in them all, yet let him remember the days of darkness; for they shall be many.

To the healthy-minded, this mindset may seem inexplicable or even frustrating. I suspect some readers of this essay may feel this way: it is not difficult to find, particularly on the Internet, followers of the doctrine of healthy-mindedness encouraging individuals to simply get outside and breathe the fresh air—to appreciate the gift of life and find happiness in this gift. Despair, they might argue, is a kind of choice; why choose something that makes yourself (and those around you) miserable?

To this, the sick soul has a ready rejoinder: even the most healthy-minded among us are not immune; in fact, many who consider themselves “enlightened” in their healthy-minded state have not considered the fact that this state is tolerated only a temporary abatement of pain and suffering in their life. But this abatement cannot last, especially in the face of death:

A little cooling down of animal excitability and instinct, a little loss of animal toughness, a little irritable weakness and descent of the pain-threshold, will bring the worm at the core of all our usual springs of delight into full view, and turn us into melancholy metaphysicians…The old man, sick with an insidious internal disease, may laugh and quaff his wine at first as well as ever, but he knows his fate now, for the doctors have revealed it; and the knowledge knocks the satisfaction out of all these functions. (pg. 141)

Here, James points out that although the ancient Greeks are sometimes held up as models of healthy-mindedness, he argues that this ignores the fact that many of their philosophies—such as Epicureanism or Stoicism—eschew the pursuit of ecstatic pleasures or happiness; instead, they advocate for a kind of modest contentment (facilitated by the temporary freedom from pain) or simply lowering one’s expectations.

The Epicurean said: ‘Seek not to be happy, but rather to escape unhappiness; strong happiness is always linked with pain; therefore hug the safe shore, and do not tempt the deeper raptures. Avoid disappointment by expecting little ,and by aiming low; and above al do not fret.’ The Stoic said: ‘The only genuine good that life can yield a man is the free possession of his own soul; all other goods are lies.’ (pg. 143)

It is hard, in my view—and I think James agrees—not to feel that these worldviews have a kind of maturity (or, in James’s words, a “completeness”) that, at least as earlier characterized, the worldview of healthy-mindedness not; perhaps this is what is meant by “twice-born”. At its best, the sick soul does not wallow in despair, but it equally does not pretend that evil can be forgotten or set aside, even for a moment.

As the paragon example of the sick soul, James draws from Leo Tolstoy’s Confession. Tolstoy wrote this autobiographical in his early fifties about his struggle during a period of intense existential despair—what now might be called, somewhat demeaningly, a “mid-life crisis”. It is worth noting that this period fell upon him during a time of considerable personal and professional success. He had already published, to great acclaim, both War and Peace and Anna Karenina: two books5 that are considered by many (correctly, in my view) to be among the best of all time. As Tolstoy himself points out: he had a large family with a loving wife; he lived on a large and prosperous estate; he enjoyed considerable fame and admiration; and he was physically and intellectually quite healthy. Sounding almost boastful, he writes:

I had the kind of strength in mind and body that I rarely encountered in my age-group. Physically, I could go out mowing with the peasants and keep up with them. Intellectually, I could work for eight or ten hours at a stretch and not feel any ill-effects from the strain of it. (pg. 18)

And yet:

…in this situation I had come to a point where I couldn’t go on living, but I was afraid of death and so I used every ruse against myself in order to avoid taking my life. (pg. 18)6

Tolstoy describes, in some detail, the measures he took to avoid the temptation to commit suicide: hiding the rope; no longer going hunting. He did not know, in short, why he ought to continue living—and yet he was desperately afraid to die.

This existential crisis appears to have been driven by the same question that haunts many people: what’s the point of all this? What is it all for? He could give, he writes, “no reasonable meaning to any actions of [his] life”.

He explains his state of despair by referencing an “Eastern fable”, in which a traveller fleeing a wild beast jumps into a dried-up well. Yet at the bottom of the well, the traveller sees a dragon, waiting with its jaws wide open. Caught between the beast above and the dragon below, the traveller clutches onto the branch of a bush growing in the well. He writes (bolding mine):

At any moment the bush is going to snap and break off, and he will fall into the dragon’s jaws. The traveller sees this and knows his destruction is inevitable, but while he is dangling there he searchers around, and on the leaves of the bush he comes across a few drops of honey, and he goes after them with his tongue and licks them up. I was hanging on to a branch of life, knowing that the dragon of death was waiting inevitably, ready to tear me limb from limb, and I had no idea why I had come in for torment like this. And I tried to suck up the honey that had once consoled me, only to find that this honey gave me no pleasure, and meanwhile the white and black mice—day and night—were gnawing away at the branch I was climbing to. I could see the dragon clearly, and the honey had lost its sweetness for me. (pg. 19).7

There are a few things to notice here. First, the sensation of being trapped between two terrors—presumably the despair of living and the fear of death. Second, the inevitability of the latter, as conjured by the evocative image of those black and white mice eating around the edges of the branch. And third, pleasure is construed—as we saw earlier, in the section on healthy-mindedness—as honey. Yet here, the honey brings Tolstoy no pleasure. Compare this to Theodore Parker’s characterization of the memories of life’s pleasures:

…there is none but has left me honey in manhood of this time, there is none but has left me honey in the hive of memory that I now feed on for present delight… (pg. 83)

I only noticed this contrast while putting together examples for this essay, and I think it’s instructive. Both men view pleasure as honey—a kind of distilled sweetness—so it is not that it is this for one and that for the other. But they relate to it very differently: for Parker, the honey is a comfort in his old age, reminding him of the sweetness of life; for Tolstoy, the sweetness cannot be enjoyed because he is all too aware that it will come to an end—for him, for his wife, for everyone he loves and, indeed, everyone who will ever live.

Tolstoy considers various solutions to this problem, and dismisses each in turn: the first, ignorance, is no longer on the table for him; the second, Epicureanism (enjoying the honey while it lasts), does not work for him while knowing that others and eventually himself will suffer; the third, suicide, he is too afraid to pursue. By default, then, he falls into the fourth “solution”: hanging onto the branch despite the hopeless absurdity of doing so.

It is worth noting that, in the end, Tolstoy does find a sort of solace: Confession is not merely a memoir of despair. His comfort comes in the form of faith in God, though not, crucially, a faith driven by any kind of rational argumentation or formal theology; it really is—fittingly with James’s own view on religious experiences—faith in the deepest, most mystical and intuitive sense of the word.

It is also worth noting that, after this period, scholars point to a change in Tolstoy’s writing: he became much more preoccupied with religion, and in some cases his writing became overtly moralistic; in others, as in The Death of Ivan Ilyich, we see what might almost be characterized as a repudiation of a certain kind of healthy-minded individual—no one, Tolstoy seems to be saying, can be saved from the inevitability of suffering and fear; if there is redemption to be had, it is through that process of despair.8

There is, as James points out, a natural antagonism that arises between these two temperaments: the healthy-minded find the sick souls frustrating or perhaps even incomprehensible in their insistence on locating despair in all things; whereas the sick souls find the healthy-minded to be naive at best, and at worst, filled with a kind of unearned confidence.

While James does suggest that there’s a kind of completeness and maturity in the sick soul (that you don’t observe in the healthy-minded), this isn’t, I think, the sort of contrast where one side clearly “comes out ahead”: different individuals likely need different religions. Indeed, the same individual may find themselves drawn towards different worldviews at different points in their own life, depending on their circumstances.

Some may find it tempting, when presented with a dichotomy, to “make the case” for one side or the other; or, alternatively, to stake out one’s own place in that contrast. But I think the more interesting—and the more useful—perspective comes from understanding that both temperaments likely live in all of us (this is the “divided self”), and that they each have their place.

This is not to say that one should strive to fall somewhere between the temperaments, i.e., a kind of “average”. In the first place, I’m not sure it makes sense to use language like “should” when it comes to these temperaments at all; but second, I think it is more helpful to view the divided self as engaging in something like an internal dialectic, oscillating between the two poles. A good dichotomy, as I argued in the beginning of this essay, should clarify our understanding of a space and also retain some degree of conceptual flexibility; in this case, I think James succeeds in characterizing a contrast that helps us understand a constant dialogue within ourselves.

James provides an excellent articulation of studying the most extreme cases of a phenomenon in the service of better understanding the phenomenon itself:

It is a good rule in physiology, when we are studying the meaning of an organ, to ask after its most peculiar and characteristic sort of performance, and to seek its office in that one of its functions which no other organ can possibly exert. Surely the same maxim holds good in our present quest. The essence of religious experiences, the thing by which we finally must judge them, must be that element or quality in them which we can meet nowhere else. And such a quality will be of course most prominent and easy to notice in those religious experiences which are most one-sided, exaggerated, and intense. (pg. 45)

He writes:

I mean those who, when unhappiness if offered or proposed to them, positively refuse to feel it, as if it were something mean and wrong. We find such persons in every age, passionately flinging themselves upon their sense of the goodness of life, in spite of the hardships of their own condition, and in spite of the sinister theologies into which they may be born. (pg. 79)

Which is fitting, from the poet who wrote “I contain multitudes”.

Or, in some cases, to reclaim evil as part of some beneficent transcendental spirit that is itself good, thus negating the evil.

Tolstoy was emphatic that War and Peace was not, technically, a “novel”.

Page number from Tolstoy’s Confession, not James’s quotation.

Same as above.

In this, I am reminded too of Flannery O’Connor’s writing, in which grace is found only (if at all) when the confident bluster of her protagonists is obliterated through personal tragedy.

Calibrating expectations

Sean Trott — Sat, 17 Jan 2026 01:08:30 GMT

Early on in graduate school, I was interested in trust: specifically, the (normative) question of how much to trust new technologies enabled by machine learning (ML), and the (descriptive) question of which factors led individuals to trust or distrust these systems. One way to think about this is as a problem of calibration. From this perspective, users of a tool should be well-calibrated to what it can and can’t do, so that they make effective use of the tool but don’t deploy it in inappropriate settings—i.e., they should trust it “just the right amount”.

Notably, this was years before the advent of ChatGPT. At the time, research on trust focused on emerging tools like self-driving cars and “AI-assisted” decision-making technologies (e.g., for domains like healthcare). As a cognitive scientist, I was especially drawn to the question of whether and how designing these tools to appear more “human”—such as the use of a language interface—might influence human judgments about their capabilities or even the degree to which humans trusted them (whether that trust was appropriate or not).

My graduate research went in a different direction for a variety of reasons, one of which was that the language interfaces of the time were simply not particularly capable. But I’ve rekindled my interest in the topic in the last year or so, as large language models (LLMs)—or more accurately, LLM-equipped software tools—are deployed in more and more settings, despite a lack of consensus on how to properly evaluate these systems. Individuals thus struggle to calibrate their expectations: as Kelsey Piper wrote in a recent article for The Argument, the same system (e.g., Claude Code) might produce clever, functional code in a matter of minutes and generate infuriating, inexplicable errors.

I don’t think this problem is going away anytime soon, and I’m certainly not going to solve it in a single essay. But it is a longstanding problem in human-machine interaction, and as always, I think contextualizing the problem in that history can help us understand both the parallels and particularities of the current moment with respect to the past.

Language interfaces and the habitability problem

I’ll start with some more personal history: before I started graduate school, I worked for several years at a research institute in Berkeley called the International Computer Science Institute (ICSI). I worked in Jerry Feldman's research group on a project designing a natural language interface (NLI) for a simulated robotics application. These days, NLIs are commonplace—some have even argued text is the “universal” interface—but at the time, they were quite limited in scope and deployment.

A figure from our 2015 paper describing the architecture for a simple “agent-based” language interface system. Linguistic input was processed by a tool (the “ECG Analyzer”) that made use of a hand-build grammar and ultimately produced a representation (an “n-tuple”) usable by another component of the system (the “problem solver”), which in turn executed API calls in a downstream application—in this case, a simulated robotics application.

The major bottleneck is, of course, that building a system that understands language is very hard! By “understand” language I don’t merely mean processing it (i.e., converting strings of text into some other representational format), but somehow producing an action in response to linguistic input—such as planning trips, controlling an operating system, or interacting with the world (or a simulation of the world) in some way, as in Terry Winograd’s famous SHRDLU.

The working assumption of much research (including our own) was that natural language understanding (or “NLU”) was an inherently domain-constrained problem: that is, while it might be possible (though difficult) to build end-to-end systems that mapped a limited set of linguistic commands to actions for specific applications, building a “general-purpose” language interface was hopelessly out-of-scope.

Part of the challenge with a “general-purpose” system is, obviously, that “general-purpose” includes quite a lot. For what it’s worth, that challenge is very much alive with current LLM-equipped software tools (as I discuss later on in this post).

But another challenge lay with what you might think of as the “front end” of these tools: the component responsible for processing linguistic input and converting it into a usable format for some downstream application. Human language is incredibly flexible—there are all sorts of ways one can say the same thing, or almost the same thing—and it’s very difficult to design, by hand, a system that can competently process all of this linguistic formulations.

This is where the habitability problem comes in. Suppose you, a computational linguist, write a series of rules that can cover a variety of sentences; let’s call the set of these rules a “grammar”. This could mean different things to different people, but in short, I mean something like: convert a string like “Put the red block on the blue block” into some kind of syntactic parse. You’ve also written a component that converts this parse into some kind of usable action specification, e.g., move(moved_object(type = block, color=red), target_location(object(type = block, color = blue), relation = on)).1 Finally, you’ve written code that uses this action specification to actually make the relevant changes in the downstream application—in this case, perhaps making API calls to a simulated robotics application.

Simplified schema of how the input/output for such a system might work.

Now, such a system could fail in at least two obvious ways. First, our grammar might fail to identify a suitable syntactic parse, i.e., there is no grammatical rule that produces a complete analysis of the sentence. And second, even if we produce a syntactic analysis, our system for producing an action specification might not “know” what to do with that analysis—or which API calls to make. In both cases, the user of the system has produced a sentence that the system does not “understand” in the sense that no suitable output is produced.

The failures above correspond to what we might think of as overestimation: the user has produced a command under the impression that the system will be able to execute it, but the system is unable to do so. Yet another, more subtle failure mode corresponds to underestimation: that is, a user fails to fully exploit the range of linguistic formulations and actions available to them. To make this concrete: the system might be able to execute commands like “Put the red block on the blue block”, but for whatever reason, the user doesn’t think of this as the type of thing the system can do—so they never ask.

These two failure modes are the twin sides of the habitability problem, as originally defined by Watt (1968):

A “habitable” language is one in which its users can express themselves without straying over the language’s boundaries into unallowed sentences.

It might help to depict this problem visually. We can think of two “sets” of sentences: first, sentences the system understands (i.e., produces the correct action from linguistic input); and second, sentences the user produces (i.e., commands or “intended actions”). In the ideal case, these two sets should have complete overlap—everything the user produces should be comprehensible by the system, and everything the system can do is (at least in principle) accessible to the user. In other words, we want the intersection between these sets to be as large as possible.

Visual illustration of the habitability problem. Ideally, there should be complete overlap between the sentences the user produces and the sentences the system understands.

Framing it this way allows us to flesh out what we mean by underestimation and overestimation. In this framework, underestimation occurs when the set of sentences a system understands is larger than the set of sentences the user produces; overestimation is when the set of sentences a user produces is larger than the set of sentences a system understands. Finally, appropriate calibration is when these sets are, effectively, identical.

Different failure modes (underestimation and overestimation) and appropriate calibration can all be depicted as the intersection of two sets: the set of sentences the system understands and the set of sentences the user produces.

For engineers building NLIs, then, the challenge becomes how to effectively “steer” users into using language that the system understands and away from using language that the system won’t understand. This is not an easy problem to solve, and approaches might include strategies as diverse as “wizard-of-oz” studies (in which they gather data about the kinds of language a person might use with such a system), instructional demos (in which the range of abilities and limitations of a system can be illustrated), and more indirect approaches (e.g., having the system use certain linguistic formulations to “prime” those same formulations in users).

The habitability problem was originally defined in the context of mostly symbolic, rule-based systems for processing natural language—an approach now sometimes called “good old-fashioned AI” (or “GOFAI”). The problem was also identified at a time when users interacted with computers through text-based interfaces; as others have pointed out, interest in the habitability problem waned as graphical user-interfaces (or “GUIs”) became more popular in the following decades, even though some researchers continued to develop language interfaces.2

With tools like ChatGPT, however, language interfaces are very much back—and it’s clear that the habitability problem, or at least a version of it, never really went away.

From GOFAI to LLMs

LLMs are an example of the so-called “connectionist” approach to AI, which is usually placed in contrast to symbolic, rule-based approaches. Specifically, LLMs are large neural networks trained to predict tokens (e.g., words) in the context of other tokens (e.g., a sentence). These systems can be fine-tuned and modified in all sorts of ways, such as using human feedback or training them to solve questions with objective answers (“verifiable rewards”); they can also be trained on other modalities, such as images or audio. Finally, such a model can be used as a component of a larger system, which I call “LLM-equipped software tools”.

But at their core, each underlying system is essentially the same:

Some input is presented to the model.
The model produces a probability distribution over subsequent tokens.
Some other process is used to sample from this probability distribution.

What’s surprising—and in many ways remarkable—about modern LLMs is that this relatively simple process, applied recursively to its own output, often produces coherent, contextually appropriate outputs. It’s worth noting that this is itself a nontrivial finding: for many years, neural networks trained on the statistics of language struggled to produce grammatical sentences—let alone semantically sensible ones. But training larger models on larger amounts of data seems to go a long way towards improving the grammaticality and sensibility of their outputs. Moreover, the various “post-training” techniques that have been applied to these systems have resulted in tools that solve difficult reasoning tasks and generate computer code.

I mention all this because I think it’s relevant to understanding the excitement over LLMs. As I noted in the previous section, there was always an understandable skepticism that an AI system composed of hand-written rules could ever be “general-purpose”. The set of possible English sentences—let alone desired actions—seemed too vast and too under-determined to build a system that could contend with all of them. By contrast, LLMs appear to be able to contend with a variety of linguistic formulations; they can also be trained to produce API calls to other system components (like a calculator). There’s a temptation, then, to hope that LLMs could, one day, serve as something like a “general-purpose” language interface—or, to use a more popular (though controversial) term, that they might one day represent something like “Artificial General Intelligence” (or “AGI”).

Now, I do think LLMs are quite interesting; that’s why I research them and write about them so often! But there’s a gap between: observing, first, that LLMs can learn, in some bottom-up way, statistical patterns that allow them to produce coherent language and computer code; and inferring, second, that they are (or will be) “general-purpose” language interfaces, i.e., that the habitability problem has somehow been circumvented. I think we should be very careful not to elide or minimize this gap.

It’s also notable, in my view, that even the strongest proponents of LLMs as general-purpose systems will often readily acknowledge that there are a number of “alignment problems” to solve. The term “alignment” signifies many different (related) concepts in the AI discourse, but one useful, straightforward definition refers simply to ensuring that a user’s intentions are in some sense aligned with the response of the system they’re using. This definition helpfully covers a variety of failure modes, from cases of obvious incompetence to more subtle misinterpretations. It also makes it clear that the problem of “alignment” shares many features with the habitability problem, or more generally with the problem of calibration.

Calibration is also about us

One thing I appreciate about the calibration perspective is that it makes it clear this is a two-way street. Building and validating a more capable system that correctly interprets a user’s requests is, of course, half the battle. But the other half concerns the human side. What kinds of mental models do people have about the abilities and limitations of LLMs? When do people underestimate or overestimate their abilities? And perhaps most crucially, how can we encourage the proper degree of calibration?

As I mentioned at the start of this post, I don’t know the answers to these questions—they are, to my mind, questions that require both empirical research and discussion of what our actual societal values are and should be. Part of the answer surely lies in educating individuals about how LLMs and related systems actually work. But I will close with several observations that I believe to be true about LLMs and calibration.

First, I suspect that one source of calibration difficulty arises from the fact that these systems are language interfaces in particular. Humans tend to associate the systematic, appropriate use of linguistic symbols with other cognitive abilities, or even emotions and an interior life. For some domains, those correlations might hold for LLMs as well, if only because the statistics of language contain a remarkable degree of information about the world and the way it works. At the same time, there will be many domains in which LLMs produce surprising, inexplicable failures—and our surprise at these failures should be viewed as a warning, in my view, that we don’t understand why or how these systems produce the behaviors they do.

That brings me to my second observation. With many tools, we know what it should and should not be used for; we have a general understanding of its affordances. We also know—or trust that someone, somewhere knows—why the tool works in this way. In many cases we even extrapolate this assumption to the mechanics of the world around us. This fundamental assumption underlies much of what sociologist Max Weber called “disenchantment”:

It is the knowledge or the conviction that if only we wished to understand them we could do so at any time. It means that in principle, then, we are not ruled by mysterious, unpredictable forces, but that, on the contrary, we can in principle control everything by means of calculation. That in turn means the disenchantment of the world. (pg. 12-13)

As I’ve argued before, this assumption of legibility starts to break down with non-deterministic, opaque systems like LLMs. While we certainly understand many low-level properties of LLMs (e.g., how they’re trained), we lack an understanding of which internal mechanisms produce the behaviors we observe—which is precisely why research on interpretability is so important.

More generally, this leads to a strange predicament in which a tool is available for use, but it’s not necessarily clear what we should use it for. It might be useful for a range of downstream applications, but it also might not be. This is, to me, one of the most aggravating facts about how many LLM products are marketed: customers pay some amount of money for access to a system with under-specified capacities; if they deploy that system in a context that’s inappropriate and it leads to failures, the company selling the product has some degree of plausible deniability—after all, the product specification never stated the model could be used in this exact circumstance; and perhaps it simply wasn’t prompted in the right way (i.e., “user error”).

One might imagine addressing this problem of under-specified capacities by developing and running benchmarks that evaluate those capacities. In theory, then, you could imagine paying more for a model (and deploying it more broadly) that performs better on more benchmarks than a model that performs poorly. But here, we run, as always, into the problem of construct validity: are these benchmarks actually measuring the capability we think they’re measuring? Is measuring something like “programming ability” (let alone “reasoning”) even a coherent goal? How exactly should we use the results of such a benchmark to guide actual decisions about safely, reliably deploying a model?

This is also why I often feel a deep sense about ambivalence when arguments about the capabilities of LLMs break out. On the one hand, I agree with articles like this one pointing out that it’s not necessarily appropriate to conflate the training objective of a model (e.g., predicting the next word) with the capabilities that model develops. At the same time, I continue to think there’s considerable uncertainty what those capabilities actually are—and I agree with Melanie Mitchell’s arguments here that we ought to exercise restraint in ascribing human-like cognitive capacities on the basis of an LLM’s behavior on tests designed for human participants. As I’ve written before, the same behavior can be produced by multiple mechanisms, and the same test may not necessarily “mean the same thing” for humans and LLMs.

Once again, I don’t have the answers. But my underlying ethos on this matter is one of epistemological caution. There is much we don’t know—perhaps much we can’t ever know—and I think it’s worth calibrating to that uncertainty.

With the caveat, of course, that this example is probably not well-designed in all sorts of ways.)

It’s also worth noting that a version of the habitability problem very much applies to GUIs as well!

Evaluating empirical claims: practical advice

Sean Trott — Tue, 30 Dec 2025 20:05:30 GMT

Claims about the world and its inhabitants abound, both in the scholarly literature and in public discourse. One of the most important things a person engaging with these claims can do is learn how to evaluate them, sometimes (though not always) with a critical eye.

I want to be clear about what I mean by this. I’m not endorsing here a kind of epistemic nihilism: I think most people are already aware, on some level, that at least some of the claims they encounter are misleading at best and false at worst—and unfortunately this leads, all too often, to a kind of reflexive disbelief that is often just as bad as its inverse. I’m also aware that most people simply do not have the time to investigate the epistemic foundations of every claim they encounter, which is precisely why networks of trust are so crucial to constructing knowledge.

What I am suggesting, rather, is as follows: first, if you come across a claim, it is useful to wonder what theoretical assumptions or empirical evidence that claim relies on; second, this process of “wondering” can be made much easier by using some kind of conceptual framework, which offers a defined lexicon for doing this kind of cognitive work. There are many such frameworks one could choose from (e.g., many perspectives from critical theory), but I’ll focus here on an approach commonly taught and used in Cognitive Science for evaluating empirical claims in particular.1 I originally taught this framework in a Research Methods course (using materials from Beth Morling’s excellent textbook on Research Methods in Psychology), but it’s gone on to inspire much of my current focus on the epistemological foundations of research on large language models (LLMs). Many people already implicitly adopt aspects of this framework, but individual critiques (or responses to critiques) can be made more precise by understanding it in more detail.

Fortunately, the framework itself is relatively simple and primarily consists of answering two questions:

First, what kind of claim is being made?
Second, is that claim valid?

Crucially, the validity of claims can be evaluated in multiple ways, and different kinds of validity are more or less important for different kinds of claims. That’s why it’s important to start by identifying the kind of claim in the first place.

What kind of claim is being made?

Oversimplifying a bit, there are roughly three types of empirical claims one could make: frequency claims, association claims, and causal claims.

A frequency claim is a claim about the rate or degree of something. For example, the assertion “4 in 10 people text while driving” is a claim about the rate (40%) of people putatively texting while driving. Another example would be “Human adults encounter an average of 100M words in their lifetime”. (Note that I’m not arguing either of these claims are true or false!) In both case, the claim is about a single variable (e.g., texting while driving or number of words encountered), and expresses some descriptive statistic about that variable (e.g., the rate or average2).

An association claim is (unsurprisingly) a claim about the direction and strength of association between two or more variables. With continuous variables, the direction of association is most intuitively illustrated using something like a scatterplot. For instance, two variables might be positively associated (as X increases, so does Y), negatively associated (as X increases, Y decreases), or not associated (Y is unrelated to changes in X). The strength of association refers to something like the extent to which values of Y can be reliably estimated from values of X. Both these dimensions of associational claims are depicted (along with an associated Pearson’s correlation) in the figure below, ranging from a perfectly linear positive correlation to a perfectly linear negative correlation.

Examples of correlation between two variables. Copied from the Wikipedia page on correlation.

Of course, associations can be non-linear as well. That same image from Wikipedia includes visual examples of relationships that seem clearly structured but correspond to a correlation of 0: a good example (along with Anscombe’s Quartet) of the perils of relying on statistics without plotting your data.

Examples of real associations that would be measured as having a Pearson’s correlation of 0. Always plot your data!

So far, my definition of associational claims has been rather abstract. But you likely encounter associational claims all the time: anytime two or more variables are described as being related to each other (though not necessarily in a causal relation), that’s an associational claim. For example, Our World in Data contains a scatterplot showing the relationship between GDP per capita and self-reported life satisfaction, depicted below.

Scatterplot from Our World in Data showing the relationship between GDP per capita and self-reported life satisfaction.

Judging by eye, this relationship seems positive and of moderate strength. Someone might assert something like: “Countries with higher GDP per capita also have a higher average self-reported life satisfaction.” We could also put this in terms of prediction: “A person from a country with a higher GDP per capita is likely to have a higher average self-reported life satisfaction.”

Note that these statements make no claims about whether this relationship is causal (nor in which direction). It could very well be that residing in a country with higher GDP per capita makes people more satisfied with their lives—indeed, this empirical evidence could be consistent with such a claim. But the evidence is also consistent with the inverse claim (i.e., that higher life satisfaction leads to higher GDP per capita), and it’s also entirely possible that the relationship is spurious or dependent on some hidden confound. We’re also, for now, not assessing the validity of the claim: we’re just describing what kind of claim it is.

A causal claim, in contrast, does claim that one or more variables is somehow causally responsible for changing some other variable(s). Such claims are often (though not always) accompanied by certain verbs expressing a causal relationship, such as “affects” or “changes”.3 For example, the claim “Taking music lessons improves pitch perception” is a causal claim about the efficacy of music lessons. Interestingly, it does contain within it a kind of association claim: presumably, people who take music lessons scored more highly on some measure of pitch perception than people who didn’t. But the expression of a causal relationship between these variables assumes that something about our design or collection of the data justifies the stronger, causal claim: for example, perhaps participants were randomly assigned to take music lessons vs. some control condition.

Association and causation are often conflated—so often that many readers will likely have encountered the mantra “Correlation does not imply causation”. There’s a rich philosophical literature on the nature of causality and when, epistemologically, we are justified in drawing causal inferences; I won’t go into too much detail here, but I do discuss specific inferential issues (like internal validity) below. Suffice to say: first, the issue is complicated; second, random assignment is generally viewed as the “gold standard” for establishing causality; but third, there are various other data collection and statistical modeling techniques for trying to establish causation when random assignment is impossible or unethical. These techniques (like regression discontinuity design, the use of instrumental variables, etc.) are common in fields like Econometrics or Epidemiology, where researchers tend to rely on observational data rather than controlled experiments.

Having established the kind of claim being made, we can turn to identifying whether that claim is valid.

The four validities

There are (at least) four ways in which a claim’s validity can be assessed: construct validity, external validity, statistical validity, and internal validity.

Construct validity

Construct validity refers to how well a variable is operationalized. As I’ve written before, many variables of interest in Cognitive Science (and throughout much of science!) are relatively abstract, and cannot be measured “by eye”. Instead, they require some kind of instrument to measure them, hopefully with some degree of reliability and precision. Even something that seems as concrete as temperature requires specific instruments for measuring the variable; and as Hasok Chang describes in Inventing Temperature, the historical path towards building and refining these instruments often consists of considerable trial-and-error. Variables like “life satisfaction” or “Theory of Mind” are even more fraught: researchers may not even agree on the correct definition of the variables in the first place, much less how to measure the variable—indeed, some might even argue that the construct cannot be measured in any useful or reliable way.

Most readers have probably encountered critiques of construct validity before. It’s an especially common argument in the discourse on the “capabilities” of Large Language Models (LLMs): namely, that the performance of an LLM on some “benchmark” designed to assess some ability does not reflect the actual capacity in question, but rather some “shortcut” or “cheap trick” that produces indistinguishable behavior. Adjudicating between these possibilities requires identifying potential “shortcuts” a system (or human!) might be using and perhaps designing an evaluation that removes the possibility of using such shortcuts: in short, doing the hard theoretical and empirical work of validating an instrument.

It’s also common when discussing variables such as “happiness” or “life satisfaction”, which some view as the kind of variable that inherently resists quantification. Take, for example, the scatterplot we saw earlier showing a positive relationship between GDP per capita and self-reported life satisfaction. The associational claim that “Wealthier countries contain happier people” could be critiqued on grounds of construct validity: namely, someone might argue that self-reported life satisfaction is not a good measure of “actual” happiness. In response, someone who does think that it’s a good measure would need to defend the construct validity of the claim.

Establishing the validity of an instrument depends on several factors. The instrument should ideally be reliable (i.e., it produces consistent results), face valid (i.e., it seems theoretically and intuitively plausible as a measure of the construct), and exhibit both convergent and predictive validity (empirical measures of the extent to which the measure correlates with other measures of the same construct). Readers interested in this topic might enjoy my previous post describing the process of establishing construct validity in more detail.

External validity

External validity refers to how well a given claim generalizes to the population of interest.

This is another topic I’ve written about extensively, both in the context of studying human cognition and studying systems like LLMs. In each case, researchers are typically interested in drawing conclusions about some “population of interest” (e.g., human cognition). But researchers can’t, of course, study every human that’s ever existed, so they make do with a sample. If the sample is random and representative of the underlying population, then facts about the sample can sometimes be generalized to the population. However, if the sample is not representative (as is the case with so-called “WEIRD” psychology), then drawing these generalizations is unjustified.

External validity is also at the heart of recent debates about the validity and utility of political polling. The point of a poll is to produce an estimate of the current degree of support for particular candidates or ballot measures. Unlike a census, polls are generally conducted on a sample—not the entire population. Typically, researchers will conduct a large-scale surveys of their intended population (e.g., citizens of the United States) and extrapolating from the results of such a survey to their population of interest.

A key problem in this approach is selection bias: if some people are systematically more likely to respond to a poll than others, and if this systematicity is correlated with what the poll is measuring (e.g., support for Democratic vs. Republican candidates), then the poll’s results might be biased. Here, “bias” has a specific technical definition: namely, that an “estimator” (e.g., a statistic calculated on a sample) will systematically misestimate the underlying parameter of interest. While all sample statistics will be a bit wrong (because of sampling error), an unbiased estimator is one that is equally likely to overestimate or underestimate the true value—bias, on the other hand, is more problematic because it causes researchers to make systematic errors in judgment.

Establishing external validity is also complicated! It depends, at minimum, on two factors: first, a coherent definition of the “population of interest”; and second, a method for reliably producing representative samples of the population (such as random sampling). In some cases, researchers also try to use more sophisticated statistical approaches to counteract potential selection biases: for example, pollsters sometimes factor the state of the economy or previous voting patterns in a region into their estimates.

Statistical validity

Statistical validity refers to whether a given conclusion about the data is justified or reasonable on the basis of the statistical evidence presented.

As I’ve discussed before, scientists rely on various statistical techniques for modeling their data and testing hypotheses about the relationships between variables. The valid use of these techniques depends, in turn, on various assumptions. For instance, ordinary least squares (OLS) regression assumes that each observation is independent; violation of this assumption can artificially decrease the standard error estimates in an OLS model, increasing the probability of a false positive error. This is why researchers studying datasets with non-independent observations now rely on more sophisticated techniques that explicitly account for these sources of non-independence, such as mixed effects models.

In general, drawing conclusions from statistical analyses depends on all sorts of assumptions, which aren’t always met. Issues like insufficient statistical power and multiple comparisons without correction are other examples of practices that could in principle jeopardize statistical validity. When an analysis is not statistically valid, the chance that some kind of inferential error is made increases: whether that error is a false positive or a false negative depends on the nature of the statistical mistake.

My sense is that, in part due to consequences of the replication crisis, researchers have become more aware of statistical validity as a concern. At the same time, statistical modeling techniques are not always taught in sufficient detail for researchers to be able to recognize these issues when they arise.

Internal validity

Internal validity refers to how successfully a piece of evidence supports a causal claim.

While causal inference is by no means a settled issue, researchers tend to agree that causal claims like “A causes B” require that at least three criteria be met: temporal precedence (A comes before B); association4 (A is related to B); and the elimination of confounds (i.e., accounting for alternative causal pathways for B). The phrase “internal validity” typically refers to that last criterion (eliminating confounds), though I’ve sometimes seen it used to refer to all three.

A confound is simply a variable that affects both the outcome variable (B) and the explanatory variable of interest (A). If some confound C causes changes in both A and B, it is plausible that A and B will be correlated even if they are causally unrelated. Crucially, if we fail to account for C—either in our experimental design or in our statistical analysis—we might infer that A and B are in fact causally related.5

A common confounding scenario occurs when a hidden variable (the confound) causes changes both in our outcome variable and the variable of interest. This produces a spurious association between the variable of interest and the outcome.

An example that’s often taught in introductory statistics classes is the positive correlation between the number of ice cream sales in coastal towns and the number of shark attacks in those same towns. There is, in some datasets, a genuinely significant and positive relationship here: that is, knowing the number of ice cream sales gives you some ability to predict the number of shark attacks—better than knowing nothing, at any rate. But this relationship clearly makes no sense from a causal perspective: how exactly could people buying more ice cream lead to more shark attacks (or vice versa)? The hidden confound, of course, is temperature: people are more likely to buy ice cream in hotter weather, and they’re also more likely to swim in the ocean, increasing the chance of a shark attack.

How, then, do researchers eliminate confounds?

The gold standard, as I mentioned above, is the use of experiments with random assignment: that is, participants are first randomly assigned to one of multiple experimental conditions (say, Treatment vs. Control); then, some intervention is applied; and finally, outcomes are compared across conditions. The core intuition here is that if participants are equally likely to have been assigned to the Treatment vs. Control condition—since the assignment is purely random—we are in some sense “automatically” accounting for any differences in the sample of Treatment participants and the sample of Control participants.

Random assignment helps address confounds arising from pre-existing variation in the participant population. But researchers must also eliminate potential confounds in the design of their experimental conditions as well—i.e., differences in the stimuli or procedures that participants are exposed to other than the primary experimental contrast of interest. For instance, suppose a researcher is interested in whether people respond faster to more frequent words than less frequent words. If they don’t account for the empirical fact that frequent words also tend to be shorter (Zipf’s law of abbreviation), they might incorrectly attribute a treatment effect to word frequency rather than word length. These confounds are typically best addressed through careful experimental design (e.g., matching stimuli for length across conditions).6

Which kind of validity matters most?

A natural question to ask at this point is whether one of these four validities is particularly crucial and perhaps worth devoting special attention to.

The answer is that it really depends on the nature of the claim being made. Internal validity is very important for causal claims, in which a causal link is expressed between two or more variables (“A causes B”); in such cases, it is critical that researchers do their best to eliminate alternative explanations (e.g., “C causes both A and B”). On the other hand, for frequency claims (e.g., “The average A is 4.5”), the notion of a confound is much less relevant, so internal validity never really surfaces as an issue. The same could in principle be said of association claims, since a true association claim does not express a causal link—the only caveat is that some putatively associational claims are actually causal claims, so readers should be careful to identify whether someone has snuck causal language in somewhere.

External validity, on the other hand, is key for frequency claims. Descriptive statistics (e.g., the average height of California residents; or the number of people who text while driving) don’t generally grant anything like mechanistic insight; instead, their primary utility is summarizing some property of a population. Here, it really matters that the sample is representative of the population of interest. In contrast, while external validity does matter for causal claims—this is a central critique of laboratory experiments—there’s still considerable merit in demonstrating a causal link even in the absence of a representative sample; as psychologist Paul Bloom pointed out in a recent post, such work can provide mechanistic insights about how some system behaves under controlled conditions.

Statistical validity is relevant to some degree for all types of claims: it always matters that one’s conclusions rest on solid statistical principles. In my experience, however, issues with statistical validity are more likely to arise with claims about association or causation, since those claims rely on more complicated statistical techniques that quantify the relationship between two or more variables, and it’s easier to make mistakes in the execution or interpretation of these techniques. Of course, that doesn’t mean researchers can’t make statistical errors with frequency claims—an example would be conflating the mean and median of a distribution.7

Construct validity, as I’ve argued before, is perhaps the most omnipresent form of validity. Any kind of claim is vulnerable to a criticism about how the constructs have been operationalized. For instance, the frequency claim “4 in 10 people text while driving” hinges on how exactly this estimate has been produced. Did survey respondents self-report whether they ever texted while driving? Did researchers install cameras on street corners and annotate how often individuals texted while driving? Similarly, the association claim “Wealthier countries contain happier people” relies on an appropriate operationalization of both wealth and happiness. And while no metric is perfect, we might be more confident about any given metric (e.g., self-reported life satisfaction) if it’s been shown in independent studies to correlate with other measures of the construct (i.e., it demonstrates convergent validity). Finally, causal claims are effectively association claims with a causal link, so construct validity applies there as well. As with external validity, some researchers might (fairly) argue that it’s worth sacrificing some construct validity to demonstrate a causal link. Such an argument is entirely reasonable—but it’s also why, as I’ve argued before, it’s important to pair such controlled studies with observational studies that sacrifice internal validity for improved construct and external validity.

Putting the framework to use

I write frequently about Large Language Models (LLMs), with an emphasis on how we might try to reliably learn things about them. As such, I encounter claims about LLMs quite often—as I’m sure many readers of this newsletter do. Having established the kinds of empirical claims and the four validities with which to evaluate them, we can now apply this conceptual framework to claims about LLMs.

Consider the following hypothetical claims:

The average hallucination rate of frontier LLMs is 20%.
The size of a model is positive correlated with reasoning performance.
Chain-of-thought prompting improves reasoning performance.

In evaluating each claim, we might first identify what type of claim it is. (1) appears to be a frequency claim: it is a summary statistic (the average hallucination rate) about some population (frontier LLMs). (2) appears to be an association claim: it expresses a relationship (a positive correlation) between two variables (model size and reasoning performance). And (3) appears to be a causal claim: it expresses a causal link (“improves”) between two variables (chain-of-thought prompting and reasoning performance).

We can now interrogate the validity of each claim. Let’s start with (1), which we already established was a frequency claim. There are two crucial dimensions to notice here: first, the variable being summarized (hallucination rate); and second, the population of interest (frontier LLMs). The question of how hallucination rates were assessed is a question of construct validity—how confident can we be that the evaluation used is both reliable and a valid predictor of hallucination rates “in the wild”? The question of which LLMs were studied is a question of external validity—what exactly is meant by “frontier LLMs”, and which specific LLMs were studied in the context of this study? Are those specific LLMs representative of the population being described?

We can ask similar questions of (2), which is a claim about association: larger models display better reasoning performance. Construct validity applies now to two variables instead of one. Model size is typically measured in terms of the number of parameters (weights) in the model, but we’d want to make sure that’s what the metric refers to in this particular context. Reasoning performance is much more abstract and presumably relies on specific reasoning benchmarks; whether or not we think those benchmarks are actually reflective of true reasoning capacity presumably depends both on our a priori theoretical perspective (whether we think LLMs are capable of reasoning) and empirical facts about the benchmarks themselves (e.g., whether they correlate with other indices of reasoning). In this case, external validity refers (again) to the sample of LLMs tested. This is relevant to knowing whether the positive correlation observed is likely to be true across the board, or whether it applies only to a subset of LLMs: is it always the case that larger LLMs will perform better on these tasks? Finally, statistical validity is relevant here: how exactly was the positive correlation calculated, and how strong is the measured association?

(3) is a claim about a causal link between two variables: chain-of-thought prompting (in which a model is instructed to think through a solution step-by-step) and reasoning performance. As with (2), the same construct validity questions about reasoning performance apply: how confident should we be that the evaluations assess reasoning in particular as opposed to some other latent variable? Are we primarily assessing a certain kind of reasoning (E.g., mathematical reasoning or analogical reasoning) or are we making a claim about reasoning “in general”? We might also (again) be concerned about external validity: which LLMs were assessed to make this claim? We will be more confident about the generalizability of the result if we observe a similar effect across 100 LLMs than if we tested only a single model. Finally, because this is a causal claim, we also care about internal validity. To make a claim like this, researchers would probably conduct an experiment in which models were prompted either with chain-of-thought prompting or with some “control” condition—but the validity of such a causal claim depends on the researchers successfully eliminating all differences between conditions except for the experimental manipulation. Even something as seemingly trivial as the length of the prompt might matter, so it could be helpful to include multiple control conditions, which account for different potentially confounding factors.

There’s also a fourth kind of claim about LLMs that might be even more common: that is, claims about the performance of a specific LLM, such as “GPT-4 passes the bar exam”. In terms of the framework I’ve described here, these are clearly not association claims or causal claims, but it’s also not clear that they’re intended as frequency claims about a population of LLMs; if they are frequency claims, they’re claims with very poor external validity, given that only a single model was assessed. In some ways, they’re arguably closer to something like a “case study” or perhaps an “existence proof” for some phenomenon. The key validity to focus on, then, is probably construct validity: what instrument was used to assess the performance of the model, and how reliable and valid is this instrument as an operationalization of the underlying construct people might be interested in?

Crucially, none of these questions about any of the claims we’ve considered here can be answered without knowing more details about how each claim was actually produced. A good research paper should ideally provide all the details necessary to answer such questions, but in some cases, answering them will be impossible. A failure to answer such questions is not a failure on the part of the reader; indeed, this kind of uncertainty is itself useful information for deciding how much credence to assign to a given claim. For example, if it’s not clear how the researchers in a study assessed “reasoning performance”, then any claims about the reasoning performance of LLMs—or the variables that affect reasoning performance—should be taken with a fairly large grain of salt.

My hope is that equipping readers with this conceptual framework—three kinds of claims, and four validities with which to assess them—can help them navigate the epistemic landscape. In turn, the people producing claims might consider consulting a framework like this to help make their epistemic contributions more precise.

Subscribe now

There are other kinds of claims one could make that don’t depend on empirical evidence, such as definitional claims.

Frequency claims could, of course, reference other descriptive statistics as well, such as the median or even a measure of the variance of a variable.

Note, too, that the presence of hedging (“may change”) does not change the kind of the claim being made—just the speaker’s epistemic stance towards the claim.

Note that the absence of a significant association between two variables does not imply no causal relationship. There are all sorts of reasons why a real causal relationship could be masked such that one would fail to detect an association.

Confounds don’t always produce false positives; in some cases, failing to account for a hidden cause can actually mask a real association between A and B, producing a false negative.

If the design does not address these issues, researchers can attempt to address them after the fact in their statistical analysis (e.g., including word length as a covariate). This introduces problems of its own and is sometimes misinterpreted as “controlling for” an effect, when it is in fact adjusting for some covariate in a way that may not always eliminate actual confounding.

This is most relevant with skewed distributions or distributions with very extreme outliers. The presence of skew or outliers affects the mean more than the median, but these summary statistics are often conflated.

Who's afraid of the null hypothesis?

Sean Trott — Mon, 22 Dec 2025 19:46:20 GMT

Cognitive scientists routinely run experiments in the lab on samples of human participants. As any experimental researcher would readily admit, virtually all lab experiments are utterly unlike the “real world”. They’re (intentionally) stripped of all context except for the key variables the researcher is hoping to manipulate. For instance, in a lexical decision task, a participant might view a series of strings on a computer screen and indicate which are valid words; this is not, of course, something that one is typically asked to do outside of the lab—but the researcher might be interested in whether participants respond faster to words that are more frequent (or shorter, or more concrete, and so on).

Someone concerned about this fact might argue that these studies lack “ecological validity”, i.e., the conclusions reached by the study may not generalize to the real world. Is this true, and does it matter? It depends crucially on what experiments are for in the broader context of Cognitive Science.

What are experiments for?

Cognitive scientists are not the only scientists that run experiments. The experimental method is widely used across scientific disciplines as a tool for isolating how different variables interact and, ideally, determining causal relationships between those variables.1 The “real world” is full of noise and confounds: threats to the internal validity of a claim. An experiment—whether it’s in physics, chemistry, biology, or psychology—sacrifices some amount of external validity (generalizability and applicability) for the sake of tighter control.

Earlier this year, the psychologist Paul Bloom wrote a good article about this very point. In it, he argued that the concern over whether a study “works” outside the lab is sometimes misguided, since the point of experiments in Cognitive Science is not merely to identify useful, practical interventions—at least in the ideal case, the goal is to inform mechanistic theories of how the mind works. He wrote:

Laboratory studies are one way (not the only way) that science comes to know things. By establishing “perfectly controlled conditions” for our experiments, we’re doing what every other scientist does—testing our theories by focusing on specific contrasts.

I’ve heard versions of this argument from other experimentalists as well, ranging from those studying human behavior to those investigating the neural underpinnings of spatial navigation in rodents. The goal of an experiment is to isolate specific variables and understand how they relate to each other. That understanding can in turn be brought to bear on the construction (and modification or even falsification) of scientific theories. For instance, the relative speed with which humans respond to abstract vs. concrete words might inform theories about how we represent the meaning of words in a “mental lexicon”.

One way to think about this is that an experiment represents a kind of “micro-world”: a simulation of certain relevant parameters of reality that allows us to determine, under controlled conditions, how those parameters interact. Observing an “effect” in an experiment—e.g., a non-zero difference between experimental conditions—can be helpful in at least two ways. First, it serves as an empirical existence proof that there exists some set of conditions under which parameter A (e.g., word concreteness or frequency) influences parameter B (e.g., response time). And second, it can provide an estimate of the “effect size”: does “A” influence “B” a little or a lot?

The second question is often more interesting, but also hard to determine reliably via experimentation, since the effect size in the lab may not actually be representative of the effect size outside the lab—that’s a case where ecological validity really does matter. But the first question is also an important one for triangulating the conditions under which two or more variables causally interact. It might sound like I’m underselling experiments by calling them existence proofs, but a well-designed existence proof can be really helpful for informing theory! But this hinges, in many cases, on how surprising or informative it is to know that “A” can influence “B”.

When are existence proofs useful?

An existence proof is, intuitively, a demonstration that something exists.

Existence proofs are most helpful for disproving universal claims. For instance, the claim that “All swans are white” can be disproven by demonstrating the existence of a single black swan. This is a form of deductive reasoning: we can falsify the first claim (“All swans are white”) by finding evidence contrary to the implications of that claim (“There exists at least one black swan”).

Now, it’s still entirely possible that most swans are white. From another perspective, identifying a single black swan may not do much to change our expectations of observing white swans in general. Conversely, if we observe numerous black swans, we might gradually update our beliefs about the relative preponderance of white and black swans. This is, essentially, a form of inductive reasoning 2, in which evidence is used to support probabilistic inferences about the state of the world. Nonetheless, it remains true that the universal claim has been falsified by a single counterexample.

Which kind of reasoning is what scientists do? And what is the role of experiments in that reasoning? Much confusion and debate, in my view, can be traced to differences in how people might answer these questions. It’s made more complicated by the fact that there are normative answers (i.e., Karl Popper’s falsificationism) and descriptive answers (e.g., maybe scientists actually do something more like inductive reasoning under certain paradigmatic assumptions). To make matters worse, the statistical techniques many scientists rely on for drawing inferences about experimental results are grounded in a kind of falsificationism (disconfirming the null hypothesis), but experiments are not always designed with explicit falsification in mind. All of this complicates the actual utility of an experiment.

Below, I’ll work through each of these issues in turn.

Falsify, not verify?

Karl Popper famously argued in support of developing falsifiable theories. Science, he argued (roughly), should proceed not by finding evidence in support of our theories, but by finding evidence that allows us to eliminate theories—letting stand the theories that have not (yet) been explicitly falsified.

A core motivation here is that it’s very hard (arguably impossible) to prove a scientific theory true using empirical evidence. This is a specific form of the more general problem of induction: even if all available evidence is consistent with a proposition (e.g., “The sun will rise again tomorrow”), it is entirely possible that future evidence will come to light disproving that proposition. Indeed, even the fact that this assumption has generally held true and been helpful in the past is itself a form of inductive reasoning—and there’s no way to know with certainty that it won’t fail in the future.

Popper’s solution to this was to argue in favor of falsification, which (as noted above) operates as a kind of deductive reasoning. A good scientific theory is one which can be rigorously tested and falsified; accordingly, theories are not “proven right” but rather “not yet proven wrong”. In principle, this allows us to distinguish scientific from non-scientific claims (Popper was very concerned with the question of demarcation): if a theory can’t be falsified, it’s not a scientific theory. It also provides a clear mechanism for making progress: science proceeds by pruning away falsified theories. When it comes to any particular theory, we ought to try to prove it wrong, not prove it right.

This brings me to null hypothesis significance testing, or “NHST”.

NHST is kind of weird

With some exceptions, NHST is the conceptual foundation of statistical analysis in much of Psychology (and indeed, much of experimental science). There are a litany of “hypothesis tests” (t-tests, chi-squared tests, and much more; many of which can actually be represented in terms of linear regression), but all of them operate under the assumption that theoretical inferences are driven by our decision to either reject or fail to reject the so-called “null hypothesis”.

Consider the schema of a simple experiment with two conditions: “A” and “B”. Participants are randomly assigned to conditions, the intervention is applied, and some dependent variable (DV) is measured. Typically, the “null hypothesis” in such an experiment is that there is no real difference between the conditions, i.e., that the true difference in the DV is zero. Of course, sampling error means that there’ll likely be some marginal difference across conditions even if the intervention had no “true” effect. How do we distinguish random noise from a real effect?

This is where statistical tests come in. We can compare the observed effect to the distribution of effects we’d expect under the null hypothesis. This distribution can either be derived empirically (e.g., through permutation testing) or assumed (as in a t-test). In either case, we can calculate the probability of the observed effect under that distribution; we call this probability the p-value.

Illustration of NHST in practice. If we assume a null (normal) distribution of effect sizes centered at 0 with a standard deviation of 1, we can calculate the probability of obtaining an effect size at least as large as 1.8 under that distribution. If that probability (p-value) is sufficiently small, we might choose to reject the null hypothesis.

Intuitively, a larger effect size is less likely to have occurred under the null distribution.3 Thus, if the p-value is sufficiently small (say, p < 0.05), we might choose to provisionally “reject” the null hypothesis. That is, it is unlikely to have observed an effect this large under the null hypothesis. If the p-value doesn’t cross some predefined threshold for “statistical significance”, we might instead “fail to reject” the null hypothesis. This is what I mean when I write that NHST is deeply rooted in the logic of falsification.

If this all seems a bit strange to you—we’re trying to reject the null hypothesis rather than prove the alternative hypothesis?—you’re not alone. NHST is not intuitive, and even practiced scholars (including myself) are not always careful with how they describe statistical results. But technically, the point of NHST is to provide a kind of conceptual framework for deciding when to reject or fail to reject a null hypothesis.

The problem arises when the null hypothesis simply isn’t very interesting.

Who’s afraid of the null hypothesis?

Earlier, I argued that one goal of an experiment is to provide an empirical existence proof that there exists at least some set of conditions under which (say) “X” affects “Y”. Using the language of falsification and NHST, we might say that the null hypothesis of many experiments is that “X” does not affect “Y”; thus, if “X” is empirically related to “Y” more than we’d expect under the null hypothesis, we can reject (falsify) the null.

I also argued that the extent to which this matters depends on how surprising or informative it is to learn that “X” might affect “Y” in at least some conditions. Put another way: how strongly held was the null hypothesis in the first place?

It’s easy to imagine null hypotheses that could be easily disproven, and that no one would find interesting to disprove. For instance, suppose the null hypothesis is something like: “An anvil and a feather are equally likely to break a glass cup”. We could design a study to test this null hypothesis. First, we might select or “sample” 100 glass cups of identical make and size. Then, we might randomly assign 50 of these cups to the Anvil condition and 50 of them to the Feather condition. Finally, we would measure how many cups broke when we dropped an anvil on them and how many broke when we dropped a feather on them. If the difference between conditions (Anvil - Feather) is larger than we’d expect under the null hypothesis, we can successfully reject the null.

Any reasonable observer would (correctly) ask: what would be the point of such an experiment? Surely we already know that anvils are heavier than feathers, and thus more likely to break a glass cup! No one would be surprised by finding that this is the case.

I don’t think most experiments test something as obvious as the relative heaviness of anvils and feathers. But I do think many experiments (including some I’ve run) are designed without a particularly interesting null hypothesis in mind. Instead, they’re designed to find corroborative evidence for a theory. That is, they identify the predictions of a theory (“X affects Y”) and design an experiment to test those predictions. The problem in such a scenario is that if we think of experiments as existence proofs (as I’ve argued we can), then perhaps it’s not so surprising that a researcher with considerable degrees of freedom in how they design their study can construct some scenario under which their theory’s predictions (“X affects Y”) are corroborated. If the null hypothesis corresponded to a different, competing theory (as I discuss below), that would be a really interesting result. But if the null hypothesis is a theory no one believes—simply the negation of the theory of under investigation—then it’s more of a demonstration that the researcher’s intuitions are borne out with empirical evidence in at least one case. That’s not nothing, but it’s epistemologically fraught to lean entirely on such a methodological paradigm in constructing theories about the world.

What should researchers do?

Where does this leave us?

One obvious (but challenging) solution is to actually design experiments with an interesting null hypothesis in mind and trying to disconfirm that theory: essentially what Popper suggested. This is hard work: it requires carefully enumerating the predictions of psychological theories and figuring out not only what would be consistent with a given theory but also what would be inconsistent with the theory. Then, once you’ve figured this out, you try to design an experiment in which the theory predicts no difference in response to some experimental manipulation; in such a design, finding a significant difference—rejecting the null hypothesis—would equate to a kind of falsification of that theory, requiring some kind of modification (assuming you’ve operationalized your constructs appropriately).

In the ideal case, researchers might even design experiments that pit multiple theories against each other, such that the results actually help adjudicate between competing theories. This is what biophysicist John Platt called “strong inference”. It’s a really effective approach because it helps avoid the inevitable (and very human) urge to “stack the deck” experimentally in favor of finding some experimental effect. Here, the goal is to design a study such that any result would be informative: Result A disconfirms Theory B in favor of Theory A, while Result B disconfirms Theory A in favor of Theory B.

I want to emphasize that this is really hard work. It’s especially hard when different psychological theories don’t make precise predictions; in some cases, it’s difficult to determine whether different theories make competing predictions at all! That’s not a reason not to do it—I absolutely think researchers should strive to design experiments that actually adjudicate between competing theories. But some verbal theories simply aren’t sufficiently developed for this kind of quantitative comparison, so comparing them requires making some strong (often contestable) assumptions on the part of the researcher.

Do experiments still have a role in these fuzzier contexts, absent strong inference? I think the case is less clear, but there are a few possible affirmative responses I can imagine someone making.

The first (and weakest) is the view I pointed out in the previous section: even if experiments are designed to corroborate a theory (rather than disconfirm it), and even if the results are not particularly surprising, there is some information provided above and beyond the researcher’s intuitions—i.e., it is an empirical demonstration that some set of experimental conditions can be devised to produce an effect.

The second is to suggest that the goal of science isn’t always investigating whether X affects Y; it’s to determine the size of this effect or relationship (sometimes called parameter estimation). Here, the role of experiments might be to help estimate an effect size under certain controlled conditions. As I argued earlier, such a finding may not always be reliable or useful given the lack of ecological validity in most experiments: the extent to which “X affects Y” in the lab may not tell us much about the extent to which “X affects Y” in the real world. But this approach can be more useful if multiple studies are conducted, and their results are integrated with studies conducted using more naturalistic, ecologically valid data—providing something like convergent evidence across multiple methodological approaches.

Here, a reader might also suggest that the problem lies in the binary logic of NHST; perhaps scientists should adopt a more “Bayesian” approach instead, in which no single experiment is decisive but each experiment’s results might “update” our beliefs in one direction or another. I think Bayesian statistics are great, but I’m not actually confident the problem here is about which statistical paradigm we used. In my view, the problem is more about how we go about selecting theories to test and designing experiments to test them—not how we analyze the data from those experiments. I think the results of the Anvil vs. Feather experiment would be uninteresting regardless of which statistical paradigm one used to analyze them.

With that said, one insight that the Bayesian paradigm does attempt to capture is the notion of the “weight of evidence” and cumulatively building up beliefs for or against a theory. To me, the best version of this approach looks something like finding convergent evidence across multiple studies (as I argued above)—what I think of as a process of triangulation. Finding a relationship between X and Y across multiple experiments and multiple naturalistic studies might increase our confidence that there is, in fact, a relationship between X and Y, even if none of those studies were designed with falsification in mind.

This is, ultimately, a kind of inductive reasoning, and it comes with well-known philosophical problems. The kind of knowledge obtained through this process is inherently uncertain and unstable, but that may simply be the reality of the epistemological situation we find ourselves in. I do think, however, that we need to be honest about when we’re actually doing falsification and when we're doing something more like this process of fuzzy triangulation.

Notably, this is why randomized controlled trials—an experiment—are generally considered the “gold standard” of evidence in medicine.

Not the same thing as a proof by induction, which is actually a form of deductive reasoning!

Generally this is also affected by things like sample size, which (along with the sample variance) determines the variance of the null distribution.

Is ChatGPT "grown, not made"?

Sean Trott — Mon, 24 Nov 2025 04:56:03 GMT

A feature of modern large language models (LLMs)—the architecture underlying technologies like ChatGPT—that is often hard to convey to non-experts is their opacity.

Systems like ChatGPT, after all, are human artifacts. They didn’t spontaneously materialize one day from the sky or from the depths of the ocean; they are the result of human research, engineering, and product design. We tend to assume such artifacts are, ultimately, comprehensible: many of us live, arguably, in a “disenchanted world”, in which we assume that much (if not all) of that world—at minimum, the human-designed portion of it—can be explained in terms of legible, mechanical principles.

Put another way: I don’t know exactly how a toaster oven works, but I assume that someone, somewhere does.

It seems strange, then, to assert that even the people building LLMs don’t entirely understand how or why they work. How could we build them if we didn’t understand them? I don’t want to overstate the case here—as I point out below, we obviously know some things about how LLMs work. But there are, nonetheless, epistemic gaps. The reason for these gaps is sometimes conveyed in the following metaphor:1

LLMs are grown, not made.

As I’ve written before, all metaphors have their strengths and weaknesses: any given metaphor highlights certain aspects of the frame its describing and obscures other aspects. This metaphor is no different. I like some aspects of it, and I’ve relied on it (or a variant of it) in the past to communicate the basic idea of LLM opacity, but I also have some reservations about applying it uncritically. Below, I attempt to convey more precisely what I mean when I say “we don’t understand how or why LLMs work”; at the end of the post, I return to the “grown, not made” metaphor and discuss my mixed feelings in more detail.

What we do know

There is much we do know. We know, first, what LLMs are trained to do: predict upcoming words on the basis of their prior context.2 For so-called “open”3 models, we know what data the LLM was trained on, ideally in which order.

We also know the mechanisms underlying this training process: as Tim Lee and I wrote in our explainer on LLMs, you can think of training as updating a bunch of “knobs”—like changing the temperature in the shower—to help the LLM get better and better at predicting upcoming words. The direction and magnitude by which we turn each knob is based on the error signal the LLM gets during training. After lots of examples, it gets quite good at predicting words. Moreover, for open-weight models, we know the final weights—roughly, the values of these “knobs”—resulting from this training process, and we know the mathematical operations by which particular inputs are transformed (via a “forward pass”) into predictions about the next word.

Of course, systems like ChatGPT are also subject to extensive “post-training”, which includes teaching them to follow instructions, giving them human feedback, and incentivizing them to “think step by step” when solving hard problems. We also know, more or less, how this works.

It’s also worth noting, here, that even though I don’t know all the training details or weights for closed-source models like ChatGPT, OpenAI employees presumably do (even if that cumulative knowledge is distributed across multiple individuals). The information is knowable and known.

Why is this not enough?

What we don’t know

For some people, it is. As I wrote in my mechanistic interpretability explainer, some have argued that interpreting LLMs is effectively a solved problem. In fact, some researchers in the field don’t even seem to think it’s a relevant problem: recently, a manuscript I submitted was criticized by one reviewer on the grounds that we already know, more or less, how transformer language models work—there was no need to conduct further research on the mechanisms they use to contextualize the meaning of words. I don’t think this is a widespread opinion in the field (the other reviewers liked the paper, and found the topic interesting and important), but it is held by some.

That review was a good opportunity to clarify exactly what it means to say “we don’t understand how or why LLMs work”. When people say this, they usually mean at least one of two things (or both).

First, LLMs appear capable of doing things that they weren’t explicitly trained to do. In the course of learning to predict upcoming words, LLMs acquire internal representations and mechanisms that enable them to do this more effectively; these mechanisms, in turn, allow LLMs to produce predictions that are sensitive to myriad factors, like whether an utterance is grammatically well-formed, whether a sentence conveys a plausible event, or even what a character in a story knows or doesn’t know.

Now, as I’ve argued before, the question of whether LLMs actually possess certain capacities (like “grammar” or “Theory of Mind”) is a deep question about the construct validity of the measures we use to assess those capacities, which will require extensive philosophical and empirical work to resolve. The crucial thing, however, is that LLMs weren’t explicitly trained to produce these interesting behaviors: they are a byproduct of the thing the LLM has actually been trained to do.4 Moreover, we lack a thorough understanding of which capacities we should expect to develop through this training process, or why these capacities would develop at all. We have hypotheses (e.g., the idea that learning to predict text encourages a model to “reverse-engineer” the causal process giving rise to that text), but this is very different from the mechanical understanding we have about (say) a toaster oven.

Second, although we know the values of the weights (“knobs”) in an LLM, we don’t know how exactly those weights implement the computations and behaviors we observe. Epistemologically, you might think of this as a gap between levels of analysis. We have a high-level description of what the components of a transformer language model do (e.g., attention heads are a “matchmaking service for words”); we also know, at a very low-level, the precise mathematical operations underlying a specific instantiation of those components (e.g., the result of the query/key/value operation for a particular attention head for a particular input). But in the absence of empirical investigation, we don’t know which behaviors those individual components subserve.

Figuring this out is, in large part, the goal of mechanistic interpretability. And I’d argue we’re making some progress! For instance, interpretability researchers have discovered interesting components like “induction circuits” in a number of different LLMs, which determine whether a given token (e.g., “the”) has previously occurred in the context (e.g., at position t), then predict that the subsequent token will be the one following the previous one (e.g., at position t + 1). There’s even some evidence that these circuits might play a role in important abilities like in-context learning (though recent evidence suggests the situation is more complex than it initially appeared).

At the same time, interpretability has a long way to go: many fundamental questions remain about the nature of circuits and the best ways to go about finding them.

More broadly, it’s notable that this picture is, again, very different from the epistemic picture of a toaster oven. With LLMs, we attempt to identify circuits in a post-hoc manner, i.e., after the system has been trained; and even in the best-case scenario, we generally can’t be certain that we’ve accurately identified the correct function of a given circuit.

An analogy might be in order here. Suppose we construct a mechanical system of cogs and levers, which can be modified according to the various inputs to that system. We then “train” that system to produce certain patterns of exhaust fumes in response to various inputs. The system excels at this task, and in the process, we learn—to our surprise—that training it to produce these exhaust fumes has, inadvertently, taught the system to move forward and backward; moreover, these movements appear to be appropriately calibrated in response to different patterns of input. That is, the system can “drive”. We don’t know how or why this happened, though we have some plausible hypotheses. We also don’t know exactly which cogs and levers are responsible for the system’s “decisions” to move forward or backward in different contexts, though, again, we’re making some progress in the endeavor to map these components onto observable behaviors.

Grown, not made?

Let’s return, then, to the metaphor at hand: is a system like ChatGPT better described as “grown” than “made”?

In the language of conceptual metaphor theory, metaphors typically work by construing some target frame (e.g., ChatGPT) in terms of a source frame (e.g., biological growth processes). As I noted at the start, metaphors don’t necessarily construe every aspect of the target frame, nor do they use every dimension of the source frame. My sense is that the “grown, not made” metaphor emphasizes the contingency and unpredictability of living organisms. While fields like genetic engineering have clearly made incredible strides in recent years, I think most people would still endorse the claim that we understand less about the processes underlying the growth and development of biological organisms than those underlying the construction of a toaster oven or a combustion engine.

But there are lots of things we don’t understand as well as toaster ovens: dark matter, anesthesia, consciousness. Why select the grown vs. made contrast in particular? I think there are a couple aspects of the biological growth and development frame that make it a convenient vehicle for the goals of this metaphor.

First, there’s a human element to both growing and making that makes them similar enough to be meaningful counterparts. Biological organisms grow on their own, too, of course, but human societies have long reshaped their environments to better suit their needs, which includes practices like agriculture and animal husbandry. Even the verb “grow” acknowledges this ambiguity: the subject of the verb can be the thing growing (“plants grow when provided sunlight”) or it can be the agent controlling the growth process (“humans grow plants for food”). Framing ChatGPT as something “grown” accommodates the fact that it was, after all, created by humans—unlike, presumably, dark matter.

Second, there’s much we still don’t understand (and can’t control) about growing life. For instance, a gardener can establish the conditions that facilitate growth (e.g., the approximate quality of the soil, the number and depth of seeds planted, the volume of water, the exposure to sunlight), but they can’t directly control the growth processes themselves, and there’s always some element of contingency. A proponent of the “grown, not made” metaphor might argue that there’s something similar about training LLMs: engineers set the training conditions (e.g., the training data, the initial parameters5 , the training objective, and the architecture), but typically don’t control the specific representations or mechanisms developed by the LLM.

As noted earlier, I have mixed feelings about this metaphor. Its strength is that it concisely and effectively conveys both that humans do create LLMs, but also that the engineers creating them lack the kind of direct control and understanding we typically associate with technology. An argument in favor of the metaphor would thus point out that it’s a quick way to illustrate to someone why it’s possible to create something without fully understanding it.

That said, the metaphor doesn’t, on its own, convey what it is that we don’t understand or why, which is what I’ve tried to convey in this post. The engineers training LLMs know what they’re trained to do, and how they’re trained to do it, but they don’t always know why LLMs end up doing other things besides what they’re trained to do—or how they do those other things. I don’t think that’s a problem with the metaphor, per se; a metaphor can’t do everything. But the one-sentence summary I just wrote also does (I think) a reasonably effective job of conveying, in the absence of metaphor, one of the core points that the metaphor is also trying to convey. I think there’s virtue in saying things as precisely as one can while still getting the basic point across, especially when discussing scientific topics, so I will strive to describe it this way in the future.

You see this metaphor, for instance, in discussion of the new book If Anyone Builds It, Everyone Dies, as well as this article breaking down the strange “seahorse emoji glitch” observed in ChatGPT and other models.

For auto-regressive models; bidirectional LLMs are trained to predict masked tokens using both “left” and “right” context.

I’m glossing over, here, the distinction sometimes made between “open-data” and “open-weight” models, as well as “fully open-source” (in which all training code is also made available). Openness, safe to say, is a gradient.

I’ve tried to avoid, here, relying on the terminology of “emergence”, though that is typically how these “byproducts” are described. This paper by David Krakauer, John Krakauer, and Mitchell discusses the technical definition of “emergence” in-depth and when and why it may not apply to LLMs.

Literally referred to as the “random seed” in some cases.

On what to work on

Sean Trott — Tue, 18 Nov 2025 21:54:00 GMT

Our time is finite. For many, a natural question thus arises about how best to use (or “spend”) that time. A microcosm of this question emerges in the case of academic research, where researchers are typically privileged to have a fair degree of autonomy over what they work on. How should researchers prioritize among the possible projects that present themselves?1

This is, fundamentally, a good kind of challenge. It’s one I faced often in graduate school: I was surrounded by interesting and intelligent peers working on all sorts of questions I’d never thought of before, and each of those questions felt like a tantalizing alternative to whatever project I was currently stuck on. Further, my PhD advisor was (fortunately) more of a mentor than a manager, which meant that he never forced me to work on “his” projects, but rather, used his expertise to help sculpt my vague interests into something resembling a research direction; that, in turn, required me to think about what I was actually interested in doing.

One of the most useful things my advisor did for me was to consistently ask, when presented with a proposed project, why one would do that project—what would they learn? I learned that there’s a difference between figuring out how to do something (and whether it’s even possible) and deciding that you should do it.

Still, I never really considered the question in terms of longer arcs of time. I thought about which project to work on next but not which kinds of projects to work on over many years. I was forced to consider that latter question in recent years as I applied for faculty jobs and was asked to develop a research vision; it was hard for me, but I do think I learned something about myself and how to think about the relationship between my interests and the ways I move through time.

In this post, I’ll describe a few ways of thinking about this question:

Putting one foot in front of the other: figuring out your next project.
Finding an “important problem” to solve, and how to find it.
Developing a sense of taste.

Notably, (1) and (2) conceivably fall under the same conceptual frame—i.e., “spending” your time effectively—and differ primarily in their sense of time horizon. (3), in contrast, focuses less on characteristics of the work itself (e.g., whether it is impactful) and more on one’s relation to that work. These days, I’m increasingly drawn to (3), but I think each of these approaches has merit.

(1) One foot in front of the other

A reasonable starting point might be to think primarily in terms of your next project. In the most extreme case, this might involve thinking day to day: each day you do some work, and you think about what you should do the next day. At a slightly less granular level, perhaps you think in terms of roughly “paper-sized” projects: the set of activities that will ultimate constitute a contribution to the scientific literature.2

Truthfully, I think this is how many people actually operate, including myself—and I think that’s entirely fine. To some extent, we engage in a kind of hubris when we suppose that we can plan for anything more distant than a single “move” away. There’s a sense in which the wisest strategy is to surrender one’s sense of control over the long arc of your life and instead focus on the daily choices that make up that life.

Moreover, you can still endeavor to make these choices intelligently, using whatever decision framework you prefer. You can think, for instance, about whether this paper or that paper would be a more useful contribution to the field (i.e., its “impact”), how much effort would be involved (i.e., its “tractability”), and whether someone else is likely to do it anyway (i.e., its “neglectedness”). In short, you can ask: is it worth doing?

Of course, looking only a single move ahead has its drawbacks. It’s already very easy in academic research to get lost in the details and lose sight of what your broader goal; if you don’t even have a broader goal other than your current (or next) activity, the winds of fate can easily buffet you into a region of decision-space that’s hard to escape. Maybe that means chasing down an endless series of package installations (and subsequent version issues) without asking whether there’s an alternative approach; or maybe it means chasing down a chain of references for a topic you weren’t even that interested in.

Again, there’s nothing bad about this: it’s more or less how I do things, and it leaves the door open for serendipity. But as a researcher, one of the most useful skills you can develop is an ability to “toggle between levels”, so to speak—it’s crucial to immerse yourself in the details of an analysis, but it’s equally crucial to have a kind of internal meta-cognitive monitor that can check in on your activities every so often and ask whether this is, in fact, a valuable use of time.

(2) Finding “important problems”

If (1) looks one “move” head, then the aim of (2) is to look at your “destination” and figure out how to get there. What kinds of problems do you want to solve in your career?

The underlying assumptions here are that it is possible to reason at all about something with such a broad temporal horizon, and that it makes sense to think of those things as “problems to be solved”. I’m not sure I fully endorse either of those assumptions, but the view is still a really useful counterpoint to the short-sightedness of (1) above.

This frame is the one that’s implicitly adopted anytime someone asks about your “10-year plan”, or indeed, anytime you plan something as long-term as a “career”. Organizations such as 80,000 (80K) Hours explicitly advocate for an approach towards identifying a career that addresses important problems:

You have about 80,000 working hours in your career: 40 years x 50 weeks x 40 hours.
If you want to have a positive impact with your life, your choice of career is probably your best opportunity to do that.
That means it’s worth thinking hard about how to use this time most effectively. If you can make your career 1% higher impact (whatever that means to you), it would in theory be worth spending up to 800 hours working out how.
We aim to help you work out how you can best use your 80,000 hours to help others, and to take action on that basis.

In my experience, online discussions about 80K Hours (or related organizations) sometimes get hung up on what kinds of things the organization considers a major “problem area” (e.g., AI risk, engineered pandemics, or factory farming). But it’s entirely possible to apply the underlying framework to any kind of long-term planning about how you spend your time. In fact, I’ve met plenty of cognitive scientists who do view their career in terms of a big-picture problem (e.g., “solving vision”); they usually have a set of reasons for working on this problem that aren’t too different, ultimately, from the decision framework 80K Hours suggests using.

Specifically, many people present a rationale rooted in (roughly) three criteria mirroring the 80K Hours framework: impact (how important is the problem?), tractability (can you actually make progress on the problem?), and neglectedness (is it likely to get solved without you?). Readers will likely notice that these are the same criteria I presented in section (1) above: indeed, you can apply this framework both to short-term decisions and to career planning.

One way to think about this framework is as a kind of algorithm. In theory, you could actually quantify these things for yourself: that is, if you consider the range of possible options, and you situate each option in some three-dimensional space parameterized by these criteria (impact, tractability, and neglectedness, or “ITN”), you can identify the options that reside in the most “optimal” part of that space.

Alternatively, you can simply think of the criteria as heuristics to remember whenever you’re faced with a decision or as you’re thinking about the arc of your career. As an exercise, I assigned subjective ITN ratings to a bunch of my projects recently. It didn’t change anything about what I wanted to work on—that’s really a matter of taste, as I argue below—but it was informative about my own process: for instance, it revealed that I’ve historically over-emphasized tractability, perhaps at the expense of impact. That encouraged me to be a bit more ambitious in terms of finding projects that I think will be important, rather than finding projects I know I can complete.

(3) A matter of taste

There’s much one could say about research taste.

Some people think of it as something that can be good or bad, as in “this person has good taste in research questions”. For me, a more precise rendering casts taste in terms of its degree of attunement, as in “this person has a really well-developed sense of research taste”. Different people will, naturally, be drawn to different kinds of questions or theories—just as different people are drawn to different constellations of flavors—but each individual can work to acquire better self-knowledge about their own tastes and, if they so desire, work to either refine or expand those tastes.

Unlike something like the ITN framework described in (2), taste is (in my view) an approach towards research that can’t be generalized, represented algorithmically, or decomposed into its constituent parts; it is by its nature both holistic and highly individual. That’s part of why I don’t think taste can really be “good” or “bad”: some people will have a taste in questions that better aligns with the rest of their field (and, accordingly, makes them more successful in terms of their number of publications, promotions, etc.), but that doesn’t necessarily make their taste better. We can, however, strive to better understand and express our individual tastes. The motivation for doing so might be something like self-knowledge (it is interesting, and fruitful, to learn what you like and don’t like); it might also be crucial to honing your taste so you can more readily and intuitively identify what draws you to particular research topics.

An example here might clarify what I mean. I’ve worked on a number of projects throughout my research career thus far, but as I stand here with the benefit of hindsight and attempt to make sense of the overall shape of these projects and papers, one thing that stands out is that I’m consistently drawn to projects that question the link between some empirical analysis and the theoretical conclusions drawn from that analysis. This particular taste might be described as an epistemological frustration (a “distaste”) with what I take to be unwarranted claims. This was the motivation for some of my earlier work on the role of efficiency in language evolution: it’s not that I disagreed a priori with the argument that languages are shaped by a pressure efficiency, but I felt that some of the pieces of evidence used to bolster this claim were also consistent with other theories. It’s also the driving force behind much of my work on LLM-ology, which focuses on whether we’re measuring LLM capabilities in the right way and whether our results generalize to a broader population of LLMs.

Of course, many scientists—including myself—are drawn not only by their distastes but by an appreciation of something like beauty: a positive, rather than a negative, tropism. The writer and scholar Emma Stamm recently pointed me towards an interesting study examining aesthetic experiences among biologists and physicists. A substantial percentage of practicing scientists in both disciplines reported regularly experiencing a sense of awe and wonder with respect to their subject of research. Yet the source of beauty varied by individual and by discipline: to cite a particularly illustrative example, some people find beauty in simplicity, while others find beauty in complexity; perhaps some find beauty in both!

The notion of “elegance” in a theory is, perhaps, overstated by some—certainly, elegance is not identical with accurate. But there is something that feels important about the ineffable sense of resonance one occasionally feels in the moment of encountering a particularly elegant theory or explanation. For me, elegance often manifests as something like explanatory insight. One paper that fits into this category for me is Jeff Elman’s 2009 paper “On the Meaning of Words and Dinosaur Bones”3; another is Len Talmy’s 1988 “Grammatical Construal”. Both papers, notably, present a theory—motivated by psycholinguistic evidence in Elman’s case, and linguistic examples in Talmy’s—of linguistic meaning: what it is and how different parts of language contribute to it. Another writer who consistently strikes me with his capacity for insight is William James, who is almost literary in the way he weaves together case studies and introspection to craft a coherent account of something as seemingly intangible as religious conversion.

But that’s just me. Different people will presumably be drawn to different sources of beauty. This is precisely my point: it’s not that some theories (ideas, results, etc.) are more beautiful than others, but that we can learn to recognize what we find beautiful and why, and that alone is a useful and illuminating insight about ourselves and the way we choose to spend our time. I’m not really sure how one develops this sense of taste—but I do think that taste, perhaps more so than any other aspect of experience, is not something that can be arrived at via shortcut, and it is possibly not a destination at all.

Another version of this challenge might be framed in terms of journalism or public-facing writing more generally: what should you choose to write about?

Again, from the journalistic perspective, maybe this could be cast in terms of “your next article”.

See this article for a more in-depth description of that paper and my experience with it.

The epistemology of lower-back pain (pt. 2)

Sean Trott — Sun, 26 Oct 2025 19:42:26 GMT

About nine months ago, I wrote an article about the putative causes of (and possible treatments for) lower back pain. The article was inspired by my own experiences: I was dealing with an intense bout of sciatica, and one way I navigated the pain was by trying to learn more about the condition. I was surprised, and honestly quite moved, by the response to the article—a number of readers reached out directly to share their own experiences, offer advice, or simply express their hopes that I’d recover soon.

Of course, I shouldn’t have been surprised. Lower back pain is extremely prevalent1, and many people who’ve suffered from some kind of chronic pain feel a sense of recognition when they encounter echoes of what they’ve gone through2; moreover, many people are quite kind and empathetic, so even those who haven’t personally experienced a certain flavor of pain are compelled to reach out in sympathy. Nonetheless, I was both grateful and moved, so I’d like to say thank you to those who reached out or even just read the article.

Getting to the bottom of lower back pain is an extremely difficult epistemological challenge.

Pain, after all, is an experiential phenomenon. Even if the causes are physical, it’s hard (perhaps impossible) to establish the precise mechanistic causes of subjective experience. While we can point to potential causes, the thing we’re trying to explain—the pain itself—is a property of an individual’s experience. Here, I think a quote from Elaine Scarry’s The Body in Pain (which I included in a previous post on ineffability) might help illustrate the problem:

Thus when one speaks about “one’s own physical pain” and about “another person’s physical pain”, one might almost appear to be speaking about two wholly distinct orders of events. For the person whose pain it is, it is “effortlessly” grasped (that is, even with the most heroic effort it cannot not be grasped); while for the person outside the sufferer’s body, what is “effortless” is not grasping it (it is easy to remain wholly unaware of its existence; even with effort, one may remain in doubt about its existence or may retain the astonishing freedom of denying its existence; and finally, if with the best effort of sustained attention one successfully apprehends it, the aversiveness of the “it” one apprehends will only be a shadowy fraction of the actual “it”). (Pg. 4)

In the last nine months or so, I’ve read a variety of sources from the literature on back pain, some of which was suggested by readers in email exchanges; I’ve also experienced more ups and downs myself (more on that below). Together, these input streams have gone some way towards constructing a kind of rudimentary mental model of lower back pain—both in terms of an appreciation for the diversity of potential mechanisms underlying the thing itself, and a better understanding of why our epistemology is so fraught on this topic. Change is the only constant, but this post is intended to give a snapshot of that mental model as per October, 2025.

Where I’m at now

Things are, in general, much better but not totally normal. This is typically what I say when asked about my back: compared to January of 2025 (when the pain was at its worst), I’m really grateful for the progress.

These days, I feel functional: I go to work, teach, play with (and take care of) my daughter, and even do some light weight lifting (more on that below). Pain, or the possibility of pain, is sometimes present and occupies some part of my attention some of the time, but to a lesser degree than earlier this year.

Another question I sometimes get is what’s worked for me. I’ll discuss various treatment approaches throughout the course of the article, but part of what makes this so challenging is that time is the great confounder: were the treatments I tried early on less effective than the ones I tried later on, or is it simply that healing follows a slow but inexorable path that would’ve looked exactly the same had the order of operations been reversed? Nevertheless, the routine I’ve settled on is a combination of over-the-counter medications (calibrated to minimize other side effects), careful weight training and mobility work (under the guidance of a great trainer at Kinetic Impact; more on that below), and, honestly, trying not to think about it too much. Other treatments I’ve tried or been prescribed (including acupuncture, massage, physical therapy, believing the pain is only in my head, osteopathic manipulative treatment, dry needling, epidural steroid injections) have, for me, had little to no efficacy. That’s not to say they don’t work for anyone or even most people; back pain is complicated and you can’t generalize from an N of 1!

Of course, I recognize that most people don’t follow this newsletter to read updates about my back pain. I mention all this to provide additional context for what’s below: my limited experience has in some sense run the gamut of theoretical views about the underlying mechanistic causes of back pain, from the purely “bottom-up” approach to what I think of as a kind of modern “mind-cure” movement. I sit now at a tentative synthesis between these views, which may well change.

The bottom-up view

Perhaps the most straightforward account of lower-back pain—and the one I covered in a post nine months ago—is what I think of as the “bottom-up” view. In this account, an injury in the lower back (such as a herniated disc) causes pain signals to be sent from the affected area through the nervous system to the brain, which is where the “experience” of pain takes place. One concrete mechanism by which this could take place is compression of the sciatic nerve (e.g., by a herniated disc).

The reason I call this “bottom-up” is that the brain’s role in the process is primarily as a recipient of neural signals (sometimes called afferent), which have themselves been issued from (roughly3) the site of injury. A further assumption, of course, is that the brain is the seat of consciousness: it’s where these electrochemical signals are somehow (mysteriously, magically) converted into experience. I’m not going to take issue with the latter assumption here, but alternatives to the former assumption are explored in the sections below.

Now, even though the bottom-up mechanisms are relatively straightforward, the prognosis is still complicated. Herniated discs can heal on their own, but it can take a while; moreover, nerve tissue itself can be damaged, and nerves are complicated beasts with unpredictable recovery rates. As I wrote before, doctors may suggest a variety of interventions to facilitate recovery, including: surgery (cutting out the offending disc tissue); epidural steroidal injections (intended to reduce inflammation in the area); or simply emphasizing careful movement patterns and exercises (like McGill’s “Big 3”) that reinforce core stability and avoid further aggravating the tissue.

There’s a lot to like about this bottom-up view: notably, it’s quite parsimonious, and the posited mechanism clearly fits with broader physiological theories. Nothing mysterious—save the mystery of conscious experience itself, but that’s a problem with any theory—need be posited. The view might well be right, and should probably be viewed as the default hypothesis. Nevertheless, there are a few inconvenient observations that complicate the reality of the situation.

A modern mind-cure

Many people who’ve experienced back pain have probably heard of John Sarno. Sarno was a professor and physician who treated a number of patients with back pain. At some point, he came to believe that the cause of the pain for a nontrivial percentage of these people was, effectively, psychological: repressed emotions (especially anger and stress) caused the brain to “generate” the pain (e.g., by decreasing blood flow to certain muscles or nerves), possibly as a way to distract itself from those unpleasant emotions. He called this Tension Myositis Syndrome (or TMS). In Sarno’s view—presented in books like Healing Back Pain (1991) and in other writing—the best treatment for TMS was simply learning about the syndrome and accepting that the pain was, in some sense, “in your head”. Once you learned that the pain was a distraction from these unpleasant emotions, it no longer functioned as effectively as a distraction—so the pain would go away.

To some people, Sarno’s advice sounds like magical thinking. Yet many former sufferers of chronic pain swear by it: in fact, there’s a website called “Thank you, Dr. Sarno” dedicated to testimonials from people who suffered (often for years) from various forms of chronic pain (in their backs, their wrists, their knees, etc.) and who, upon reading the book, experienced a rapid and complete recovery that years of treatment in the medical system could not accomplish. I would advise any skeptical readers to check out some of these testimonials: not because I think it will convince you that Sarno is right, but because I think it’s useful for inculcating some epistemic humility about our ability to understand what’s really going on here.

I’ll be honest: I struggled quite a bit with Sarno’s view when I read his book, and I think that experience is not uncommon.

One common source of resistance comes from a preconception about what, exactly, Sarno is saying. These days, there’s a very understandable (and I think correct) stigma against asserting that pain is “in your head”. Attributing pain to deficits in a person’s psychology has historically been used to dismiss the very real medical complaints of many people. It’s pretty natural, I think, to equate the argument that “the pain is psychological in nature” with the argument that “the pain is not real”. This is aggravating to hear, especially when you’re in pain, and especially when you’ve already been told by a doctor that the cause is a herniated disc (which you have MRI evidence for)!

Of course, these two statements do not mean the same thing: the former is a claim about the underlying cause of an unpleasant experience, while the latter is, essentially, a claim of astonishing doubt (to paraphrase Elaine Scarry) about the pain’s existence. Sarno is saying the former and not the latter. His writing, in fact, is extremely compassionate towards those experiencing back pain, and he is careful to point out that TMS can and does afflict anyone.

It’s also worth noting—and this is me editorializing, not a quote from Sarno—that unless one is a dualist (which some may well be!), one probably thinks the mind is part of the body. Therefore, saying that the pain is “psychological” is, again, a statement about the causal origin of the pain, not necessarily a dismissal of the pain’s reality. This account could thus be nicely contrasted with the “bottom-up” account I described earlier: in the top-down view, the relevant signals originate (somehow) in the brain and are issued towards the muscles and nerves in the lower back (presumably via efferent pathways).

Another source of my resistance, however, is that the mechanism Sarno describes is (in my view) rather vague. The constructs Sarno invokes—e.g., the unconscious mind “needing” to distract itself from repressed anger—are all fairly high-level, and it’s not clear to me how they connect to the proposed mechanism (i.e., the brain “cutting off” blood flow to certain muscles or nerves). Unlike the bottom-up account I described earlier, the causal story doesn’t fit clearly with my understanding of physiology. That doesn’t mean it’s not right! Paradigms shift all the time, and I’d be among the first to say our understanding of physiology (especially as it relates to the mind) is woefully incomplete. But coherence with established mechanisms is still an important factor when it comes to evaluating a new theory.4

Still, efficacy is another crucial factor, and the success of Sarno’s work can’t easily be ignored. (Again, I urge skeptics to investigate websites such as Thank You, Dr. Sarno, or any of the many online testimonials by those whom Sarno’s advice has helped.) That’s certainly the most convincing aspect to me.5 Moreover, there are other presentations of the “top-down” view that I find more intuitively and mechanistically appealing, such as the idea of central sensitization. I’ll discuss those in a moment, but first, I want to briefly touch on the connection between Sarno’s view and what the psychologist William James called the mind-cure movement.

James associates “mind-cure” with a broader quasi-religious movement of the 19th century, which he dubs the religion of healthy-mindedness.6 The religion of healthy-mindedness takes as doctrine that to be human is to be made in God’s image, and thus to be whole and healthy is a natural state of affairs, and thus much of what we experience as “affliction” can be remedied by, more or less, believing (or realizing, depending on one’s perspective) that we are all part of some unified, divine presence—no part of which can really be unhealthy. James writes (bolding mine):

But the most characteristic feature of the mind-cure movement is an inspiration much more direct. The leaders in this faith have had an intuitive belief in the all-saving power of healthy-minded attitudes as such, in the conquering efficacy of courage, hope, and trust, and a correlative contempt for doubt, fear, worry, and all nervously precautionary states of mind. Their belief has in a general way been corroborated by the practical experience of their disciples; and this experience forms to-day a mass imposing in amount. (pg. 94-95)

In The Varieties of Religious Experience, James quotes at length from various individuals who suffered for years from various afflictions (digestive issues, chronic fatigue, aching joints, etc.), and who experienced almost immediate relief once they accepted the gospel of mind-cure (or “New Thought”, as it was called). The parallels to Sarnos’s account of back pain are clear: in both cases, the majority of medical practitioners failed to alleviate the suffering of these individuals; and in both cases, the thing that did alleviate their suffering was the power of belief.

I draw this connection in order to illustrate that this way of thinking is, in fact, quite old. It also helps illustrate the epistemological puzzle at play: towards the end of the chapter on healthy-mindedness, James points out that the mind-cure movement inverts the standard scientific worldview of the time (and, indeed, of the current age), allowing for the possibility that consciousness can, contrary to the dictates of scientific thought, influence the world around us (bolding mine):

Now science, on the other hand, these positivists say, has proved that personality, so far from being an elementary force in nature, is but a passive resultant of the really elementary forces, physical, chemical, physiological, and psycho-physical, which are all impersonal and general in character…Follow out science’s conceptions practically, they will say, the conceptions that ignore personality altogether, and you will always be corroborated. The world is so made that all your expectations will be experientially verified so long, and only so long, as you keep the terms from which you infer them impersonal and universal.
But here we have mind-cure, with her diametrically opposite philosophy, setting up an exactly identical claim. Live as if I were true, she says, and every day will practically prove you right. That the controlling energies of nature are personal, that your own personal thoughts are forces, that the powers of the universe will directly respond to your individual appeals and needs, are propositions which your whole bodily and mental experience will verify. (pg. 119)

And as with Sarno’s advice for back pain, mind-cure enjoyed actual successes; which, as James points out, is its own source of “verification”.

Central sensitization and the role of attention

Above, I mentioned that there are other flavors of the top-down view. These approaches are broadly consonant with Sarno’s presentation, but tend to tone down the role of repressed emotions—instead emphasizing the twin roles of predictive processing and attention.

For instance, the YouTube channel “Pain Free You” (run by Dan Buglio) contains a number of videos describing what Buglio calls Perceived Danger Pain (or “PDP”). The notion of PDP begins with the observation that pain is, in some sense, the body’s response to perceived danger. In most cases, this danger is real, and the pain serves as a useful, often life-saving, signal to avoid some action that could cause harm to the body—this is why congenital analgesia, a genetic insensitivity to pain, can be so dangerous. But in some cases, the brain is effectively “misfiring”; it’s perceiving danger where there is, in fact, none. There are all sorts of reasons why this could happen: maybe bending in precisely this way in the past caused a flare-up, and the body has in some sense “remembered” that association, thus leading the brain to (unhelpfully, in this case) “generate” a pain signal in the absence of real danger.

Rachel Zoffness, a psychologist and professor at Stanford, makes a similar case. Her Pain Management Workbook opens with a “tale of two nails”: two stories that serve to highlight the role of belief—and in particular, predictive processing—in the creation of pain signals. Here’s the first of those stories excerpted from her article on Psychology Today:

In 1995, the British Medical Journal reported on a 29-year-old construction worker who’d suffered an accident: after jumping onto a plank, a 7-inch nail pierced his boot clear through to the other side (Fisher et al, 1995). In terrible pain, he was carted off to the ER and sedated with opioids. When the doctors removed his boot, they discovered a miracle: the nail had passed between his toes without penetrating his skin! There was zero damage to his foot: no blood, no puncture wound, not even a scratch. But make no mistake: despite the absence of injury, the pain was real. What happened?

Zoffness argues (I think convincingly) that this man’s brain, having processed the visual signal of a nail piercing his boot, perceived a threat to his safety, setting off a “cascade of biological and neurochemical processes”. That is, pain was generated as a response to the prediction that this event was causing the man harm.

She follows this up with another tale:

On the flip side, another construction worker (dangerous job, that!) was using a nail gun when it unexpectedly discharged, clocking him in the face (Dimsdale & Dantzer, 2007). Other than a mild toothache and a bruise under his jaw, he thought he’d escaped relatively unscathed. Six days later—six days of eating, sleeping, and going to work—he went to the dentist. Much to his surprise, an X-ray revealed a 4-inch nail that was embedded in his head! Indeed, the nail had pierced his cerebral cortex, putting him in potentially grave danger. However, because contextual cues failed to put his brain on high alert, his pain system remained quiet—despite actual bodily harm and the need for medical intervention (#fail).

She draws several lessons from this pair of stories. The first is that the experience of pain is not 100% reliable as an indicator of tissue damage. To borrow terminology from machine learning, we might cast this in terms of precision and recall: there are cases in which tissue damage can occur in the absence of a pain signal (failed “recall”), and there are cases in which a pain signal can occur in the absence of tissue damage (failed “precision”). This doesn’t entail that pain is a poor indicator of damage—precision and recall might well be high overall—but merely that the connection is not 1-1.

The second lesson is that our experience of pain is very much affected by context: our emotional state, who we’re with, what we’re thinking about. Attention to pain can amplify the signal; attention to something else can “turn down the volume” of pain, so to speak—or even reinterpret the signal as something more positively valenced. This is why Zoffness emphasizes the metaphor of a “pain dial”. Some activities can increase a person’s attention to pain (e.g., lying in bed and thinking about every little twinge), while others can decrease it (e.g., talking with a good friend). Her pain management workbook encourages readers to identify the parameters of their own pain dial: which activities turn up your pain, and which turn it down?

The idea that people can develop a heightened sensitivity to pain, and thus experience pain signals in the absence of direct tissue damage, is sometimes called central sensitization. Central sensitization is one of the proposed mechanisms underlying “nociplastic pain”. Nociplastic pain can be contrasted with nociceptive pain (caused by direct stimulation of pain receptors) and neuropathic pain (caused by some kind of lesion or disorder in the nervous system). We still don’t understand exactly how central sensitization works, but it’s sometimes presented as a form of “dysregulation” in how the brain creates and processes pain signals: somewhere along the line, something has gone haywire in the nervous system, and signals no longer mean quite what they used to. In this way it seems almost akin to the dysregulation observed in autoimmune diseases, where a system evolved to protect the body somehow—for hard-to-understand reasons—responds in ways that, paradoxically, hurt the body.

At this point, a couple questions naturally arise. First, is any of this actually right? And second, if it is right, where do you go from here?

A tentative synthesis

Personally, I think there’s a lot to like about each of the views I’ve presented here. As a scientist, I tend to be attracted to views that emphasize concrete, physical mechanisms; that’s the appeal of the straightforward “bottom-up” account I started with. But I’m also generally predisposed towards the idea that things are rarely as simple as we believe them to be, and that learning how things “really” are is quite challenging—that’s a large part of my ongoing interest in epistemology. That part of me is most compelled by Zoffness’s presentation: namely, that attention and nervous system dysregulation can play some nontrivial role in the experience of back pain. This is what I mean when I say I’ve arrived at a “tentative synthesis” between these views.

Of course, this does not tell us anything about the relative balance of these causes. Is it mostly bottom-up, with some minor role for central sensitization? Is it 50/50? Or is it mostly in our heads? My sense is that the balance varies considerably across individuals, and also varies considerably across time within an individual. For example, a plausible scenario might be that the initial stages of pain are caused by actual tissue damage (say, a herniated disc compressing a nerve); in some individuals, this takes a while to heal, and the body’s adjustments to the pain—both in terms of neural reorganization and in terms of actual changes to posture or gait—can “lock in” the pain, turning it from something acute into something chronic; this is when, according to Zoffness, the signal (pain) is decoupled from the thing it’s taken to indicate (damage).

How do you know which situation you’re in?

You don’t, of course! That’s partly why this is so hard. My current sense is that have to take it gradually, using a process of trial-and-error.7 For example, I’ve been working for months now with an excellent rehab trainer (Donald Mull) at Kinetic Impact, and much of that training revolves around incrementally reintroducing movement patterns under safe conditions (with an emphasis on functional movements, such as deadlifts), then observing the effect on the body over the next few days. The logic is so simple it’s profound: if something makes the pain worse, back off; if it doesn’t make it worse, or it makes it better, explore it further. Reintroducing movement patterns can be scary for someone in pain, which is one reason why it’s so helpful to do it under the guidance of a trainer who’s developed hands-on expertise in shepherding people through this process and, importantly, can calibrate their advice to each individual person. It also involves learning how to listen to your body—something that can be difficult if you’ve been experiencing chronic pain.

One perspective I’ve appreciated on the problem of central sensitization comes from Brendan Backstrom, the creator of the YouTube channel (and rehab program) Low-Back Ability. Backstrom argues that chronic back pain comes in part from a vicious cycle: we injure our backs, so we avoid engaging our backs at all; this makes our backs weak and extremely sensitive, which leads to triggering injuries more easily in the future. He suggests that we need to gradually “build evidence” that our backs can work again. Opinions will likely differ on the best way to do this (his approach is to begin with isometric holds on a back extension machine), but the underlying philosophy resonates deeply with me: the brain and the rest of the body needs to be gradually convinced that it can move normally without fear of injury, and that requires striking the right balance between reintroducing movement patterns (as I wrote above) and not pushing through the pain (lest you inadvertently reinforce a particular “story” of injury).

The argument that back pain can sometimes be caused by protective mechanisms in the absence of actual tissue damage can be pretty frustrating to hear, or even hard to believe. Why would the body have evolved mechanisms to cause pain when it doesn’t need to? To be clear, I’m not asserting that this is true, or even that it accounts for the majority of back pain (including my own). But I don’t think the retort that it’s implausible holds water: the body, unfortunately, “misfires” in all sorts of ways. You can think of this as the misapplication of a rule that was formed, at some point, for good reasons—it’s just hard to be at the receiving end of that misapplication.

The epistemological challenge

I started this post with the observation that getting to the root cause of chronic back pain is a challenging epistemological problem. The main reason for this difficulty is that pain is experiential and we don’t have a good theory of how conscious experience arises from physical mechanisms. The body is a very complicated system, and the way in which it relates to the mind is even more complicated.

The bottom-up view of back pain does seem like a good starting point, and I still think there’s a lot to be said for it: it’s entirely possible that the vast majority of lower back pain cases can be explained by the fact that herniated discs sometimes compress the sciatic nerve, and that this, unfortunately, can take quite some time to heal. But it’s also possible that there’s some kind of interaction with how the nervous system responds to this: that the brain does, in some cases, turn up (or down) the “pain dial”, and this can affect recovery. Certainly, this insight seems to help some people, and that’s worth something.

This challenge connects more generally to my interest in the limits of what we can know about ourselves and about the world around us. Sitting as it does at the intersection of the experiential and the physical, pain represents a particularly interesting and important epistemological niche.

I’ll close with what I think is a fitting quote from William James, again from that chapter on healthy-mindedness (bolding mine):

I believe that the claims of the sectarian scientist are, to say the least, premature. The experiences which we have been studying during this hour (and a great many other kinds of religious experiences are like them) plainly show the universe to be a more many-sided affair than any sect, even the scientific sect, allows for. What, in the end, are all our verifications but experiences that agree with more or less isolated systems of ideas (conceptual systems) that our minds have framed? But why in the name of common sense need we assume that only one such system of ideas can be true? The obvious outcome of our total experience is that the world can be handled according to many systems of ideas, and is so handled by different men, and will each time give some characteristic kind of profit, for which he cares, to the handler, while at the same time some other kind of profit has to be omitted or postponed. Science gives to all of us telegraphy, electric lighting, and diagnosis, and succeeds in preventing and curing a certain amount of disease. Religion in the shape of mind-cure gives to some of us serenity, moral poise, and happiness, and prevents certain forms of disease as well as science does, or even better in a certain class of persons. Evidently, then, the science and the religion are both of them genuine keys for unlocking the world’s treasure-house to him who can use either of them practically. Just as evidently neither is exhaustive or exclusive of the other’s simultaneous use. And why, after all, may not the world be so complex as to consist of many interpenetrating spheres of reality, which we can thus approach in alternation by using different conceptions and assuming different attitudes, just as mathematicians handle the same numerical and spatial facts by geometry, by analytical geometry, by algebra, by the calculus, or by quaternions, and each time come out right? (pg. 122-123)

According to this article, it’s considered the number one cause of disability globally. Some estimates for lifetime prevalence (i.e., the rate of people experiencing it at least once) are as high as 80%.

One need only look as far as the comments in nearly any YouTube video about exercises for dealing with chronic back pain to see evidence for this claim.

With sciatica, the pain radiates down the leg, so technically extends beyond the original site of injury.

There’s also the problem that, like many Freudian theories, TMS is challenging to falsify. If someone says they tried to follow the “prescription” (i.e., believing their back pain is caused by TMS, not by a disc herniation) but it didn’t affect their back pain, it’s always possible they didn’t really believe sufficiently. Of course, many back pain treatments have mixed efficacy, but at least with something like surgery, we can agree whether the prescription was followed and thus evaluate the outcome as a function of the prescription.

Other arguments, like the fact that herniations don’t always cause pain and therefore the pain is not attributable to the herniation, are less convincing: it seems intuitively quite plausible to me that herniations would vary in size and scope, and further, that individuals would vary in anatomy (e.g., the precise location of the sciatic nerve), which seems like more than enough to reconcile the bottom-up view with the observation that many people have herniations without back pain.

It’s worth noting that James follows this discussion with an alternative route to conversion, namely the sick soul.

Sarno’s recommendation is, effectively, to stop worrying about it and get back as soon as you can to your old activities. But if you’re in the throes of back pain, it’s hard to imagine going out and playing tennis. Moreover, if your pain is related to tissue damage, you might simply injure yourself further.

The scientific vocation in a disenchanted world

Sean Trott — Wed, 01 Oct 2025 18:53:35 GMT

As I’ve mentioned before, I occasionally write reviews of books or other pieces of media over at The Leaky Margin. For the most part, I don’t publish those reviews here. But in some cases, when they concern topics that are relevant to the kinds of topics I write about on the Counterfactual, I do.

I have an ongoing interest in disenchantment: the notion that many aspects of society have undergone a gradual stripping away of belief in the world of magic, spirits, and other entities that lived alongside us and sometimes intervened in our affairs. I was first introduced to this idea in Meghan O’Gieblyn’s excellent God, Human, Animal, Machine, which launched me on a path to learn more about how we (humans) understand ourselves in relation to the universe around us and to the realm of possibility. I am interested, broadly, in how disenchantment came about, whether it is as far-reaching or definitive as sometimes claimed (some disagree), and whether contemporary technologies like Artificial Intelligence represent a potential “re-enchantment” or are simply another step towards the logic of mechanization.

All that, though, is secondary to the purpose of this post. Although I first encountered disenchantment in God, Human, Animal, Machine, the idea is typically associated with the sociologist Max Weber. Weber is known for many things, such as his essay on The Protestant Ethic and the Spirit of Capitalism, and more generally, for playing a pivotal role in the development of sociological theory. As far as I know, he first discussed disenchantment in a 1917 lecture entitled “Science as a Vocation”.

I wanted to understand the concept at its “source”, so to speak, so about two years ago, I read (or attempted to read) Science as a Vocation. I struggled to get through it, even though it is very short. Earlier this year, I revisited it, and for whatever reason, reading it felt much more fluid—perhaps because I had already familiarized myself with more of the underlying ideas. Then, just the other week, I picked it up again one evening and found myself marveling at the clarity and concision of his prose and the breadth of his ideas.

Weber’s lecture discusses disenchantment, but its theme is really about what it takes to have and to follow a vocation for science, particularly against the backdrop of changes like the bureaucratization of the university and the disenchantment of society. It is striking to me how so many of the issues that Weber describes in 1917—more than a century ago—are relevant today. In the spirit of a leaky margin, I wanted to use this space to jot down some of what stuck out on this reread.

Weber begins by “pedantically” (his own words) enumerating the details of university life: the conditions under which science is conducted and scientific knowledge is produced. As a work of sociology, this section is both illuminating and—as I wrote above—remarkable in how well it matches my understanding of the current conditions of academic science.

First, he argues that while Germany and America have thus far maintained distinct institutional structures, the Germany university system was at that time (1917) “moving in the same direction as in America”:

The major institutes of science and medicine are “state-capitalist” enterprises. They cannot be administered without funding on a huge scale. So we see the situation that exists wherever capitalist operations are to be found, namely the “separation of the worker from the means of production”. The worker, in this instance the assistant, is dependent on the resources that are provided by the state. (pg. 3)

Weber believes that this industrialization of university life will eventually spread to even his own discipline, “where the artisan is still the owner of his own resources” (pg. 4), and that while the “technical advantages” of this paradigm cannot be doubted, it does bring with it a fundamentally different “spirit”:

Both in essence and appearance, the old constitution of the university has become a fiction. (Pg. 4)

My understanding of what Weber is saying here is that the university (and the scientific process) has undergone many of the same changes that other aspects of society underwent with “changes to scale”. At a certain size, an institution must become a kind of bureaucratic machine. This machine may function efficiently, but a consequence of this mechanization is that it turns individuals into “parts” of the system and alienates them from their labor—in much the same way that an artisan craftsman was displaced by the assembly line.

One of the only features that has survived these changes (in Weber’s view) is the element of chance in determining success in a scientific career. This element of chance manifests itself partly in the conflation of duties that accompanies professorial life: a professor must not only produce scientific knowledge (research) but also teach, ideally large bodies of students. As Weber points out (and as any university student could report) the abilities to conduct research well and to teach well do not necessarily correlate:

Every young man who feels he has a vocation as a scholar must be aware that the task awaiting him has a dual aspect. He must be properly qualified not only as a scholar, but also as a teacher. And these two things are by no means identical. A man can be both an outstanding scholar and an execrable teacher. (pg. 5-6).

Here, Weber also argues that the metrics for evaluating teaching quality at the time (e.g., the number of students attending a lecture) are quite obviously not an indication of how effectively an instructor is teaching those students:

But the question of whether an academic is a good teacher or a bad one is answered with reference to the frequency with which students honor him with their presence.1 However, it is also true that the fact that students flock to a teacher is determined largely by purely extraneous factors such as his personality or even his tone of voice—to a degree that might scarcely be thought possible. (pg. 6)

A further element of arbitrariness enters into the nature of success:

Do you believe that you can bear to see one mediocrity after another being promoted over your head year after year, without your becoming embittered and warped? Needless to say, you always receive the same answer: of course, I live only for my “vocation”—but I, at least, have found only a handful of people who have survived this process without injury to their personality. (pg. 7)

Weber doesn’t discuss these issues in a tone of complaint. He is dispassionate and descriptive, and his point is that the conditions of university life and a scientific career are, in many ways, not as externally rewarding as one might assume.2 Scientists, then, must harbor an inner vocation for science that allows them to weather these conditions and avoid resentments.

Weber then turns to the nature of scientific work itself.

In his view, a vocation for science requires a capacity and passion for intellectual specialization. Again, I think he’s largely right. While cross-field collaboration is exciting and sometimes even helpful, making a scientific contribution demands an ability to narrow one’s focus to a particular research question:3

And anyone who lacks the ability to don blinkers for once and to convince himself that the destiny of his soul depends upon whether he is right to make precisely this conjecture and no other at this point in his manuscript should keep well away from science. He will never be able to submit to what we may call the “experience” of science. In the absence of this strange intoxication that outsiders greet with a pitying smile, without this passion, this conviction that “millennia had to pass before your were born, and millennia more must wait in silence” to see if your conjecture will be confirmed—without this you do not possess this vocation for science and should turn your hand to something else. For nothing has any value for a human being as a human being unless he can pursue it with passion. (pg. 8)

Perhaps cast in stronger terms than I’d put it, but fundamentally, I agree that good science depends in part on the ability to focus with almost obsessive detail on exactly which claim can be made from exactly which pieces of evidence.

Passion, though, is not enough. Science also relies on inspiration. Here, Weber argues that while the notion of reducing science to a series of rote actions, much as in a factory, has become fashionable, it doesn’t stand up to scrutiny: for one, in both laboratories and factories, it is “necessary for something, and the right thing at that, to occur to people if they are to achieve anything worthwhile” (pg. 8).4 This “occurring of the right thing” is, essentially, inspiration. The problem is that inspiration cannot be summoned on demand: the best ideas might occur to us while smoking a cigar on the sofa or going for an evening walk, and even then, we cannot engage in those activities with any certainty that inspiration will strike.

Moreover, a necessary precondition for inspiration is hard work:

At any rate, ideas come when they are least expected, rather than while you are racking your brains at your desk. But by the same token, they would not have made their appearance if we had not spent many hours pondering at our desks or brooding passionately over the problems facing us. (pg. 9)

This section is, in my view, one of Weber’s strongest insights, and it reflects my own experience of scientific practice.

In order for me to make real progress on something, I have to care about it. “Caring about it” manifests in different ways, but one such way is obsession: I ruminate on ideas, turning them over and over again in my mind and inspecting them from various angles. Sometimes this work involves reading or analyzing data, but much of it is intangible and occurs entirely in my head at various hours of the day or night. It is also, for the most part, lacking in inspiration; but when inspiration does strike, it only does so when I have laid the appropriate foundation.

Science involves progress. Even if one rejects a strictly teleological view of the history of science, it is not particularly controversial to assert that successive generations of scientists often (though not always) construct “better” (more accurate, more useful, more parsimonious) models of the phenomena they are engaged in trying to explain. A corollary of this is that success, in some sense, requires a kind of obsolescence (bolding mine):

Contrast that5 with the realm of science, where we all know that what we have achieved will be obsolete in ten, twenty, or fifty years. That is the fate, indeed, that is the very meaning of scientific work…Every scientific “fulfillment” gives birth to new “questions” and cries out to be surpassed and rendered obsolete. Everyone who wishes to serve science has to resign himself to this. The products of science can undoubtedly remain important for a long time, as “objects of pleasure” because of their artistic qualities, or as a means of training others in scientific work. But we must repeat: to be superseded scientifically is not simply our fate but our goal. We cannot work without living in hope that others will advance beyond us. In principle, this progress is infinite. (pg. 11)

In science, then, one’s “legacy” is in having contributed some number of steps to the onward march of a particular theoretical paradigm. Not only that, but we cannot assume that we (broadly construed) will ever “arrive” at some “destination”; there are only new questions to be asked.

There is something noble in how this all sounds, but it is also at odds with a certain desire to be remembered as more than a link on an infinite chain. What, Weber asks, is the point of doing something that cannot be finished?

One solution is pragmatic: if our research enables some kind of technical achievement (say, building a faster train), we can take solace in having had some practical benefit; but most science does not have a direct application, and many scientists, if pressed—or even if not—would admit that something other than practical application drives their interest in their work.

This is where Weber turns at last to disenchantment.6 According to Weber, a crucial effect of scientific progress over the last few centuries is that an increasing number of people operate according to the assumption that the world is in some sense fundamentally comprehensible. We do not assume that we understand everything, or even that for every possible phenomenon there is someone that already understands it, but rather, we have a faith—my word, not Weber’s—that everything could be understood:

It is the knowledge or the conviction that if only we wished to understand them we could do so at any time. It means that in principle, then, we are not ruled by mysterious, unpredictable forces, but that, on the contrary, we can in principle control everything by means of calculation. That in turn means the disenchantment of the world. (pg. 12-13)

Leaning on Tolstoy’s Confession, Weber continues by suggesting that disenchantment robs even death of intrinsic meaning. In an enchanted world, our life represents part of some grander “cosmic tale”, and there is some sense in which our death represents a kind of “arrival” at a destination (the exact nature of which depends on the particularities of the beliefs involved). In a disenchanted world, there is no point of arrival: consequently, we can become tired of life but cannot be fulfilled by it.

The same is true of scientific progress. In an enchanted (or even simply Christian) world, science was viewed as a path to understanding God’s intentions: nature was the “book” of God’s design, and science (especially naturalism) was the means by which we could read it. This is the paradigm that gave us quotations such as these:

“I bring you the proof of God’s providence in the anatomy of a louse.” (Jan Swammerdam, 1658).

In a secular, disenchanted world, there are no intentions to be read off the book of nature, so this cannot be the source of meaning derived from science:

Who imagines nowadays that a knowledge of astronomy or biology or physics or chemistry could teach us anything about the meaning of the world? (pg. 16)

Weber then considers the possibility that science could deliver answers to questions about the meaning of life and how to achieve happiness. Tolstoy dismissed this idea, as does Weber. Referencing Thus Spoke Zarathustra, Weber writes:

But after Nietzsche’s annihilating criticism of those “last men” “who have discovered happiness”, I can probably ignore this completely. After all, who believes it—apart from some overgrown children in their professorial chairs or editorial offices? (pg. 17)

Science, then, cannot tell us what to do or how to live. It cannot even answer the fundamental presuppositions that drive scientific investigation itself: physics and chemistry can give us knowledge about the laws governing the world, but they cannot tell us why it is worth knowing these things. Similarly, practical sciences like modern medicine presuppose that it is worth preserving lives and reducing suffering—a supposition that most of us would hopefully agree with, but which cannot itself be demonstrated scientifically. Underneath all realms of scientific investigation are assumptions about what is worth knowing or doing in the world—values, in other words—and these values are not themselves amenable to the scientific method.7

What, then, is the value of science and a scientific vocation?

One contribution is something like a set of epistemic techniques:

…science provides methods of thought, the tools of the trade, and the training needed to make use of them. (pg. 25)

Perhaps this seems prosaic, but I’m not so sure. It’s similar to what I’ve argued elsewhere about the possible value of applying insights from Cognitive Science to the study of Large Language Models (LLMs), which represent a kind of “moving target”: at minimum, Cognitive Science provides a set of theoretical tools and methods for thinking about how to approach a research question—even if the specific inferences or insights from a particular research study don’t generalize to future LLMs.

This points to the more substantive contribution of science, which is clarity about how different courses of action relate to the values we hold and wish to achieve—which, again, are not themselves determined by science. Weber writes:

If you take up this or that attitude, the lessons of science are that you must apply such and such means in order to convert your beliefs into a reality. These means may well turn out to be of a kind that you feel compelled to reject. You will then be forced to choose between the end and the inevitable means. Does the end “justify” these means or not? The teacher can demonstrate to you the necessity of this choice. As long as he wishes to remain a teacher, and not turn into a demagogue, he can do no more. Of course, he can say to you that if you wish to achieve this or that end, you will have to put up with certain accompanying consequences that experience tells us are bound to make their appearance…To put it metaphorically, if you choose this particular standpoint, you will be serving this particular god and will give offense to every other god. (pg. 26)

Science can help demonstrate which conclusions follow from which premises, and thus what the likely consequences will be of adopting various attitudes or enacting various behaviors. This has practical relevance, of course, but also moral relevance: if we wish to uphold certain values, it is helpful to know how taking certain positions or doing certain things relates to those values. The scientific process cannot answer these questions definitively—and it certainly cannot tell us what those values should be in the first place—but it can often provide additional clarity, even if that clarity amounts, in the end, to more uncertainty than we started out with.

This portrait of the scientific enterprise resonates deeply with me. Clarity is not everything, but it is not nothing. It is what I hope to achieve through my research and my writing. I do not imagine that I can firmly and unequivocally draw this or that conclusion, but I hope that I can at least articulate the conditions under which this or that conclusion might hold, describe how the existing pieces of evidence meet or don’t meet those conditions, and suggest how we might go about collecting evidence to gain further clarity. It is, perhaps, a less exalted view of a scientific vocation than is held by some, but I think that is both appropriate and much more sustainable. As Weber argues, science cannot tell us the meaning of life, and even its strongest supporters will be disappointed if scientists pretend otherwise and fail to deliver.

Interestingly, I suspect some would also argue that many contemporary research universities suffer from the inverse of what Weber describes: research is often felt to be incentivized more than quality teaching. Unfortunately, I do think this is true, though there are ongoing attempts to remedy this throughout higher education, such as the creation of tenure-track lines for teaching-focused professors. At the same time, Weber’s insight about class sizes still holds true, just in a slightly modified sense. For a number of reasons (many of them well-intentioned), instructors are encouraged to teach larger and larger classes, such that even the act of teaching itself has undergone the kind of industrialization Weber describes with respect to scientific research: instructors speak of “managing” or “administering” a class with large (though sometimes insufficiently so) “teaching teams” divided into various “roles”.

I think Weber is both right in an absolute sense and wrong in a relative sense, at least with respect to certain academic positions. The role of a tenured scientist is perhaps less artisanal than it used to be or than people imagine, but my understanding is that it still affords relatively more autonomy than most other “professionalized” fields and certainly than most occupations one could conceivably have in society. Of course, many researchers occupy more precious, non-tenured positions, where the “external conditions” of science do in some cases (not all) seem more challenging.

Paul Bloom makes a similar case in his recent essay.

Weber also makes the point that industrialists, too, depend on inspiration:

A businessman or a big industrialist without “commercial imagination”, that is to say, without inspiration or brilliant ideas will continue his whole life long to be someone who ought rather to be a clerk or a technical official.

In other words, academics—and artists, for that matter—would do well that they’re not the only ones who depend to some extent on the mystery of sudden insight.

Here, Weber is drawing a contrast with art, where (arguably) a “product” does not become obsolete in the same way as it might in a domain like science: we still marvel over ancient works of art and do not feel that more modern works have somehow “displaced” them.

The next few pages read as a kind of précis for later works such as Charles Taylor’s A Secular Age, and serve as an excellent example of the kind of clarity and concision that Weber achieves in this lecture.

This is also why, Weber argues, there should be a distinction between the academic analysis of political institutions and party politics themselves, and why lecturers should be very careful not to proselytize from the podium:

And if he feels he has a vocation to intervene in the conflict of worldviews and party opinions, let him do so outside in the marketplace of life, in the press, at public meetings, in associations, or wherever he wishes. But it is all too easy for him to display the courage of his convictions in the presence of people who are condemned to silence even though they may well think differently from him. (pg. 25)

Further, as Weber points out later on, science can provide clarity about the actions that might help us achieve our values—but those values are not themselves derived from science.

Generalizability in LLM-ology, revisited

Sean Trott — Tue, 30 Sep 2025 03:14:53 GMT

Much of my research focuses on what I call LLM-ology: the scientific study of the behavioral capacities of large language models (LLMs), as well as the mechanisms underpinning those behaviors. That is, I’m interested in what LLMs can and can’t do, how they behave under different conditions, and which internal mechanisms give rise to the behaviors we observe.

But as I’ve written before, one of the central challenges in this relatively novel field is the problem of generalizability. Any given empirical study of LLMs must necessarily focus on a sample of model instances, or even a single model instance (e.g., GPT-2). It’s unclear whether and to what extent the findings obtained on that single instance generalize to other model instances we haven’t studied (say, GPT-3).1 We don’t even have a set of coherent principles we can use to guide decisions about which findings might generalize and which might not.

When it comes to mechanistic interpretability research, this challenge is compounded by another: not only is it unclear which “reference class” a given model instance belongs to, it’s also unclear what it means to assert that a given mechanism (such as a circuit) can be found in multiple models. Concretely, when can we say that two models have the “same circuit”? What theoretical parameters can we use to define circuit identity or similarity?

This is the problem I set out to characterize (and partially address) in a recent paper (now available on arXiv here) that’s just been accepted to the Mechanistic Interpretability Workshop at NeurIPS 2025. Below, I sketch out some of the main arguments in that paper.

The problem of circuit identity, briefly explained

To understand the problem, we have to understand what it means to define and identify a circuit or other putative “mechanisms” in an LLM.

Suppose you’re interested in which components of a model implement a particular function: say, tracking the modifiers (adjectives, prepositions, etc.) attached to a particular noun. Take the following sentence:

The quick brown fox jumped over the lazy dog.

We might expect that certain components of a language model have learned to specialize in representing noun phrases, such that “quick” and “brown” are attached to “fox”, while “lazy” is attached to “dog”. This problem could, in principle, be solved by any combination of model components: static word embeddings, attention heads, feed-forward layers, residual stream, and more.

For the sake of simplicity, let’s assume that you (an interpretability researcher) decide to focus on attention heads. Using a range of interpretability techniques (probing, ablation, etc.), you identify two candidate attention heads in a specific model (say, Pythia-14m) that seem to be responsible for this function. The model has six layers with four heads in each layer, and the heads you’ve identified are both in the same layer (say, layer 4). You’re pretty confident that these are indeed “noun phrase heads”2, so you write up a paper concluding the following:

We identified two “noun phrase heads” in Pythia-14m, both in layer 4: [L4, H1] and [L4, H2].

The paper is sent out for review, and it comes across my desk. Here is my central question: what is the nature of the scientific claim (if any) that can be generalized to other model instances from this empirical result?

To make this concrete: if we retrained Pythia-14m with a different random seed (or on a different dataset, etc.), would you expect the same heads to implement the same function? What if we tested an entirely different model (say, Pythia-70m) or a model trained on a language other than English?

There is, in fact, basically no reason to expect that the exact same head indices should implement the same function in a different model instance. The index of a given head within a layer is arbitrary: head 2 is not “closer” to head 1 than head 4 in any meaningful sense (except in some modified architectures). Thus, the narrow claim that heads [L4, H1] and [L4, H2] are “noun phrase heads” cannot reliably be generalized to other model instances.

But is there any broader claim that can be generalized? That is, in what sense might we say that two different circuits in two different models do the “same thing”?

Axes of potential correspondence

This problem is not unique to mechanistic interpretability. Neuroscientists must also contend with the fact that every brain is unique (even within the narrow set of species their research might focus on). Nonetheless, they work within a set of assumptions (informed by empirical observations) about potential axes of correspondence between the brains they study, which allow them to hopefully draw broader conclusions about “rodent brains” or “human brains” as a general category. For example, researchers might taxonomize brain cells according to their gene expression, their anatomical connectivity patterns, their developmental profile, their morphology, their firing rate, and more. Similarly, mesoscopic and macroscopic structures (such as structures or gross brain regions) might be defined in terms of their putative function, their relative position within the brain, and their relationship to other neural components.

These axes of correspondence are not perfect, and they’re also under constant revision. But they serve as useful organizing principles for making sense of findings and guiding future research.

What might “axes of correspondence” look like for the study of circuits in LLMs? In the paper, I proposed five:

Functional. This is the simplest, most important, and also perhaps the most intuitive axis. Two circuits in two different models can be defined as “doing the same thing” to the extent that they meet the same functional criteria: for instance, knocking out each circuit results in similar behavioral changes in each model instance.
Developmental. A growing body of research is focused on “training dynamics”, i.e., the developmental trajectory of specific model behaviors and mechanisms throughout pretraining. Just as biological organisms reach certain “developmental milestones” at systematic points during development, we might expect certain behaviors or circuits to “come online” at similar points in training. Intuitively, two circuits in two different models seem more similar if they not only meet the same functional criteria, but also arise at relatively similar points during training.
Positional. Earlier, I mentioned that the index of particular attention heads within a layer is arbitrary—but the layer itself is not. We might expect some systematicity across models in terms of where certain mechanisms develop: for instance, perhaps earlier layers track more superficial relationships between tokens, while later layers represent more “abstract” relationships. In models of different sizes, we can further ask whether the absolute or relative position is what matters most.
Relational. In biological neural networks, circuits are defined not only in terms of their function, but in terms of their compositional structure and their relationship to other circuits. Similarly, we might expect the “same” circuit across model instances to exhibit similar internal structure (e.g., roughly analogous networks of attention heads occupying similar relative positions), as well as similar relationships to other model components (e.g., perhaps those “noun phrase heads” always connect to simpler “previous token heads”).
Configurational. Finally, we might expect two circuits doing the “same thing” to share more fundamental properties as well, such as their relative position in weight-space. There’s still much we don’t know about why certain attention heads end up specializing in certain ways, or why exactly a certain configuration of weights corresponds to a given function, but my intuition is that there’s some comprehensible systematicity here.

Importantly, this list is not meant to be exhaustive, nor is each axis meant to be equally important. My goal in the paper (and here) was to propose a set of plausible organizing principles that interpretability researchers can use—I expect (and hope) that researchers will build on these principles and test them more rigorously.

One utility of these principles is that they allow us to assess various circuits that have already been identified in terms of their axes of correspondence across model instances. For example, induction circuits—which track previous instances of a token, then copy information about the token that occurred subsequently to that previous instance—appear reliably across a range of model instances, meeting the same functional criteria and also, intriguingly, emerging at roughly similar points during training at roughly similar relative positions in each model. Further, induction circuits are partially defined in terms of their own internal relational structure.3

These axes of correspondence can also be used to motivate future research. For example, a researcher interested in a particular mechanism (say, noun phrase heads) might explicitly set out to investigate that mechanism across a range of model instances with respect to these axes of correspondence. That’s what I did in the paper, focusing a very simple kind of attention head: 1-back heads (or “previous token heads”).

Shared developmental trajectories of 1-back heads

1-back attention heads are among the most simple mechanism one can imagine in a transformer language model: their main “job” is simply to direct attention from some target token to the token immediately preceding that token.

Here, their simplicity was actually a benefit, since my goal was to stress-test the conceptual framework I’d proposed for identifying commonalities across model instances. Because 1-back heads are so simple, I was pretty confident I could find them (or something like them) across a range of model instances—even small very small models. That meant that I could compare the properties of 1-back heads across all models in which they showed up, such as when and where they appeared in each model.

There are a number of ways one could define and measure “1-back heads”. I chose a relatively straightforward approach: I presented a series of English sentences to a model instance; then, for each attention head in that model, I measured the average attention directed from each token to the token immediately preceding that token. I implemented this procedure for four models in the Pythia suite (14m, 70m, 160m, and 410m), testing all nine random seeds of each model; the Pythia suite also makes pretraining checkpoints publicly available, so I measured these heads’ behavior at select checkpoints throughout pretraining to characterize their developmental trajectory.

I should emphasize, here, that this is a purely behavioral assessment of “1-backness”. Technically, this procedure does not identify the function of these heads (for which we’d need a causal intervention). That said, my view is that if a head always “looks” from the current token to a preceding token, it’s a reasonably high candidate for being a 1-back head.

Once I took these measurements, I could address all sorts of interesting questions. What was the overall distribution of “1-backness” across heads in each model? When do 1-back heads start to emerge in each model? Where do they emerge? Are there any interesting points of convergence or divergence across model instances?

The paper has a number of empirical results, and I won’t go into all of them here. I want to focus on a few specific findings relating to the developmental trajectory of these 1-back heads:

First, I found that different random seeds of the same model (e.g., all nine random seeds of 14m) were highly aligned in when putative 1-back heads started to show up. That is, there was extremely high inter-seed consistency.
There was also surprisingly high inter-model consistency (e.g., 14m vs. 70m vs. 160m vs. 410m). Seeds of different models were less aligned than seeds of the same model, but the overall correlation was still quite high.
That said, there were interesting (and systematic) points of divergence. Specifically, larger models tended to show an earlier onset of 1-back attention than smaller models. The slope of 1-back attention was also steeper, i.e., 1-back attention developed not only earlier but at a faster rate in larger models. Finally, larger models had a higher peak: they had individual heads that showed a higher degree of “1-backness” than smaller models.

Figure adapted from my forthcoming paper on axes of correspondence in LLM circuits. Each panel shows the trajectory of putative 1-back attention heads across random seeds of a given model. Red line shows the predictions of a GAM predicting 1-back attention from all seeds of all models. In general, there was high inter-seed consistency (i.e., within each model) and relatively high inter-model consistency: that is, 1-back heads developed at roughly similar points. That said, larger models displayed an earlier onset of 1-back attention, as well as a steeper slope and higher peak.

I found these results really exciting. The high degree of inter-seed consistency was a reassuring proof-of-concept that at least when it came to a simple mechanism, models of the same size trained on the same data—but with different initial random weights—developed that mechanism at very similar timepoints. I was also struck by the relatively high degree of inter-model consistency: a priori, I didn’t know whether models of different sizes would show much temporal alignment. And finally, the fact that the points of divergence (onset, slope, and peak) were systematic suggest that understanding why models differ might actually be somewhat tractable.

What’s next?

One theme tying together many of my disparate research projects—which I’m planning to focus on more in the coming months and years, especially as I start my own research lab at Rutgers University—is that of trying to make headway on the epistemological challenges I’ve outlined in previous posts. When it comes to generalizability in particular, there are tons of possible extensions to the work I’ve described here, including investigating models trained on other languages and trying to determine what causes different heads to specialize.

More broadly, I remain interested in issues like construct validity and how different metaphorical construals of LLMs shape our understanding of what “kind of thing” an LLM is. These are all questions about how to figure out what we want to know, and they are by no means unique to LLM-ology. They might seem abstract or even navel-gazing, but in my view, they’re fundamental: it’s very hard to make progress—or to know whether we’ve made progress—without some kind of epistemological framework to guide us.

This is a problem for the study of human behavior as well, which is why it’s so important that Cognitive Science researchers account for linguistic and cultural diversity when investigating research questions.

Determining the functional scope of a model component is itself a very hard (and related) problem.

As far as I know, the only axis that has not been investigated with respect to induction heads is the configurational axis, but I might simply be unaware of that research.

The search for more efficient LLMs

Sean Trott — Wed, 17 Sep 2025 22:16:57 GMT

A couple months ago, readers voted on some topics they’d like to see me cover; one of them was an explainer on techniques for model distillation and building more efficient LLMs generally.

State-of-the-art large language models (LLMs) require huge amounts of resources to train and deploy. Historically, most of this cost has come from training (training GPT-4 is estimated to have cost about $100M), though widespread usage has also increased the total costs associated with deployment, especially with the rise of chain-of-thought “reasoning” techniques.

Naturally, there’s considerable interest in reducing these costs. The companies building these models would presumably like to spend less money on compute, assuming they could guarantee similar performance. Moreover, these processes are expensive because they use lots of energy, much of which involves some amount of carbon emissions1: while there’s reasonable debates about how to quantify and assess the actual environmental harms, it seems (to me, at least) straightforwardly true that training and deploying models more efficiently would reduce whatever environmental harms there are.2

Finally, there are more theoretical reasons to prefer more efficient models—particularly if by “more efficient” we mean smaller and/or trained on less data. As I’ve written before, humans seem to acquire remarkable linguistic competence despite encountering fewer words than LLMs; theories of language learning would be informed by the success or failures associated with training models on more plausible volumes of linguistic input. Similarly, smaller models may be more interpretable than larger ones: intuitively, it seems easier to try to understand 100M parameters than 100B parameters (though whether or not this is actually true remains to be seen3). And from a purely self-interested perspective, I think basic scientific research on LLMs would benefit from more efficient models: it’s hard to run (and even harder to train) state-of-the-art LLMs on an academic budget, which limits the kinds of questions we can ask.

With all that in mind: why is it that models are so expensive to train and deploy, and what are researchers doing to reduce these costs?

Why are models so hungry for compute?

We can roughly divide compute costs into those associated with training and those associated with inference.

Before training, an LLM consists of layer upon layer of random weights (or “parameters”). Such an LLM could be used to generate predictions about upcoming words, but those predictions would be very bad: we might say (informally) that it doesn’t “know” anything about language. Typically, training consists of presenting many examples from a language corpus to the model and comparing its predictions to the actual word in a given context. Learning, then, is informed by an error signal. The model’s mistakes are used to make “updates” to the parameters; over time, these updates produce parameters that make better predictions.

The reason I’m describing these details—and there’s lots I’m glossing over—is that an intuition for what training involves is helpful for understanding why it’s so compute-hungry. Concretely, an “update” to the model looks like something like this:

Pass in a sentence or “batch” of sentences and run a “forward pass” through the model (i.e., derive predictions for each example).
Compute the error associated with each prediction.
Use that error to figure out how to update each model parameter; the technique for “walking” the error backwards through each layer is called backpropagation.

Each of these steps must be performed for each training example, and datasets often contain billions or trillions of tokens. Moreover, in a state-of-the-art LLM, there are often tens of billions or even hundreds of billions of parameters4, each of which must be used in calculations during each forward pass and subsequently updated based on the error signal from each training example. To make matters worse, successful training often involves multiple passes through the dataset, meaning each example might be observed many times by the model. Finally, because of how attention in transformer LLMs works (more on that below), longer and longer context windows produce quadratic increases in compute costs.

That’s a lot of calculations! As we wrote in our explainer on LLMs:

OpenAI estimates that it took more than 300 billion trillion floating point calculations to train GPT-3—that’s months of work for dozens of high-end computer chips.

Another way to get intuition for this is to think about what it would take to do all this by hand. As I’ve written before, even a single forward pass through the model could take thousands of years.5 That’s not even considering the calculations involved in computing the error signal and backpropagating that signal through the network to update the weights.

That’s training. Inference refers to using the model purely for a forward pass—that is, with the “weights frozen” so that no updates are made based on the model’s output. This should already give you some intuition as to why inference is generally cheaper than training: training a model requires both a forward and backward pass through the model. That said, as Tim Lee has written, inference costs are now non-trivial. We can attribute that to a few causes: first, more people are using tools like ChatGPT, and each “query” involves a number of forward passes involved in generating the model’s response tokens; second, models and context windows have gotten bigger and bigger; and third, models increasingly deploy chain-of-thought or “scratchpad” techniques, which involve generating lots of tokens on the “path” to producing the final answer.6

Understanding this context is (hopefully) useful for understanding how researchers might in turn make headway on building more efficient models. A few techniques are focused on training models more efficiently, while others are focused on deploying an already-trained more efficiently. The rest of this post will walk through some of the techniques researchers are exploring. Because there are more techniques aimed at making already-trained models more efficient, that’ll be the primary focus—but I’ll also discuss some of the efforts to speed up training towards the end.

Knowledge distillation

Knowledge distillation (KD) refers to the process of compressing (or “distilling”) the representations and functions (or “knowledge”) of a large model into a smaller model. You can think of KD as a general class or approach, as there are many specific techniques for implementing it. The goal of knowledge distillation is to a produce a smaller, more efficient model from a bigger, already-trained model: thus, it’s a technique for making deployment more efficient.

The core intuition is that a large model is able to extract complex relationships from its training data that a small model might not be able to. Once trained, however, it may be possible to use the representations or outputs from that large model as a training signal for a smaller model. That is, the large model may produce representations that serve as a helpful training signal for a smaller model—perhaps more helpful than the original training data.

There are various reasons why this might be true, but one way of thinking about this is that the original training signal is sometimes pretty coarse. Consider an image classifier trained to sort images into the following bins: cats, dogs, tables, and other. Typically, that classifier will be trained on images with a single label (e.g., “cat”), and its error signal will thus fail to incorporate the similarities between the classes of interest. Suppose, for instance, that the model misclassifies a picture of a cat as a dog; intuitively, this seems like a less egregious error than misclassifying it as a table, but a typical training signal will simply take into account the probability assigned to cat.

The direct training signal for LLMs is similarly coarse. Typically, an LLM is penalized according to (essentially) the inverse probability7 assigned to the correct token in a given context: specifically, assigning a probability of 1 (100%) to the correct word would result in a loss of 0, resulting in no penalty; anything less than a probability of 1 results in a larger loss (thus necessitating updates to the weights). But clearly, some mistakes are larger than others, and it seems plausible that predicting the word “car” when the correct word is “automobile” should incur a smaller cost than predicting “elephant”.

In both cases, the challenge arises from the fact that learning to classify hard labels (a given token or image category) doesn’t take into account that there is useful information in the kinds of errors a model makes; Geoff Hinton once called this “dark knowledge”. Of course, it’s not that engineers are unaware of this fact! It’s just much harder to identify a scalable metric for penalizing models in this way—and large-enough models seem to achieve pretty good performance with the training protocols we already use, even if smaller models struggle to get off the ground.

To return to where we started: model size may be a useful proxy for a model’s capacity to extract complex relationships from the training data even with a somewhat coarse error signal. Concretely, a larger model contains more possible “sub-models” (Frankle & Carbin, 2018) and thus more opportunities to identify the function that best approximates the data.8 The insight of knowledge distillation is that once that large model has been trained, we may not need all those parameters to perform the same computations. Instead, we can distill the hard-won knowledge from the bigger model into a smaller model by exploiting the observation above: there is information is the kinds of errors or representations produced by that bigger model, which acts as finer-grained training signal for a smaller model.

One specific technique is known as logit distillation (Gou et al., 2021). The idea here is that we can the outputs of a bigger “teacher” model—the probabilities assigned to each possible token or image label—as a training signal for a smaller “student” model. The procedure looks something like this:

Researchers present the same input (e.g., a sentence or image) to the already-trained teacher model and untrained student model.
They then extract the outputs (e.g., logits or probability distribution) from the teacher model.
Those outputs are used as the training signal for the student model (i.e., instead of simply measuring the probability assigned to the correct output).

This procedure is depicted in the visual illustration below.

A schematic depicting logit distillation, a form of knowledge distillation in which the outputs of one model (the “teacher”) are used as the training signal for another model (the “student”).

Empirically, techniques like logit distillation can work surprisingly well. The benefit, of course, is that the end result is a smaller model that runs more efficiently and, at least in principle, is about as good as the bigger model. A related approach is to train the smaller model on the features (i.e., internal representations) from the bigger model, as opposed to the outputs; this approach is sometimes called feature distillation.

More recently, researchers have explored other techniques, enabled by the success of systems like LLMs in generating content (like coherent paragraphs). One such technique (discussed in this survey; Yang et al., 2024) is called In-Context Learning Distillation. In-context learning (ICL) consists of presenting an LLM with a series of questions and answers (or example tasks) in the “context” of the question or task the user wants the LLM to solve; the examples in the context help steer the LLM towards the kinds of responses the user is looking for.9 You can think of ICL Distillation, then, as a kind of modified version of logit distillation, using generated sequences as the training signal instead of the logits. A teacher model (say, GPT-4) is presented with a given context and asked to generate some output; that entire string (context + output) is then presented to a student as the training signal to mimic.

The advent of “reasoning” approaches like chain-of-thought (CoT) have also motivated researchers to train on these generated sequences as well. Here, the approach is similar to ICL Distillation, but the teacher model is asked to produce a series of reasoning “steps” involved in producing a given answer; the student model can then be trained directly on those steps. If we assume that the reasoning steps in the teacher model are correct—a nontrivial assumption, to be clear—this has the advantage of automatically producing a training signal for elaborate reasoning chains.

These latter techniques are sometimes called “black-box distillation techniques” because the student model is not being trained directly with representations or logits from the teacher—instead, the generated behavior (token sequences) of the teacher model is used as the target of training. Nonetheless, the approaches are conceptually analogous to some extent: in each case, a student model is trained to reproduce the outputs (logit distillation or ICL Distillation) and/or intermediate steps (feature distillation and CoT distillation) of a teacher model.

Why does any of this work? We can’t know for sure, but I think part of the answer comes back to what I mentioned earlier: bigger models might be better than small models at extracting useful information from a coarse training signal—but once that information has been extracted, it’s possible to distill it into a more compressed form. Of course, it’s also hard to judge how reliable the approach is: even if the smaller model performs equally well on certain benchmarks as the larger models, it’s always possible that we’ve “distilled away” truly important representations or mechanisms—we just can’t detect it in the benchmarks because measuring model capabilities is really hard. I’ll discuss this challenge more later in the post, but one possibility is that mechanistic interpretability techniques could help us decompose the teacher and student model to determine whether they are, in fact, performing similar functions, or whether the student is implementing more “superficial” heuristics.

Pruning

Like knowledge distillation, pruning assumes that once a large model has been trained, there exists some smaller model that can perform roughly the same set of functions about as well as the larger model. Pruning is thus also similar to distillation in that it boosts efficiency during inference, not training.

Unlike knowledge distillation, however, pruning works by systematically removing (or “pruning”) components of the original model directly. The resulting model is thus sparser: many of the original parameters have been set to 0.

The underlying assumption of pruning is that many of the parameters in the original model are somehow redundant or not particularly important. That is, the original model is overparameterized. Indeed, evidence for apparent redundancy—at least for select tasks—in transformer language models dates back at least to a 2019 paper (Michel et al., 2019) entitled “Are 16 Heads Really Better Than One?” The paper’s central question is right there in the title: most transformer models use multi-headed attention, meaning that each layer of the model has multiple “heads”, each of which could (in theory) learn to track different kinds of relationships in the context—but are all these heads really necessary?

The authors tested this question using an ablation study, a technique I’ve written about before. Ablation involves removing or “knocking out” different components of a model and asking whether the model’s performance on some task changes at all. The logic is that if a component is important for solving a task, then performance should decrease when that component has been ablated; if the component is redundant or unnecessary, then performance should not change. Concretely, the difference in performance between the original model and the version with a given component ablated is a measure of how much that component mattered for the task.

The authors focused on two models: WMT, the original transformer architecture from the famous “Attention is all you need” paper (Vaswani et al., 2017), which has 6 layers and 16 heads per layer; and BERT, which was considered “state of the art” at the time (though no longer), and which has 12 layers with 12 attention heads each. For each model, the authors applied the same procedure: one by one, they selectively ablated each head in each layer and quantified the resulting change in performance on a machine translation task. This amounts to asking how much worse WMT and BERT performed when equipped with one fewer head than the model was originally trained with. The authors found that removing most heads didn’t matter much (bolding mine):

Notably, we see that only 8 (out of 96) heads cause a statistically significant change in performance when they are removed from the model, half of which actually result in a higher BLEU score. This leads us to our first observation: at test time, most heads are redundant given the rest of the model.

But removing one of 96 (or 144) heads isn’t actually ablating a huge chunk of the model. In the next experiment, the authors inverted the approach from the first study: instead of ablating only a single head, they ablated all heads but one from a given layer, allowing them to ask whether that head was sufficient for solving a given task. Strikingly, they found that for some layers, one head was enough to retain good performance on the machine translation task:

We find that, for most layers, one head is indeed sufficient at test time, even though the network was trained with 12 or 16 attention heads.

The authors also report the difference in runtime efficiency between the original model and the pruned versions: unsurprisingly, pruning resulted in substantive efficiency gains.

Of course, as the authors note, these results were obtained for a specific task (machine translation) on specific models. It’s unclear how well they’d generalize to other tasks or other models, and there’s always a risk of being too eager to attribute redundancy to a system: maybe those other heads were important but it’s just hard to measure them. But I think the paper is primarily interesting as a proof-of-concept, both in terms of the method used (pruning) and the results (for a machine translation task, one head is sometimes as good as 12 or 16).

Schematic of a simple approach to pruning. Each model component is selectively ablated and the resulting effect on performance is analyzed. If ablating a component did not hurt model performance, it can be removed from the model.

There is a growing body of work exploring redundancy in pre-trained transformer models, particularly with respect to the attention heads. Again focusing on machine translation, this 2019 paper (Voita et al., 2019) trained multiple transformer models to perform the translation task. They then ranked the relative “importance” of each head (the extent to which it impacted the model’s predictions), as well as its “confidence” (the average maximum attention to particular tokens10): in general, a small number of heads emerged as most “important”, and importance was positively correlated with head confidence. That is, heads that consistently distributed attention to particular tokens (as opposed to distributing it more uniformly across tokens) made stronger contributions to model outputs.

The authors then characterized the putative function of the most important heads by analyzing their attention matrices, classifying them as follows:

Positional heads were those that consistently (≥90% of the time) attended to specific relative token positions (e.g., 1-back heads).
Syntactic heads were those whose attention weights correlated with particular syntactic dependencies, e.g., subject-verb relations.
Rare word heads tended to point towards the least frequent token in a sequence.

One of the things I found most interesting about this paper was that these “important” heads really did seem to have relatively interpretable functions. Moreover, these functions actually make sense in the context of what the models were trained to do: at minimum, translating from one language to another seems like it would benefit from paying attention to the relative position of words in the input and output sequence, the syntactic relations of those words, and the use of particularly unusual or infrequent words. (This is a good place to note that a good machine translation model would almost certainly need to pay attention to more than just those things; it’s also entirely possible that the heads analyzed performed more complex or sophisticated functions that were simply harder to characterize.11)

But again: these are small models trained on a specific task, and we should be cautious about extrapolating to other model classes or task contexts.

Setting aside, for now, the question of whether and which model components might be redundant, we can turn to a second, equally challenging question: given that our goal is to identify some subset of a model’s parameters that are sufficient for equivalent performance, what’s the best way to identify that subset? This is a version of what’s called model selection in statistics and machine learning.

Naively, one might imagine that you could find the “best” model simply by testing every possible subset of parameters. But this quickly becomes computationally intractable. Suppose you have 100 parameters (k = 100), and are willing to entertain subset models ranging from those with no parameters (k = 0) to those dropping only a single parameter (k = 99)—and everything in between. There are roughly 10^30 such possible models. If we assume each candidate model takes only one second to fit and evaluate, it would take 10^30 seconds to evaluate all those candidates: for comparison, the likely age of the universe is about 10^17 seconds. That means it would take longer than the entire history of the universe thus far simply to identify the best subset of 100 possible parameters for a given task.

Clearly, we need a better way. There are a number of different model selection techniques used in statistics, such as forward or backward stepwise regression, in which variables are added or removed one at a time according to some evaluation criterion; researchers also make use of regularization methods like Lasso, which impose an additional sparsity constraint on the final model. These methods don’t necessarily find the globally optimal set of parameters, but they work well as heuristics for finding a “good enough” model.

Similar principles apply in the context of pruning neural networks, though the scale and complexity of the problem is considerably greater than most linear regressions. “Greedy search” methods make the best locally optimal choice at each decision point and are roughly analogous to stepwise regression. For instance, you might start by finding the best single attention head out of all possible attention heads (say, 100)12; once you’ve found that head, you proceed by finding the best second head to add to that head out of the remaining heads (99); you then continue this process until you’ve added all heads in order of importance.13 This might still take a while—especially if you have many attention heads—but it’s much, much faster than searching all possible combinations of heads.

In general, pruning methods tend to trade off between computational complexity (i.e., the efficiency of the pruning procedure) and fidelity (i.e., whether you end up removing the right parameters). These range from removing parameters with the smallest magnitudes (highly efficient, but imprecise) to more sophisticated methods like the “Optimal Brain Surgeon” and its modern counterparts. The most accurate methods tend to rely on “second-order” metrics that account for the curvature of the overall loss landscape when removing model parameters; this is in contrast to “first-order” metrics that use only the local gradient information (i.e., increases or decreases in loss) to estimate a parameter’s impact. (Metrics like parameter magnitude could be considered “zeroth-order” in the sense that they’re not calculated as a function of the loss at all.) Unfortunately, second-order metrics are the most expensive to compute, whereas zeroth-order metrics are much faster. There are, however, clever workarounds that attempt to balance these trade-offs. For instance, the SparseGPT method relies on second-order metrics, but computes them layer by layer (as opposed to the entire model) using relatively efficient methods, which makes the problem more computationally tractable.

Putting this all in context: like knowledge distillation, the key assumption of pruning is that many LLMs are overparameterized—they have more parameters than they need. In principle, then, one could identify a subset of these parameters that runs more efficiently without a loss in performance. The main challenge is figuring out which parameters are actually necessary and which are redundant, and doing this in a way that doesn’t take longer than the age of the universe to calculate.

What about training?

So far, I’ve focused on methods that make already-trained models run more efficiently. But as I wrote earlier, much of the energy costs come from training. Can we speed that up as well?

Training has indeed gotten more efficient in some ways, mostly through a combination of hardware advances and algorithmic approaches that better exploit these advances. Some benefits have come through improved optimization methods: while neural networks were historically trained using stochastic gradient descent (SGD), optimizers like Adam generally result in faster convergence than SGD by adapting the learning rates (how quickly parameters are updated) for each parameter based on its historical trajectory. Other techniques, such as mixed-precision training, reduce the precision of each stored weight (e.g., storing up to 8, rather than 16, decimal values), which speeds up both inference and training.

One innovative technique that’s been particularly impactful for making fine-tuning more efficient in particular is LoRA, or “low-rank adaptation” for LLMs (Hu et al., 2021). The underlying assumption of LoRA is that most of the changes to weight matrices induced by fine-tuning can actually be represented in a lower-dimensionality matrix than the original model matrices. That is, if the original matrix has 100 columns, you might be able to represent the changes to that matrix in terms of just 5-10 columns of updates.

Conceptually, this makes sense if we assume that fine-tuning mostly introduces a few new “directions” in weight-space, while keeping other aspects of weight-space relatively intact. Oversimplifying a bit: training on medical textbooks might introduce new information about medicine and health, but not necessarily new grammatical rules or general facts about the world. Thus, we might be able to represent those updates using fewer dimensions than the original weight matrix.

The technical details of LoRA are, of course, more complicated. LoRA works by “freezing” the actual weight matrix W (i.e., preventing it from updating) and instead learning updates to two smaller matrices: let’s call them A and B. Respectively, these matrices help summarize the core dimensions for the updates (A) and spread their influence (B) across all the original dimensions of W. Concretely, if W is 100x100 dimensions, A might be 100x10 and B might be 10x100. In this case, LoRA allows the model to learn roughly 10 “directions” of weight updates, which in turn can be distributed back across the original 100 dimensions. Thus, instead of learning updates for the full 100x100 grid (10K parameters total), LoRA would learn updates for each 10x100 grid (for a total of 2K parameters): a 5x reduction in the number of parameters the model had to learn!

In practice, the savings of LoRA are often even more dramatic. The authors of the original paper report as much as a 10,000x reduction (bolding mine) with no apparent loss in performance:

Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times.

It’s worth noting that the logic here is, in some ways, similar to the ideas underlying knowledge distillation and pruning—only applied to the dimensionality of weight updates as opposed to the weights themselves. LoRA’s assumption is that when it comes to fine-tuning, updating every parameter in the full model would in some sense be an “overparameterized” update. Instead, this update can be distilled in a lower-rank matrix that nonetheless effectively summarizes the important directions of change.

LoRA has had a major impact on the field in terms of making fine-tuning much more efficient. Some researchers have tried to develop LoRA-like methods for pretraining as well, which have had some success, though my understanding is that their impact is considerably limited relative to LoRA. Again, this makes some conceptual sense if we keep in mind the distinction between training from scratch vs. fine-tuning: when training from scratch, the model “knows nothing” and thus likely benefits more from exploiting the full dimensionality of its parameters—if you could summarize the updates with fewer dimensions, you might just be able to use a smaller model in the first place. In fine-tuning, the model has already learned a lot about how language works, so the updates can be more targeted.

The age of small language models?

While I was writing this explainer, The Economist published an article suggesting that “small language models” (SLMs) might become increasingly relevant in the future. Aside from the fact that there’s something funny about calling them “SLMs”—the term large language model was coined to indicate a contrast with traditionally-sized (smaller) models of the past—I agree that smaller models are likely better-suited for many applications, especially any application that requires the model to run locally on a device.

And as I noted in the introduction, I also think there’s something philosophically interesting about trying to build more efficient systems. For the past several years, AI discourse has largely centered around the question of whether building bigger neural networks trained on more data will continue to yield performance improvements—sometimes called the scaling hypothesis. I think there’s merit to either side here, and I also think the debate about the exact shape of the function relating model size to model performance is interesting.14

But even if scaling holds to some degree, I like the idea of building systems that deploy resources more efficiently: not just because it’s financially or environmentally prudent (though this obviously matters too), but because much of what I find interesting about human cognition is the fact that it operates in a resource-constrained setting. Biological organisms operate under some pretty serious metabolic constraints; moreover, while the human brain is certainly energy-hungry, it’s seen by many as a remarkably efficient system in terms of what it achieves relative to its energetic costs. Efforts to do more with less might thus bring us closer towards building biologically plausible models of cognition.

The extent to which this is true is of course, is dependent on the grid used to power the data centers used to train the models.

At the very least, the ratio of harm to usage would decrease.

It’s possible that model with fewer parameters may end up “loading” more concepts into a smaller number of dimensions, which makes those dimensions hard to interpret. Techniques like sparse auto-encoders (SAEs) actually “explode” the dimensionality of a model, suggesting that in some cases, interpretability may be easier with more dimensions (though there’s ongoing debate about the utility of SAEs themselves).

We don’t know exactly how many parameters are in GPT-4, but some estimates are as high as 1.7 trillion. GPT-5 is a “multi-model system”, so each model in that system would have an associated number of parameters; even if we knew the parameter count of each constituent model, it’s not clear to me the best way to estimate the actual “parameter count” of GPT-5 itself.

The full post contains the assumptions needed to get this estimate, which include: the number of operations a person could perform per minute and the number of operations you’d need to perform. Obviously this is all back-of-the-napkin math, but my sense is that the order of magnitude seems roughly right. As I mentioned in the original post, if anyone disagrees with that estimate, let me know and I can correct it.

This is, in part, why leaderboards for benchmarks like ARC-AGI sometimes display performance relative to the compute costs required to solve a given task.

Or technically, the surprisal.

As described in that linked article, this is sometimes called the “lottery ticket hypothesis”. From the abstract:

Based on these results, we articulate the lottery ticket hypothesis: dense, randomly-initialized, feed-forward networks contain subnetworks (winning tickets) that—when trained in isolation— reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective.

Technically, no “learning” occurs through weight updates—which is one reason for the popularity of the technique.

The logic here was that more “confident” heads tended to focus their attention on particular tokens, rather than distributing it across multiple tokens.

Another 2019 paper (Kovaleva et al., 2019) conducted an even more comprehensive analysis of putative head functions and attention patterns in BERT.

You could also do this in reverse, e.g., by starting with 100 heads and iteratively removing the heads that matter the least.

As described above, this won’t necessarily yield the optimal configuration of k heads, because the solution is path-dependent. It’s possible the best 6 heads are different than the best 5 heads plus the head that’s best to add next.

A fact that’s sometimes glossed over in contemporary discussion of the scaling hypothesis is that the original scaling laws paper actually showed some evidence of diminishing returns to model size. That doesn’t mean necessarily scaling has diminishing returns practically, just that the function relating performance to model size increases at a sub-linear rate. Of course, diminishing returns are still returns!

Language models and language change: recent evidence

Sean Trott — Fri, 29 Aug 2025 18:43:46 GMT

Back in July of 2022, I wrote a post speculating about whether the introduction of large language models (LLMs) into communications technology could end up shaping language itself.

The reasoning was straightforward: the more we allow our linguistic choices to be determined (or at least influenced) by an LLM, the more we might start to internalize those choices, such that they become the “default” words or grammatical constructions we turn to in a given context. This hypothetical scenario is importantly distinct from a scenario in which a higher proportion of linguistic context we encounter is LLM-generated. I’m interested in that as well (and its relationship to ideas like Dead Internet theory), but I’m particularly drawn to the question of whether our internal sense of language will change. At that point, we can refine the question further: will LLMs homogenize our language or fracture it into billions of tiny idioletcs?

Since I wrote that post, we’ve seen more than a few technological shifts: ChatGPT was released in late 2022, followed by other chat-based models like Claude; we’ve also seen the rise of “LLM agents”, practices like “vibe coding” and warnings about students cheating their way through college with ChatGPT; and finally, I can report that at least anecdotally, I encounter a great deal of content that has the “flavor” of being LLM-generated.

In the meantime, I’ve maintained an interest in that original question. I’m planning to write a longer academic piece about this soon, so I’ve tried to stay abreast of the literature on LLM-generated text, and I’m still working on connecting it to longstanding theories of technology-induced language change. In this post, I wanted to provide a brief review of some of that relevant empirical evidence, as I imagine it’s a topic of interest to many readers of this newsletter.

Refining the research question

In order to figure out what evidence is relevant, it’s important to be clear about our research question(s). I’m interested in two broad questions here:

Will language models change language in some way?
How might language models change language?

Of course, no single study is capable of fully addressing either question.

For (1), the best we can hope for is to enumerate various preconditions of influence and ask which of them are satisfied. Briefly, we might expect an affirmative answer here to depend on at least three factors: first, the degree to which LLM usage permeates a particular sociolinguistic domain or multiple domains; second, the degree to which LLM-generated text is measurably distinct from human-generated text; and third, the extent to which LLM users are “primed” by the linguistic outputs of that LLM. That is, language change (at least in a particular domain of communication) depends on lots of people using LLMs in the first place, on LLMs producing content that’s somehow different from what someone might have produced in a counterfactual scenario without LLMs, and on users being primed by recent linguistic encounters (particularly those with LLMs).

For (2), we can start by formulating different hypotheses. I suggested two competing hypotheses in a previous post: the homogenization hypothesis proposes that widespread LLM usage will ossify certain linguistic practices in place, collapsing variance and preventing further change; the nova hypothesis proposes the opposite—namely, that widespread LLM usage will lead to a kind of “Cambrian explosion” of micro-languages, each finely tuned to the styles and practices of a given user (or a given LLM).1 Here we might look to a few sources of evidence to adjudicate between these hypotheses: first, whether most people use the same (non-customized) LLM or whether there are a multiplicity of models; and second, the extent of variance in LLM-generated content (particularly relative to analogous measures of human writing).

Relevant pieces of evidence for addressing the question of whether and how LLMs might change language. This list is not exhaustive!

None of these pieces of evidence are easy to come by, but they’re critical for informing the bigger-picture questions about whether and how LLMs might change language. Measuring them also requires making certain assumptions about the relevant constructs, e.g., about how to assess “variance” in writing and what counts as an appropriate sample of human-generated and LLM-generated text.

In this post, I’m going to narrow my focus further to a subset of these pieces of evidence: first, on the alleged distinctiveness of LLM-generated text; and second, on the relative variance in LLM-generated output. There is actually good evidence for priming in human communication, but I think that’s a topic for another post—as with the factors relating to actual usage of LLMs.

As I discuss the evidence, there’s a big caveat I need to point out up front: any conclusions we draw from empirical results depend on the representativeness of both the human-written text and LLM-generated text in question. I’m increasingly skeptical of the idea that either of these are helpful unitary constructs: each human contains multitudes, and there are also lots of different humans. Similarly, there are many “LLMs” on the market, and each LLM can be prompted in numerous ways that might elicit different linguistic behaviors. I’m still not sure how to square these issues with the broader goal of, say, identifying signatures of LLM-generated text—it’s possible the best we can do is keep them in mind and limit our conclusions accordingly. I’ll do my best to be clear about what sample is being compared in each study and how constructs like variance are being assessed.

With that in mind, let’s turn to the evidence.

LLM-generated output is (somewhat) distinctive

The rise in LLM usage has, for a number of reasons, led to considerable interest in developing methods to detect LLM-generated text. As I’ve written before, I think this is a hard problem—there’s no reason in principle that LLM-generated text has to be distinguishable from human-written text. But even though I’m concerned about the actual application of these methods, this work has been helpful from the perspective of basic research: it turns out there are various signatures of LLM-generated text, some more interpretable than others.

I described some of these potential signatures in a recent post. Systems like DetectGPT rely on a notion called curvature, which the authors defined as follows:

This paper poses a simple hypothesis: minor rewrites of model-generated text tend to have lower log probability under the model than the original sample, while minor rewrites of human-written text may have higher or lower log probability than the original sample.

That is, because of the way applications sample tokens from LLMs, LLM-generated text will generally reflect something like a local maximum in probability-space: the observed sequence is more probable than other counterfactual sequences. The claim is that this is more true—at least in the samples used—for LLM-generated text than human-written text. Focusing on fake news articles generated by GPT-NeoX, the authors find that their method distinguished these articles from the original human-written articles with relatively high success.2

GPT-NeoX isn’t exactly state-of-the-art at this point. A more instructive example might come from recent work by some of my colleagues applying the curvature method to text generated by various LLMs in the Turing Test. Here, accuracy was about 69%, and all of the LLMs compared (including GPT-3.5 and GPT-4, both of which were prompted in multiple ways) had higher curvature on average than humans in the sample. Notably, this is better performance than the human judges in the original paper, as well as a separate pool of humans judging the conversations in an “offline” task.

Most importantly for our purposes, these results are consistent with the idea that LLM-generated text is “distinctive”: there exists at least one measure that distinguishes LLM-generated text from human-written text—in at least two different communicative settings (news articles and chat messages).

It’s also consistent with the original empirical analyses I presented in that same post. I found that words in LLM-generated passages were more predictable—as measured by independent open-source language models—than words in human-written passages. Words in LLM-generated passages also exhibited less variance in their predictability:

Left: mean surprisal of words in passages generated by students, ChatGPT-3, and ChatGPT-4. Higher values indicate more “surprising” words overall (i.e., less predictable). Right: standard deviation of surprisal for words in passages across conditions. Here, higher values indicate more variance in the predictability of words.

That said, in the Turing Test setting, average predictability was less effective at identifying LLM-generated messages (about 62% accuracy); and perhaps more crucially, messages generated by GPT-4 with the best-performing prompt were statistically indistinguishable from human-generated messages in terms of average predictability. That suggests that curvature might be a better indicator of synthetic text than outright predictability.

Another interesting piece of evidence comes from searching for specific linguistic features: beyond the average predictability of words, do LLMs choose certain kinds of words or certain kinds of grammatical constructions? One 2024 paper measured the different “syntactic templates” used by multiple LLMs, including closed models like GPT-4o and open models like OLMo. Text was generated from each model using multiple approaches, from entirely open-ended elicitation (e.g., prompting the model with a start token and sampling subsequent tokens) to more targeted tasks (e.g., summarization); the authors used human-written summaries as a baseline.

A “syntactic template” can be thought of as a higher-order abstraction describing some sentence. While two sentences might contain totally different words, they might correspond to the same parts of speech in the same order—and thus make use of the same “template”. That would be true, for instance, of the following two phrases:

a romantic comedy about a corporate executive
a humorous insight into the perceived class

According to a standard syntactic parser, both phrases use the “DT JJ NN IN DT JJ NN” template, which stands for (roughly) “determiner + adjective + noun + preposition + adjective + noun”. The authors used a parser to identify different templates in LLM-generated (and human-written) text; these were defined as a sequence of length n3 of part-of-speech tags that occurred some minimum number of times in a corpus. Then, they calculated the template rate for each model/task (the proportion of documents with one of those top templates), as well as something called the compression ratio: a measure of how much the set of part-of-speech sequences in a given corpus could be compressed. (The intuition here was that more compressible corpora entailed greater redundancy between sequences.)

The authors report a number of results, but two are most relevant to the discussion here. First, models generating open-ended text produced a lower rate of templates overall than models generating summaries: that is, the task mattered. Second, focusing specifically on the summarization task, models produced a much higher rate of templates than humans when summarizing movies—independent of the value of n used to define templates; this gap was reduced when analyzing biomedical summaries (the Cochrane dataset), possibly because the human-written summaries involved more templatic language in general.4

Another recent paper asked whether certain words have increased in frequency since the release of ChatGPT, above and beyond what you’d expect from extrapolating their pre-ChatGPT trends. Focusing on scientific abstracts in the PubMed database, they calculated a measure called “excess words”, which reflects the difference between the actual number of times a given word appeared in a given year and the expected number of times that word “should” have appeared that year; the latter measure was derived by extrapolating a linear trend from previous years. This allowed the authors to determine which words occurred more times post-ChatGPT than we would’ve expected in a world without LLMs.5 Some of the words they found are often cited as ChatGPT “tells”, such as delves or underscores; other examples included potential and crucial. Beyond suggesting that LLM-generated text might be distinctive, these results are also consistent with the idea that the introduction of a new technology (ChatGPT) correlates with a change in the composition of words we encounter. Moreover, other recent work suggests that this fact isn’t limited to scientific abstracts: a study of podcast transcripts and YouTube academic talks points at similar trends in scientific communication.

A final piece of evidence regarding the distinctiveness of LLM-generated text comes from this very recent (May 2025) paper exploring whether people can accurately distinguish human-authored vs. synthetic text. The authors recruited 9 annotators and asked them to read 30 human-written and 30 LLM-generated articles, indicating for each text whether they believed it was written by a human or LLM.6 Of these 9 annotators, five frequently used LLMs for writing tasks, while the other four had only limited experience. The authors found that the four with limited experience achieved an average accuracy of 56.7%—roughly random chance—though they nonetheless reported high confidence in their decisions. The five that frequently used LLMs achieved an average accuracy of 92.7%; moreover, the majority vote of these five achieved almost perfect performance (99.9%), which was considerably better than any of the automatic detection methods tested. Expert annotators also retained high accuracy in the face of various adversarial methods, including paraphrasing the LLM-generated text and attempting to “humanize” it (which involved explicitly instructing an LLM to avoid certain “tells”).

This is actually consistent with work from the Turing Test paper I mentioned earlier: although humans performed quite poorly on average, humans with more knowledge about LLMs and more experience playing the game performed better. Both results point to the conclusion that there’s something about interacting with LLMs over time that hones one’s detection ability. This suggests that LLM-generated text really might have its own signature, but it’s hard to identify it without extensive experience.

So what is it that those expert annotators are doing? Fortunately, the authors also asked each annotator to explain how they made their decision. These explanations are helpfully summarized in Table 3 of the paper. Here, two things stood out to me: first, expert annotators (unlike non-experts) seemed aware of which words in particular were overused by LLMs (like “testament” or “crucial”), and used those words as clues; and second, they pointed to consistent grammatical clues, i.e., predictable structural patterns (e.g., “not only …. but also …”)7.

We started this section by asking whether LLM-generated text is truly distinctive. For me, this evidence points to a tentative “yes”: at least right now, text generated by an LLM using a range of elicitation methods can be distinguished from human-authored text at a reasonably reliable rate; distinguishing factors include the relative predictability of LLM-generated passages, as well as the use of specific words (like “delves”), and syntactic templates. It is entirely possible that future LLMs or different ways of prompting current LLMs will produce text that is impossible to distinguish from human-authored text. It is also worth noting (again) that the very construct of “humanlike text” is a fraught one: there is, after all, no average human. But in the context of speculating about the practical and societal consequences of LLM usage, I think the evidence suggests at least one of the preconditions I mentioned earlier could be met.

Homogenization: a thought experiment

There are at least two ways that widespread usage of LLMs could change language: first, people might converge on a more similar set of expressions (i.e., the homogenization hypothesis); alternatively, each individual person might “drift” further and further from other people in terms of their linguistic idiosyncrasies, relying increasingly on LLMs to “translate” what they mean to interlocutors (i.e., the nova hypothesis8).

One way to get traction on this question is to ask whether LLM-generated text exhibits more or less variance (or diversity) than human-authored text. There are all sorts of ways one could operationalize “variance”, but I imagine most readers will have some sort of intuition for the underlying construct: different humans write in different ways, and an (arguably) important part of becoming a good writer is “finding your voice”; LLM-generated text, by contrast, is sometimes described as “slop” or “filler”, implying a kind of contentless homogeneity. But is it actually the case that LLM-generated text as a general category necessarily exhibits less variance than human-authored text?

Before I turn to the evidence, I’ll start with a thought experiment meant to complicate our intuitions.

Suppose that we compared a text corpus authored by millions of individuals to a text corpus (of the same size) authored by a single person: our intuition is that the million-person corpus probably contains more variance—both in the breadth of topics and the diversity of linguistic styles. At first glance, this might seem analogous to the case of comparing a bunch of documents written by different humans to a bunch of documents written by the same LLM. Clearly, we might say, the LLM-generated text will exhibit less variance.

The first complication concerns the scope of this inference. The intuition above rests on the assumption that the million-person corpus was produced by a greater variety of generative processes, whereas the LLM-generated corpus was produced a single process. That means the intuition doesn’t necessarily extend to comparing LLM-generated text to documents written by a single person: the inference only holds when comparing text produced by different people to text produced by the same LLM.

Second, and more importantly, it’s not actually clear that there’s a direct correspondence between a single-person corpus and an LLM-generated corpus. An LLM has ingested many more words than a single person; although these words are a biased sample of the remarkable variety of human languages, an LLM trained on billions or trillions of words has probably still encountered a greater variety of writing styles and perhaps languages than a random person. Further, there are many ways one could prompt an LLM, and also many ways one could sample tokens from the probability distribution (e.g., adjusting the temperature, using beam search, etc.). Once we take these additional factors into account, our intuition might be less clear.

Third, it’s conceivable that even a million-person corpus would be relatively constrained in terms of its topics and linguistic styles. Users of a specific forum, for instance (like Reddit), might converge on similar world models, similar interests, and similar ways of using language. My intuition is still that there’d be more variability between two posts sampled from different users than between two posts generated by the same LLM, but it also feels like there could be many exceptions to this.

All this is to say: our intuitions about LLM-generated text and homogeneity might not necessarily be valid, and we should do our best to ground our beliefs in the evidence—while still acknowledging the real limitations in empirically measuring these sorts of things.

Homogenization: the evidence

Based on my read of the emerging research literature, most (but not all) of the research seems to support the homogenization hypothesis (as opposed to the nova hypothesis).

One relevant piece of evidence comes from from a 2024 paper comparing writing produced with and without assistance from LLMs. Specifically, the authors looked at GPT-3 and InstructGPT: relatively “early” models, and less performant than GPT-4 or GPT-5, but state-of-the-art as of 2023 (when this work was done). They recruited participants online and asked them to write argumentative essays on different topics; subjects wrote the essays on the CoAuthor platform, which allows users to obtain text continuations contingent on text they’ve already written (which they can accept or reject). Subjects were assigned either to a control group (where they were not given LLM assistance) or one of two treatment groups, which allowed them to receive suggestions from either GPT-3 (a “base” model) or InstructGPT( a model fine-tuned on human feedback).

The authors then calculated a “homogenization score” for each condition, which was based on the textual similarity between all essays produced for a given condition/topic by different subjects.9 On average, essays written with assistance from InstructGPT showed significantly more homogenization than essays written without any assistance—or, interestingly, essays written with assistance from GPT-3 (the “base” model). Further, essays written with assistance from InstructGPT also involved more repetition of the same sequences: that is, two randomly sampled essays from the InstructGPT were more likely to contain the same 5-word sequence than were two randomly sampled essays written without LLM assistance.

There are some clear limitations to this work (like the fact that the models used are now outdated, and the fact that the CoAuthor interface seems pretty different from contemporary LLM usage), but one interesting conclusion is that there’s something about fine-tuning the model specifically that leads towards homogenization.

In fact, there’s a growing body of evidence that techniques like reinforcement learning with human feedback (RLHF) and instruction-tuning do actually reduce variance in LLM-generated text. These are methods used during “post-training” (i.e., after an LLM has already been “pre-trained” on a large corpus of text) to make the LLM better at following directions and avoid producing undesirable behaviors; for example, RLHF might “teach” an LLM to avoid producing sentences that a human would find hateful or otherwise offensive.

Conceptually, I can see how these techniques as currently applied would reduce diversity in LLM-generated text. Suppose we imagine a “base” model in terms of an attractor landscape: there are different valleys or basins that the model might “visit” depending on the input, and the overall shape of this landscape is determined by its pre-training. Now suppose we apply RLHF to this model, which encourages it to produce certain desirable behaviors and avoid undesirable behaviors: intuitively, this might have the effect of making some regions of the landscape much more “attractive” (the model will visit them more often) and others much less “attractive” (the model will visit them less often) or even disappear entirely (what Janus calls “mode collapse”), exaggerating existing structural variability in the landscape as a function of the RLHF process. This seems to match up with the evidence: instruction-tuned models show less entropy in their outputs (i.e., a sharper, less uniform probability distribution over tokens).

This, in turn, could explain all sorts of other empirical observations about text generated by RLHF-tuned models. Compared to base models, RLHF-tuned models use a smaller variety of proper names, nationalities, ages, and personality types. Similarly, other work on relatively state-of-the-art RLHF-tuned models (like GPT-4) has found that the stories they write seem to “recycle” a larger number of plot devices than stories sampled from human authors (as far as I know, this study did not assess stories produced by “base” models).

Now, it’s entirely possible some of this apparent homogenization is, as they say, a “skill issue”: presumably, more effective prompts could elicit a greater range of styles, plot contrivances, and proper names from an RLHF-tuned LLM. But the fact remains that the underlying probability distribution really does seem less entropic in RLHF-tuned LLMs; to me, that suggests that while better prompts might elicit more diversity, text produced by current RLHF-tuned models will still be relatively more homogenous than text produced by their “base” equivalents.10

This isn’t necessarily a bad thing! Presumably, we’re not simply looking for the most entropic distribution—if we wanted that, we could just sample words at random from the vocabulary. Indeed, raising the “temperature” in the sampling process does make text a bit more novel, but it also makes it less coherent. Ideally, we’d avoid homogenization while also ensuring that the text makes sense.

Moving forward, a major open question in this space is just how constraining this “attractor landscape” is, and just how successful different prompts or contexts can be in eliciting diverse behavior. For instance, a recent preprint suggests that providing sufficient contexts erodes some of the previously observed differences in textual variance between LLM-generated and human-authored text.

So will LLMs change language? And if so, how?

It’s impossible to answer either of these questions with absolute certainty.

To the first, however, we can already point to some positive examples, such as changes in the vocabulary composition of scientific abstracts and podcast transcripts. Further, LLM-generated text does appear to be at least somewhat discriminable, both by automated metrics and by humans with extensive experience interacting with LLMs. I didn’t discuss evidence for the other preconditions I proposed—the degree of LLM usage and whether human language use is reliably primed by linguistic inputs. But ChatGPT usage does seem relatively widespread (and growing), which suggests that the first precondition may well be met, at least in certain domains; and as I noted earlier, there’s good evidence that the words and grammatical constructions humans deploy are influenced by the words and grammatical constructions they encounter11.

The second question, I think, is harder to address, and the answer is even more contingent: how many customized LLMs do we expect to be in circulation in 1-2 years, and just how variable can the outputs of even a single LLM be? Right now, my sense is that the current equilibrium points towards homogenization: most people seem to use one of a handful of models, and (with some exceptions) mostly don’t seem to be invested in explicitly trying to elicit diverse linguistic behaviors from those models. And since there is good evidence that these RLHF-tuned models exhibit less variance, my takeaway is that the “default” scenario is more likely to be one of homogenization than what I’ve called the “nova” hypothesis. Put another way: if this were a betting market, I’d bet that we’d see more evidence for homogenization or convergence than fracturing over the next few years.12

Yet if there’s anything I’ve learned in the last 4-5 years, it’s that technological changes can take you by surprise—so I wouldn’t bet too much on this outcome.

In retrospect, one hypothesis I might’ve overlooked is what I’ll call the polarization hypothesis: the idea that people will begin using language that’s unlike LLM-generated text (like avoiding the em-dash).

They report an AUROC of .95, which means (roughly) that if you sampled a human-written and LLM-generated passage at random, the method would correctly rank them about 95% of the time.

The authors performed this analysis for multiple values of n, from 4 to 8.

Also relevant was their finding that, on average, LLMs produced more templates that were also identified in model training data than did humans.

The authors also validated this approach by testing it for other events that changed the frequency of certain words: for instance, the COVID pandemic led to (as you might expect) words like “pandemic” being used much more often than you’d expect simply by extrapolating pre-pandemic trends in their usage.

The authors conducted multiple studies, some of which contained texts generated by ChatGPT and some of which contained texts generated by Claude; the results were qualitatively similar in each study.

One annotator wrote:

"One pattern I’ve been noticing with AI, and I think I’ve stated this before, is the comparison of ‘it’s not just this, it’s this’ and I’m seeing it here, along with listings of specifically three ideas."

Thusly named for an analogous point made by Charles Taylor about religious choice in A Secular Age.

They used two different metrics here: Rouge-L, which is based on common sequences of words; and BERTScore, which compares the similarity of document embeddings in the BERT language model.

Once again, this doesn’t seem to me to be an in principle kind of thing. It seems conceivable that model designers could implement RLHF in such a way that this kind of homogenization did not occur—but I could be wrong. Janus also reports an interesting tale from the early days of OpenAI:

Another example of the behavior of overoptimized RLHF models was related to me anecdotally by Paul Christiano. It was something like this:
While Paul was at OpenAI, they accidentally overoptimized a GPT policy against a positive sentiment reward model. This policy evidently learned that wedding parties were the most positive thing that words can describe, because whatever prompt it was given, the completion would inevitably end up describing a wedding party.
In general, the transition into a wedding party was reasonable and semantically meaningful, although there was at least one observed instance where instead of transitioning continuously, the model ended the current story by generating a section break and began an unrelated story about a wedding party.

An open question here concerns the longevity of these effects, since they’re typically established in a laboratory setting. There is some evidence that structural priming can persist for as long as a week, though I don’t know how robust that effect is. It also may not matter that much if LLM usage is sufficiently pervasive: that is, if people use LLMs multiple times every day, they’re repeatedly “primed” by LLM linguistic styles.

I think it’s an interesting question how one would operationalize this kind of question in terms of clear, concrete predictions; if any readers are interested, let me know!