In cautious defense of LLM-ology

It's useful to try to peek inside the black box, but it should be done rigorously.

Mar 02, 2023

In the span of just a few months, Large Language Models (LLMs) like ChatGPT have gone from being relatively fringe topics to dominating the news discourse.1

Much of the coverage focuses on what they can do. For example, ChatGPT achieved a passing grade on both medical and law exams. These articles typically include speculation about whether and when ChatGPT will replace human workers in various labor sectors, though sometimes they discuss how ChatGPT could be used to augment human performance––e.g., there are now tutorials on how to improve your programming skills with ChatGPT, as well as how to integrate ChatGPT into the classroom.

The most recent coverage has centered around Bing’s new chatbot (which is based on GPT), after an interview with a New York Times reporter went viral. These articles also discuss the model’s capabilities, but with a particular focus on the model’s “misbehavior” and what that might mean for the future of Artificial Intelligence.

I don’t have much to say about the economic implications of a model like ChatGPT; there are people much better-placed and better-informed than me to comment on this.

But I hope to provide a different angle: the question of how these models work in the first place––and why figuring that out is surprisingly hard.

The world before ChatGPT; or, the rise of BERT-ology

Back in the pre-ChatGPT days, one of the largest and most successful LLMs was called BERT.

BERT exploded onto the Natural Language Processing (NLP) scene in 2018. It used the now-ubiquitous “transformer” architecture, and performance-wise, blew other language models out of the water. By 2020, thousands of papers had been published using BERT for various NLP tasks; the latest count (as of March 2, 2023) is approximately 60K citations.

Much of that work was fairly standard for NLP: comparing different models on various “benchmarks” of model performance, e.g., part-of-speech tagging, semantic role labeling, and more.

But some of that work was aimed more at understanding how BERT worked. BERT, after all, was a very large language model––especially for the time––with over 100M parameters.2 Although BERT did very well on NLP benchmarks, it wasn’t entirely clear why. In the terminology of machine learning and computer science, BERT was a “black box”––and researchers wanted to look inside.

This field of work gradually acquired the somewhat tongue-in-cheek term BERT-ology, i.e., the “study of BERT”. In 2020, Anna Rogers and co-authors published a literature review explicitly aimed at summarizing some of the insights this research had yielded. There’s now even a page in the HuggingFace3 web documentation about BERT-ology specifically.

What did BERT-ology look like?

A now-classic 2019 paper by Ian Tenney and others was called “BERT rediscovers the classical NLP pipeline”, which used a technique called “probing” to identify which kind of information was encoded in various layers of BERT.

Some important bits of context:

LLMs like BERT are composed of various layers.
Each layer contains some number of “artificial neurons”––individual nodes that connect to nodes in successive layers with some particular connection strength (called a weight).
The “base” version of BERT has 12 layers, and the “large” version has 24 layers. This image from a tutorial on how BERT works summarizes the point nicely:

Depiction of how many layers BERT “base” and BERT “large” has. Taken from this tutorial: https://www.geeksforgeeks.org/explanation-of-bert-model-nlp/.

End-to-end, these layers take input and transform it into some predicted output, enabling (sometimes) BERT to solve some NLP task. But how do they do this?

That was the central question of Tenney et al. (2019). To answer it, they trained a classifier––a model that uses various features to sort input into different categories (e.g., “noun” vs. “verbs”)––on successive layers of BERT, and measured how well this classifier could recover different kinds of information. For example, how well could a classifier trained on only the first two layers recover information about a word’s part-of-speech? What about the first five layers? What about the first nine?

Interestingly, they found that syntactic features––such as part-of-speech––could be recovered successfully in early layers of BERT. In contrast, tasks that required semantic and pragmatic features––such as resolving a pronoun to its antecedent––relied more on later layers. The paper gets its title (“BERT rediscovers the classical NLP pipeline”) from the observation that this is also how many traditional NLP systems––i.e., those whose components were built by hand, as opposed to using a deep neural network––were constructed: syntax first, then semantics.

This technique came to be called classifier probing, or just “probing” for short, and established an important method for trying to peek inside the “black box” of LLMs.

Why is BERT-ology necessary?

You might be wondering why this is necessary at all.

After all, an LLM is at its core simply just a large matrix of numbers (weights), which provide instructions about how to produce output from input––like a giant equation. And unlike the human brain, we know exactly what those weights are. We also know what data the model was trained on, and we know (more or less) which algorithm was used to optimize the network during its training. Why, then, do people think of LLMs as a “black box”?

The issue here really comes down to what it means to understand a system. What makes for a good explanation?

On one level, the weights are really all there is to it. But a matrix of numbers doesn’t tell us much on its own. We might know, vaguely, that these numbers reflect an optimized distribution of weights to help the model succeed in next-token prediction. But we don’t know how this arrangement of weights helps the model succeed in that task.

Knowing the weights does allow us to determine how the system will respond to a particular stimulus. That is, given the input “I like my coffee with cream and ____”, we can inspect exactly how much each artificial neuron in each layer was activated, and we can even derive a probability distribution over the tokens that are most likely to fill in the blank. For example, perhaps BERT considers “sugar” more likely than “salt”, and both more likely than “soap”.

But even in this latter case, the weights themselves don’t necessarily allow us to generalize about how the system responds to a class of stimuli.

That final point is crucial. Explanations are a form of compression: a map is useless if it simply reconstitutes, atom by atom, every point in the original territory. Explanations must at some level generalize beyond the particulars of any given situation––ideally, they should identify the principles or laws governing the behavior of a system.

That’s where BERT-ology comes in. The goal of BERT-ology has always been to identify generalizable principles about how BERT works, typically relying on theoretical constructs and methodologies from neighboring domains, such as math, psychology, and linguistics.

From BERT-ology to LLM-ology

NLP moves fast.

In 2020, GPT-3 was released, representing yet another new advance. And since 2020, various iterations of GPT-3 have been released, such as InstructGPT. The conversation has shifted from BERT and its siblings (RoBERTa, ELMo, etc.) to GPT-3 and the topic of LLMs more generally.

Hence, “LLM-ology”, which I’ll define as follows:

LLM-ology: a research program that studies how Large Language Models (LLMs) work “under the hood”. LLM-ology is distinguished by its focus on the internal mechanisms of LLMs (i.e., which computations underlie the transformation of input to output) as well as the comparison to humans or human populations (i.e., administering classic psychological tasks to an LLM).

It might also be useful to further sub-divide LLM-ology into what I see as two broad, mutually compatible approaches:

Behaviorist LLM-ology: analyzes LLM outputs (e.g., predicted text or probabilities assigned to upcoming tokens) in response to particular inputs with the goal of establishing overall system capabilities and limitations. In the ideal case, this is done with reference to some underlying theoretical framework, using carefully curated stimuli that allow the researcher to make inferences about what capabilities could in theory allow for specific behavioral responses. An analogy from Cognitive Science would be the field of Cognitive Psychology, and an example would be this recent PNAS paper, which is conveniently titled: “Using cognitive psychology to understand GPT-3”.
Internalist LLM-ology: analyzes LLM internal states (e.g., activations across units in a given layer) in response to particular inputs with the goal of establishing mechanistic explanations of how information passes through the system. An analogy from Cognitive Science would be the field of Neuroscience. Some of this work is inspired by work analyzing the mechanics of visual neural networks, e.g., Chris Olah’s work on interpretability. An example of this approach would be the already-mentioned Tenney paper, as well as this more recent paper by Atticus Geiger on others probing language models for causal abstractions.

As I noted above, these approaches are mutually compatible and complementary. In some sense they mirror (though not perfectly) Marr’s levels of analysis. For example, a researcher might conduct “behaviorist” LLM-ology to chart the boundaries of an LLM’s abilities, then use “internalist” methods to identify the mechanisms underlying these abilities.

But is this useful?

Not everyone is convinced. There are a few counterarguments to the utility of taking up LLM-ology as a worthy field of study.

One argument is that LLMs change too quickly and that the research we conduct on one LLM won’t generalize to other LLMs. BERT was considered “state of the art” in 2019, but in just four years has been supplanted by GPT-3––and GPT-4 is likely coming later this year. The concern is that if we spend all of our time and resources studying GPT-3, then even if we achieve something like understanding, that knowledge might be rendered obsolete as soon as GPT-4 is released (and so on for GPT-5, etc.).

Another perspective is that LLMs are not a “natural kind”. They are a tool built by humans to accomplish a task, and are therefore not particularly interesting to study in and of themselves. To draw a very rough analogy: we wouldn’t necessarily want to spend our time conducting experiments on toaster ovens, as toaster ovens don’t reflect something “real” and human-independent about the world.

A final argument is that this is simply too difficult. Even our best examples of LLM-ology don’t really add up to “understanding”––at best, they are explanations of a very specific, constrained set of behavior, but not really of the system as a whole.4 Further, there are conceptual concerns with the dominant methodologies like probing; for example, just because a probe identifies certain information (e.g., part-of-speech) as decodable in a given layer of an LLM, does not imply that this information actually plays a functional role downstream. These issues raise the question of whether the insights we glean from approaches like probing are even insights at all.

These counterarguments seem even stronger in concert. If LLMs aren’t inherently interesting, then why should we spend time figuring out how they work––especially if that knowledge might be obsolete in a year, and especially if we may not even succeed in the first place?

Does LLM-ology even need to be defended?

This article is entitled “A cautious defense of LLM-ology”. So far I haven’t done much defending––I’ll get to that in the next section.

But first, I think it’s important to point out that LLM-ology is already happening. Further, it’s well-represented not just in academic research but also in industry: companies like Anthropic are specifically focused on developing a better understanding of LLMs.

From this perspective, then, defending LLM-ology feels a little bit like defending Goliath. The players involved already have clout. And admittedly, I’ve wondered if perhaps my energies are better spent attacking the worst offenders of LLM-ology rather than defending its merits.

Which brings me to the purpose of this essay: my goal is to defend what I find useful about the best kind of LLM-ology, and in doing so, hopefully enumerate some of the pitfalls I see in the less rigorous implementations of this work.

Why LLM-ology could be useful

I think the critiques of LLMs I raised above are all valid. But I also think there are some good arguments in favor of LLM-ology.

LLMs are increasingly prevalent, so we might as well understand them.

Language models have been deployed in consumer technologies for a while now; as I’ve discussed before, “auto-complete” tools are typically rooted in some kind of language model––a probability distribution over possible upcoming words, given the words you’ve just typed. There are also tools like Grammarly, which use language models (along with more traditional rule-based systems) to check your grammar and spelling and offer suggestions.

But LLMs like GPT-3 represent a significant advance––if only in terms of sheer size––beyond earlier auto-complete tools. Over the last few weeks, I’ve seen people propose a huge variety of uses, from writing (and fixing) code to offering medical advice. A recent essay argued that LLMs could mark the beginning of “auto-complete for everything”, which would ostensibly include writing essays about LLMs:

AI-based word processors will automate this boring part of writing – you’ll just type what you want to say, and the AI will phrase it in a way that makes it sound comprehensible, fresh, and non-repetitive. Of course, the AI may make mistakes, or use phrasing that doesn’t quite fit a human writer’s preferred style, but this just means the human writer will go back and edit what the AI writes.

An important question is whether these newer LLMs are ready for widespread deployment. Opinions differ: one view is that the best way to find out is to deploy them at scale (i.e., to “move fast and break things”); the other view is that the downside risks are too large and we should proceed with extreme caution (a view I associate with concerns about AI alignment).5

Either way, I think a discipline focused specifically on peeking inside the black box would be of use. For those who want to deploy LLMs immediately, LLM-ology provides a toolkit for unraveling what went wrong, and how, when something inevitably breaks. For those who favor caution, LLM-ology may provide the mechanistic justification for doing so––and help adjudicate disagreements about exactly what a given LLM is capable of and what’s beyond its ken.

(There’s a deeper question here––relevant mostly for the cautious among us––about the extent to which conducting LLM-ology perpetuates or gives ideological cover to the development of bigger, better (and possibly more dangerous) LLMs. I’ve seen versions of this view expressed by people concerned with AI alignment and also those concerned with more proximal harms of AI; I’m definitely sympathetic to the idea that we should be working to slow down developments in AI and adopt a more cautious approach. But this does suggest that LLM-ology, done “right”, exists in an awkward relation with groups actively building LLMs; I’m not sure what to do about this.)

LLMs may not be natural kinds, but they’re also not toaster ovens.

Earlier, I suggested that studying LLMs is akin to studying a toaster oven––with the implication that doing so is patently absurd.

But LLMs are clearly not like toaster ovens. Some people (not me, admittedly) know how toaster ovens work; and presumably, when the first toaster ovens were built, they were built with certain design principles and mechanisms in mind. If that wasn’t the case––if someone accidentally hooked some wires together and found themselves with some burnt toast––then I can only assume there was a mad rush to figure out what, exactly, was happening.

And so it goes with LLMs. On some level, we know how they work––we know (more or less) what they’re trained on; we know their weight matrix (at least for publicly available models); we know how they’re trained. But we don’t know how the behaviors we observe––the things that make us surprised, impressed, or even a little unsettled––emerge from all this training.

So while it’s true that LLMs are a tool built by humans, as opposed to something “natural” we’ve discovered, we don’t seem to understand them as well as most tools––which suggests, at least to me, that we ought to try.

LLMs are interesting socio-technical systems.

In graduate school, I read a paper called “How a Cockpit Remembers its Speeds”, by Ed Hutchins. It was my first exposure to the theory of distributed cognition (or “D-cog” for short).

Unlike traditional approaches to the mind––which locate intelligence in the brain––D-cog adopted a wider lens on cognition. A cognitive system could be a brain inside a body, but it could also be many bodies working together to solve a problem, or even the intelligent use of space.

I think this lens of analysis is helpful as we think about what, exactly, an LLM is. An LLM has been trained on orders of magnitude more words than a human sees within their lifetime, representing a remarkable diversity of language use (though strikingly homogenous in other ways). LLMs may in turn be deployed in consumer applications, where they will perhaps be allowed to “learn” in novel contexts (i.e., change their weights according to the task at hand); this deployment may then lead to strange feedback loops, wherein humans are themselves “trained” by LLMs that have been trained on human output. LLMs may be used for propaganda or for productive classroom pedagogy.

In all these cases, LLMs sit at an interesting intersection of language, culture, and society more broadly. LLMs are a new kind of cognitive system––and so is any distributed system that makes use of an LLM, for better or for worse. And I think that’s also worthy of study.

LLMs may help us understand the human mind.

Lastly, LLMs can give us insights into how human cognition works.

Current LLMs are trained on linguistic input alone; they’ve never felt grass beneath their feet or tossed stones across a stream. They’re also not provided with any “innate” knowledge about how the world or even language works. This makes them particularly well-suited as baselines––a working model of just how much knowledge one could expect to extract from language alone.

This property happens to be incredibly useful for addressing many entrenched debates within Cognitive Science.

For example, do humans have an innate language module? Decades of debate haven’t really made much headway, but one view (initially put forward by Noam Chomsky) is that some amount of our linguistic knowledge is hard-wired––that the input we receive as children is simply too impoverished to account for how much we know. LLMs are a great way to put this hypothesis to the test: if LLMs exposed to a developmentally realistic amount of linguistic input do manage to learn things about language––e.g., syntax, some semantics––it suggests that the stimulus isn’t so impoverished after all.

A similar debate has played out with respect to Theory of Mind, the ability to reason about the mental states of others. For years, a dominant view was that Theory of Mind was both unique to humans (there’s now evidence for Theory of Mind in other great apes) and innate––i.e., humans have a biologically evolved module for seeing the world through the eyes of another. But another, opposing view is that Theory of Mind emerges at least in part from exposure to language itself––that learning words like “know” and “believe” helps bootstrap the process of reasoning about mental states. LLMs are a perfect way to test this language exposure hypothesis––and indeed, I recently wrote a paper with my colleagues doing exactly this. We found that GPT-3 did display sensitivity to knowledge states––though not to the same degree as humans did on the same task––which could be interpreted as evidence of a nascent Theory of Mind (or as a reductio of the task itself); importantly, this work allowed us to quantify exactly how far one can get from language alone––something that wouldn’t have been possible without LLMs.

There’s plenty of other work in this vein, and it’s a growing field. LLMs, when used in the context of a careful, rigorous experimental design, can provide insights into the human mind.

LLM-ology needs to be done well.

As someone who’s been working with LLMs for a few years now, the last couple of months have been a little surreal. I’ve seen article after article about ChatGPT and Bing’s chat-bot. This is to say nothing of the tweets, many of which include screenshots of interactions with ChatGPT.

These screenshots––and transcripts, quotes, etc.––often purport to demonstrate something about ChatGPT. Sometimes that’s a demonstration of ChatGPT’s remarkable capacities, e.g., the quotes from the NYT transcript. Sometimes it’s a demonstration of ChatGPT’s surprising incompetence, e.g., as in this extract of ChatGPT extolling the virtues of eating crushed glass.

If I’m being fair, I think these demonstrations do count as LLM-ology, at least of the behaviorist kind; at the very least, it’s very hard to draw a line in the sand distinguishing them from the papers I’ve discussed earlier in this post. Further, I’m on record as being a “duck tester”, which means I’ve got to accept the baggage of that view. This means that I––along with anyone else invested in empirical inquiries into how LLMs work––need to contend with the question of what counts as knowledge and when.

I think these demonstrations can be useful contributions to the discussion of what, exactly, ChatGPT can and can’t do. But I also worry about what seems to me a rather large gap in the empirical demonstrations people point to and the inferences they make about underlying capacities. And importantly, this is true regardless of whether someone is expressing skepticism about ChatGPT’s abilities or arguing that ChatGPT represents an Artificial General Intelligence.

Can a screenshot of a single interaction with ChatGPT be interesting or even informative? I think it can. However, it’s extremely unlikely that such a demonstration could ever provide definitive proof for or against a hypothesis (except the most narrow of hypotheses). But these one-off demonstrations are often treated as confirmation or disconfirmation of some underlying capability. The problem, then, lies in the mismatch between the magnitude of the evidence presented and the theoretical claims derived from that evidence. Or perhaps it’s even deeper: perhaps the problem is simply that the hypotheses are rarely spelled out in sufficient detail to know what would count as evidence for or against them.

This problem is not isolated to screenshots of ChatGPT conversations. All empirical science operates in a zone of uncertainty––including Cognitive Science. Perhaps most crucially, we don’t know whether we’ve divided the conceptual space of what cognition “is” in the appropriate way: do the set of theoretical constructs we’re interested in (e.g., Theory of Mind, language comprehension, working memory) map in a meaningful way onto the “true” division of the mind? And are the instruments we’ve developed to measure these constructs actually measuring what we’re interested in?6

We can’t know the answers to these questions. But at least when we’re doing things right, we ask the questions and work together––albeit sometimes in adversarial collaboration––to arrive at something closer to an answer than where we were before we started.

This, to me, is what LLM-ology done right looks like as well. It requires:

Being explicit about what we’re looking for––what capabilities, exactly, do we think we’re measuring?
Justifying our instrument––why is this the right way to measure that capability?
Proper scoping of our claims––which inferences or generalizations are we licensed to draw about the thing we’re interested in?

We won’t agree on the answers. But here’s the thing: the questions are going to surface one way or another––you can’t escape construct validity––and it’s helpful to address them head on, with a clear sense of what they mean and why you’re arguing about them in the first place.

LLM-ology moving forward: what’s the right model?

A question’s been gnawing at me throughout the whole post. What’s the right model for understanding how LLMs work? Is it the human mind––bounded, as in traditional cognitive psychology, by the skull/brain barrier––or is it something entirely more distributed?

So far, my analogies have focused on comparisons between LLMs and individual humans. Behaviorist LLM-ology compares an LLM’s responses to some stimulus to individual (or sometimes averaged) human responses to that very same stimulus; internalist LLM-ology deploys the tools and metaphors of neurophysiology, which were developed to pick apart the workings of a brain, or more typically individual neurons and neural circuits within a brain.

Both approaches assume on some level that the most appropriate analog for an LLM is an individual human agent. This is what’s inspired lines of research like assessing the political orientation or intelligence––or, indeed, the Theory of Mind––of LLMs. It’s also what’s inspired questions like: are LLMs sentient? But what if that’s wrong? What if we’re making a category error?

Another way of asking this question is to ask: what, specifically, is it about an LLM that makes us think the natural analog is an individual human mind? And further: what other kind of thing could an LLM be?

In answer of the first question, my gut response is that it has something to do with properties of the underlying stimulus: i.e., language. A transformer model (like GPT-3) could in principle be trained with any kind of training data, but there’s something about language that feels particularly human––and so we interpret the result of training such a network on language data to be analogous, in some way, to a human.

But what other kind of thing could an LLM be? And what might we learn by construing it as such? Here, the possibilities seem boundless. Is an LLM more like a collection of human agents––a chorus of voices, rather than a solo? (Which of course leads to the countervailing question: to what extent can a collection of human agents be interpreted as a mind-like system?) Is it like a society? Or a mycorrhizal network? Or is it more appropriate to construe an LLM like an object from physics––a dynamical system cycling through different states? Is an LLM like other human artifacts, and if so, which? Is it most like language itself? What kind of thing, for that matter, is language?

Ultimately, of course, an LLM is what it is: itself. But analogies are useful because they give us conceptual purchase, along with a suite of tools with which to better understand a thing. Right now, I’m (cautiously) excited about LLM-ology and I hope it matures into a science with both a strong theoretical framework and a rigorous empirical method. Yet I also hope that we avoid falling into the trap of anthropocentrism: there’s a lot to see in the world besides ourselves.

“Fringe” in terms of mainstream discourse, I mean. Plenty of people have obviously been discussing LLMs for some time.

In a neural network, a “parameter” refers to the weight between a neuron in one layer and a neuron in another layer.

A popular Python library for using pre-trained language models.

The same point could be said about Cognitive Science more generally. But with Cognitive Science, there are other, independent reasons to focus on this object of study––people care about how minds and brains work, and the way in which they work matters not just for basic research but also for things like medicine. (The counter-counter argument here is that knowing how LLMs work is also important.)

My personal view is that the downside risks, while uncertain, are numerous enough––and the upside benefits comparatively small––that we should at the very least move slowly and preserve a “human in the loop” in most, if not all, applications.

Here, I’m drawn back to this Raji et al. (2021) paper about the same issue in AI “benchmarks” and the difficulty in knowing whether we’re assessing what we think we’re assessing.

The Counterfactual

Discussion about this post