LLM-ology: the challenges ahead

Three hard questions for a new paradigm.

Nov 08, 2024

Earlier this week, I gave a talk for the UC San Diego Linguistics department entitled: “Large Language Models as Model Organisms—Opportunities and Challenges”. It was a lot of fun: I was really grateful for the invitation to speak, and people asked great and thought-provoking questions. The talk covered some of the topics I write about here, like whether LLMs have a Theory of Mind and some of the challenges involved in assessing LLM capacities.

In my view, LLMs (and LLM-ology) are a huge opportunity for Cognitive Science and Linguistics. I think it’s possible we’re at the cusp of a paradigm shift.1 First, for research on human cognition, LLMs could serve as “model organisms” to test questions we can’t test in humans; and second, LLMs themselves are interesting subjects of scientific research. But while paradigm shifts are exciting, they’re also periods of methodological and theoretical instability. There are still many unanswered questions when it comes to LLM-ology. And crucially, these aren’t the kinds of questions that can be addressed in an experiment—they’re questions about the very epistemological foundations of this new field: how should LLM-ologists and cognitive scientists using LLMs go about learning things?2

What kind of thing is an LLM?

As I’ve written before, people use a variety of metaphors when talking about LLMs. I’ve seen LLMs described as “agents”, “crowds”, “stochastic parrots”, “tools”, and even “aliens”. And as I wrote in this post, this isn’t just a question of what language we use: how we construe LLMs matters for how we study them scientifically.

Much of what I’ve described as LLM-ology implicitly or explicitly treats LLMs as akin to individual human minds or brains. We assess their capacities using tests designed for individual humans, and we investigate their representations and internal mechanisms using methods (and assumptions) developed to study brains. This seems like a reasonable working assumption, but it’s entirely possible it’s the wrong assumption. It could be that we should study LLMs the way we study groups of people, or even the way we study physical systems like clouds or the ocean—i.e., without reference to their “representations”.

I don’t think this question has to be answered right away. In fact, I think we should avoid committing prematurely to any particular methodological approach. The field is young, and I think that calls for methodological and theoretical pluralism. But that means it’s something we need to think seriously about. If we want to study LLMs, we need to think about what kind of think an LLM is and thus how best to learn about that system.

The other two questions follow naturally from this fundamental uncertainty about the nature of what kind of thing we’re dealing with.

How do we measure what we want to measure?

Crucial to making a valid claim is construct validity: ensuring that we’ve operationalized the construct we’re interested in measuring in a fair way. Many debates in Cognitive Science center around construct validity: we might agree on the data, but we don’t agree on what the data represents.

And you can’t escape construct validity—certainly not in the study of LLMs. It’s still unclear to me how to determine whether we’re studying an LLM’s capabilities or mechanisms in the right way. For example, if an LLM passes tests developed to assess Theory of Mind in humans, does that mean the LLM has Theory of Mind? What kinds of theoretical or methodological principles can we use to guide our inferences here?

One way in which construct validity rears its head here is what I’ve previously called differential construct validity: this is the position that the same test doesn’t “mean the same thing” for humans vs. LLMs. Using the example above, perhaps the False Belief Task is a good indicator of Theory of Mind in humans3—but not LLMs. This position has some intuitive appeal, perhaps in part because it helps us navigate the Scylla and Charybdis of deciding whether to attribute Theory of Mind to a language model on the basis of its test performance or straight-up reject the test altogether. But I’ve yet to see a clear set of theoretical principles that indicate whether differential construct validity is at play. Presumably some tests are appropriate to use in both humans and LLMs, and presumably some aren’t. Which ones are appropriate for which constructs—and more importantly, how would we know?

A related issue is what I think of as “generalization”. Any time we test an LLM, it’s important to rule out the possibility that the test was included in its training data: a problem known as data contamination. But even if we’re confident the test wasn’t in its training data, it’s possible a similar test was. And because modern LLMs are so powerful, they’re good at identifying structural correspondences between examples, which means that a skeptic can always suggest that good performance is “just pattern-matching” and not due to the model “truly” displaying the underlying construct we’re trying to assess. I tend to think this critique is rather under-specified—and just as easily lobbied at human cognition—but it points to a genuine epistemological problem. We can all agree that data contamination invalidates a test result. But how different does a test need to be from anything the model has seen in its training data such that good performance constitutes an “interesting” result? Once again, we don’t have much in the way of formal theories that help us answer this question, especially for the more complex constructs people are often interested in (like Theory of Mind).

What’s the relationship between samples and populations?

It’s virtually impossible for any study of LLMs to consider every LLM that’s ever been trained—to say nothing of all the LLMs that could’ve been trained but were not. Instead, we rely on a sample. But what’s the relationship between the LLMs in that sample and the broader population of possible models? And if we’re using LLMs as model organisms, to what extent does the sample in question reflect the population of humans we’re interested in?

I wrote about this in much more detail in a recent post, so I won’t say too much about it here. In a nutshell, the problem is this:

But we lack a systematic theory that connects those constitutive design factors to the properties and behaviors that emerge after training. This is especially true when it comes to measuring more complicated “capabilities” and characterizing the underlying representations and mechanisms subserving those capabilities, as in the field of mechanistic interpretability.

Fundamentally, we just don’t know what the space of possible LLMs looks like. We also don’t know whether or how the differences across LLMs affect our ability to generalize from one LLMs to another.

The path forward

To be honest, I don’t know how best to address these challenges.

As I mentioned in my external validity post, part of the answer is surely careful descriptive work: it’s pretty hard to construct a typology of butterflies—or languages, for that matter—without knowing lots of facts about the thing you’re studying. That seems at least somewhat tractable, modulo some technical and sociological limitations.

But I think we’ll also need some philosophy of science. The questions I’m raising in this post are epistemological, and as a community, we don’t have a shared set of epistemological norms about how to construct knowledge about LLMs. I’m not sure how we get from here to there, and it doesn’t help that Cognitive Science has its own set of unresolved paradigmatic conflicts about which I’m equally uncertain. It’s possible that the epistemological norms I’m referring to just emerge gradually over time—but even if that’s true, I think it’d be worthwhile to characterize what those norms actually are as they develop.

Artificial neural networks, of course, are not new, but neural language models with this kind of expressive power and flexibility very much are.

See also this conversation with Ben Riley for some more thoughts on these questions.

It’s worth noting that many researchers criticize its use in humans as well! Theory of Mind is just a hard construct to assess.

The Counterfactual

Discussion about this post