Should psycholinguists use LLMs as "model organisms"?

Models are useful representations; models are also not the thing itself.

Jul 17, 2023

In biology, researchers often rely on “model organisms” to study specific mechanisms or the progression of various diseases, particularly where human-subjects research would be impossible or unethical. For example, in neurophysiology, researchers are interested in the behavior and function of individual neurons. Recording the activity of these individual neurons requires invasive surgery, e.g., for implanting electrodes directly into the tissue of the brain. This kind of practice is prohibited in humans, so instead, researchers conduct their studies on the brains of rats.

The use of model organisms is not without controversy.

The first issue is ethics. Animal rights advocates often object that animals cannot consent to participation, and further, that they are harmed by their participation. Accordingly, researchers who use animal models are urged to limit the number of animals used in their studies, seek replacements (e.g., computational models) where possible, and conduct their studies in such a way so as to limit the suffering or pain involved on the part of the animals. Proponents of animal models argue that the benefits to humans outweigh the costs incurred by animals; opponents disagree, obviously—it’s not a settled matter, and intuition might vary on a case-by-case basis.

The second issue—which is more relevant to the current post—is about external validity. Namely, are model organisms a good model of humans? Skepticism about this claim is the inspiration behind Twitter accounts like “Just Says In Mice”, which adds caveats to press releases or scientific abstracts giving more details about the population actually studied. Mice and humans are different in a number of ways; it’s not always clear that the effects of a treatment tested in the former generalize to the latter. The situation gets even more complex when one considers that historically, research using model organisms has used males in particular, which further limits generalizability. (This is starting the change, as the National Institute for Health now expects NIH-funded research to account for biological sex.)

Limited generalizability is a big problem. Yet proponents point out (justifiably) that we’ve got to start somewhere, and it’s better to have some data than no data at all. As long as researchers are careful about drawing generalizations, this research can provide valuable insights that motivates further research in humans.

No model organism for language?

Unfortunately, for researchers interested in language, identifying a suitable model organism is much harder.

If you’re studying vision, then you can study the visual system of a related animal. A rat’s visual system won’t be exactly the same as a human’s, but rats at least have a visual system that can be studied.

In contrast, rats do not use a detectable communication code that’s directly analogous to human language. It’s possible, of course, that rat social communication is just as complex and multifaceted as human language, and that we don’t have the tools to characterize it. The history of Linguistics is ripe with debates about whether language is unique to humans, as well as which features (if any) make it unique. But the world in which rats use an imperceptible, humanlike communication code is empirically indistinguishable from the world in which rats simply don’t have something like human language. Thus, for practical purposes, rats can’t be used as model organisms for the study of human language.

That doesn’t mean no insights can be gleaned from the study of animal communication systems. Considerable research has been done on gestural communication between great apes (see my colleague Dr. Federico Rossano’s work on the ritualization of bonobo gestures); other work focuses on birdsong, which shares certain structural and developmental properties with human language (see this review article for more details). This work is fascinating and informs theories of when and how human language and other tools for social coordination evolved.

But these animal models can’t really be used to address many of the questions psycholinguists are interested in. To name a few:

Which factors influence when children learn which words?
Which cues how comprehenders interpret a sentence, and how are these cues integrated over time?
How do comprehenders represent the meaning of words with multiple meanings?
Does the language one speaks influence how they see the world?
To what extent does language comprehension involve the activation of non-linguistic representations, such as sensorimotor “traces” or even representations of an interlocutor’s mental states?
How do language producers translate the message they want to convey into a sequence of motor articulations?

It’s not clear to me how any of these questions could be answered directly by studying non-human animals. Yet these are, in my view, some of the “big questions” that psycholinguists want to answer!

And one reason (among many) we have such difficulty answering them is precisely the lack of an animal model: we’re stuck trying to answer questions of mechanism at the level of observable human behavior. At best, we can study large-scale changes in brain activity: for example, fMRI indicates changes in blood oxygenation in a given brain region. But this scale of resolution is nowhere close to what we’d need to be able to describe how the activity of individual neurons gives rise to what we think of as cognitive functions like language comprehension.

The promise of “in silico” experimentation

A recent paper published in the Neurobiology of Language1 suggests an intriguing alternative to animal models: as Large Language Models (LLMs) improve, perhaps we can use them to test theories and generate novel hypotheses about the mechanisms underlying human language acquisition, comprehension, and production.

There are a few ways in which psycholinguists could use LLMs to answer the questions they’re interested in. As I’ve written before, LLMs are useful as “distributional baselines”:

Current LLMs are trained on linguistic input alone; they’ve never felt grass beneath their feet or tossed stones across a stream. They’re also not provided with any “innate” knowledge about how the world or even language works. This makes them particularly well-suited as baselines––a working model of just how much knowledge one could expect to extract from language alone.

This means that LLMs are good for answering questions like: is language “innate”?

Decades of debate haven’t really made much headway, but one view (initially put forward by Noam Chomsky) is that some amount of our linguistic knowledge is hard-wired––that the input we receive as children is simply too impoverished to account for how much we know. LLMs are a great way to put this hypothesis to the test: if LLMs exposed to a developmentally realistic amount of linguistic input do manage to learn things about language––e.g., syntax, some semantics––it suggests that the stimulus isn’t so impoverished after all.

Now, some people (like Noam Chomsky) disagree that LLMs can be useful models of language or language acquisition. But I think Chomsky’s wrong on this. In my view, LLMs provide a useful empirical test of how much about the structure of language can be learned through statistical regularities alone.

Further, as that recent paper argues, psycholinguists can be even more ambitious than using LLMs as distributional baselines.

An additional benefit of LLMs, after all, is that we have access to their internal behavior. Unlike the human brain, we can identify what each “neuron” in a given LLM is doing in response to a given input. In principle, this should help us construct mechanistic, low-level theories of how the behavior of individual neurons gives rise to emergent behavior on various linguistic tasks.

To draw another analogy: the reason studying the visual system of a rat is so useful is that we can inspect the behavior of specific neurons in each layer of the rat’s visual system; this gives us a level of precision that we simply don’t have with humans. We can even “knock out” or “lesion” certain neurons, which allows us to identify their causal impact on downstream behavior of the visual system. Proponents of using LLMs as model organisms would argue that they provide similar epistemic leverage.

Of course, actually doing this work will be extremely challenging; it may even be impossible. But if it’s impossible to do for an LLM, then I struggle to see how we’d ever do it for the human brain. I’m not an optimistic person by nature, and I do think it’s wise to be cautious around the current LLM hype. But I also think that being a practitioner of science requires a commitment to a certain level of optimism, and so I try to be optimistic about the possibility of making genuine, lasting discoveries about the human language faculty. Thus, I prefer to think of LLMs as representing a genuine opportunity to make discoveries about human language.

All models are wrong

There’s an old saying (usually attributed to George Box) that scientists often like to quote:

All models are wrong, but some are useful.

I like this aphorism because it gets at something fundamental about the practice of science. Any given model makes a set of assumptions and is therefore in some sense “wrong”, i.e., it is not a perfect reconstruction of reality. (Indeed, a perfect reconstruction of reality would not be all that useful.)

It also emphasizes the pragmatic nature of model construction. Models are built to help us ask and answer certain questions. Thus, for any given purpose, some models are more useful than others.

It might be helpful to view LLMs through this lens. First, what makes LLMs “wrong”—what are some limitations we should be mindful of? And second, are LLMs nonetheless useful?

What makes LLMs bad models?

I’m certainly not the first person to note the limitations of LLMs, either as scientific models (e.g., Mitchell & Krakauer, 2022) or in terms of their ethical implications (e.g., Bender et al., 2021).

But I’ll do my best to summarize some of the oft-stated limitations here, focusing on issues that limit the conceptual mapping from LLM to humans in particular.

Most LLMs are typically exposed to very different training data than humans. Human children do not, by and large, learn language by reading all of Wikipedia.
Relatedly, most LLMs see much more training data than human children. Human children are exposed to approximately 10M words during their first 10 years of life, whereas GPT-3 was trained on billions of words.
LLMs also face a different, possibly simpler prediction task than humans in certain ways. LLMs are trained to predict upcoming tokens (e.g., “the ___”), which simplifies the hard work of figuring out what the words are in the first place. Humans, in contrast, come into the world without much in the way of phonological or certainly lexical knowledge. When humans process language, they’ve got to recognize the sounds (or signs) they encounter, combine them into recognizable units of language, and use that information to predict or comprehend linguistic input.
Further, the training data LLMs do see is biased. Because LLMs are trained on written text, they’re limited to languages or language varieties with a written form (i.e., they can’t be used for signed languages); they also tend to perform worse for language varieties used by marginalized communities, and over-represent the “voices” you’re most likely to find on easily available online forums. LLMs are also primarily trained on English, which as I’ve written before, is not representative of the world’s languages.
Some LLMs, like GPT-4, undergo further training known as “Reinforcement Learning with Human Feedback” (RLHF). I won’t go into all the details here, but RLHF provides more information to LLMs than training on textual distributions alone—which means that researchers should be cautious about drawing conclusions about the distributional hypothesis from RLHF-trained models, as they might have an unfair advantage. (There’s a separate question of whether RLHF makes models more analogous to humans overall, but regardless, it certainly complicates their use as distributional baselines.)

In my view, these are the main, concrete reasons why LLMs may not be good models.

There are other issues with LLMs, to be sure, and also other differences between LLMs and humans. For example, most LLMs are trained on text alone, whereas humans have experience in a physical, situated environment. However, this difference does not impact the utility of LLMs as models: in fact, it makes them more useful, as we can then assess the impact of that difference on downstream behavior of each system in turn.

LLMs aren’t the only (limited) model organism in town.

Before I ask whether LLMs are nonetheless useful, I want to address another issue that I think is sometimes overlooked in these debates.

Namely, psycholinguists already rely to some large extent on a “model organism”: (mostly English-speaking) university undergraduate students. As I’ve written before, Cognitive Science in general (and psycholinguists in particular) suffers from a lack of linguistic diversity; further, participants in experimental studies are overwhelmingly English-speaking, and are often undergraduates at four-year universities in Western, industrialized societies.

“WEIRD” subjects are weird in a number of ways. They rely on different norms of cooperation and negotiation than humans in small-scale, hunter-gatherer societies. They may even be more susceptible to certain visual illusions.

Thus, Cognitive Science is facing a bit of an external validity reckoning. Many results we thought applied to “all humans, everywhere for all time” may apply more narrowly to the subjects we’ve empirically studied. This is a problem for the generalizability of our results.

At the same time, this doesn’t entail that research conducted on university undergraduate students is useless. It’s just limited in its universal applicability. The hard problem before us is to establish exactly how limited these results are. What generalizes and what doesn’t?

Are LLMs useful?

LLMs are limited in their use as model organisms, and so are university undergraduate students. I’ve already claimed that university undergraduate students are still useful. What about LLMs?

I think LLMs can still be useful for investigating at least two types of research questions.

First, to what extent can human linguistic behavior be explained as a function of linguistic input alone? This is what I call the distributional baseline question: if human behavior on a psycholinguistic task can be approximated by an LLM, it suggests that the mechanisms responsible for generating human behavior could in principle be the same as those responsible for generating LLM behavior. That is, linguistic input is sufficient to account for human behavior. (Note that sufficiency ≠ necessity.)
Second, what kinds of representations mediate the transformation from input to output, and when are they accessible? Given that an LLM approximates human behavior on some task, we can then “look inside” the LLM and ask about what different layers of that network are doing. For example, at what point in the network is a given semantic distinction (e.g., the difference in meaning between “financial bank” and “river bank”) accessible? Is that before or after other distinctions (e.g., verb vs. noun) become available?

The first approach corresponds to questions about what kinds of information or inputs are important for human language acquisition and comprehension. The second approach corresponds to questions about what kinds of representations facilitate language comprehension, and when they are activated.

My claim is that: 1) both research questions are of clear interest to psycholinguists; and 2) LLMs can be used fruitfully to address both types of research questions. To my mind, that makes LLMs useful—even if “wrong”, in some ways—as model organisms.

It’s also very possible that LLMs have broader use as models than I’ve described here. If I’ve missed something you think is important, I’d love to hear it.

Similar proposals have been made elsewhere, e.g., in this recent paper on the “neuro-connectionist research programme”; I’ll dedicate a longer post to the details of this latter paper.

The Counterfactual

Discussion about this post