How could we know if Large Language Models understand language?

Two perspectives on a thorny debate.

Oct 18, 2022

Earlier this year, Google fired Blake Lemoine, an engineer who claimed Google’s dialogue model––LaMDA––had developed sentience.

As someone who works adjacent to the field, I’ve been asked a couple times by friends and family what I think about that claim: can a language model have a mind? If so, should we be worried––are we exploiting these models? Could they exploit us? And ultimately, how could we know?

This question is not new, of course. I’m certainly not capable of definitively answering the question of “what it is like” to be a language model, if it is like anything at all.1 And as such, it’s also not the question I’m aiming to address in this essay.

Yet there’s an even simpler question that people outside the field often don’t think to ask, and that people inside the field can’t seem to agree on. The question is: Do language models even understand language?2 And importantly, how would we know?

Brief Note of Clarification

Although I opened this essay with a discussion of sentience, I want to clarify that the arguments presented below are about language understanding, not answering the hard problem of consciousness.

The two are often conflated; often arguments about language understanding machines appear to devolve into questions about whether those machines have something like a conscious experience. And it’s possible the two are, in fact, related––I’m not taking a strong stance on that.

What I am taking a stance on is that if we’re going to make progress on language understanding as a topic of scientific inquiry, then we need to approach it as a concept that’s clearly defined and also clearly demarcated from the question of whether it’s “like” something to be a machine.

The challenge, as you’ll see below, is coming up with a consensus definition of what understanding really means. But the answer, in my view, can’t require sentience as a pre-condition. (As an analogy, it’s much less controversial to view something like syntactic parsing––identifying the parts-of-speech in a sentence, re-arranging them into syntactic units like noun phrases, verb phrases, etc.––as an in-principle mechanical operation or computation that can be rigorously defined and critically, which doesn’t require consciousness.)

Large Language Models: A Primer

I’ve written a fair bit about Large Language Models (LLMs) before, so I’ll keep this brief.

First, a language model is, in essence, a probability distribution over a sequence of words. For example, a model of the English language should assign a higher probability to “salt and pepper” than “salt and computer”; similarly, “walk the dog” is a more likely expression than “walk the tree”.

There are many ways to build such a model. The simplest possible language model is what’s called a unigram model, which just counts the frequency of each of the words in an expression. The hitch is that the same word is differentially likely in different contexts. “The” is the most common word in the English language, but it’s not particularly likely to occur after another “the” (the band excepted). Such a model could be improved by widening the context: a bigram model represents the probability of a particular word, given the previous word; a trigram model widens the context to include the previous two words; and so on.

For a long time, these language models were pretty mediocre. But the application of neural networks to the language modeling task really changed the game. Neural networks represent each word as a vector of real-valued numbers in a continuous space (a word embedding); these word embeddings are learned by giving a neural network a huge (usually written) corpus, and training that network to predict which words co-occur with which other words. Over time, that network tunes its representations of each word to allow for much better predictions.

The last few years have seen an explosion in these “neural” language models and their capabilities. There are a few reasons for this:

Computing resources have gotten much more abundant, so we can build bigger, “deeper”3, models.
We also have larger text datasets, so we can train the models on more data.
Researchers developed the transformer architecture, which seems to have led to a qualitative change in performance.

The result of all this progress is large language models (LLMs) like GPT-3 and LaMDA.

These LLMs appear surprisingly capable at generating coherent and meaningful text––a far cry from the state of the art ten or even five years ago. If you haven’t seen them in action, I do recommend taking a look at some examples. The writer Gwern has logged some examples of GPT-3 writing fiction and non-fiction. I also included an example in a recent post. You can also request access to use GPT-3 via OpenAI’s “Playground”.

If you never encountered statistical language models before the recent wave of LLMs, perhaps these improvements might be hard to understand or notice. But suffice to say, early bigram and trigram models were nowhere near this level of coherence––they struggled with basic grammar, let alone generating sentences and even entire essays that make sense. Modern LLMs generate text that’s coherent and even context-sensitive; GPT-3, for example, is fairly good at producing poems or essays in the style of a particular writer. Another of Google’s models (PaLM) has made great strides in tasks involving reasoning about causation and events in the world.

But do LLMs understand?

LLMs have clearly come a long way.

But do these advances really constitute understanding? Or are these models just stochastic parrots––regurgitating slight variations on the input they’ve been given and hallucinating “facts” without truly understanding the text they’re producing?

Opinions differ, to say the least.

From what I’ve seen, I think people’s views on the matter tend to fall into one of two categories:

An a priori rejection of the possibility that LLMs understand language, given that they lack certain necessary conditions. Let’s call this the axiomatic rejection view.
A view that this is primarily a question of behavior––if models behave like they understand language, then they understand. Let’s call this the duck test view.

Both these views have surprisingly radical consequences. In the sections below, I describe each view in more detail, then explore what the implications of adopting each perspective might be.

Axiomatic Rejection

One of the clearest expressions of what I take to be the axiomatic rejection view is found in the abstract of Bender & Koller (2020):

In this position paper, we argue that a system trained only on form has a priori no way to learn meaning.

In the paper, the authors argue that meaning, and in particular the notion of communicative intent, is something extrinsic to language itself. Language clearly mediates much of our communication, and as such, linguistic signs are an important component of how humans construct meaning. But because meaning depends on more than simply linguistic form, and because LLMs are trained only on linguistic form4, LLMs cannot understand the meaning of language. At best, they simply learn to rearrange linguistic symbols––the form of language––in ways that resemble meaning.

The philosopher John Searle made a similar argument with his “Chinese Room” thought experiment. The point of the argument is that a system can behave as though it understands language when in reality, it’s simply executing a series of programmed instructions–––i.e., “when you receive X input of characters, produce Y output of characters”. And so it goes, argues Searle, for language understanding more generally: without some notion of semantics, without a way to ground our understanding of words in the world and in social interaction, those words have no meaning at all. They’re just lines of ink on a page, or pixels on a screen.

It’d be like relying entirely on a dictionary for your understanding of what words mean; you just end up in an infinite regress, defining each word in terms of other words and so on.

And grounding symbols in the real world aren’t the only criterion people have proposed for true understanding. Other criteria often involve the kinds of cognitive capacities or internal representations humans likely draw on for language understanding. For example, Gary Marcus suggests that compositionality––the ability to compose and decompose an expression into its constituent parts––is key. Alternatively (or additionally), perhaps human language understanding requires something like a situation model: a “coherent and non-linguistic mental representation of the ‘state-of-affairs’ described in a text” (Bos et al., 2016).

Where does this view take us?

In the axiomatic rejection view, these criteria are used to support the same underlying point: LLMs are incapable of understanding language, by definition.

Yet this assumption takes us to a surprisingly radical conclusion: if LLMs cannot understand language, then achieving human-like performance on a language comprehension task should actually be taken as a kind of reductio of that task. We’ve already assumed the LLM doesn’t understand––it’s a kind of stochastic parrot, reproducing minor variations on the language statistics it’s ingested––so any metric that suggests it does must therefore be a bad metric.5 This might mean that many tasks we currently use as acceptable measures of human understanding should be discarded if an LLM "solves" those tasks.

A brief caveat.

I’ll discuss this again later, but I want to note here that endorsing the criteria mentioned above is not synonymous with endorsing the axiomatic rejection view.

One could, for example, believe that something like a situation model is necessary for language understanding––but also believe that LLMs could, over time, acquire something like a situation model; further, and most critically, one might believe that an LLM could be studied in such a way as to provide evidence for or against the possibility of it having acquired a situation model.

The Duck Test

The essence of the duck test6 view can be summarized in the following statement:

If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.

Under this view, the question of whether an LLM understands language can be answered by looking at how that LLM behaves. If it behaves like something that understands human language, then it probably understands human language. This view is arguably analogous to what the philosopher Henry Shevlin calls the superficial view––the idea that what matters for attributing some kind of capacity to an agent is its publicly observable behavior.

One of the oldest examples of something like a duck test for machines is, of course, the Turing Test. The basic premise was that if a machine could––through conversation––convince an unknowing human that it was a machine, then that machine was an intelligent, thinking being. This test has been implemented in various ways, but many different computer programs have “passed” since its proposal, despite being little more than template-based text generators that exploit our well-documented tendency towards anthropomorphism (e.g., ELIZA).7 Humans see little bits of ourselves in everything around us, and so perhaps it's no surprise that a program equipped with stock phrases ("And how does that make you feel?") might seem to us a kindred spirit.

Because of the inherent subjectivity involved in the Turing Test––and because of our clear tendency to see minds where they may in fact be none––more rigorous tests have since been devised.

The Winograd Schema Challenge, for example, is a test of pronoun resolution that purportedly requires world knowledge. In the sentence below, the pronoun “they” might be alternately resolved with “city councilmen” or “the demonstrators” depending on whether the verb feared or advocated is used.

The city councilmen refused the demonstrators a permit because they [feared/advocated] violence.

As it turns out, some of those initial sentences can actually be solved by modern LLMs. However, stimuli involving physical world knowledge, like the ones designed by my colleagues, appear harder to crack.

Nonetheless, LLMs have made incredible progress on language understanding “benchmarks” like SuperGLUE––which might, for our purposes, be considered another kind of duck test.

Where does this take us?

This view also has some pretty radical consequences.

If we decide that passing a test––or perhaps some number of tests––is sufficient to merit “understanding”, then we have to face up to what that means as LLMs pass more and more of the tests we devise.

Perhaps we find it reasonable to accept that an LLM can successfully resolve pronouns with their antecedents; this seems like a relatively benign and almost mechanical skill. But what about when LLMs start to pass the False Belief Task, a hallmark test for Theory of Mind? In fact, recent work I’ve done with my colleagues at UCSD suggests that GPT-3 is, in fact, getting closer and closer to human-level performance on the False Belief Task––does that mean that GPT-3 has Theory of Mind? And if so, what does that mean? Does GPT-3 have beliefs, as well as beliefs about the beliefs of others?

This view also raises some really hard questions. How many tasks must GPT-3 excel at before we’re willing to grant it a particular capacity (e.g., Theory of Mind)? Is a single task sufficient? What about five? And what constitutes “passing” here––performance equal to or greater than average human performance? These questions can and should be answered, but I also suspect that many answers will feel more than a little arbitrary and ad hoc.

And the above complications don’t even touch on the deeper philosophical question of what a benchmark is. We rely on tasks as a proxy to measure some latent construct––something we assume is both “real” but also impossible to observe directly. These tasks stand in for that construct but they’re also not, of course, equivalent to it; nothing but the thing itself is the thing itself. As Raji et al. (2021) argue, we have to consider the possibility that these tasks are not valid operationalizations of the constructs we’re interested in measuring.

Resisting equivocation

I hope, in the course of reading the essay, it hasn’t been quite clear whether or not I explicitly endorse one view and which view that would be––I’ve tried my best to be fair with each.

I considered ending this essay by equivocating. “Both sides have their merits”, “there are advantages and disadvantages of each”, and so on. But I’m not sure that’s particularly useful, and I’m also not sure it’s particularly true.

In the end, I think scientific debates should––in an ideal world––center around claims that are testable and falsifiable. If “language understanding” is to be a topic of scientific inquiry (and I think it should be) then it needs to be something amenable to empirical investigation. That includes empirical investigation of LLMs. So of the two extremes, I guess I’m ultimately more of a duck tester at heart.

Now, that doesn’t mean we throw theory out the window. Rather, we should let theories about language understanding––what it requires, what it looks like––guide our empirical investigations. For example, if we think situation models are a critical component of language understanding, then we should design studies that investigate whether and to what extent LLMs display something like a situation model (like this one).

I think we also need to be open to the possibility that “language understanding” is not an all-or-none phenomenon. Perhaps it comes in multiple flavors. An LLM might display some properties of language understanding but not others. This means we should exercise caution in both directions––and critically, the more empirical tests we have of different “dimensions” understanding, the better we can characterize the affordances and limitations of these models.

Below, I consider two of the objections to this perspective that I find most compelling.

Possible objections

Some criteria aren’t amenable to empirical investigation.

Perhaps some criteria aren’t something we can “test” for. They are design features of a system, rather than observable behavior. For example, one might object that grounding refers to a property of a system’s design as opposed to how it behaves. Perhaps a “grounded model” (like this recent work by Google) and an “ungrounded model” (like GPT-3, arguably) even display indistinguishable behavior, but we want to be able to say that the grounded model is capable of a deeper kind of understanding.8

I think this is a fair objection. But if we’re going to raise the possibility of axiomatic features or criteria, I think we need to be very explicit about exactly what would satisfy these criteria. What, exactly, constitutes “grounding”? That’s hard to do, but without clarity on these necessary conditions, skeptics can always move the goalposts. And I think specifying those criteria more precisely helps us develop better theories.

This something that’s always frustrated me about Searle’s Chinese room argument: it’s unclear what mechanisms we could point to that would allow us to say the system does understand.9 And in contrast, that's something I appreciate about Bender & Koller (2020): the paper gives some explicit examples of what would satisfy (or at least go some way towards satisfying) the grounding criterion.

Benchmarks don’t tell the whole story.

As Raji et al. (2021) argue, performance on some task is not equivalent to having a capability that task is intended to measure. In particular, they express concern about the construct validity of the tasks we use in AI and machine learning:

[Construct validity] concerns how well designed — or rather, how well constructed — the experimental setting is in relation to the research claim.

Too often, they argue, researchers in NLP (and beyond) rely on benchmarks to make claims about purpose capabilities of systems (like LLMs) that simply aren’t warranted by the limited scope and representativeness of those tasks. They don’t dismiss benchmarks altogether, but suggest that we should view them as helpful probes into the mechanics of a system:

The effective development of benchmarks is critical to progress in machine learning, but what makes a benchmark effective is not the strength of its arbitrary and false claim to “generality” but its effectiveness in how it helps us understand as researchers how certain systems work— and how they don’t.

I basically agree with this objection. I think any hard-core duck testers need to take it very seriously. If our primary method for adjudicating “understanding” is observable behavior on various tasks, then the effectiveness and representativeness of those tasks are of central importance.

There’s the clever Hans problem, for one: perhaps our tasks can be “solved” through simpler methods than the capabilities they’re designed to test. There are already examples of this in action: a dataset designed to test “natural language inference” actually contained lexical confounds in its stimuli, which BERT (another LLM) used to “hack” its way to successful performance. Once the authors removed these confounds, BERT was at chance.

And the clever Hans problem is perhaps just the most extreme example of a continuum of problems concerning construct validity. There is not simply one thing called “understanding” that we can directly measure; we must rely on various theoretical definitions of understanding, along with tasks designed to measure behavior we think would be consistent or inconsistent with understanding. But there’s always a potential gap between what we’re actually measuring and what we think we’re measuring.

This brings me back to something I mentioned above, and which I think is very much in agreement with the authors here. I think we should adopt a pluralistic view of understanding––an admission that it’s a multi-faceted construct––and accordingly, design many different kinds of empirical tests of understanding so that we can conduct a more thorough accounting of where a given system falls short.

Looking forward

If recent advances are any indication, LLMs are probably going to get bigger in the years to come.

There are various ethical and societal reasons why one might object to this development––this paper articulates those objections in much more detail than I can here.

However, the impending growth of these LLMs also has important implications for debates about whether or not they “understand” language, especially since increases in LLM size are generally (though not always) correlated with improvements on various tasks, including language-related tasks. In this essay, I’ve described two opposing views one might take on this question, as well as my own. Moving forward, I think both perspectives would benefit from making some more explicit commitments as to what would constitute evidence for or against understanding. This could take the form of empirical benchmarks (e.g., observable behavior) or explicitly defined criteria (e.g., design features).

As I noted above, this will be hard, but it’s well worth it, and our theories will become better defined in the process.

For the record, I’ve yet to meet someone in the field who believes current language models are sentient in any way. And many researchers, like Gary Marcus, have argued that such a claim is essentially nonsense.

I think it’s quite important to separate the question of language understanding from the hard problem of consciousness. The two are conflated occasionally (which I’ll discuss in another post), but in my view, we’re not going to get any traction on adjudicating whether and to what extent a model “understands” language if we think of understanding as requiring a conscious, subjective experience.

Deeper = more layers.

It’s important to note that the authors allow for the possibility of multi-modal models or models trained in social interaction as having the capacity for partial understanding.

I do think there’s a weaker and more defensible version of this claim, which I’ve written about before. Specifically, an LLM’s performance on a comprehension task reflects the extent to which that task can be “solved” using distributional statistics. Of course, this is almost a tautological claim: an LLM has access to linguistic input alone, so any behavior it produces is by definition behavior that can be produced using linguistic input alone.

Somewhat ironically, as the Wikipedia article discusses, this phrase seems to date back at least to the mechanical duck invented by Jacques de Vaucanson. The mechanical duck exhibited certain duck-like behavior––it quacked, moved its head, and even excreted a mixture resembling duck droppings––but was, of course, only a mechanical being in the end.

I assume some readers are now thinking to themselves: couldn’t humans be described this way?

Note that I’m not necessarily endorsing this objection or point of view here.

In fact, I think mechanistic descriptions of language understanding in some system will often generate skepticism that the system understands, precisely because we conflate understanding with conscious experience. But that’s a topic for a different post.

The Counterfactual

Discussion about this post