How do we know (how) humans understand language?

At the end of the day, theories of language understanding in humans all rely on some kind of evidence.

Nov 06, 2022

In a recent post, I outlined two opposing views on the question of whether Large Language Models (LLMs) could understand language:

The “duck test” view holds that if a system behaves as if it understands language, then it probably understands language.
The “axiomatic rejection” view holds that behavior alone is insufficient––that true language understanding requires some foundational property or mechanism, such as grounding, compositionality, or a situation model.

As I argued, both views have their merits and deserve to be taken seriously, though (currently) I’m more aligned with the pragmatic approach of the duck testers for the following reason:

In the end, I think scientific debates should––in an ideal world––center around claims that are testable and falsifiable. If “language understanding” is to be a topic of scientific inquiry (and I think it should be) then it needs to be something amenable to empirical investigation. That includes empirical investigation of LLMs. So of the two extremes, I guess I’m ultimately more of a duck tester at heart.

But I also pointed out that a philosophical definition of “understanding” may involve criteria that aren’t easily measurable––they’re properties of a system, not behavior on a task. In that case, I argued:

…if we’re going to raise the possibility of axiomatic features or criteria, I think we need to be very explicit about exactly what would satisfy these criteria. What, exactly, constitutes “grounding”? That’s hard to do, but without clarity on these necessary conditions, skeptics can always move the goalposts. And I think specifying those criteria more precisely helps us develop better theories.

In the next series of posts, I want to examine some of these axiomatic criteria more closely.

In particular, I’m going to focus on what we know about whether these criteria apply to human language understanding. After all, human language understanding is––at least implicitly––the bar against which we’re comparing systems like LLMs. So if we’re going to demand that LLMs (or other computational models) exhibit some critical property (grounding, compositionality, etc.) to merit the badge of understanding language, it seems fair to investigate whether humans exhibit that property––and how we’d know.

Do humans understand language?

I think it’s safe to say that––with some rare, often tragic exceptions––adult humans understand language. And of course, most children understand language to some extent too: the entire field of Developmental Linguistics is focused on identifying the trajectory of which aspects of language are understood at which point.

We can debate about what understanding is, how it varies across individuals and situations, whether some types of understanding are “deeper” or more “shallow” than others, when a typically developing human understands which kinds of language, which cognitive mechanisms are involved, whether language is uniquely human, and so on.

But human language understanding is, to the best of my knowledge, the prototypical case of language understanding out there, and I think this assumption is so widely shared so as to not even be questioned in the first place.

The more interesting question, to me, is how we know this.

How do we know that humans understand language?

This question may seem strange given how foundational the assumption of human language understanding truly is. But I think it’s worth asking: which criteria do we apply when we interrogate this assumption?

If you’re like me, the first things you think about are probably behavioral in nature. We know that someone understands language by the way they use language. They produce responses to linguistic input that indicate they understood this input. Sometimes these responses are linguistic (Q: “Would you like to see a movie tonight?”, A: “I can’t, I have to study for a test tomorrow.”), and sometimes they consist of a facial expression (surprise, amusement, sympathy, etc.) or a gesture (a shrug, etc.). If this person is in a literate society, we might also consider whether they’re able to extract meaning from written text––and the way we typically evaluate this is how successfully they discuss or respond to questions about that text.

We can extend this a bit further: humans understand at least one language1, but no human understands every language. When addressed in English, I produce behavior that indicates I understood that input––my behavior is in some sense contingent on the input.2 But if someone addresses me in a language I don’t know––i.e., any language but English and perhaps Spanish––then I’m basically flying in the dark. If I’m very lucky, I can rely on the speaker’s prosody or gestures, as well as the situational context, to try to deduce generally what’s being said. But this approach will be much less successful and much less reliable than if I knew the language.

It’s worth pointing out here that this kind of behavioral evaluation is essentially the duck test view, applied to humans. People behave as though they understand language, therefore we assume they probably understand language. As I wrote above: it’s so foundational, so obvious, we rarely think to question it.

Is behavior all there is?

One objection at this point is that there’s a set of background assumptions we bring to the table when interacting with other humans, and it’s in light of these assumptions that we make inferences about their linguistic capacity.

For example, we assume that other people we interact with have conscious experiences––Descartes aside––and further, that they have memories, experiences, goals, and desires. There are many different reasons why we assume these things to be true. But the underlying point is that these assumptions are also part of the calculus when interacting with another person: it’s not just about their behavior––these assumptions are a lens through which we interpret that behavior.

In contrast, we don’t necessarily assume these things to be true about a computer, so the “same” linguistic behavior produced by a computer may not lead to the same inferences and interpretations as if it was produced by a human. (This objection is complicated by the fact that the use of language––perhaps because it’s so strongly associated with humans––makes us more likely to anthropomorphize a system.)

Unpacking this objection.

Note that as I’ve framed it, this objection seems to be wavering between a descriptive and normative claim.

The descriptive claim is: people will not make the same inferences about the linguistic capacities of a computer (or other artificial system) as a human, even if they produce the same behavior––because of other background assumptions people bring to bear on this process of interpretation. This claim may or may not be true, but it’s empirically testable, and importantly, it’s also distinct from the normative claim, which is: people should make different inferences about the linguistic abilities of a computer because of those background assumptions.

I’m not going to adjudicate here whether either claim is correct. I do think the descriptive claim is interesting (and testable). Of course, if it is empirically verified, that doesn’t entail that the normative claim is true as well. Conversely, empirically disconfirming the descriptive claim doesn’t falsify the normative claim.

For the normative claim to be true, we have to think that “language understanding”––to the extent that this is a clearly demarcated construct––somehow depends crucially on these background assumptions and can’t really be conceptualized independent from them. In my view, that’s a pretty strong claim. It bundles together whatever it means to understand language with the broader category of whatever it means to have goals, desires, memories, experiences, agentivity, and more. Perhaps that’s appropriate: it’s never been clear, after all, whether our understanding of word meanings is more like a dictionary or like an encyclopedia. (Are word meanings atomic and discrete, or are they part of a web of beliefs about the world?)

But either way, the debates I referred to in my last post mostly aren’t about this specific claim.

So what’s actually under debate?

From what I’ve seen in the field, objections to the claim “LLMs understand language” center around specific mechanisms or properties that humans purportedly have, and that LLMs arguably don’t––such as compositionality, grounding, and situation models.

Thus, in this series of posts, I’m going to focus on these axiomatic properties as a way that LLM skeptics use to demarcate what LLMs do from what humans do.

(It’s worth noting that––at least in my experience––most outsiders to the field don’t usually mention these properties when asked about whether LLMs understand language. Thus, in an odd sense, the question of which axiomatic properties matter is flipped inside and outside the field. To simplify a bit: it seems like insiders care more about concepts like grounding and compositionality, while outsiders care more about concepts like agentivity, sentience, or goals.3)

Why care?

I’m going to be focusing on three properties: grounding, compositionality, and situation models.

Above, I noted that many people don’t necessarily think about these properties when asked about whether LLMs understand language. So why do these properties matter at all?

I think defenders of the Axiomatic Refection view might say something like the following:

The duck test view is a kind of folk psychological view of understanding, but doesn’t give us insight into what understanding is (and what it’s not). Behavior is only a reflection of what a system is actually doing “under the hood”. As scientists interested in language understanding, we care about how understanding really happens. There are certain core properties or mechanisms that are necessary conditions for human understanding––and so if a system does not have those properties, the behavior it produces is (by definition) just a kind of mimicry, designed to imitate the output but not the process of language comprehension.

Now, if this is one’s view, it matters quite a lot that those properties are indeed necessary components of human language comprehension.

Thus, in the upcoming posts, we’ll consider evidence for and against the claim that humans have these axiomatic properties––all in the hope of informing the ongoing debate about whether and to what extent LLMs can be said to understand language.

In fact, most people understand more than one language: http://ilanguages.org/bilingual.php.

Or at least I hope it is.

Again, this is painting with a broad brush––and I’d be happy to be convinced otherwise.

The Counterfactual

Discussion about this post