Humans, LLMs, and the symbol grounding problem (pt. 2)
Evidence for knowledge in the absence of sensorimotor input.
In this series of posts, I’m exploring the debate around whether Large Language Models (LLMs), such as GPT-3, can be truly said to understand language.
For background, check out:
Does understanding language require sensorimotor experience in the world?
This is a major research question in psycholinguistics; it’s also, as I argued previously, an important question when it comes to the debate around whether Large Language Models (LLMs) understand language. If humanlike language understanding crucially depends on embodiment, it suggests that most LLMs––which are trained on text input alone––are incapable of understanding language, at least in the same way that humans do. If, however, language understanding does not depend on embodied simulation, it leaves the door open for LLMs.
My last post considered the evidence in favor of sensorimotor simulation in humans. I argued that although there’s considerable evidence suggesting embodied simulation happens, there’s less convincing evidence that it’s necessary.
But even if the evidence is weak, it doesn’t disconfirm the embodied simulation hypothesis––it just leaves us with our priors. Is there any evidence against the claim that language understanding requires sensorimotor experience? And what would that look like?
Why it’s harder to build a negative case
Developing a case against embodied simulation is a little different than demonstrating that it happens or that it’s necessary, and there are a few challenges involved.
One challenge is that it’s hard to think of a clear alternative to what “meaning” is, if not some kind of connection to sensorimotor experience or real-world references. At the end of the day, I think the symbol grounding problem taps into a deep intuition many people share: the meaning of a symbol can’t just be other symbols. So what’s the alternative explanation, if we rule out embodied simulation?
The second challenge follows directly from the first. Because it’s hard to clearly state an alternative to grounding, it’s also hard to devise experiments to demonstrate evidence in favor of this alternative. Of course, there are plenty of experiments that fail to find evidence in support of grounding––but as I’ve already noted, an absence of evidence does not entail evidence of absence. Typically in scientific practice, a hypothesis of interest (e.g., embodied simulation) is contrasted with some null hypothesis; evidence for that hypothesis of interest often comes in the form of some non-zero effect in an experiment where the null hypothesis would predict an effect of zero––thus allowing us to “reject” the null hypothesis. As far as I know, there’s a dearth of work that takes grounding as the null hypothesis, and some alternative to grounding as the hypothesis of interest.1
These two challenges point a third problem, which is that there’s a simply a less coherent evidentiary basis against grounding. There are over twenty years of research investigating the embodied simulation hypothesis (and at least one book-length treatment); this research involves studies that build on past studies, resulting in a relatively coherent picture of when simulation happens and whether or not it’s necessary. It’s harder to point to an equivalent sub-field building an explicit case against grounding in favor of some alternative hypothesis. This means that the case against grounding is less a connected body of research and more a loose collection of observations.
The case against grounding
In this section, I lay out a few observations that complicate the pro-grounding position––primarily by way of pointing to conceptual knowledge with no apparent sensorimotor correlate. In the section below, I’ll then respond to these observations with some of the solutions that various theorists have proposed.
Observation 1: Abstract concepts
If an alien somehow managed to read the academic research on embodied simulation, they might conclude that human language consists primarily of sentences like:
Bob kicked you the ball.
Or:
The eagle is in the sky.
But as Gary Lupyan and Bodo Winter argue in this 2018 article, humans often talk about more abstract concepts––things that don’t refer to objects or events we can directly perceive. Oft-cited examples of abstract concepts are lofty ideals like “freedom” or “justice”, but equally nebulous are words like “fun” or “chance”. What, exactly, is a “chance” and what would be involved in a sensorimotor simulation of its meaning?2
Things get even harder to pin down when we turn to verbs. What would it mean to ground words like “imagine” or “agree”. By the time we arrive at adjectives (“normal”, “irrelevant”) and adverbs (“especially”, “maybe”), it’s very unclear what sensorimotor or real-world referents we could possibly be simulating.
And importantly, we talk about abstract concepts all the time. The authors write:
suppose we select a random noun, verb or adjective weighed by its frequency…we discover that we have a 59% chance of selecting a word that is above the median level of abstractness (M = 2.15). Example words in this part of the concrete/abstract distribution are extrovert, uncomfortable, innovating, immodest and flamboyant.
And:
How many words before encountering a word at least as abstract as words like freedom, idea and fun? … Given an utterance of only five words, there is a 73% chance of coming across a word that is as abstract as idea and 95% chance of coming across a word that is abstract as freedom.
Given that abstract concepts are by definition difficult to perceive or imagine, it’s unclear how we’d simulate them during language comprehension. Thus, their existence––and their frequency of use––poses a challenge to the claim that comprehension depends on simulation: how, then, could we ever hope to understand words like “freedom” or “idea”?
Observation 2: Visual knowledge in congenitally blind individuals
According to philosophers like John Locke, congenitally blind individuals are fundamentally incapable of understanding visual concepts like color. This position is echoed in the “strong embodiment” position: if our understanding of “that car is red” depends on a sensorimotor simulation of the percept red, then someone who’s never experienced color should lack some fundamental understanding of the sentence.
Yet research dating back to the 1980s suggests that blind individuals do have some understanding of color concepts. There are multiple studies demonstrating that although blind individuals obviously cannot perceive color, their similarity judgments about color––which colors are most similar and most different––are often quite correlated with judgments made by sighted individuals. This is not to say there are no differences––judgments made by blind individuals tend to be more variable, for example––but it does seem as though their underlying representations of the color space are quite similar to those of sighted individuals.
More recent work by Judy Kim and others expands upon these early insights. In a 2021 study, they asked both blind and sighted participants two questions about various kinds of objects––including natural kinds (e.g., “strawberries” or “banana”), artifacts in which color plays a non-functional role (e.g., “pants” or “book”), and artifacts for which color plays a functional or identifying role (e.g., “fire truck” or “stop sign”). The questions were:
What is a common color of a _____?
If you picked two ______(s) at random, how likely are they to be the same color?
For the first question, sighted participants displayed higher within-group agreement in general: that is, sighted individuals are more likely to agree when responding to questions about the color of particular objects.
Yet for the second question, there was no difference in performance between blind and sighted participants. Both groups agreed that natural kinds and certain artifacts (e.g., “fire truck”) are more likely to have consistent color than other artifacts in which color plays a non-functional role (e.g., “pants”). When asked, both groups also provided similar explanations as to why this was:
For natural kinds, both blind and sighted appeal to an objects’ intrinsic nature (e.g., “that’s just how it is,” “that’s nature”) or describe processes such as photosynthesis, growth, or evolution. For artifacts, participants consistently cite individuals’ or groups of people’s needs and intentions (e.g., culture, aesthetic preference, visibility)
Further, even when blind and sighted individuals disagreed with each other on the specific color that an object or animal had, blind individuals generated coherent explanations for the origins of that color:
For example, while both groups’ explanations for the color of polar bears mention their arctic habitat, almost all sighted participants explain that their white fur allows camouflage in the snow while some blind participants explain that polar bears are black to absorb heat in the cold. (Polar bears indeed have black skin underneath their white fur, and these features are thought to have evolved for camouflage and heat absorption)
The takeaway from this work is that despite not having access to particular phenomenological experiences, blind individuals display a coherent conceptual understanding of color––and importantly, one that is often correlated with sighted individuals’ understanding.
The authors of that study have also studied other domains, such as animal appearance (e.g., “scaly” or “furry”) and vision verbs (e.g., “glimmer” or “flicker”), with qualitatively similar results. In many cases, blind and sighted individuals produce indistinguishable responses. Thus, contrary to the claims of John Locke and other Empiricist philosophers, it seems as though one can learn and understand visual concepts without direct visual experience.
Observation 3: Individuals without mental imagery
The final piece of evidence is the existence of individuals with aphantasia. Aphantasia is defined as the inability to voluntarily create mental images in one’s mind. Although it was first described at least as early as the 1880s, it’s been relatively unstudied until recent decades. It’s also just very hard to study, since it’s about subjective mental experience––something that, by definition, other people can’t really access.
As far as I know, there hasn’t been much work linking aphantasia and embodiment specifically. It’s unclear whether individuals with aphantasia display the same embodied simulation effects I’ve described in a previous post. If they don’t, it’s yet another complication for the strong embodiment view: people with aphantasia demonstrably understand language––so if they’re not simulating the world, how is this understanding achieved?
Rejoinders and alternative accounts
How, then, might we learn about the world and understand language about it––if not via sensorimotor experience and sensorimotor simulation?
There are a few responses to these questions, including one that simply refuses to accept the premise.
Account 1: Conceptual metaphor theory
As I observed above, a central challenging for embodied theories of language understanding is abstract concepts.
Proponents of grounding sometimes argue that these abstract concepts are grounded through systematic metaphors. Just as we often describe concepts like TIME using language imported from other domains like SPACE (“Christmas is fast approaching”), we repurpose the neural systems dedicated to spatial processing to understand time. Similarly, just as we describe PITCH in terms of HEIGHT (a “high” or “low” pitch), our conceptual representation of pitch is grounded in our understanding of heights.
I think this is a really intriguing and powerful account. It’s conceptually elegant; metaphor is indeed pervasive in language; and there’s also considerable experimental evidence backing it up. Much of this work has been done in the domain of time/space metaphors (e.g., Boroditsky, 2000; Hendricks & Boroditsky, 2017), with some fascinating applications to cross-cultural differences––e.g., different spatial conceptions of time as a function of writing direction.
There’s also been work comparing different metaphors for pitch: although some languages (such as Dutch and English) describe pitch using terms like “high” or “low”, other languages (such as Farsi) describe it using terms like “thin” or “thick”; accordingly, there’s evidence that speakers of these languages also think about pitch differently.
Overall, I think conceptual metaphor theory is a promising and probably underrated account. But at the same time, there are clearly explanatory gaps that remain. For example, it’s not clear how conceptual metaphor theory can account for visual knowledge in blind individuals––at least based on the evidence presented thus far, it doesn’t seem like blind individuals are “grounding” their knowledge of color in other domains they do have sensorimotor experience of. And even among abstract domains, there are many words and concepts that don’t fit neatly into the theory––like “fun” or “chance”.
Account 2: language as a source of knowledge
They suggest a few mechanisms by which this might occur:
(i) Language as a source of propositions.
(ii) Language as a categorical cue
(iii) Language statistics as knowledge
The first mechanism (“language as a source of propositions”) is not particularly controversial. A core use of language is communicating information about the world, such as:
(1) relatively specific facts, e.g. that the mayor of Talkeetna, Alaska from 1997 to 2017, was a cat named Stubbs, (2) facts that help guide action, e.g. that sticking a fork in an electric outlet is a bad idea, and (3) more abstract knowledge, e.g. that a year is 365 days, that an even number is divisible exactly by two, and so on.
The second mechanism (“language as a categorical cue”) is more disputed, and connects to the debate around linguistic relativity. But if you accept that: 1) words act as labels of the world around us; 2) there are multiple ways to carve up that world (e.g., different languages carve up the color space in different ways); then I think it’s clear that words––and language more generally––presents a kind of map of the world, offering a particular category system with particular contours of abstraction.
And crucially, the third mechanism (“language statistics as knowledge”) taps directly into the debate around whether LLMs understand language. LLMs, after all, work by exploiting statistical regularities in which words co-occur with other words. According to distributional semantics, words with similar meanings will appear in similar contexts. The linguist JR Firth famously expressed this intuition with the pithy dictum:
You shall know a word by the company it keeps.
The remarkable success of LLMs in producing coherent text serves as a kind of proof-of-concept of this dictum. At the very least, it’s clear that you can know some aspects of a word by the company it keeps––and LLMs exploit this to the fullest. GPT-3 has clearly never seen a table, but it can answer questions about the purpose that different tables serve; this sure seems like it understands something about the table concept.
Account 3: inferential, causal models of the world
Another perspective has been put forward by Judy Kim, Marina Bedny, and others.
These authors agree that language itself is a valuable source of information about the world. But they suggest that humans also rely on rich causal or taxonomic models of the world, which we use to transform the information that’s present in language into a more structured representation––the kind that can be used to answer questions like, “How likely are two bananas to have the same color?”
This view also helps distinguish what it is that humans know about visual concepts (and how we represent it) from what an LLM could be said to “know”. The authors write:
Evidence from people who are blind highlights the differences between how people and current text analysis algorithms learn about appearance through language. Unlike such algorithms, people incorporate linguistic information into causal intuitive theories through inference.
Under this view, then, the kind of statistical regularities exploited by an LLM are only part of the story––the true potential of this information is only unlocked when combined with the inferential powers of the human brain, which, according to this theory, includes rich causal and taxonomic representations of the world.
Also, this should hopefully go without saying, but the point here is not that LLMs and blind individuals are somehow comparable or that they even need to be distinguished in the first place (Emily Bender has a longer, more in-depth essay articulating why this analogy is not a good one). The whole point is that blind individuals do display evidence of visual knowledge––knowledge that’s comparable in many cases with that of sighted individuals––and that this in turn calls into question the strict necessity of direct visual experience for understanding language about visual concepts.
Account 4: “hybrid” models
In many entrenched psychological debates, the truth lies somewhere in between the strong positions that various adherents take. That’s basically the position of so-called “hybrid” models, which argue that our representation of different concepts involves multiple input streams––sensorimotor experience, distributional regularities in language use, and perhaps even inferential, causal models of the world.
On the one hand, perhaps this insight seems so obvious it’s hardly worth pointing out.
But like any good theory, hybrid models don’t just make indiscriminate predictions. According to these models, different concepts involve different weighting of these sources of knowledge. Specifically, concrete concepts may rely more on perceptual experience, while abstract concepts may rely more on distributional statistics.
And in fact, there’s convergent sources of evidence pointing to exactly this conclusion. For example, one 2018 paper gave a model access to a large text corpus (e.g., a kind of LLM) and all of Google images; they then allowed this model to flexibly pull from each source of information when making predictions about human similarity judgments––and crucially, the model did best in predicting concrete words when it pulled primarily from image data; and it did best in predicting abstract words when it pulled primarily from text data. Other papers have implemented similar approaches with similar results.
One reason I like these models is that they add nuance to a debate that’s in danger of taking on overly entrenched positions––i.e., Grounding vs. No Grounding. They suggest that grounding is not an all-or-none phenomenon, and that it might be differentially useful for different concepts.
Multimodal models: is this debate irrelevant?
Within the field of Natural Language Processing, there is a growing appreciation of the importance of grounding––as well as the nuance involved in what one means by “grounding”. More and more researchers are training language models with access to some kind of non-linguistic input, such as image or video data; some of this work even hooks models up to simulated physics environments.
For the time being, this research is still limited in that the kind of “grounding” involved typically involves visual input specifically. That’s partly because we have tons and tons of image data. This is limiting because other senses––such as touch, taste, and smell––are really important to human experience, and also because (in my view) a big part of our knowledge of the world comes from taking action in it, not merely from being a passive observer (this is why the work using simulated physics engines is particularly interesting). Additionally, most of the LLMs getting the most attention right now––such as GPT-3 and its variants––are still not grounded.
However, I do think this is changing, and that the next few years will see more and more models trained with multimodal input. Will that render this debate entirely irrelevant?
One answer is that of course it does. If LLMs are grounded, perhaps it no longer matters whether grounding is necessary for human language understanding––at the very least, it’s no longer something proponents of the “Axiomatic Rejection” view can point to as a line in the sand separating humans from LLMs (or “LLM+” models, as David Chalmers has called them).
Yet as I noted above, I think the debate over grounding in humans is more complex than simply whether grounding is necessary. In my view, neither language comprehension nor grounding are “all-or-none”. Accordingly, it’s possible that some degree (or kind) of comprehension is possible with some degree (or kind) of grounding, and that some other degree (or kind) of comprehension would require a different degree (or kind) of grounding.
One way I think about the utility of multimodal information is as offering a kind of stability to our representations. If you only have one source of information about a particular stimulus, the more prone you are to catastrophic failure––i.e., if that input stream fails in some way, your representation of that stimulus is completely distorted. But multiple sources of information about the world offer redundancy, which could prevent that kind of catastrophic failure. Moreover, the information content of different modalities is also inter-correlated, which means that we can sometimes fill in the gaps of one input using information from another input. Ultimately, this stability may go a long way towards addressing the fragility of current language models––and, for that matter, many image recognition models––which often “break” under slight perturbations in the stimulus.
Under this view, the question is not whether ungrounded LLMs understand language; some understanding––the kind afforded by exploiting large-scale statistical regularities––is possible. The more relevant question is about whether and to what extent that understanding is robust.
A big exception is work by Marina Bedny on visual knowledge among blind individuals, which I discuss below.
Pro-embodiment readers might reply that they imagine a pair of dice, or perhaps Chance the Rapper. I’ll admit that it’s hard to define exactly what makes something abstract, and that our brains are prone to trying to ground these concepts in something––even if we end up simulating very different things. As I argue later, it’s possible this is one of those “differences in degree” situations: abstract concepts are much more diffuse in the sensorimotor experiences they recruit, and also perhaps rely more heavily on distributional cues like associated words.