Does GPT-3 read between the lines?
Some (but not all?) of these results make sense.
Humans are quite good at reading between the lines.
In fact, this ability might form the bedrock of everyday communication. According to scholars like Paul Grice, speakers and listeners follow a set of rules, or maxims, about how a conversation is supposed to go. When speakers say something vague or ambiguous, listeners can use these maxims to infer what the speaker probably meant, given what they said. Hence: listeners read between the lines, going beyond the literal meaning of what they hear.
Thanks for reading The Counterfactual! Subscribe for free to receive new posts and support my work.
Some, but not all.
The field of Pragmatics abounds with examples. But one particularly classic case is scalar implicature. Suppose I tell you: “Some of my students passed the test”. Do I mean all of my students passed the test? Or is it more likely that more than zero––but not all––passed?
Most adult humans will infer the latter. Even though “some” could technically mean all1, we infer that in this case, the speaker probably meant some, but not all.
One explanation for this could be those Gricean maxims: specifically, the maxim of “Quantity” claims that speakers should be as informative as is required, but no more. If a speaker wanted to communicate that all students had passed the test, they would’ve just said “all”. Because they used a more ambiguous term, “some”, we infer that it’s less likely that they meant all. This is the explanation that underlies computational models of pragmatics, such as the Rational Speech Act (RSA) model: by assuming that speakers are “rationally” informative, listeners are able to infer more from what they said than what their words literally mean.
This 2013 paper by Noah Goodman and Andreas Stulmüller corroborates this idea. They devise a set of scenarios in which a character uses the word “some” to describe how many objects have a given property (e.g., how many letters have a check in them); participants are then asked how many objects they think have that property.
Letters to Laura’s company almost always have checks inside. Today Laura received 3 letters. Laura tells you on the phone: “I have looked at 3 of the 3 letters. Some of the letters have checks inside.”
Now how many of the 3 letters do you think have checks inside?
As expected, participants consistently think it’s more likely that only 2 of the letters have checks, rather than 3. That is, they consistently derive scalar implicatures.
But the authors take it a step further. If people are deriving these implicatures in a rational, context-sensitive way––as opposed to just assuming that “some” always means not all––we should be able to cancel those implicatures by manipulating something about the world.
For example, if Laura has only looked at 1 of the 3 letters, she doesn’t have any idea about whether the other letters have checks. Thus, when she says “Some of the letters have checks inside”, it’s still entirely possible that the other letters also have checks––she’s just following the maxim of Quantity, so she’s being careful not to overstate how much she really knows about the situation.
To test this, the authors manipulated how many letters the speakers had inspected (“access X” in the figure below). They then asked participants to assign “bets” to each possibility, i.e., whether participants thought it was more likely that “some” meant 0 letters, 1 letter, 2 letters, or 3 letters.
When the speaker had knowledge of all the letters, participants very consistently interpreted “Some” as meaning 3. This is a classic case of scalar implicature: the speaker has full knowledge of the situation, so if they had meant all, they would’ve said “all” (see Figure 2 from the paper for a visualization).
But when the speaker has only opened 1 or 2 of the letters, this inference is considerably weaker. In fact, when the speaker has opened only 1, participants have a relatively uniform distribution over 1, 2, or 3 of the letters having checks (with a slight preference for all 3).
Why does this matter?
This might all seem a bit in the weeds.
The reason I’m describing this particular experiment in this much detail is that it provides a nice demonstration of a popular theory by which scalar implicature works. Based on these results, it seems like people are able to use their knowledge of a situation––including what the speaker knows or doesn’t know about the situation––to derive inferences about what the speaker meant by what they said.2 This is all consistent with the classic Gricean model of pragmatic inference.
As you can probably guess from the title of this post, I’m not only interested in what humans do or don’t do; I want to see whether GPT-3 can do it too.
Why test this in an LLM?
There are two main reasons why I think it’s interesting to test whether an LLM can perform scalar implicature.
Improving Understanding of LLMs
First, as LLMs––and other models, such as DALL-E 2––get bigger and apparently more effective, it’s going to be increasingly important to understand the limits of their affordances, as well as the mechanisms by which they accomplish what they accomplish. This is true especially from the perspective of AI safety and ensuring that any system companies hope to deploy (commercially or otherwise) acts like we think it should, in a given situation; increasing the transparency of these “black boxes” is the logic behind a company like Anthropic. These questions have led to spirited debates, including a recent back-and-forth between Scott Alexander and Gary Marcus, as well as this paper arguing that LLMs are akin to “stochastic parrots”––repeating text without understanding it––by Emily Bender and colleagues.3
Scalar implicature is an important part of human language comprehension. And critically, the paper I described above tests how scalar implicature works in context. In order for humans to do what they did on that task, it really seems like they need to use some kind of representation of speaker’s likely knowledge state, which requires something like an elementary Theory of Mind. And further, perhaps they need some basic understanding of the notion of “communicative intent”––the idea that a speaker means something by what they said, and that this intent might be different in different contexts, even for the same sentence.
Critically, some researchers (like Emily Bender and Alexander Koller) have argued that those skills are lacking in LLMs. This seems intuitively true to me as well; what would it mean for a model like GPT-3 to “have” Theory of Mind or an understanding of communicative intent? These models, as Bender has pointed out, are trained to do string prediction: given a string of tokens, they predict which token comes next. Testing this task in GPT-3 represents an empirical test of the claim.4
Improving Understanding of Human Language Comprehension
Yet this argument can also be flipped on its head.
Critically, GPT-3 was exposed to language input alone, given the sole task of predicting strings. Thus, to the extent that GPT-3 does display humanlike performance on given language task, it suggests that this task can be “solved” using a very sophisticated understanding language statistics.
Now, some might object that perhaps GPT-3––during its training––acquires something like a Theory of Mind. For the moment, and for this post, I’m choosing to remain agnostic on this point. As I’ve noted elsewhere, it all comes down to how strong one’s a priori assumption is that GPT-3 definitely doesn’t have those abilities. But I think this is actually orthogonal to the point I’m trying to make.
Because the fact remains that humans are exposed to many different inputs. We encounter language, yes, but critically, we also encounter other humans, whom we interact with in all sorts of ways: we see them, we smell them, we touch them, we laugh with them, we imitate them. We sit in chairs and run across fields and feel the rain on our backs. The assumption behind much of embodied cognition is that some combination of these experiences forms the basis of our rich cognitive models of the world. And to my knowledge, very few (if any) Cognitive Scientists think that abilities like Theory of Mind or pragmatic inference are derived purely from our understanding of language and which words co-occur with which other words.
From this perspective, then, GPT-3 represents a kind of useful baseline against which to compare human performance. In Cognitive Psychology (and other fields), we use experimental tasks to test a theory, e.g., whether humans use their understanding of a situation to understand what someone means; the assumption is typically that if we manipulate a situation and it affects how people behave, then people are performing according to our theory. But if an LLM––which, remember, has no way to ground the linguistic symbols it encounters––performs the same way, it suggests that this task could in principle be solved by a kind of “unimodal agent”: someone or something that only has exposure to strings of symbols that occur in regular, systematic orders.5
What does GPT-3 do?
Part 1: Scalar implicature
For this experiment, I’m limiting myself to Experiment 1 in the main paper. I’m also limited to the passage/item they include in the published paper, which is the one I described earlier (i.e., with Laura and the letters). Finally, in this section, I’m focusing on how GPT-3 responds to the main critical question: How many of the 3 letters do you think have checks inside?6
I used the Python OpenAI API to access GPT-3. To test it, I presented it with a text passage describing the scenario, followed by the critical question: How many of the 3 letters do you think have checks inside?
Unlike the original paper, I included two experimental manipulations:
Access: How many letters Laura has checked (this was in the original paper).
Quantifier: Which quantifier Laura uses, i.e., “Some”, “None”, or “All” (this was not in the original paper, but served as a useful comparison point for the “Some” results).
Altogether, this yielded 9 possible conditions: 3 Quantifiers and 3 Access points (1, 2, or 3).
I used GPT-3 with temperature setting of 07 and allowed it produce up to 60 tokens in response. I then extracted GPT-3's response, which was nearly always something like "2" or "3". Because a couple answers were something like "All of the letters have checks", I wrote a little Python script to present each answer to me––blind to condition, of course––so that I could hand-code it into one of three categories: 1, 2, or 3.
The other thing to note here is that I didn’t experiment at all with using different prompts. I’m aware that this matters, so I’m certainly curious to see how this might affect the results.
There’s only one item and 9 “conditions”, which means there are only 9 data points. As is often the case, the easiest way to describe them is to show them:
The first thing to note is that GPT-3 infers that “some” does not mean all. When Laura says “Some of the letters have checks”, GPT-3 responds that this means that 2 of the 3 letters have checks. Based on this alone, GPT-3 is doing something akin to deriving a classic scalar implicature.
But critically, we also see that GPT-3 is not modulating its response to “Some” as a function of how many letters the speaker has opened. That is, regardless of whether Laura has opened 1 letter or 3 letters, GPT-3 always thinks “Some” means 2.
Humans, on the other hand, appear to have a much more flexible understanding of “Some”. Depending on whether Laura has full knowledge of the situation or not, humans interpret “Some” to mean something different––suggesting that at least in this case, pragmatic inference is integrated with their understanding of a situation.
Based on these results, it seems like two things are true:
GPT-3 assumes that “Some” means more than zero but not all.
This assumption––it’s unclear whether we ought to call it an inference––is unaffected by features of the situation, e.g., whether the speaker could possibly know whether all of the letters have checks or not.
If anything, GPT-3’s performance seems more like lexicalized theories of scalar implicature: i.e., that rather than performing all this complicated, context-dependent inferencing, people just store a meaning of “Some” corresponding to more than zero but not all.
All of this raises a second question: does GPT-3 even know what Laura knows about the situation? That is, does GPT-3 know that when Laura has opened all 3 letters, she knows how many have checks––but that when she’s opened only 1, she can’t really know that?
Part 2: Modeling Knowledge
I presented the prompts to GPT-3 in the same way, but asked a different question (also from the paper): Do you think Laura knows exactly how many of the 3 letters have checks inside?
Then, as before, I hand-coded each response. I gave “No” responses a 0 and “Yes” responses a 1.
All other parameters were kept constant.
I’ll be honest: these results were not really what I was expecting.
In the graph below, “correct behavior” by GPT-3 should be 100% of yes responses when Laura has opened all three letters, and something like 0% when she’s opened only one or two. That’s not what we see.
The first thing to notice is that in only 2 of the 9 scenarios does GPT-3 respond with “yes” (a 1). And importantly, one of these scenarios is the situation where Laura has only opened 2 letters. That is, even though Laura can’t possibly know whether all 3 letters have checks––she’s only opened 2 of them, after all––GPT-3 responds that she does know how many have checks. Additionally, when Laura has opened all 3 letters, it’s only in one condition that GPT-3 responds with “yes”. In other words, even if Laura does have full knowledge of the situation, GPT-3 responds with “no” in two of the cases.
To better understand these data, I also broke the results down by quantifier (see below).
These data give some insight into what’s happening. Based on my read, it seems like GPT-3’s response is dependent primarily on whether Laura says “All”. That is, rather than basing its response on how many letters Laura has opened––which is more akin to what humans do on this task––GPT-3 derives a different inference about Laura’s knowledge state based on what she says.
From one perspective, this actually makes sense. It’s almost like an inversion of the typical direction of scalar implicature. In the typical model, we derive a different understanding of what someone said based on what we think they know. But in this case, it seems like GPT-3 is deriving a different interpretation of what someone knows based on what they said. After all, if someone says “All”, they must be pretty confident”. (Side note: why, then, does GPT-3 not also answer with “yes” when Laura says “None”?)
Note that this isn’t 100% the case: if Laura has opened only one letter, GPT-3 seems to downgrade her knowledge of the situation––even if she says “All”.
For me, this raised one more question: how much of GPT-3’s response here is contingent on that final statement by Laura––All/Some/None of the letters have checks––and how much is contingent on how many letters she says she’s opened?
Part 3: Modeling Knowledge (again)
I did exactly the same thing as Part 2, but took out the final statement with the quantifier. The passage ends with Laura saying how many letters she’s opened; then, GPT-3 is asked whether Laura knows how many letters have checks.
The results (see below) are clear: when we take out that final statement by Laura, GPT-3 seems to answer in accordance with what Laura could be presumed to know. When Laura has opened all three letters, GPT-3 responds with “Yes”; otherwise, GPT-3 responds with “No”.
Granted, our number of items is now a third of what it was originally, since we no longer have the manipulation of Quantifier. But to me, this seems like compelling evidence that in the absence of a verbal cue otherwise, GPT-3 is correctly modeling Laura’s knowledge of the world state (again with the caveat that this is only one particular scenario).
So what’s going on?
Broadly, there are three key results:
GPT-3 interprets “Some” as meaning 2 (out of 3), regardless of how many letters Laura has opened (Pt. 1). Humans, in contrast, modulate their interpretation depending on what Laura knows about the situation.
When asked about Laura’s knowledge state before Laura says “Some/None/All have checks”, GPT-3 seems to correctly model what she knows. When she’s opened all three letters, GPT-3 says she knows exactly how many letters have checks; otherwise, GPT-3 says “no”.
But when asked about Laura’s knowledge state after she says “Some/None/All have checks”, GPT-3’s response seems heavily conditioned on which quantifier she uses!
What are we to make of this?
Based on point (1) alone, my conclusion would’ve been something like: GPT-3 has a lexicalized interpretation of “Some” as meaning more than zero but not all.
This would be consistent with GPT-3 either not modeling Laura’s knowledge state at all, or with not integrating that information into its pragmatic inference. And further, the latter interpretation is more consistent with (2): namely, GPT-3’s model of what Laura knows seems correct, as long as we ask this before she uses a quantifier––which suggests that GPT-3 “knows” what Laura knows, it just isn’t using that information.
But that’s not entirely right. Because as we saw with (3), GPT-3’s model of Laura’s knowledge state changes depending on which quantifier she uses. The easy––and cool––interpretation of this would be that GPT-3 is sensitive to Laura’s confidence level, like a reversal of the typical Bayesian inference performed in the Rational Speech Act model. But if that were true, GPT-3 should also respond with “Yes” when Laura says “None”: after all, “None” conveys a lot of confidence about the situation too, and the question is just whether she knows exactly how many letters have checks. But this confidence-based interpretation––if that is indeed what’s happening––seems restricted to when Laura says “All”.
Perhaps this is where we throw our hands up and say it’s too complicated to make sense of. That might indeed be the case––certainly, I’d want to see these results generalized to a lot more items and scenarios before I drew any strong conclusions.
For me, though, the key takeaway is that as presented, GPT-3 is not deriving scalar implicatures in the same way that humans do: unlike humans, GPT-3 responds that “Some” always more than zero but not all, regardless of what a speaker knows. The question to be determined in future investigations is why.
There are some good reasons to think that the literal meaning of some could include ALL. For example, if you later find out that all of the students passed the test, and that I knew this fact, most (though not all!) people wouldn’t think I’d been lying when I said “some”. You might think I was being an uncooperative conversational partner, but not necessarily that I was stating an outright falsehood.
Note that this is not a claim about the time course by which this happens.
I should note that the paper makes a fair number of additional points, including the fact that training these LLMs involves a high amount of carbon emissions, as well as the now well-documented observation that their representation of language seems to encode social biases (e.g., relating to gender, race, and more).
With the caveat, of course, that even if GPT-3 “fails”, a bigger, better model might always out-perform it sometime in the future.
There’s an even deeper point to explore there, which I hope to write a separate post about, which is that the very structure of language encodes our knowledge of the world. Thus, learning language is almost like a kind of shortcut to learning about how the world works. As others have argued, it’s a kind of cognitive enhancement––one of our first and greatest technologies.
The authors also asked two other questions. One was designed to obtain a subject’s “prior” over how many letters had checks (i.e., before Laura says anything); I can tell you that GPT-3 responds with: “All three of the letters Laura received have checks inside.” The other question was designed to probe whether subjects had encoded information about Laura’s likely knowledge state; Pt. 2 has the results for that.
Some preliminary tests suggest that other temperature settings don’t have a big impact on the main results, but I can’t speak definitively to this.