Orange peels, human tests, and LLMs
Why it's hard to measure an LLM's "capacities"—and whether we ought to use tests originally designed for humans to do so.
Figuring out what Large Language Models (LLMs) can and can’t do is important. It’s relevant to our understanding of how these systems work, and it also matters for our decisions about when and where to deploy them. But determining a system’s capacities is a surprisingly difficult task.
“Capacities”, after all, are relatively abstract constructs. Take the capacity to “reason”. Academic and journalistic articles alike are replete with arguments about whether or not LLMs can reason. This debate stems in part from the difficulty in establishing consensus on what constitutes reasoning capacity. In order to measure reasoning capacity, we need a clear definition of “reasoning” and a clear way to operationalize this definition in a measurable task. Moreover, as the authors of “AI and the Everything in the Whole World Benchmark” argue, many of the benchmarks developed to evaluate LLMs may not be fully representative of the capacities they’re trying to assess. How can we go about designing better benchmarks? Or are benchmarks the wrong approach entirely?
This is ultimately a question about construct validity, which I’ve written about before. One potential solution could be to adapt tests of human performance for LLMs. If researchers generally agree that a given test is good at assessing human capacities—and that’s a big “if”—then perhaps it can also serve as a window into how LLMs work. Yet not everyone agrees with this logic.
In this post, I’ll present some of the arguments for and against using human benchmarks. I’ll then describe the trajectory of my own thinking on this topic. While I can’t promise a definitive answer, I can at least try to provide a conceptual framework for thinking through the problem.
Using human tests to benchmark LLMs
Cognitive scientists try to understand the mind by studying the mechanisms that underpin human cognition. This often involves hypothesizing about the specific “capacities” that make up the mind, such as working memory, Theory of Mind, logical reasoning, and more. Researchers also design tasks to measure these capacities. For example, as I’ve written about before, Theory of Mind is often assessed using the False Belief Task.
Cognitive scientists are no strangers to construct validity. Researchers may disagree about which capacities are “real” and which are convenient fictions. Researchers might disagree even more about how to measure those capacities. Nonetheless, some tasks are used more frequently and are met with less criticism than others. Presumably, if one were to poll researchers in a given domain of human cognition, some tasks would come out ahead.
One argument, then, is that we ought to use these generally-agreed-upon tasks to assess LLMs as well. The logic is straightforward: if a given task is accepted as a measure of some capacity in humans, then that same task could be used to measure the same capacity in LLMs. This approach solves the problem of construct validity by passing the buck, so to speak, to researchers of human cognition. And I don’t mean that negatively—rather than reinvent the wheel, it makes sense to adopt tests that are already widely used in humans.
This is the logic underlying much of what constitutes “behavioral LLM-ology”. I’ve written about Theory of Mind before, but researchers have applied other measures too, such as tests of analogical reasoning, personality, and even political orientation. It’s also the logic behind other highly publicized results, such as GPT-4 passing the bar exam, medical licensing exam, and the sommelier exam.
Crucially, in each case the assumption is that a given task, as well as any results obtained using that task, means the same thing for both humans and LLMs.
Differential construct validity; or, “it doesn’t mean the same thing”
Not everyone agrees with that assumption. In fact, as described in this article, many researchers have argued that the same test means different things when applied to humans vs. LLMs. Here are two choice quotes from the article (bolding mine):
“People have been giving human intelligence tests—IQ tests and so on—to machines since the very beginning of AI,” says Melanie Mitchell, an artificial-intelligence researcher at the Santa Fe Institute in New Mexico. “The issue throughout has been what it means when you test a machine like this. It doesn’t mean the same thing that it means for a human.”
…
“There is a long history of developing methods to test the human mind,” says Laura Weidinger, a senior research scientist at Google DeepMind. “With large language models producing text that seems so human-like, it is tempting to assume that human psychology tests will be useful for evaluating them. But that’s not true: human psychology tests rely on many assumptions that may not hold for large language models.”
I’ve started calling this the differential construct validity (DCV) view. The DCV view holds that a measure of some capacity (e.g., reasoning) can be valid when applied to one type of subject (e.g., a human) but not to another type of subject (e.g., an LLM). This view is conceptually attractive because it allows psychologists to continue using their assessments for human subjects, while still arguing that the tests are not meaningful for LLMs.
The question is why the same test should have differential construct validity. There are a couple possibilities.
Data contamination can invalidate a test
The first is data contamination. Data contamination occurs when a system is tested on something that was explicitly in its training data. If GPT-4 was trained on the exact bar exam it’s being tested on, it’s not particularly impressive that it can pass the test. At minimum, then, researchers need to ensure that the assessments they’re using were not observed in the training data. (This is why my co-authors and I devised completely novel stimuli for our recent paper on Theory of Mind.)
As the article points out, it’s also possible that LLMs were trained on similar but not identical tests:
In its work with Microsoft involving the exam for medical practitioners, OpenAI used paywalled test questions to be sure that GPT-4’s training data had not included them. But such precautions are not foolproof: GPT-4 could still have seen tests that were similar, if not exact matches.
I’m not sure whether this is outright data contamination. Certainly it’s an example of a model learning the structure of a task from the training data, and this also opens up the results to potential confounds—e.g., if the correct answer is always confounded by some simple lexical cue.
But if confounds have been controlled for, then this latter scenario feels more similar to a student preparing for the SAT using a test prep course. Does a test prep course invalidate one’s SAT score? I’m not sure, but your answer to that question will inform how seriously you take this latter kind of data contamination for interpreting an LLM’s results. If there is a difference, however, it’s likely a difference in scale—LLMs see a lot of training data—and perhaps quantity has a quality all of its own. Ultimately, an LLM’s performance on a test would be most convincing if that LLM has not seen any similar tests in its training dataset. (This is also a good reason to be open and transparent about an LLM’s training data!)
Humans and LLMs may solve the same problem in different ways
The second possibility is more subtle. It’s usually framed as an objection (see the quote just below), though I think it’s actually more of an interesting research question—namely, whether LLMs and humans arrive at the same solution (task behavior) using different mechanisms. Here’s one version of that objection from the same article I linked above:
Does GPT-4 display genuine intelligence by passing all those tests or has it found an effective, but ultimately dumb, shortcut—a statistical trick pulled from a hat filled with trillions of correlations across billions of lines of text?
The explicit contrast in this framing is between “genuine intelligence” vs. “dumb shortcut”. You might also see “dumb shortcut” glossed as “cheap tricks” or “superficially impressive auto-complete”.
The implication in each case is that there are two ways the same behavior could emerge from a system:
Some cognitive capacity (the thing we’re interested in) gives rise to the behavior.
Some alternative “trick” (not the thing we’re interested in) gives rise to the behavior.
One might go further and assume that (1) applies to when the test is applied to humans, and that (2) applies to when the test applies to LLMs. This is a way of restating the Differential Construct Validity view.
However, I’m not convinced these are the only two options on the table. For one, it’s possible that that it’s humans using the cheap tricks—after all, there’s ample evidence that people rely on good-enough heuristics to solve problems under resource constraints, including language comprehension. But more importantly, there might be multiple, valid solutions to a given problem, and it’s worth knowing what those are.
There’s more than one way to peel an orange
It’s true that the same behavior could in principle be generated by distinct mechanisms. But this doesn’t automatically invalidate all but one of those mechanisms. Rather, in my view, it calls for a more thorough evaluation of what those distinct mechanisms are.
If the argument is that humans solve a task using some cognitive capacity and that LLMs do something else, what is that “something else” and how does it diverge from what humans are doing? It’s possible that LLMs are using “cheap tricks” and that humans are not, but I don’t think it’s fair to assert this without empirical demonstration. Notably, even the article I cited includes a quote from psychologist Tomer Ullman making a similar point:
But Ullman thinks that it’s possible, in theory, to reverse-engineer a model and find out what algorithms it uses to pass different tests. “I could more easily see myself being convinced if someone developed a technique for figuring out what these things have actually learned,” he says.
“I think that the fundamental problem is that we keep focusing on test results rather than how you pass the tests.”
It may be helpful to be precise about exactly where humans and LLMs diverge in the way they solve a task. The neuroscientist David Marr argued that analyses of information-processing systems could operate along three complementary levels of analysis:
The computational level: what is the goal of a system? What are its inputs and outputs?
The representational level: what representations or algorithms does that system use to mediate between input and output?
The implementational level: how can that representation be realized physically?
By way of illustration, Marr references a commonplace device: a cash register. One way of understanding what a cash register does is describing its goal—a cash register sums up a bunch numbers, so its goal (its computation) is addition. But you could also describe the representations such a system uses (like Arabic numerals) or the algorithms it uses to combine those representations. And finally, you might describe how those representations and algorithms are implemented on a physical device.
You can also apply this framework in the study of cognitive or perceptual functions like vision. One of the “goals” of the visual system is detecting objects in a visual scene. But there might be distinct representational mechanisms by which object detection is achieved—different ways of combining perceptual features like color, shadow, or visual acuity to infer the boundaries of different objects. And there might be even more distinct possibilities for how such mechanisms could be implemented in a physical system (biological or not).
This conceptual framework could help us identify exactly how humans and LLMs diverge in how they solve a given task. For example, if both humans and LLMs produce analogous behavior on a task designed to assess some capacity (e.g., Theory of Mind), we can ask whether that behavior is also produced by analogous mechanisms. This is, unfortunately, not an easy task. For one, we don’t actually know which representations humans use to solve any given task—we have to infer mechanisms from human behavior on carefully designed experiments.
But that’s a game cognitive psychology is already familiar with: by systematically manipulating parameters of a task, we can figure out which manipulations produce measurable. changes in behavior; this, in turn, carries implications about the different “algorithms” or mechanisms that might be responsible for generating behavior on the original task.
This is roughly what Tomer Ullman did for LLMs in a recent preprint showing that LLMs are not robust to small alterations to Theory of Mind (ToM) tasks, concluding:
Has Theory-of-Mind spontaneously emerged in large language models? Probably not. While LLMs such as GPT-3.5 now regurgitate reasonable responses to basic ToM vignettes, simple perturbations that keep the principle of ToM intact flip the answers on their head.
In this case, the goal was deflationary, i.e., arguing that LLMs do not have ToM. But even a deflationary account calls for a mechanistic explanation: if LLMs are not using ToM to solve a given task, then what exactly are they doing? And is that distinct from what humans are doing?
Again, these are not easy questions to answer. But my basic argument is that it’s not enough to simply assert that LLMs solve a problem in a different way than humans and thus a test “doesn’t mean the same thing”; this may well be true, but I think more evidence or at least theoretical elaboration should be marshaled to support the claim. And to be clear, I think the same is true of positive claims about LLM capacities based on task performance, which should be tempered by consideration of counterfactual and possibly “deflationary” explanations.
This is roughly the stance adopted by computer scientist Ellie Pavlick in a recent paper. From the abstract (bolding mine):
While debate on this question typically centres around models’ performance on challenging language understanding tasks, this article argues that the answer depends on models’ underlying competence, and thus that the focus of the debate should be on empirical work which seeks to characterize the representations and processing algorithms that underlie model behaviour.
Personally, I still think it’s fair to focus on explicit behavior as well. But I agree woith Pavlick that the debate over which capacities give rise to which behaviors may be intractable without diving deeper into either the representational or implementational levels of analysis.
Where I stand now
For a while, this was where I landed:
If a task designed for humans is accepted as a valid assessment of some capacity, then it’s also a valid computational-level assessment of that capacity in LLMs—but also that it may reflect different mechanistic solutions in LLMs, some of which could be considered “cheap tricks”, necessitating a deeper analysis of the likely representations giving rise to a behavior in both humans and LLMs.
I still think that’s basically right, and that the Differential Construct Validity (DCV) view is sometimes levied unfairly without empirical evidence to back it up.
But a couple of weeks ago, I started having doubts—I realized that there’s a concrete way in which the DCV view could be correct, and it has to do with the predictive validity of a measure.
Suppose an LLM (like GPT-4) passes the bar exam (which it does).
My original position would’ve been that insofar as the bar exam is an assessment of lawyerly capacities in humans—let’s set aside whether it actually is for now—then it’s also a good assessment of lawyerly capacities in LLMs. It’s possible, perhaps even likely, that GPT-4 solved the exam using different resources and different mechanisms from humans. But that’s a separate question from whether it passed. If we think of the bar exam as a prediction of future lawyer performance—again, for the sake of argument—then passing the bar exam is a kind of “prediction” that a given test-taker would make a good lawyer.
My concern now is that the same test may be differentially predictive of future performance for humans and LLMs. And I don’t just mean that being a lawyer obviously involves many more capacities than those which are assessed in the bar exam; as Timothy Lee aptly pointed out in his recent post, many jobs depend on more than just abstract reasoning ability. But even if the bar exam was all you needed to predict a human lawyer’s future performance, I’m not confident that it’d be a good metric for an LLM lawyer’s performance.
Put another way: the correlation between test performance and in situ performance may be weaker (or at least different) for LLMs than for humans.
I’m not asserting that this is true. But it feels plausible to me. And it actually relates to deeper questions about aligning Artificial Intelligence systems, as Pam Rivière writes in her recent post on sandboxing (bolding mine):
Ultimately, the challenge with the “sandboxing” approach is this: it’s hard to foresee all of the relevant factors that might lead to misalignment, which means that a model’s behavior in the sandbox may not predict its behavior in the wild.
Of course, the best way to determine how a system will perform in the wild is to actually deploy it and see how it performs. Forget all these murky debates about which “capacities” are relevant, and which tests measure which capacities for which experimental subjects, and just observe what a system does in the real world. It’s certainly true that the best predictor of performance is performance itself. But this is also the riskiest option. After all, one of the reasons it’s so important to have good assessments is that we want to know ahead of time if a system is safe to deploy. And that, I think, calls for an approach that tries to assess LLM capacities prior to deployment.
The silver lining is that the solution may be the same as in the alignment case:
A possible alternative to this gloomy state of affairs might be to have a ton of test environments…the more sandboxes we have—where each sandbox tests a unique configuration of situational factors—the more likely one of those sandboxes will reveal some kind of misalignment.
Similarly, I’d advocate for a multi-pronged approach in how we assess LLMs.
First, I do think that human assessments will continue to be useful. I’m more concerned about their predictive validity than I used to be, but I still think that a behavioral result provides information. And I still think we shouldn’t automatically assume a test has differential construct validity for humans and LLMs.
But second, in line with Ellie Pavlick’s argument, I think we need more investigations of the representational or algorithmic level at which a given system solves a given task. In terms of the work being done now, this is more similar to “mechanistic interpretability” or what I’ve previously called “internalist LLM-ology”.
Success is not guaranteed, just like it’s not guaranteed in the study of human cognition. But my current view is that this is the best approach to minimizing the downside risk associated with deploying LLMs in the wild, and it’s a gamble I think is worth making.
Great stuff here Sean, as always. I've been thinking a lot lately about this passage from Stanford historian Jessica Riskin, writing about the history of AI and the Turing Test specifically:
Recently I was talking with a group of very smart undergraduates, and we got to discussing the new AIs and what sort of intelligence they have, if any. Suddenly one of the students said, “I wonder though, maybe that’s all I do too! I just derive patterns from my experiences, then spit them back out in a slightly different form.” My answer came out of my mouth almost before I could think: “No! Because you’re you in there thinking and responding. There’s no ‘I’ in ChatGPT.” He smiled uncertainly. How we can we tell there’s no “I” in there, he and the others wondered? To insist that ChatGPT can’t be intelligent because it’s a computer system and not a living thing is just a dogmatic assertion, not a reasoned argument.
How do we know when we’re in the presence of another intelligent being? Definitely not by giving it a test. We recognize an intelligent being by a kind of sympathetic identification, a reciprocal engagement, a latching of minds. Turing was definitely on to something with his idea about conversations, and if we were able to have conversations like the ones he imagined with machines, that might be a different matter. It wouldn’t be a test of artificial intelligence, but it might be a compelling indication of it. Such machines, though, would be fundamentally different from the generative AIs. To contemplate what they might be like, I think we’d need to draw upon the very sort of intelligence whose existence the the founders of AI denied: an irreducibly reflective, interpretive kind of thinking. In fact, the sort Turing used to imagine conversing with intelligent machines.
***
I don't know if that will resonate, hardcore empiricist that I believe you to be. But I think it's interesting to ponder a move away from "test" and toward "indications," fuzzy as that may sound. It's a sort of buzzing inside my head, which is how Turing fuzzily described his own process of thought.
https://www.nybooks.com/online/2023/06/25/a-sort-of-buzzing-inside-my-head/
Hi Sean, interesting article. I would like to ask you something, but maybe it would be better over a more private channel. If you have a minute, please email me at theswissroadtocrypto@icloud.com