Language models vs. "LLM-equipped software tools"
Cognitive architectures, extended minds, and distributed cognition.
Discussions about the capabilities (or lack thereof) of large language models (LLMs) sometimes treat LLMs as “end-to-end” systems. This perspective makes sense when the specific goal is assessing what kinds of abilities can emerge from training a next-token prediction system on lots and lots of text. Notably, however, many of the commercial LLMs people interact with are not pure “vanilla” LLMs: they’ve undergone various forms of fine-tuning, and they often have access to external applications, like a Python shell or a search engine. Sometimes they’re even programmed to run somewhat autonomously, executing many “actions” in the absence of direct user input.
In fact, many potential use cases of LLMs will likely involve embedding an LLM in some kind of broader software architecture. Because the capabilities of these systems depends not only on the LLM but also on the software it connects to, I think it’s more appropriate to refer to them as something like “LLM-equipped software tools”. In this post, I’m going to argue why I think this distinction is important—and then I’ll describe a few conceptual frameworks that might help us think about the issue more clearly, which ultimately connect to deep questions in Cognitive Science about the boundaries of the self.
I should also note up front that this distinction is not intended as a criticism of LLM-equipped software systems nor vanilla LLMs. For reasons that will hopefully become clear later in this article, I’m less interested in which “side” is correct here than in a more precise articulation of what we’re even talking about when we talk about LLMs.
What system are we referring to?
Calling something an “LLM” vs. an “LLM-equipped software tool” might seem like just the sort of pedantic distinction you’d expect an academic to make. But I do think it’s an important one, and it matters for discussions about LLMs and their capabilities—regardless of what “side” in a capabilities debate you find yourself arguing for. We ought to be precise about credit assignment when arguing about what a system can or can’t do.
One reason this matters is practical. An LLM equipped with a calculator application programming interface (API) will likely produce more accurate answers to arithmetic questions than an LLM trained only to predict the next token. That’s largely the point of papers like this one, which demonstrate that transformer language models trained not only to predict strings but also to output API calls outperform vanilla LLMs on a range of tasks, including arithmetic. Suppose, then, that someone wants to use LLMs in a context that requires them to perform arithmetic (e.g., as a math tutor). It’s important for that person to know which of the two systems is the system that actually produces consistent, reliable responses to math questions! Similarly, an LLM with the ability to search a repository of academic papers will likely generate more reliable citations than an LLM relying solely on its training data to generate text.
If the impressive performance comes from an LLM hooked up to some external API, then asserting that “LLMs can do X” is, at best, under-specified—since it’s not just a next-token prediction system that achieves that performance. At worst, it’s misleading: in the arithmetic case, for instance, it would be incorrect for someone to infer that they can expect the level of performance achieved with a LLM hooked up to a calculator from a vanilla language model.
I’m also far from the first person to make this point. Computational linguist Tal Linzen, for example, argued against “terminological creep” in a Twitter/X post back in May, writing:
From a scientific perspective it's best to work with a clear definition of an LLM (e.g. trained with self-supervision only). Once you have a system that combines an LLM with, say, a Python interpreter, it's no longer an LLM (though it may look like one from the OpenAI API).
Terminological precision also matters for building generalizable theories about what kinds of systems or system components are necessary and sufficient for which kinds of abilities—whether we’re talking about Theory of Mind (ToM) or debugging computer code. In the case of ToM, there are important theoretical debates about the role of language input specifically in shaping our ability to represents the mental states of other social agents. If a vanilla LLM displays behavior consistent with ToM, that lends credence to the “language alone is sufficient” view. But if ToM-consistent behavior requires “something more”—like, say, a symbolic “working memory” system hooked up to an LLM—then it’s important to know that.
And just in case the notion of “theoretical debates” seems irrelevant to application-oriented readers: a good theory arguably not only for academic questions but also for decisions about how to engineer a useful LLM-based application. And if you’re thinking about the best way to incorporate an LLM into a product, you’re probably working with some kind of implicit theory anyway, e.g., about how to connect a series of prompts and state-dependent variables to the operations of a vanilla LLM. In my view, we’re unfortunately still too early in the game to make strong theoretical claims about how these system components fit together, but that’s exactly why I think the distinction is so important: terminological precision is a precondition for good LLM-ology.
That’s where the notion of a cognitive architecture comes in.
Language agents and cognitive architectures
The notion of an “language agent” has become more popular recently. Although the term “agent” has (at least for me) connotations of intentionality or even sentience, here it really just indexes something about whether and to what extent an LLM is connected to some external environment—whether that environment is a web browser or a robot moving about the physical world. As I mentioned above, decisions about how to connect an LLM to that external environment invariably involve drawing upon some kind of implicit theory of the best way to do so; in my view, it would be helpful to be as explicit and precise about the design of these LLM-equipped software systems. Fortunately, an excellent paper published in early 2024 entitled “Cognitive Architectures for Language Agents” does exactly that.
The paper, led by Theodore Summers and Shunyu Yao, introduces a conceptual framework for thinking about how different parts of an LLM-equipped software system fit together. Specifically, they cast the different components in terms of cognitive abilities, like memory and reasoning: hence, “cognitive architectures for language agents” (CoALA). Under this framework, we can first divide a system’s actions into those that interact with some external environment (e.g., navigating a website or controlling a robot) and those that interact with some internal representations (e.g., maintaining and searching a memory database).
To help illustrate what they mean here, I’ll focus on the structure of memory specifically. As the authors point out, vanilla transformer language models are essentially stateless: they don’t “remember” what happens across different forward passes through the system. There are all sorts of workarounds for this, such as creating ever-larger context windows or using retrieval-augmented generation (RAG). And there’s nothing necessarily wrong with these workarounds, but the authors suggest that it’s helpful to view different solutions to the problem of “statelessness” through the lens of human memory and cognition.
Specifically, theories of human memory often divide it into different components. There’s working memory, which maintains information that’s readily available (like remembering a short list of numbers); there’s episodic memory, which stores experiences and their associated sensorimotor or social-interactional associations (like a childhood memory of playing in the park); there’s semantic memory, which we often think of as “declarative knowledge” about the world or ourselves (like that Paris is the capital of France); and then there’s procedural memory, which is knowledge about how to do things (like riding a bicycle).
How would these components map onto LLM-equipped software systems? The authors suggest that the weights learned by LLMs are a kind of procedural memory, as is any code written by the engineers describing how to execute certain actions. In contrast, methods like RAG are most similar to semantic memory: a large store of (approximate and lossy!) information about things the system has encountered and “encoded”; depending on the system’s design, that memory store might be relatively static or susceptible to change. And methods that encourage LLM-equipped systems to generate intermediate representations—like chain-of-thought prompting or “pause” tokens—might be analogous to the notion of working memory, as would any method that connects the system’s current state to other components of the memory system.
(Of course, there are many different these system components could be operationalized. For instance, RAG is only one (limited) way of thinking about semantic memory—papers like this one suggest connecting language models to external memory systems based on theories of brain function.)
The authors suggest that the CoALA framework is helpful both for evaluating existing “language agents” and for designing new systems. I think they’re right. As I argued above, it’s helpful to be as explicit as possible about how the components of a system fit together, particularly if we’re interested in making claims about what that system can or can’t do.
Finally, the authors raise a number of both conceptual and practical questions. How much does it matter whether an agent’s environment is digital or physical? How does the advent of more powerful LLMs, or simply different architectures, change the necessity of auxiliary components? And what exactly is the boundary between the agent and its environment?
It’s this last question I want to turn to next.
The extended—and distributed—LLM
I’ve been describing vanilla LLMs and LLM-equipped software systems as if they’re coherent things with well-defined boundaries. For vanilla LLMs, maybe that’s fair: we can define a vanilla LLM to be the components involved in tokenizing a string, passing the tokens through a bunch of weight matrices, and deriving predictions about the next token. It’s trickier with LLM-equipped software systems: if an LLM has access to a search engine, is the set of web pages indexed by the search engine “part” of that system—or does the system include only some other, more explicit memory store? Similarly, if a system has access to a Python shell, is all of Python now “part” of the system or is it more akin to the environment with which the system interacts? And what even is the difference between the two?
This question is not unique to LLMs: it’s also an old debate about the boundaries of human cognition and selfhood. What defines the boundaries of the human mind? Is it the boundaries of the brain or the entirety of the nervous system? What about other parts of our body—or even tools we interact with frequently?
The argument that the mind extends beyond the brain and body is sometimes called the extended mind thesis. This thesis was most clearly articulated in its current form by philosophers Andy Clark and David Chalmers in a 1988 article called “The Extended Mind”. Through a series of thought experiments, the authors challenge the common assumption that the boundaries of the mind are defined by “skin and skull”.
Perhaps the most famous of these thought experiments is the case of “Otto’s notebook”. Typically, we think of memories as being encoded in the brain, and of retrieving these memories as akin to accessing some kind of internal memory store. For example, suppose Inga hears about a new museum exhibit and decides she wants to see it—after thinking for a moment, she remembers that the museum is located on 53rd street and walks in that direction. This process of “remembering” most likely involves recalling up some kind of knowledge that’s “stored” (for lack of a better word) in her brain (perhaps her hippocampus). Now consider the case of Otto, who has a very poor memory and must rely on the aid of external tools. Otto writes down everything he learns in a little notebook, which he carries everywhere he goes. When he hears about the museum exhibit, he decides he wants to see it too—and so he consults his notebook to determine where the museum is located.
Both Inga and Otto receive the same “input” and ultimately take the same action: heading to the museum. In a certain sense, they even access the same kind of information: namely, the location of the museum. The only difference—at least according to Clark and Chalmers—is how and where that information is represented, and thus how exactly Inga and Otto retrieve it. But functionally, Otto’s notebook plays a very similar role as the part of Inga’s brain responsible for storing memories. Is it fair, then, to say that Otto’s notebook is “part” of Otto’s mind? Clark and Chalmers submit that it is, putting forth an essentially functionalist thesis about the nature of cognition and memory (pg. 14):
The moral is that when it comes to belief, there is nothing sacred about skull and skin. What makes some information count as a belief is the role it plays, and there is no reason why the relevant role can be played only from inside the body.
The authors argue that critics of this view must defend the opposing proposition that there is some importantly relevant distinction between how Inga and Otto access information about the museum’s location. For instance, someone might argue that Inga’s access to memory is more reliable, and that the potential unreliability of Otto’s notebook means that it cannot be included in our definition of Otto’s mind. But of course, memories represented within the skin-skull barrier are not always reliable either—and in a deeper sense, even an unreliable memory or faulty belief is still part of someone’s mind.
Accepting that Otto’s notebook is part of his mind opens the door to all sorts of other questions. In the age of ubiquitous smart phones and access to Google search, do the boundaries of our minds include the Internet? For that matter, do our minds include ChatGPT? There’s a real concern about “cognitive bloat” here: at what point can we safely draw a line such that the entire universe is not included in our minds? Most things, after all, are causally connected to most other things in some way. And we use plenty of tools—and coordinate our actions with plenty of other people—as we go about our lives. Is it fair to say these tools, or even these people, are part of our mind?
There’s no easy answer here. Some people default to the skin/skull barrier: a defensible response. But at least in my view, I think it’s helpful to switch the emphasis from “what is part of my mind?” to “what is part of a given cognitive system?” When I use a computer, it’s perhaps more accurate to say that a new, coupled cognitive system has been created. That system defined by a time, place, and a particular causal relations between its components, which in this case might include my brain, body, and the computer. We might even be more precise: for the specific cognitive operations in question, the cognitive system need not include every part of the computer—or for that matter, every part of my body. This perspective is very much informed by the extended mind thesis, but it’s also informed by the theory of distributed cognition, which originated here at UC San Diego: the notion that we can usefully describe distributed sociotechnical systems (like a bunch of people navigating on a ship, along with the ship itself) through the lens of Cognitive Science.
This notion of a distributed cognitive system might seem unintuitive, especially because we tend to conflate cognition with consciousness. But I’m not asserting that the set of people working together on a ship produce an emergent consciousness (even if that might well be true). Rather, the claim is that it’s useful and interesting to describe their collective behavior as a cognitive system.
Not everyone will be on board with my argument here. But I think (or hope) it’s helpful for conceptualizing the LLM case we started out this section with. The key insight is this: if we want to understand how a system produces some behavior, we need to map out which components of the system appear to be relevant for that behavior. Definitionally, it seems weird to say the Internet is “part” of an LLM-equipped software tool. But maybe it seems less weird (maybe not—you tell me!) to refer to the distributed system composed of the LLM, the set of API calls used by the LLM, and at minimum the web pages accessed in a given operation. And an important part of doing this right will involve using careful, precise terminology.
Hey Sean, I've been talking about something similar with teachers in my CRAFT Program, but in a different way and (I think) for different reasons.
I talk about "grading the chats," and I show teachers that the type of bot you choose dramatically changes the nature of the chat - or the interaction - itself, which subsequently changes your grading rubric and the way that you approach the evaluation.
I also call them Vanilla LLMs, but when it comes to "Custom GPTs," or LLM's that are equipped with a software API, as you describe, I lean towards calling them "personality bots."
Consider -- When I add a capability to an entity that already acts "like" a human being, couldn't you say I am giving it a character trait? The LLM that is attached to Python is "Coder Bot," (or whatever). In human terms, "coder" is an adjective that describes a human being's skillset, hobbies, or identity. A "Contrarian Bot," which is not necessarily attached to another piece of software - but has been adjusted in some way -- is also a human-mimicking bot that has been "given" a personality trait. "Contrarian" is an adjective that we use to describe our Uncle at Thanksgiving Dinner who is just dying to have an argument. It's who he is (identity.)
What say you? This links to a deep discussion about whether or not to anthropomorphize AI, which is a debate I have been having with Rob Nelson from the AI Log for some time now. To me, that's the foundational question - and everything else flows out from it. Anyway, would love to hear your thoughts as I agree with your premise overall absolutely but come at it from a different perspective.