In my day job, my research involves large language models (LLMs) and how they work: a discipline I’ve called “LLM-ology”. I also teach classes at UC San Diego on the intersection of LLMs and Cognitive Science. Because of this, students and other researchers often reach out wondering how to get involved in LLM-ology. I try to meet with as many of them as I can, but I realized it would be useful to have a more general guide for anyone interested in this area. This was also a winning topic in my last poll (tied with an explainer on mechanistic interpretability, which I wrote and published last month).
This post is intended to be an entry point into LLM-ology. It’s by no means comprehensive, and it’ll necessarily be influenced by own idiosyncratic tastes, but my hope is that it’ll provide useful summaries and links that people can follow to start building up their own knowledge on the subject. You can think of it as a kind of syllabus, composed of curated readings on specific topics along with commentary on why those readings are important for the field.
I’ve structured this post as follows:
The “basics”: what to know about LLMs.
“Capabilities”, or the problem(s) with benchmarks.
What is LLM-ology?
Philosophy of LLMs.
Understanding LLMs through their behavior.
Understanding LLMs through their internal mechanisms.
Challenges and the road ahead.
Three final notes. First, this guide is a work in progress, and may well be updated in the future—if you have questions or suggestions, feel free to reach out and I can try to address them. Second, even though the sections are in some sense “ordered”, don’t feel pressured to read everything in one section before progressing to the next: I certainly didn’t learn these things linearly, and I suspect you’ll learn most effectively if you get what you can from a given section and then move on to what interests you the most—knowing that you can circle back for more details if and when you need them. And third, try not to feel too intimidated or frustrated if things don’t make sense right away; this stuff can be very complicated, and at least for me, understanding it takes time—there’s much I’m still learning, and I’ve been working in this area for years at this point.
The basics: large language models
In order to study LLM-ology, you’ll need to know what large language models (LLMs) are and how they’re built. That doesn’t mean you have to know how to build them from scratch, but you should have a working vocabulary that includes an understanding of their training objective, dominant architectures, and key concepts like “embeddings” and “self-attention”. The amount you absolutely need to know about these things will depend in turn on exactly what you want to do, but in my view, the details are often handy.
Large language models (LLMs) are computational systems trained to predict words (or “tokens”) from their context. For example, an LLM might be presented with the following string:
A language model is trained to predict the next ____
By observing lots of strings like this one, an LLM will hopefully learn to make better predictions about the missing word. Here, “word” is a more likely completion than something like “house” or “dog”, so an LLM should assign that token a higher probability.
But many readers probably already know, on some level, that LLMs are “trained to predict the next word”. The harder part is learning how this actually works in practice. To understand that, you’ll need to do some reading (or watching).
Suggested materials
For a thorough understanding, it might be helpful to start with non-technical “explainers”, then progressively work your way through more technical, academic papers.
For the non-technical side, I can recommend this piece I co-authored with Timothy Lee of Understanding AI, which is intended to be a gentle but thorough introduction (including concepts like embeddings and self-attention). If you prefer visual explanations, I recommend this video by 3Blue1Brown on “Attention in Transformers, Visually Explained”. The channel also has helpful videos on lots of other topics, ranging from the central limit theorem to the basics of GPT. I also really like “The Illustrated GPT-2” by Jay Alammar.
Once you’ve acquired a basic vocabulary, you can deepen your understanding with some chapters from “the” textbook on Speech and Language Processing by Dan Jurafsky and James Martin. It’s an excellent textbook. It obviously wouldn’t hurt to read the whole thing, but the key chapters (in my view) are:
Chapter 3 on N-gram models: This will give you a working knowledge of more “traditional” language modeling approaches.
Chapter 6 on embeddings: This chapter covers everything from tf-idf to the more powerful “neural” embeddings that power modern LLMs; thinking about words and meanings as vectors in some high-dimensional space is really central to understanding how LLMs work.
Chapter 7 on neural networks: This chapter provides a history of neural networks, including concepts like perceptrons and feedforward language models.
Chapter 9 on RNNs and LSTMs: This chapter motivates and explains “recurrent” models like RNNs and LSTMs; although these are no longer dominant architectures (see below), I think they are very elegant architectures, and there is, in fact, growing interest in newer architectures with recurrence, such as “Mamba”.
Chapter 10 on Transformers: Transformers are now the dominant architecture for language models (and are even sometimes used for other modalities, like vision); in some ways, they represent a step back from RNNs (they throw out recurrence and the hidden state, requiring a fixed content window), but in other ways, they’ve clearly been a big step forward (self-attention being the key innovation). I especially recommend Figure 10.3 (a graphical illustration of self-attention), as well as Figure 10.6 (a depiction of the “transformer block”).
If you want to read the primary source material for some of these concepts, I recommend: the original Mikolov et al. (2013) paper on word2vec (for more on embeddings); Jeff Elman’s “Finding Structure in Time” (for theoretical background on RNNs); and of course, Vaswani et al. (2017) “Attention is All You Need” (for more on attention).
Additionally, many well-known models (like GPT-4) are subject to further training, such as “reinforcement learning from human feedback” (RLHF). RLHF plays an important role in updating model weights and accordingly changing their behavior, so if you want to study models like GPT-4, it’s good to know the basics of how it works. This HuggingFace explainer on RLHF is a nice place to start.
Finally, all language models rely on some form of tokenization: segmenting the input strings into discrete chunks (or “tokens”). There are different tokenization methods, and the implications of these methods for LLM behavior—particularly across languages with different structure and writing systems—is still not well-understood. Again, HuggingFace has a good tutorial on the basics of tokenization, and I’ve also written an explainer.
“Capabilities”, or the problem with benchmarks
Within the field of Natural Language Processing (NLP) and Artificial Intelligence (AI) more generally, a central goal is measuring the capabilities of systems like LLMs. It’s a natural question: we want to know what LLMs can and can’t do.
Here, “capability” can mean a lot of things, from “writing code” to “analogical reasoning”. As I’ve written before, many of these capabilities (like “reasoning”) are pretty abstract, which means we have to find a concrete way to operationalize them (you can’t escape construct validity). Typically, researchers develop tasks, which are intended as a kind of proxy for some higher-level capability: for example, the false belief task is often used to assess Theory of Mind, including for LLMs (though there are well-documented issues with it). A collection of these tasks is called a benchmark, and there are many such collections in NLP—often with somewhat strange names: GLUE, SuperGLUE, HellaSwag, and more.
Building and validating benchmarks is an important part of NLP research. However, there are also major challenges involved, including fundamental philosophical questions:
What exactly is the relationship between a task (or set of tasks) and the underlying capbility we’re trying to measure? How can we be sure the performance is really attributable to the capability and not just a “clever Hans” effect?
What level of performance is sufficient to say an LLM has a given capability? If there are 10 tasks intended to measure “reasoning”, how well must the LLM do on how many tasks? (You can think of this a variant of the Sorites paradox.)
How representative are the tasks of the broader capability, particularly as it manifests “in the wild”? And what theoretical and empirical tools do we have to make this determination?
These are really hard questions, and I don’t have the answers. I think measuring LLM capabilities is one of the biggest challenges facing NLP. People might disagree on whether constructing benchmarks should count as LLM-ology—as a methodological pluralist I’m inclined to include it—but I definitely think the thoughtful evaluation of those benchmarks is an important part of LLM-ology. That’s why the section below focuses on work interrogating the utility and representativeness of existing benchmarks.
Suggested materials
One of my favorite papers on benchmarks is “AI and the Everything in the Whole Wide World Benchmark” (Raji et al., 2021). The paper offers a really valuable critique of existing benchmarks, including their limited scope and biased data. It also suggests a shift away from trying to benchmark “general capabilities” (like language understanding). Instead, the authors urge researchers to focus on other methods of evaluation, such as error analysis (figuring out where a system’s performance goes wrong), ablation testing (“knocking out” parts of a system to better understand how it works), and measurement of other model properties (such as energy usage or memory requirements). I largely agree with the authors’ suggestions—in fact, I view LLM-ology as trying to do exactly what they argue we should do, as informed by theories and methods from Cognitive Science.
Another, more recent preprint called “Benchmarks as Microscopes: A Call for Model Metrology” (Saxon et al., 2024) makes the case for developing more targeted, practical benchmarks. The authors argue that current static benchmarks don’t reflect dynamic, real-world conditions, and that a new field of “Metrology” should focus on building evaluation methods that actually measure LLM performance on concrete tasks in ecologically valid settings. I think it’s a good argument that can be seen as complementary to the Raji et al. (2021) paper.
Finally, a 2023 preprint by Anna Ivanova (“Running cognitive evaluations on large language models: the dos and the don’ts”) approaches the question from a different perspective: what if we repurposed tests originally developed for humans to study LLMs? It offers some helpful guidelines for doing so, including common pitfalls (e.g., don’t assume models solve the test in the same way as humans). This is probably the closest to what I think of as “LLM-ology”, and the paper is accessible even for someone with limited background in LLMs.
It also leads nicely into the next part of this guide, which is defining LLM-ology and diving into the details.
What is LLM-ology?
In a previous post, I defined LLM-ology as follows:
LLM-ology: a research program that studies how Large Language Models (LLMs) work “under the hood”. LLM-ology is distinguished by its focus on the internal mechanisms of LLMs (i.e., which computations underlie the transformation of input to output) as well as the comparison to humans or human populations (i.e., administering classic psychological tasks to an LLM).
If that definition sounds a lot like Cognitive Science, that’s not an accident: I’m a cognitive scientist first and foremost, and my hope is that Cognitive Science can provide insight into the study of LLMs.
In that same post, I went on to provide definitions of more specific disciplines or subfields:
Behaviorist LLM-ology: analyzes LLM outputs (e.g., predicted text or probabilities assigned to upcoming tokens) in response to particular inputs with the goal of establishing overall system capabilities and limitations. In the ideal case, this is done with reference to some underlying theoretical framework, using carefully curated stimuli that allow the researcher to make inferences about what capabilities could in theory allow for specific behavioral responses. An analogy from Cognitive Science would be the field of Cognitive Psychology, and an example would be this recent PNAS paper, which is conveniently titled: “Using cognitive psychology to understand GPT-3”.
Internalist LLM-ology: analyzes LLM internal states (e.g., activations across units in a given layer) in response to particular inputs with the goal of establishing mechanistic explanations of how information passes through the system. An analogy from Cognitive Science would be the field of Neuroscience. Some of this work is inspired by work analyzing the mechanics of visual neural networks, e.g., Chris Olah’s work on interpretability. An example of this approach would be the already-mentioned Tenney paper, as well as this more recent paper by Atticus Geiger on others probing language models for causal abstractions.
These definitions already provide some links to relevant papers, but the sections below will go into considerably more detail. They’ll also (hopefully) illustrate why a field like LLM-ology is useful, and they’ll cover additional sub-fields like the philosophy of LLMs.
Philosophy of LLMs
Philosophy is an important part of Cognitive Science, and it’s also an important part of LLM-ology. Philosophers were asking questions (and challenging intuitions) about the mind long before LLMs came around, and many of those questions are extremely pertinent to current debates about LLMs. For example:
What constitutes the meaning of a word or a sentence?
What kind of thing can understand language? For that matter, what kind of thing has a “mind”?
What is a belief and what are the conditions for saying that something has a belief?
It’s not hard to see how these questions apply to LLMs. Regardless of your intuitions on what kind of thing an LLM is, it’s helpful to ground those intuitions in previous debates—if only to recognize certain casts of argument as having been made before.
Philosophy is also concerned about the nature of explanations. What counts as a good scientific explanation? This is something I think about a lot with LLM-ology. What is the most useful (and accurate) lens with which to study LLMs, and what would constitute a satisfying account of their behavior? One’s answer to this question probably depends in part on how one construes an LLM and what level of analysis seems most appropriate for understanding computational systems.
Suggested materials
I highly recommend reading the two-part “Philosophical Introduction to Large Language Models” by Raphaël Millière and Cameron Buckner (part I, part II). The papers are rigorous but also accessible, and both contain “glossaries” with helpful definitions and examples of key terms (e.g., “Blockhead” thought experiment, grokking, etc.). If you’re looking for a thorough overview of both the philosophical literature and a good introduction to LLMs themselves, this is a great place to start.
That introduction references a bunch of classic philosophy papers that it’s helpful to be familiar with. I’m not a philosophy expert, but I am a fan of reading the source material when time and resources allow, so I’d recommend actually reading Searle’s “Minds, brains, and programs”, Ned Block’s “Psychologism and Behaviorism”, and Daniel Dennett’s “Real Patterns”.
In terms of more recent work, Bender & Koller (2020) has had a big impact on the field. You can think of the paper as a modern update on the symbol grounding problem, applied specifically to LLMs, with recommendations and critiques about the way that the NLP field sees “progress”. An even more recent paper is Linzen & Mandelkern (2024), which asks whether LLMs’ words refer. It’s an excellent paper and, like the Millière and Buckner (2024) introductions, also does a good job of situating the reader within classic philosophical debates. Finally, Murray Shanahan has written a lot about LLMs, but a good place to start is his recent “Talking about LLMs”.
Understanding LLMs through their behavior
One way to study a system (like an LLM) is to analyze its behavior. In the case of LLMs, “behavior” could include the probabilities assigned to upcoming tokens in context or the sampled tokens themselves. Clearly, the latter are a function of the former—but in some cases, it might be more practical to study, say, a sequence of 100 tokens actually generated by the model rather than the relative joint probabilities assigned to all possible sequences of length 100. Crucially, the study of LLM behavior isn’t just arbitrary: the goal—as it is in Cognitive Science—is to craft situations (i.e., experiments) in which an LLM’s behavior tells us something useful about how it actually works. Ideally, “how it actually works” should be grounded with respect to some underlying theory or posited capability: questions about an LLM’s grammatical knowledge, say, or its ability to track the mental states of characters in a story. This approach is closest to the one I take in my own research.
Previously, I referred to this approach as “Behaviorist LLM-ology”, but I now think that’s overly narrow and connotes work in the Behaviorist tradition specifically. I’m not opposed to Behaviorism, but I’m also supportive of work that uses the behavior of a system to make inferences about the representations and processes producing that behavior (i.e., what is sometimes called Cognitivism).
For example, we might present a language model with grammatical (“the keys are on the table”) and ungrammatical (“the keys is* on the table”) sentences, then ask whether the model assigns higher probability to the grammatical strings than ungrammatical ones. If it does—and if we’ve done a careful job controlling our stimuli for other confounds—then we might infer that the model has acquired basic syntactic knowledge (e.g., subject-verb agreement). In fact, this example is from a 2016 paper by Tal Linzen, Emmannuel Dupoux, and Yoav Goldberg, which I think is a great model of the kind of work I’m imagining here.
Another approach is to explicitly compare LLM behaviors in these contexts to measures of human behavior, such as how quickly they read a word or sentence. The extent to which LLM behavior on an experimental task predicts human behavior on that task is sometimes called psychometric predictive power (PPP), and there’s now a large body of research using varieties of this paradigm. The advantage is twofold: first, PPP is a direct index into the “humanlike-ness” of LLM behaviors; and second, it quantifies the extent to which human behavior can be predicted by a statistical model trained on linguistic input alone (a big question in Cognitive Science).
Suggested materials
As I mentioned above, there’s a huge literature that I could include here. I write about a lot of this research on the Counterfactual, and a useful starting place might be my post on whether LLMs have a Theory of Mind, since it covers both the methodological approach and the various debates about interpretation of results.
In terms of actual academic papers, I think the Linzen et al. (2016) paper I referenced earlier is an excellent example of behavior-focused LLM-ology. This paper used LSTMs, an older LLM architecture, which were much less performant than the models we use now; it also studied subject-verb agreement, a phenomenon which might be seen as relatively “simple” compared to something like Theory of Mind. Yet this is actually why I like using this paper as an example, because it shows (in my view) that the research methods I’m in support of apply equally well to models of varying quality and to phenomena of varying complexity.
If you’re looking for an overview of what we know about LLM behavior, this review paper by Tyler Chang and Ben Bergen (2024) is very thorough. It covers topics like syntax, semantics, pragmatics, and commonsense reasoning, and also discusses more concerning LLM behaviors like toxicity or bias.
Finally, here’s a quick list of other papers to send you on potentially fruitful directions: Binz & Schulz (2023) for an analysis of LLM behavior on common psychology tasks; Jones et al. (2024) for an analysis of LLM behavior on several Theory of Mind tasks specifically, and Shapira et al. (2023) for a fine-grained error analysis on Theory of Mind tasks; Turpin et al. (2024) for a creative and insightful analysis of whether “chain-of-thought reasoning” leads to faithful explanations of LLM behavior; and Kuribayashi et al. (2022) for a comparison of psychometric predictive power across various model sizes and languages.
Understanding LLMs through their internal mechanisms
A complementary route to understanding how LLMs work is to study their internal mechanisms. Here, the ultimate goal is to describe how the configuration of an LLM—including its training data, its architecture, and even specific “circuits” within the network—leads to categories of behavior. If understanding LLMs through their behavior is roughly analogous to cognitive psychology, then this subfield is roughly analogous to neuroanatomy and neurophysiology, which (respectively) seek to understand the structure and function of the nervous system. Alternatively, if we’re thinking in terms of “levels of analysis”, this approach is closer to what David Marr called the implementational level of analysis—i.e., focusing on the mechanics of the substrate—whereas the behavior-focused approach would fall under the computational or representational levels of analysis. This subfield, then, is basically mechanistic interpretability.
LLM-ologists might analyze the activation of units in different layers in response to different kinds of input, study the representational geometry of those activations, assess the contribution of different attention heads to the final output of a given layer, and much more. Here, LLM-ologists have an advantage over neuroscientists (particularly those interested in the human brain): in principle, an LLM-ologist has much more information about the system they’re trying to understand. Assuming they’re working with an open-source model, they know the training objective, the training data, and the value of each and every parameter. They can also perform causal interventions much more easily.
Suggested materials
Last month, I wrote an explainer on mechanistic interpretability (MI), which also provided motivation for the field itself and an overview of three key tools in the MI toolbox; my hope is that this is a good place to start. Another accessible post that helps motivate this approach is “Interpretability Dreams” by Chris Olah, which emphasizes the AI safety perspective.
There are also a few academic review papers of mechanistic interpretability, which give a history of the field and discuss the main techniques and research questions. I used this primer by Javier Ferrando and co-authors in creating my explainer, and this very recent review (Mueller et al., 2024) discusses the field from a more theoretical perspective.
In terms of individual contributions, the following seem representative to me of the field’s trajectory and current suite of methods: this classic paper by Tenney et al. (2019) on BERT (one of the original “BERTology” papers), which makes use of classifier probes; this paper on “Locating and Editing Factual Associations in GPT” (Meng et al., 2023), which makes use of patching techniques; and this 2023 report by Anthropic using sparse auto-encoders.
Of course, I’m not an expert in mechanistic interpretability, and there’s also no way I could include every relevant paper. But Neel Nanda is an expert, and he’s fortunately written an “extremely opinionated annotated list” of his favorite MI papers. I won’t pretend to have read all the papers there, but Nanda includes helpful annotations summarizing what he thinks is important about each one.
Challenges, and the road ahead
As I’ve tried to emphasize, LLM-ology is in its infancy. We’re still figuring out what research questions we ought to be asking and what methods we ought to use to answer those questions. There are also major challenges—both theoretical and applied—that (more optimistically) can be seen as opportunities for real contributions. Thus, to close this post, I want to briefly discuss a few challenges that I think are facing the field.
One central challenge, which I’ve written about a number of times before (here and here), is the question of generalizability. When we study a given LLM—GPT-2, say—to what extent can we generalize the results we obtain to other LLMs? This is a question that faces cognitive scientists too: when we study English-speaking college students in 2024, how much can we generalize to other groups of humans? With LLMs, the task is in some ways even more complicated. LLMs can vary in their architecture (e.g., bidirectional vs. auto-regressive transformers; or transformers vs. LSTMs), source of training data (e.g., Wikipedia vs. Reddit vs. conversation transcripts), language and language variety (e.g., English vs. Spanish; or American English vs. British English), size (e.g., 100M vs. 1T parameters), training duration, and more. Many LLMs are also now multimodal (e.g., trained on both text and images), and there’s further variation in how multimodal architectures are designed. A given LLM trained on a given dataset containing a given language variety represents a tiny fraction of the possible space of LLM configurations: what’s the broadest population we can make generalizations about from that sample? As far as I’m concerned, this is an unanswered question, and it affects everything from research on LLM capabilities to mechanistic interpretability.
Another set of challenges concerns the societal implications or effects of LLMs. Due in part to biases in their training data—but also to the circumstances around how they are built and deployed—LLMs may well have serious negative impacts on people and institutions. A landmark paper on this topic is Bender et al. (2021), which also introduces the “stochastic parrots” term. LLM-ologists ought to take these ethical concerns seriously and perhaps even build tools to help mitigate them. For example, this 2023 EMNLP paper (Ma et al., 2023) developed a framework to identify and reduce harmful stereotypes and biases in pre-trained language models.
A related challenge is that many of the best models (like GPT-4) are “closed-source”, meaning they are built and maintained by companies that don’t allow access to the model internals. This creates a few problems. First, without knowing what the model was trained on, we can’t ask questions about how the type or amount of training data leads to certain behaviors; it’s also hard to ask questions about model capabilities because of the possibility of data contamination—i.e., the training data might’ve contained the answers to the tests we’re using. Second, mechanistic interpretability is out of the question. End users access GPT-4 through a Python API, which is helpful and easy to use, but they can’t even access the probabilities that GPT-4 assigns to words in context (i.e., the “logits”)—let alone all of the hidden states and weight matrices constituting the model itself. And third, because companies like OpenAI sometimes update the models or even deprecate older models, it’s a big problem for scientific reproducibility. Altogether, then, LLM-ologists have to navigate the trade-off between using open (but less powerful) models and using closed (but more performant) models. I tend to think we should use open models where we can, but it’s not always a straightforward decision—especially when the research question is about figuring out the upper-bound on performance.
Finally, “large language model” is by no means a static entity. The language models of 2024 look a little different from the language models of 2022, which are certainly different from the language models of 2016 (i.e., before the transformer architecture). This raises further challenges for generalizability: even if we studied all available models from the current year, we might see new models that exhibit different behavior in the following year. We might even see the adoption of new architectures, which in turn could require changes to the set of tools and questions we want to ask. It’s hard to study a moving target.
I’m not raising these challenges to discourage potential LLM-ologists out there. But they are very real challenges, so LLM-ologists will need to be comfortable working on questions we don’t know the answer to and which may not even have satisfying answers in the end—something that’s common in science, but which can also be a source of understandable frustration. Ultimately, we need more eyes on the problem: my hope is that this guide will inspire readers to pursue one or multiple of the threads I’ve outlined here, whether that’s asking careful philosophical questions about LLMs, characterizing their internal dynamics, or conducting rigorous experiments to better understand their behavior.
Related posts: