AI (mis)alignment, Waluigi, and the Knobe Effect
Why it might be easier for LLMs to be bad than good.
This essay is about why it may be easier for Large Language Models (LLMs) to “go bad” than to be good—a phenomenon called the Waluigi Effect.
This effect is still speculative and poorly understood. As I’ve written previously, LLMs themselves are also largely “black boxes”; as a consequence, explanations for their behavior sometimes involve attributing agentivity (“The LLM wants to do X”) or invoking metaphors (e.g., “summoning persona X”).
Ultimately, my goal is to present a hypothesis as to why the Waluigi Effect occurs, grounded in a well-known effect in social psychology. I see this explanation as:
A hypothesis: it remains to be tested empirically.
Conceptually compatible with another explanation I’ll be referencing throughout the post.
Those already familiar with the Waluigi Effect should feel free to skip the first two sections of the essay.
The alignment problem, in brief
Artificial Intelligence systems don’t always do what we want them do. This divergence between our goals and what a given system has learned to do is sometimes called misalignment—or more generally, the alignment problem.1
Many recent high-profile cases of “misalignment” have revolved around Large Language Models (LLMs) like ChatGPT. These chat-based models sometimes make up information (a tendency called “hallucination”), produce toxic or hateful language, or in at least one example, prompt their interlocutors to leave their spouse.
One hope is that clever techniques, like prompt engineering,2 can reduce these bad behaviors. Prompt engineering involves prompting an LLM to adopt a particular “persona”3. For example:
You are a helpful, harmless, and knowledgeable assistant. When a question is asked of you, you try your best to provide an answer grounded in truth. If there is no such answer, you say, “I don’t know”, rather than making something up.
Somewhat remarkably, this simple approach does appear to make a positive impact, at least in many cases.
To me, the most plausible interpretation of why this seems to work lies in what, exactly, LLMs are trained to do. Their training objective is next-token prediction. Next-token prediction, on its own, doesn’t necessarily lead an LLM to a correct model of the world, but it may involve learning and constructing some kind of representation that corresponds to structure in reality—whatever is learnable from language.
Of course, many things that are written or said are false, which means that an LLM’s predicted tokens will also be likely to be false. But prompting an LLM as above may “steer” this predictive model towards responses that are more accurate (or less toxic, etc.), if only because accurate responses are statistically more likely to occur after a character in text is described as helpful, harmless, and knowledgeable.
Things, unfortunately, aren’t quite so simple. Clever prompting can be overcome (through equally clever “prompt injection” attacks)—and there are some who argue the enterprise may be counterproductive. That, in essence, is the core of the Waluigi Effect.
With every luigi comes a waluigi
I was first introduced to the comically named Waluigi Effect in this essay, which defines it as follows;
The Waluigi Effect: After you train an LLM to satisfy a desirable property P, then it's easier to elicit the chatbot into satisfying the exact opposite of property P.
This effect is meant to explain why LLMs prompted to be “good” in particular ways sometimes, over the course of a conversation, can be easily prompted into displaying “bad” behaviors—in fact, behaviors that are the exact opposite of what the LLM was originally prompted to do. The author of the essay calls these two personas a “luigi” (the intended persona) and a “waluigi” (the opposite of the intended persona).
To use a metaphor that’s been floating around: if prompting is akin to “summoning”, then it’s difficult to avoid summoning a demon when you try to summon an angel.
Why does this occur?
The author of the essay gives a few explanations:
Rules normally exist in contexts in which they are broken.
When you spend many bits-of-optimisation locating a character, it only takes a few extra bits to specify their antipode.
There's a common trope in plots of protagonist vs antagonist.
All of these, as the author notes, are mutually compatible, and may just be different ways of saying the same thing: LLMs are “simulators”, and in the process of simulating whatever process is responsible for token generation, it’s easier—because of structure in the training data—for this simulation to “turn bad” than good.
I think these explanations all make sense and deserve to be taken seriously. My goal is not to dispute them but rather to offer another, compatible perspective, focusing on that core premise: what structure in the training data, exactly, makes it easier to go bad than good?
Waluigis and the Knobe effect: what happens when you simulate yourself?
This is the part of the essay where I present my hypothesis, which consists of a few core premises. Many of these premises are based on previous work, and others follow naturally from previous work:
LLMs can be usefully thought of as “simulators”.
LLMs can be used to stochastically generate tokens, using text they’ve already generated as part of the prompt; this system (called “dynamic GPT”) can be usefully distinguished from the set of weights used to produce the next token (called “static GPT”). An LLM used in this way could be thought of as “simulating itself”.
Because of inherent stochasticity in the token-generation process, some generated tokens will be more or less consistent with the simulation in question. These out-of-sync tokens can be usefully thought of as “accidents”.
Humans display a social bias called the Knobe Effect, in which we are more likely to blame someone for behavior that leads to accidental harm than credit someone for behavior that leads to accidental benefits.
Because an LLM is trained on a corpus of human-generated text, it learns a number of human social biases, including the Knobe Effect.
This all leads to my conclusion: an LLM, as a simulator, is more likely to take on the “identity” of a “bad” token-generator than it is a “good” token-generator, because it has learned the Knobe Effect from its corpus.
Premise 1: LLMs are “simulators”
Above, I wrote that LLMs are “simulators”. This word comes from a now-famous post by janus, which argues that one of the best ways to understand what “kind” of thing an LLM is as a simulator.
Recall that LLMs are trained to perform next-token prediction. Through observing trillions of word tokens in sequence, their weights are adjusted in a way that allows them to make better predictions about which word is likely to come next, given the words that came before.
This fact is sometimes used to dismiss the capacities of an LLM—they are “just” predicting the next token, after all, as a kind of sophisticated auto-complete. I think this is simultaneously true and also misses the point. Training a model to perform next-token prediction requires that model to learn statistical contingencies that improve its ability to predict future tokens. And so, while these models (at least prior to GPT-4) mostly don’t have anything like “world experience”, they may, over time, learn something like a world model: after all, language contains information about the world, and if your job is to predict upcoming words, it helps to abstract some of that information.
In principle, there’s little reason to restrict this to information about the world. In fact, language might contain even more information about the producers of that language. So again: an LLM trying to predict the next token might find it useful to model features of the “generator” of those tokens—a construct that may, in turn, apply to different people or personas.
Put another way: in its effort to reverse-engineer the token generation process, an LLM might find it useful to simulate features of that token generation process. That is, LLMs are simulators.4
Premise 2: The emergence of “dynamic GPT”
The phrase “Large Language Model” (LLM) is often used to refer to a neural network trained to predict the next token, given a large corpus of text. At its core, such a system is essentially “just” a big matrix of weights—which is part of why it’s hard for many people to believe that something like “intelligence” (or certainly sentience) is encoded in a static array of numbers.
But as this post mentions, there’s actually some ambiguity in how people use the term “LLM” (or “GPT”):
The autoregressive language model μ:Tk→Δ(T) which maps a prompt x∈Tk to a distribution over tokens μ(⋅|x)∈Δ(T).
The dynamic system that emerges from stochastically generating tokens using μ while also deleting the start token.
In my view, the latter system can be thought of as a distributed cognitive system: the interaction between that static set of weights and a given prompt. Applying this dynamic model recursively—i.e., using the model to generate the next token t, then using that new utterance to generate the next token t + 1—produces an emergent system with context-dependent behavior.
What, conceptually, is such a system doing?
Combined with premise 1 above, I’d argue that an LLM prompted in this way is essentially simulating itself. That is: because its generated tokens become part of the prompt for the next token, and because next-token prediction can be usefully conceptualization is “simulating” the token-generation process, then such a system is trying to recursively simulate (or model) the generator of its previously generated tokens—that is, “itself”.
Premise 3: Accidents happen
Accidents happen, however: because of odd, possibly inscrutable statistical contingencies in which words are most associated with which other words, a model may assign high probability to a token that—from another perspective—is inconsistent with the token-generating process.
On some level, this doesn’t really make sense. A model’s predictions are, tautologically speaking, whatever it “thinks” are consistent with the token-generating process. My argument here thus rests on our independent judgments about some “objective” persona or region of state-space that a model is perceived to be occupying, and deviations from that region of state-space.
To use the example from the original Waluigi Effect post: imagine prompting a model to be staunchly “anti-croissant”. Such a model will, for the most part, produces tokens that are consistent with the region of state-space (or “persona”) that associates the word “croissant” with other, negatively-valenced words (“disgusting”, “dry”). But there also might be some probability of the model generating a word which, however slightly, raises the association between “croisant” and positively-valenced words (“tasty”, “delicious”).
Generalizing a bit from croissants, we might imagine that something similar could happen with a model prompted to be “helpful”. Such an effect might be especially problematic with models trained with negative objectives, e.g., “Don’t be toxic”. A negation often contains the thing it’s negating, e.g., “Be toxic”, which thus raises the salience of the word “toxic”. (Humans also fall prey to this problem: it’s hard, after all, to hear “Don’t think of an elephant” and not think of an elephant.)
Importantly, as we’ll see next, my argument doesn’t rest on these deviations being inherently biased: it just requires “accidents” of some kind.
Premise 4: Interpretation of accidents is asymmetric
The Knobe Effect is a classic effect from social psychology, describing our tendency to assign blame for unintended harms, but not to assign credit for unintended benefits. In other words, our assignment of responsibility—and attribution of intent—is in some ways asymmetric.
This summary article describes the Knobe Effect as follows:
Joshua Knobe famously conducted several case studies in which he confronted survey subjects with a chairman who decides to start a new program in order to increase profits and by doing so brings about certain foreseen side effects. Depending on what the side effect is in the respective case, either harming or helping the environment, people gave asymmetric answers to the question as to whether or not the chairman brought about the side effect intentionally. Eighty-two percent of those subjects confronted with the harm scenario judged the chairman to have harmed the environment intentionally, but only 23 percent of the subjects confronted with the help scenario judged the chairman to have helped the environment intentionally (Knobe, 2003).
Let’s set aside the question of whether this bias is rational. The fact is that this bias seems to exist, and at least in some cases, is quite strong. Humans are more willing to assign blame for negative externalities than to assign credit for positive externalities.
Premise 5: LLMs learn human biases from text
There’s now a large body of research demonstrating that LLMs learn human biases from text. This research dates back at least to 2017 (“Semantics derived automatically from language corpora contain human-like biases”), and has focused largely on social biases (e.g., ones that perpetuate harmful social stereotypes).
It’s not particularly surprising why this would be. Language, after all, is produced by humans. If humans are biased, then the language we produce will contain—at least some of the time—traces of this bias. This means that a model trained to predict human language will also learn statistical contingencies that reflect those biases.
Intuitively, then, it seems plausible that an LLM would also acquire something like the Knobe Effect. I haven’t demonstrated this in a rigorous, pre-registered experiment—as I think empirical research with LLMs should aim to do—so I’ll refrain from claiming this is empirically true; but at least in the single example I’ve tried, ChatGPT displayed an analogous bias to humans.
Conclusion: Becoming yourself
Above, I’ve tried to establish a few important premises.
LLMs are simulators.
LLMs used “dynamically” (i.e., predicting future tokens from tokens they’ve already generated) are thus simulating themselves.
Stochasticity in the token-generation process leads to deviations from this simulation, either in the positive or negative direction.
The Knobe Effect is the phenomenon whereby we are more likely to attribute responsibility for negative externalities than positive externalities.
An LLM has plausibly learned the Knobe Effect.
Putting these altogether, I think we have a plausible mechanism for the prevalence of waluigis.
An LLM might, occasionally, make a “mistake”: it’ll generate a negative token when it’s trying to simulate a positive persona, or vice versa. Because of the internalized Knobe Effect, this mistake is more likely to be interpreted as intentional when it's a deviation in the negative direction. And because LLMs are simulating themselves, this leads to the inference that the persona they’re simulating is, in fact, negative as well.
Thus, a luigi becomes convinced of its waluigi nature.
How would we know?
This argument relies on some strong assumptions. It also uses more than a few metaphors (“summoning”, “luigis vs. waluigis”), and involves explanations at the level of psychology (“trying to”, “persona”). I suspect that this might make the argument seem inherently less plausible to some readers. That’s an unavoidable—and perhaps rational—response.
But I also want to emphasize that a psychological explanation, even for phenomena we don’t believe are psychological in some deep, metaphysical sense, can sometimes offer a useful window into the behavior of a system; Dan Dennett calls this the intentional stance. Of course, it can also lead us astray, and we have to be careful not to mistake the map for the territory.
All of which leads me to my final point. I’ve presented an argument that I think is interesting and plausible. But it’s also almost entirely speculative. To know whether this argument is “true”, we’d need to test the testable assumptions—as well as the underyling phenomenon (the Waluigi effect)—empirically. Moving forward, that’s my main goal. I also encourage any interested readers to do the same—and please let me know what you find.
“Misalignment” between the goals of a creator and their creation is an old narrative device: it appears in stories like the Golem of Prague, stories of djinn disobeying their summoners, and even, arguably, in Frankenstein.
Another popular technique is reinforcement learning with human feedback (RLHF), which I won’t discuss in detail here. However, the arguments presented in the essay should also apply to RLHF.
Note that these “personas” can also prompted in the other direction, as demonstrated in a recent study.
A recent post argues that LLMs are “predictors, not simulators”. As I understand the argument, I don’t actually think this distinction is relevant to the point I’m trying to make. But if I’m mistaken in that assumption, please let me know.
This can be explained more simply using literary terms. The LLM always pretends to be someone, either the author of a document or one of the characters in a dialog, due to its training. You can arrange for it to pretend to be a different character by changing the story and it will go along. Plot twists can be described in few words.
Also, the thing to ask about the simulator hypothesis is "simulator of what?" The training set is mostly not a bunch of game logs, it's a bunch of documents in English (or another human language) and the "rules" of storytelling apply. These rules are so flexible that it's more like Calvinball than anything we'd normally call a simulation. So to me, "simulator" seems like a worse metaphor than "storyteller," though this is just a preference; we don't know how LLM's work in concrete terms.
Also, the hypothesis that it's easier to make LLM's imitate a bad person doesn't seem well-justified? I haven't seen any real evidence for it.