AI (mis)alignment, Waluigi, and the Knobe…

Apr 15, 2023

Why it might be easier for LLMs to be bad than good.

2 Comments

Apr 17, 2023Edited

This can be explained more simply using literary terms. The LLM always pretends to be someone, either the author of a document or one of the characters in a dialog, due to its training. You can arrange for it to pretend to be a different character by changing the story and it will go along. Plot twists can be described in few words.

Also, the thing to ask about the simulator hypothesis is "simulator of what?" The training set is mostly not a bunch of game logs, it's a bunch of documents in English (or another human language) and the "rules" of storytelling apply. These rules are so flexible that it's more like Calvinball than anything we'd normally call a simulation. So to me, "simulator" seems like a worse metaphor than "storyteller," though this is just a preference; we don't know how LLM's work in concrete terms.

Also, the hypothesis that it's easier to make LLM's imitate a bad person doesn't seem well-justified? I haven't seen any real evidence for it.

Expand full comment

Reply (1)

Sean Trott

Apr 17, 2023

Thanks, these are all good points.

> Also, the hypothesis that it's easier to make LLM's imitate a bad person doesn't seem well-justified? I haven't seen any real evidence for it.

I mostly agree and I think this is the weakest part of the argument. The "waluigi effect" still seems pretty speculative. As with much discussion in this space, I think it'd benefit a lot from more rigorous investigation or demonstration, ideally with some kind of pre-registered experiment. This article was just my attempt at explaining what could cause the effect, assuming it's real.

> So to me, "simulator" seems like a worse metaphor than "storyteller," though this is just a preference; we don't know how LLM's work in concrete terms.

Yeah I think that's fair. I think it also depends on what you "want" from your explanation. My goal here was to try to occupy a sort of explanatory middle-ground that's not just dissecting the weights themselves (for one, b/c I don't have access to OpenAI's weights) and also not *purely* metaphorical. Hence my focus on modeling the token generation process. But I think you're right that this could be construed as either "storytelling" or "simulating" and I'm not sure the latter grants any special insight into the mechanisms of that model.

Thanks!

Expand full comment

The Counterfactual

AI (mis)alignment, Waluigi, and the Knobe…