2 Comments
User's avatar
skybrian's avatar

This can be explained more simply using literary terms. The LLM always pretends to be someone, either the author of a document or one of the characters in a dialog, due to its training. You can arrange for it to pretend to be a different character by changing the story and it will go along. Plot twists can be described in few words.

Also, the thing to ask about the simulator hypothesis is "simulator of what?" The training set is mostly not a bunch of game logs, it's a bunch of documents in English (or another human language) and the "rules" of storytelling apply. These rules are so flexible that it's more like Calvinball than anything we'd normally call a simulation. So to me, "simulator" seems like a worse metaphor than "storyteller," though this is just a preference; we don't know how LLM's work in concrete terms.

Also, the hypothesis that it's easier to make LLM's imitate a bad person doesn't seem well-justified? I haven't seen any real evidence for it.

Expand full comment
Sean Trott's avatar

Thanks, these are all good points.

> Also, the hypothesis that it's easier to make LLM's imitate a bad person doesn't seem well-justified? I haven't seen any real evidence for it.

I mostly agree and I think this is the weakest part of the argument. The "waluigi effect" still seems pretty speculative. As with much discussion in this space, I think it'd benefit a lot from more rigorous investigation or demonstration, ideally with some kind of pre-registered experiment. This article was just my attempt at explaining what could cause the effect, assuming it's real.

> So to me, "simulator" seems like a worse metaphor than "storyteller," though this is just a preference; we don't know how LLM's work in concrete terms.

Yeah I think that's fair. I think it also depends on what you "want" from your explanation. My goal here was to try to occupy a sort of explanatory middle-ground that's not just dissecting the weights themselves (for one, b/c I don't have access to OpenAI's weights) and also not *purely* metaphorical. Hence my focus on modeling the token generation process. But I think you're right that this could be construed as either "storytelling" or "simulating" and I'm not sure the latter grants any special insight into the mechanisms of that model.

Thanks!

Expand full comment