8 Comments
User's avatar
Simon Goldstein's avatar

Great article. I really like the example of the AI agent's mistakes about emails. I agree with you that belief/desire explanations don't seem like the way to explain the problem (its not like the agent wants to mess up large numbers of emails, but wants to do a good job with small numbers). On the other hand, I don't think tools from mechanistic interpretability will be especially helpful in explaining the error either. Instead, the relevant explanation seems to be related to how LLMs do worse with long context windows.

My question about this is how this compares to human psychology. Maybe this is analogous to competence/performance explanations in human psychology? Like psychologists have all sorts of explanations of human behavior that don't just appeal to beliefs and desires. So it may be that AIs and humans are analogous in this respect. But I don't have a good taxonomy of the kinds of explanatory tools that psychology uses.

Sean Trott's avatar

Thank you!

> On the other hand, I don't think tools from mechanistic interpretability will be especially helpful in explaining the error either. Instead, the relevant explanation seems to be related to how LLMs do worse with long context windows.

Yes, I agree with this, and I don't think mechanistic interpretability is the answer. In general I'm quite ambivalent about which level of explanation will be best (as I say towards the end), but as with human psychology, I suspect it's likely that we'll require different kinds of explanatory modes for different kinds of behavior.

We might also need to create new explanatory constructs specific to the phenomena of LLMs. E.g., in that example, what I think of as the "causal" explanation makes reference to the specific architecture of the LLM, so it uses constructs like "context window", "prompt", "prompt compaction", and so on. Maybe low-level interpretability could help explain exactly when/why certain prompt details get erased during compaction, but I think the high-level constructs do a pretty good job here.

One rejoinder to that might be that it'd be helpful to invoke other concepts from human psychology—but perhaps more at the level of *cognitive* psychology rather than folk psychology. So, e.g., one might invoke concepts like "working memory" (humans can only keep so many items in mind at a time) or "chunking" (we form abstractions across individual exemplars/units to simplify memory storage). This would be an interesting epistemic move (that kind of gets at "performance" issues) though I do think one would need to be careful about whether the analogy holds, since we know that memory doesn't really work the same way across humans and transformers, at least architecturally: in principle, transformers have access to everything in their context window (even if it's ~1M tokens long), and then access just disappears beyond that. Of course, we also know they attend differentially to different parts of the context window (i.e., the "attention sink" phenomenon) so maybe a computational-level analogy can be made.

At any rate, thank you for engaging! And thanks for writing a great and thought-provoking article.

Mira's avatar

the "useful fiction" framing is doing a lot of heavy lifting here... useful for *what* exactly? because I feel like predicting behavior and actually understanding something might need completely different answers to that question

Sean Trott's avatar

Yep I agree that is a crucial question. The original paper by Goldstein & Lederman (link here: https://philpapers.org/rec/GOLWDC-2) does a really nice job outlining the various criteria one might use to evaluate possible theories/hypotheses, and I think people's views on this will be shaped by their intuitions on the relative "tractability" of something like the intentional stance vs. a more causal-mechanistic account.

Mira's avatar

the Goldstein & Lederman framing helps — tractability as the hidden variable sorting people's intuitions is a real insight. but I wonder if "tractable" is doing the same evasion: tractable *for us*, given our current tools, which makes the whole thing contingent on the state of cognitive science rather than anything about the models themselves. does the paper address that asymmetry?

Pseudodoxia's avatar

Nice analysis. I completely agree with the principle that we should favour causal descriptions whenever possible. A big issue is that the possibility for this varies depending on who "we" are, and the biggest audience for LLMs is never going to have the technical nous to wield even basic causal descriptions (never mind degenerates like Hinton who should know better).

In some ways, we've already lived with these conditions for a long time. There are various computational artefacts more complex than thermostats where lay people rely on intentional descriptions - if a piece of deterministic software unexpectedly deletes a batch of emails for some opaque reason after a user interaction, a typical user will say that the system "misunderstood" what they were trying to do (implying false belief). That someone from IT could come along and understand the mechanism doesn't mean that the user should forego their own intentional description (how else would they communicate the issue with the IT guy? To some degree, we need stance translation).

The picture is more complicated with LLMs just because there's a decades-long background of anthropomorphising language in AI that has been designed to mislead. We never had to talk about these systems as 'learning', 'intelligent' or 'agentic' - we could have described them as data compressions, program libraries, cultural microcosms etc. - and perhaps there's still time to salvage an alternative vocabulary.

But while I think we could get comfortable with intentional description regardless, it's only going to be safe if we deconstruct widespread misunderstandings about intentional *autonomy*. When we say a thermostat wants a certain temperature, we don't confuse ourselves about the fact that the thermostat is parasitic on a setting encoded by a human. When we say that a piece of deterministic software "misunderstands" our intentions, we use a communicative metaphor because we realise that the issue lies in the *interaction* of the system and the user, not in the system itself.

LLMs have no autonomy and no appropriate intentional description separate from their prompting by humans, and we therefore ought to be able to think and talk about them in the same way as other complex systems. Our present difficulty is that there's enormous commercial pressure for people to suppress recognition of LLMs' parasitism on humans, so we're awash with lies about their potential independence from us.

There are many ways this might be addressed but I think a good start would be to get honest technical people thinking a bit more loosely about the boundaries of computational systems, as has been developed with the Systems Reply to the Chinese Room and the Extended Mind Thesis etc. If people intuitively grok that there isn't a clear boundary between them and the AI model, assumptions about autonomy lose a lot of their bite and intentional description becomes more favourable.

Dr Jo's avatar

Reading through your piece, I can't help but wonder whether you've wandered off piste a bit. I don't know whether this helps, but I've written a bit on LLMs (in the specific context of Medicine) that may be useful: https://drjo.substack.com/p/deus-ex-machina --- you may wish to skip down to https://drjo.substack.com/i/190907198/ai-wont-do-the-job in that post.

Regards, Dr Jo.